You are on page 1of 37

Table of Contents

Course Description....................................................................................5
Chapter 1 – Basic Concepts in Assessment..........................................6
Chapter 2 – Principles of High Quality Assessment............................12
Chapter 1 – Basic Concepts in Assessment
At the end of this chapter, the students will be able to:

1. Distinguish among test, measurement, evaluation and assessment.


2. Explain the meaning of assessment FOR, OF, and AS learning.

1.1 Basic Concepts


a. Test is defined as an instrument, tool or technique used to obtain
a sample of an individual‘s behaviour using standardized
procedures.
b. Measurement is a set of rules for assigning numbers to
represent objects, traits, attributes, or behaviors.
c. Evaluation is the process of making judgments based on criteria
and evidence, and determining the extent to which instructional
objectives are attained.
d. Assessment is the process of describing, collecting (gathering/
documenting), recording, scoring, and interpreting information
about learning.

1.2 Purposes of Assessment


a. Assessment FOR learning
The preposition ―for‖ in assessment for learning implies that
assessment is done to improve and ensure learning. This is
referred to as FORmative assessment, assessment that is given
while the teacher is in the process of student formation. It ensures
that learning is going on while teacher is in the process of teaching.

1|Page
b. Assessment OF learning
It is usually given at the end of a unit, grading period or a term like
a semester. It is meant to assess learning for grading purposes,
thus the term assessment of learning.

c.Assessment AS learning
It is associated with self-assessment. As the term implies,
assessment by itself is already a form of learning for the students.
As students assess their own work (e.g. a paragraph) and/or with
their peers with the use of scoring rubrics, they learn on their own
what a good paragraph is. At the same, as they are engaged in
self-assessment, they learn about themselves as learners become
aware of how they learn. In short, in assessment AS learning,
students set their targets, actively monitor and evaluate their own
learning in relation to their set target. As a consequence, they
become self-directed or independent learners. By assessing their
own learning, they are learning at the same time.

Assessment AS
learning
Assessment FOR Self-assessment
learning Assessment OF learning
Placement assessment Summative assessment
Diagnostic assessment
Formative assessment

ASSESSMENT

Various Approaches to Assessment

2|Page
Other terms in assessment include:

• Placement assessment – used to place students according to


prior achievement or personal characteristics, at the most
appropriate point in an instructional sequence, in a unique
instructional strategy, or with a suitable teacher.
• Diagnostic assessment – used to identify the strengths and
weakness of the students.
• Summative assessment – is generally carried out at the end of a
course or project. In an educational setting, summative
assessments are typically used to assign students a course grade.
Summative assessments are evaluative. Summative assessments
are made to summarize what the students have learned, to
determine whether they understand the subject matter well.

Chapter 2 – Principles of High Quality Assessment


At the end of this chapter, the students will be able to:

1. Discuss the different learning domains.


2. Distinguish among validity, reliability, practicability and other properties of
assessment methods.

2.1 Clarity of Learning Targets


Assessment can be made precise, accurate and dependable only if
what are to be achieved are clearly stated and feasible. To this end,
we consider learning targets involving knowledge, reasoning skills,
products and effects. Learning targets need to be stated in behavioral
terms or terms that denote something which can be observed through the
behavior of the students.

2.1.1 Cognitive Targets


As early as the 1950‘s, Bloom (1956), proposed a hierarchy of
educational objectives at the cognitive level. These are:

Level 1. Knowledge refers to the acquisition of facts, concepts and


theories.
3|Page
Level 2. Comprehension refers to the same concept of
―understanding‖. It is a step higher than mere acquisition of facts and
involves a cognition or awareness of the interrelationships of facts and
concepts.

Level 3. Application refers to the transfer of knowledge from one field of


study to another or from one concept to another concept in the same
discipline.

Level 4. Analysis refers to the breaking down of a concept or idea into its
components and explaining the concept as a composition of these concepts.

Level 5. Synthesis refers to the opposite of analysis and entails putting


together the components in order to summarize the concept.

Level 6. Evaluation refers to valuing and judgment or putting worth to a


concept or principle.

2.1.2 Skills, Competencies and Abilities Targets


Skills refer to specific activities or tasks that a student can proficiently
do. Skills can be clustered together to form specific competencies.
Related competencies characterize a student‘s ability. It is important to
recognize a student‘s ability in order that the program of study can be so
designed as to optimize his/her innate abilities.

Abilities can be roughly categorized into: cognitive, psychomotor and


affective abilities. For instance, the ability to work well with others and to
be trusted by every classmate (affective ability) is an indication that the
student can most likely succeed in work that requires leadership abilities.
On the other hand, other students are better at doing things alone like
programming and web designing (cognitive ability) and, therefore, they
would be good at highly technical individualized work.

2.1.3 Products, Outputs and Projects Targets


Products, outputs and projects are tangible and concrete evidence of
a student‘s ability. A clear target for products and projects need to
clearly specify the level of workmanship of such projects. For instance, an
4|Page
expert output may be characterized by the indicator ―at most tow
imperfections noted‖ while a skilled level output can be characterized by
the indicator ―at most four (4) imperfections noted‖ etc.

2.2 Appropriateness of Assessment Methods


Once the learning targets are clearly set, it is now necessary to determine
an appropriate assessment procedure or method. We discuss the general
categories of assessment methods or instruments below.

2.2.1 Written-Response Instruments


Written-response instruments include objective tests (multiple choice,
true-false, matching or short answer tests), essays and checklists.
Objective tests are appropriate for assessing the various levels of
hierarchy of educational objectives. Multiple choice tests in particular
can be constructed in such a way as to test higher order thinking skills.
Essays, when properly planned, can test the student‘s grasp of the higher
level of cognitive skills. However, when the essay question is not
sufficiently precise and when the parameters are not properly defined,
there is a tendency for the students to write irrelevant and unnecessary
things just to fill in blank spaces. When this happens, both the teacher and
the students will experience difficulty and frustration.

2.2.2 Product Rating Scales


A teacher is often tasked to rate products. Examples of products that are
frequently rated in education are book reports, maps, charts, diagrams,
notebooks, essays and creative endeavors of all sorts. An example of a
product rating scale is the classic ‗handwriting‘ scale used in the
California Achievement Test, Form W (1957). There are prototype
handwriting specimens of pupils and students. The sample handwriting of
a student is then moved along the scale until the quality of the handwriting
sample is most similar to the prototype products in education, the teacher
must possess prototype products over his/her years of experience.

5|Page
2.2.3 Performance Tests
One of the most frequently used measurement instruments is the
checklist. A performance checklist consists of a list of behaviors that
make up a certain type of performance. It is used to determine whether or
not an individual behaves in a certain way when asked to complete a
particular task. If a particular behavior is present when an individual is
observed, the teacher places a check opposite it on the list.

2.2.4 Oral Questioning


The traditional Greeks used oral questioning extensively as an
assessment method. Socrates himself, considered the epitome of a
teacher, was said to have handled his classes solely based on questioning
and oral interactions.

Oral questioning is an appropriate assessment method when the


objectives are: (a) To assess the student‘s stock knowledge
(b) To determine the student‘s ability to communicate ideas in coherent
verbal sentences.
While oral questioning is indeed an option for assessment, several factors
need to be considered when using this option. Of particular significance
are the student‘s state of mind and feelings, anxiety and nervousness in
making oral presentations which could mask the student‘s true ability.

2.2.5 Observation and Self reports


A tally sheet is a device often used by teachers to record the frequency
of student behaviors, activities or remarks. How many high school
students follow instructions during a fire drill, for example? How many
instances of aggression or helpfulness are observed when elementary
students are observed in the playground? Observational tally sheets are
most useful in answering these kinds of questions.

A self-checklist is a list of several characteristics or activities


presented to the subjects of a study. The individuals are asked to study
the list and then to place a mark opposite the characteristics which they
possess or the activities which they have engaged in for a particular length
of time. Self-checklists are often employed by teachers when they want to

6|Page
diagnose or to appraise the performance of students from the point of view
of the students themselves.

Observation and self-reports are useful supplementary assessment


methods when used in conjunction with oral questioning and performance
tests. Such methods can offset the negative impact on the students
brought about by their fears and anxieties during oral questioning or when
performing actual task under observation. However, since there is a
tendency to overestimate one‘s capabilities, it may be useful to consider
weighing selfassessment and observational reports against the results of
oral questioning and performance tests.

2.3 Properties of Assessment Methods


The quality of the assessment instrument and method used in education is
very important since the evaluation and judgment that the teacher gives
on a student are based on the information he obtains using these
instruments. Accordingly, teachers follow a number of procedures to
ensure that the entire assessment process is valid and reliable.

2.3.1 Validity

Validity is the extent to which a test measures what it is supposed to


measure or as referring to the appropriateness, correctness,
meaningfulness and usefulness of the specific decisions a teacher makes
based on the test results.
The first definition refers to the test itself while the second refers to the
decisions made by the teacher based on the test. A test is valid when it
is aligned with the learning outcome.
A teacher who conducts test validation might want to gather different kinds
of evidence.
There are essentially three (3) main types of evidence that may be
collected:
a. Content-related evidence of validity refers to the content
and format of the instrument. How appropriate is the content?
How comprehensive? Does it logically get at the intended variable?
7|Page
How adequately does the sample of items or questions represent
the content to be assessed?
b. Criterion-related evidence of validity refers to the
relationship between scores obtained using the instrument and
scores obtained using one or more other tests (often called
criterion). How strong is this relationship? How well do such scores
estimate present or predict future performance of a certain type?
c. Construct-related evidence of validity refers to the nature
of the psychological construct or characteristic being
measured by the test? How well does a measure of the construct
explain differences in the behaviour of the individuals or their
performance on a certain task?
The usual procedure for determining content validity may be described
as follows:
The teacher writes out the objectives of the test based on the Table of
Specifications and then gives these together with the test to at least two
(2) experts along with a description of the intended test takers. The
experts look at the objectives, read over the items in the test and place a
check mark in front of each question or item that they feel does not
measure one or more objectives. They also place a check mark in front of
each objective not assessed by any item in the test. The teacher then
rewrites any item checked and resubmits to the experts and/or writes new
items to cover those objectives not covered by the existing test. This
continues until the experts approve of all items and also until the experts
agree that all of the objectives are sufficiently covered by the test.
In order to obtain evidence of criterion-related validity, the teacher
usually compares scores on the test in question with the scores on
some other independent criterion test which presumably has already
high validity. For example, if a test is designed to measure mathematics
ability of students and it correlates highly with a standardized mathematics
achievement test (external criterion), then we say we have high criterion-
related evidence of validity.
In particular, this type of criterion-related validity is called its
concurrent validity.

8|Page
Another type of criterion-related validity is called predictive validity
wherein the test scores in the instrument are correlated with scores
on a later performance (criterion measure) of the students. For
example, the mathematics ability test constructed by the teacher may be
correlated with their later performance in a division – wide mathematics
achievement test.
Another type of validity is the face validity where it is the extent to which a
test is subjectively viewed as covering the concept it tries to measure.
2.3.2 Reliability
Reliability refers to the consistency of the scores obtained – how
consistent they are for each individual from one administration of an
instrument to another and from one set of items to another.
Reliability and validity are related concepts. If an instrument is
unreliable, it cannot yield valid outcomes. As reliability improves,
validity may also improve (or not) however, if an instrument is shown
scientifically to be valid then it is almost certain that it is also
reliable.
Something reliable is something that works well and that you can trust.
A reliable test is a consistent measure of what it is supposed to
measure.

9|Page
The following table is a standard followed almost universally in educational
test and measurement.
Reliability Interpretation
0.90 and above Excellent reliability; at the level of the best
standardized tests
0.80 – 0.90 Very good for a classroom test
0.70 – 0.80 Good for a classroom test; in the range of most.
There are probably a few items which could be
improved
0.60 – 0.70 Somewhat low. This test needs to be supplemented
by other measures (more tests) to determine grades.
There are probably
some items which could be improved

0.50 – 0.60 Suggests need for revision of test, unless it is quite


short (ten or fewer items). The test definitely needs
to be supplemented by other measures (more tests)
for grading
0.50 or below Questionable reliability. This test should not
contribute heavily to the course grade and it needs
revision

The Reliability Coefficient


The reliability coefficient is symbolized with the letter "r" and a subscript
that contains two of the same letters or numbers (e.g., ''r xx''). The
subscript indicates that the correlation coefficient was calculated by
correlating a test with itself rather than with some other measure.
Note that a reliability coefficient does not provide any information
about what is actually being measured by a test!
A reliability coefficient only indicates whether the attribute measured by
the test— whatever it is—is being assessed in a consistent, precise way.

10 | P a g e
Methods for Estimating Reliability
The selection of a method for estimating reliability depends on the nature
of the test. Each method not only entails different procedures but is also
affected by different sources of error. For many tests, more than one
method should be used.
a. Test – retest Reliability - The test-retest method for estimating
reliability involves administering the same test to the same group of
examinees on two different occasions and then correlating the two sets of
scores.
When using this method, the reliability coefficient indicates the degree of
stability (consistency) of examinees' scores over time and is also known
as the coefficient of stability.
The primary sources of measurement error for test-retest reliability are any
random factors related to the time that passes between the two
administrations of the test. These time sampling factors include random
fluctuations in examinees over time (e.g., changes in anxiety or
motivation) and random variations in the testing situation.
Memory and practice also contribute to error when they have random
carryover effects; i.e., when they affect many or all examinees but not in
the same way.
Test-retest reliability is appropriate for determining the reliability of tests
designed to measure attributes that are relatively stable over time and that
are not affected by repeated measurement. It would be appropriate for a
test of aptitude, which is a stable characteristic, but not for a test of mood,
since mood fluctuates over time, or a test of creativity, which might be
affected by previous exposure to test items.
b. Alternate (Equivalent, Parallel) Forms Reliability
To assess a test's alternate forms reliability, two equivalent forms of the
test are administered to the same group of examinees and the two sets of
scores are correlated.
Alternate forms reliability indicates the consistency of responding to
different item samples

11 | P a g e
(the two test forms) and, when the forms are administered at different
times, the consistency of responding over time.
The alternate forms reliability coefficient is also called the coefficient of
equivalence when the two forms are administered at about the same time;
The coefficient of equivalence and stability when a relatively long period of
time separates administration of the two forms.
The primary source of measurement error for alternate forms reliability is
content sampling, or error introduced by an interaction between different
examinees' knowledge and the different content assessed by the items
included in the two forms (e.g.: Form A and Form B). The items in Form A
might be a better match of one examinee's knowledge than items in Form
B, while the opposite is true for another examinee.
In this situation, the two scores obtained by each examinee will differ,
which will lower the alternate forms reliability coefficient. When
administration of the two forms is separated by a period of time, time
sampling factors also contribute to error.
Like test-retest reliability, alternate forms reliability is not appropriate when
the attribute measured by the test is likely to fluctuate over time (and the
forms will be administered at different times) or when scores are likely to
be affected by repeated measurement.
If the same strategies required to solve problems on Form A are used to
solve problems on Form B, even if the problems on the two forms are not
identical, there are likely to be practice effects, when these effects differ
for different examinees (i.e., are random), practice will serve as a source
of measurement error.
Although alternate forms reliability is considered by some experts to be the
most rigorous (and best) method for estimating reliability, it is not often
assessed due to the difficulty in developing forms that are truly equivalent.

c. Internal Consistency Reliability


Reliability can also be estimated by measuring the internal consistency of
a test.

12 | P a g e
Split-half reliability and coefficient alpha are two methods for evaluating
internal consistency. Both involve administering the test once to a single
group of examinees, and both yield a reliability coefficient that is also
known as the coefficient of internal consistency.
To determine a test's split-half reliability, the test is split into equal halves
so that each examinee has two scores (one for each half of the test).
Scores on the two halves are then correlated. Tests can be split in several
ways, but probably the most common way is to divide the test on the basis
of odd- versus even-numbered items.
A problem with the split-half method is that it produces a reliability
coefficient that is based on test scores that were derived from one-half of
the entire length of the test. If a test contains 30 items, each score is
based on 15 items. Because reliability tends to decrease as the length of a
test decreases, the split-half reliability coefficient usually underestimates a
test's true reliability.
For this reason, the split-half reliability coefficient is ordinarily corrected
using the
Spearman-Brown prophecy formula, which provides an estimate of
what the reliability coefficient would have been had it been based on
the full length of the test.
Cronbach's coefficient alpha also involves administering the test
once to a single group of examinees. However, rather than splitting the
test in half, a special formula is used to determine the average degree of
inter-item consistency.
One way to interpret coefficient alpha is as the average reliability that
would be obtained from all possible splits of the test. Coefficient alpha
tends to be conservative and can be considered the lower boundary of a
test's reliability (Novick and Lewis, 1967).
When test items are scored dichotomously (right or wrong), a variation of
coefficient alpha known as the Kuder-Richardson Formula 20 (KR-20)
can be used.
The Kuder-Richarson is the more frequently employed formula for
determining internal consistency, particularly KR20 and KR21. We present

13 | P a g e
the latter formula (KR21) since KR20 is more difficult to calculate and
requires a computer program:

= the number of items on the test


= mean of the test
= variance of the test scores
Example:
A 30 item test was administered to a group of 30 students. The mean
score was 25 while the standard deviation was 3. Compute the KR21
index of reliability.

So,

Content sampling is a source of error for both split-half reliability and


coefficient alpha.
For split-half reliability, content sampling refers to the error resulting from
differences between the content of the two halves of the test (i.e., the
items included in one half may better fit the knowledge of some
examinees than items in the other half);
For coefficient alpha, content (item) sampling refers to differences
between individual test items rather than between test halves. Coefficient
alpha also has as a source of error, the heterogeneity of the content
domain. A test is heterogeneous with regard to content domain when its
items measure several different domains of knowledge or behavior.
The greater the heterogeneity of the content domain, the lower the inter-
item correlations and the lower the magnitude of coefficient alpha.

14 | P a g e
Coefficient alpha could be expected to be smaller for a 200-item test that
contains items assessing knowledge of test construction, statistics, ethics,
epidemiology, environmental health, social and behavioral sciences,
rehabilitation counseling, etc. than for a 200-item test that contains
questions on test construction only.
The methods for assessing internal consistency reliability are useful when
a test is designed to measure a single characteristic, when the
characteristic measured by the test fluctuates over time, or when scores
are likely to be affected by repeated exposure to the test. They are not
appropriate for assessing the reliability of speed tests because, for these
tests, they tend to produce spuriously high coefficients. (For speed tests,
alternate forms reliability is usually the best choice.)
d. Inter-Rater (Inter-scorer, Inter-Observer) Reliability
Inter-rater reliability is of concern whenever test scores depend on a
rater's judgment.
A test constructor would want to make sure that an essay test, a
behavioral observation scale, or a projective personality test have
adequate inter-rater reliability. This type of reliability is assessed either by
calculating a correlation coefficient (e.g., a kappa coefficient or coefficient
of concordance) or by determining the percent agreement between two or
more raters.
Although the latter technique is frequently used, it can lead to erroneous
conclusions since it does not take into account the level of agreement that
would have occurred by chance alone. This is a particular problem for
behavioral observation scales that require raters to record the frequency
of a specific behavior.
In this situation, the degree of chance agreement is high whenever the
behavior has a high rate of occurrence, and percent agreement will
provide an inflated estimate of the measure's reliability.
Sources of error for inter-rater reliability include factors related to the
raters such as lack of motivation and rater biases and characteristics of
the measuring device.

15 | P a g e
An inter-rater reliability coefficient is likely to be low, for instance, when
rating categories are not exhaustive (i.e., don't include all possible
responses or behaviors) and/or are not mutually exclusive.
The inter-rater reliability of a behavioral rating scale can also be affected
by consensual observer drift, which occurs when two (or more) observers
working together influence each other's ratings so that they both assign
ratings in a similarly idiosyncratic way. (Observer drift can also affect a
single observer's ratings when he or she assigns ratings in a consistently
deviant way.) Unlike other sources of error, consensual observer drift
tends to artificially inflate inter-rater reliability.
The reliability (and validity) of ratings can be improved in several ways:
• Consensual observer drift can be eliminated by having raters
work independently or by alternating raters.
• Rating accuracy is also improved when raters are told that
their ratings will be checked.
• Overall, the best way to improve both inter- and intra-rater
accuracy is to provide raters with training that emphasizes
the distinction between observation and interpretation.
Factors that affect the Reliability Coefficient
The magnitude of the reliability coefficient is affected not only by the
sources of error discussed earlier, but also by the length of the test, the
range of the test scores, and the probability that the correct response to
items can be selected by guessing.
a. Test Length - The larger the sample of the attribute being
measured by a test, the less the relative effects of measurement
error and the more likely the sample will provide dependable,
consistent information.
Consequently, a general rule is that the longer the test, the larger the test's
reliability coefficient.
The Spearman-Brown prophecy formula is most associated with split-half
reliability but can actually be used whenever a test developer wants to

16 | P a g e
estimate the effects of lengthening or shortening a test on its reliability
coefficient.
For instance, if a 100-item test has a reliability coefficient of .84, the
Spearman-Brown formula could be used to estimate the effects of
increasing the number of items to 150 or reducing the number to 50. A
problem with the Spearman-Brown formula is that it does not always yield
an accurate estimate of reliability: In general, it tends to overestimate a
test's true reliability. This is most likely to be the case when the added
items do not measure the same content domain as the original items
and/or are more susceptible to the effects of measurement error.
Note that, when used to correct the split-half reliability coefficient, the
situation is more complex, and this generalization does not always apply:
When the two halves are not equivalent in terms of their means and
standard deviations, the Spearman-Brown formula may either over- or
underestimate the test's actual reliability.

Where:
rKK = reliability of a test “k” times as long as the
original test r11 = reliability of the original test

K = factor by which the length of the test is changed. To find k,


divide the number of items on the new test by the number of items
on the original. If you had 10 items on the original and 20 on the
new, k would be 20 / 10 = 2

Example:
A test made up of 12 items has reliability (r11) of 0.68. If the number of
items is doubled to 24, will the reliability of the test improve?
Solution: r11
= 0.68 k =
24 / 12 = 2
17 | P a g e
So,

Doubling the test increases the reliability from .68 to .81


Note: for the formula to work properly, the two tests must be equivalent
in difficulty. If you double a test and add only easy questions, the results
will be invalid b. Range of Test Scores
Since the reliability coefficient is a correlation coefficient, it is maximized
when the range of scores is unrestricted.
The range is directly affected by the degree of similarity of examinees with
regard to the attribute measured by the test.
When examinees are heterogeneous, the range of scores is maximized.
The range is also affected by the difficulty level of the test items. When all
items are either very difficult or very easy, all examinees will obtain either
low or high scores, resulting in a restricted range.
Therefore, the best strategy is to choose items so that the average
difficulty level is in the mid-range (r = .50).
c. Guessing
A test's reliability coefficient is also affected by the probability that
examinees can guess the correct answers to test items. As the probability
of correctly guessing answers increases, the reliability coefficient
decreases.
All other things being equal, a true/false test will have a lower reliability
coefficient than a four-alternative multiple-choice test which, in turn, will
have a lower reliability coefficient than a free recall test.

2.3.3 Fairness
An assessment procedure needs to be fair. This means many things:

First, students need to know exactly what the learning targets are and
what method of assessment will be used. If students do not know what
they are supposed to be achieving, then they could get lost in the maze of
18 | P a g e
concepts being discussed in class. Likewise, students have to be informed
how their progress will be assessed in order to allow them to strategize
and optimize their performance.

Second, assessment has to be viewed as an opportunity to learn rather


than an opportunity to weed out poor and slow learners. The goal should
be that of diagnosing the learning process rather than the learning object.

Third, fairness also implies freedom from teacher-stereotyping. Some


examples of stereotyping include: boys are better than girls in Math or girls
are better than boys in language. Such stereotyped images and thinking
could lead to unnecessary and unwanted biases in the way that teachers
assess their students.

2.3.4 Practicality and Efficiency


Another quality of a good assessment procedure is practicality and
efficiency. An assessment procedure should be practical in the sense that
the teacher should be familiar with it, does not require too much time and
is in fact, implementable. A complex assessment procedure tends to be
difficult to score and interpret resulting in a lot of misdiagnosis or too long
for a feedback period which may render the test inefficient.

2.3.5 Ethics in Assessment


The term ―ethics‖ refers to questions of right and wrong. When teachers
think about ethics, they need to ask themselves if it is right to assess a
specific knowledge or investigate a certain question. Are there some
aspects of the teaching-learning situation that should not be assessed?
Here are some situations in which assessment may not be called for:
• Requiring students to answer checklist of their sexual fantasies;
• Asking elementary pupils to answer sensitive questions without the
consent of their parents;
• Testing the mental abilities of pupils using an instrument whose
validity and reliability are unknown
When a teacher thinks about ethics, the basic question to ask in this
regard is ―Will any physical or psychological harm come to anyone as a

19 | P a g e
result of the assessment or testing?‖ Naturally, no teacher would want this
to happen to any of his/her student.

Webster defines ethical (behavior) as ‗conforming to the standards of


conduct of a given profession or group.‘ What teachers consider ethical is
therefore largely a matter of agreement among them. Perhaps, the most
important ethical consideration of all is the fundamental responsibility of a
teacher to do all in his or her power to ensure that participants in an
assessment program are protected from physical or psychological harm,
discomfort or danger that may arise due to the testing procedure. For
instance, a teacher who wishes to test a student‘s physical endurance
may ask students to climb a very steep mountain thus endangering them
physically.

Test results and assessment results are confidential results. Such should
be known only by the student concerned and the teacher. Results should
be communicated to the students in such a way that other students would
not be in possession of information pertaining to any specific number of
the class.

The third ethical issue in assessment is deception. Should students be


deceived? There are instances in which it is necessary to conceal the
objective of the assessment from the students in order to ensure fair and
impartial results. When this is the case, the teacher has a special
responsibility to determine whether the use of such techniques is justified
by the educational value of the assessment, determine whether alternative
procedures are available that does not make use of concealment and
ensure that students are provided with sufficient explanation as soon as
possible.

Finally, the temptation to assist certain individuals in class during


assessment or testing is ever present. In this case, it is best if the teacher
does not administer the test himself if he believes that such a concern
way, at a later time, be considered unethical.

20 | P a g e
Chapter 3 – Measures of Central Tendency and Variability
3.1 Introduction
A measure of central tendency is a single value that attempts to
describe a set of data (like scores) by identifying the central position
within that set of data or scores. As such, measures of central tendency
are sometimes called measures of central location.
Central Tendency refers to the center of a distribution of observations.
Where do scores tend to congregate? In a test of 100 items, where are
most of the scores? Do they tend to group around the mean score of 50 or
80?

There are three measures of central tendency – the mean,


median and the mode. Perhaps you are most familiar with the mean (often
called the average). But there are two other measures of central tendency,
namely the median and the mode. Is there such a thing as the best
measure of central tendency?
If the measures of central tendency indicate where scores congregate, the
measure of variability indicate how spread out a group of scores is or
how varied the scores are or how far they are from the mean. Common
measures of dispersion or variability are range, variance and standard
deviation.

3.2 Measures of Central Tendency


3.2.1 Ungrouped data

The mean, median and mode are valid measures of central tendency but
under different conditions, one measure becomes more appropriate than
the others. For example, if the scores are extremely high and extremely
low, the median is a better measure of central tendency since the mean is
affected by extremely high and extremely low scores.

21 | P a g e
Mean

The mean or average or arithmetic mean is the most popular and most
well-known measure of central tendency. The mean is equal to the sum
of all the values in the data set divided by the number of values in the data
set. For example, 10 students in a Graduate School class got the following
scores in a 100-item test: 70, 72, 75, 77, 78, 80, 84, 87, 90 and 92.

The mean score of the group of 10 students is the sum of all their scores
divided by 10. The mean, therefore, is 805/10 equals 80.5.

80.5 is the average score of the group. There are 6 scores below the
average score (mean) of the group (70, 72, 75, 77, 78 and 80) and there
are 4 scores above the mean of the group (84, 87, 90 and 92).

The symbol we use for the mean is ̅ (read as bar).

When not to use the mean

The mean has one main disadvantage. It is particularly susceptible to the


influence of outliers. These are values that are unusual compared to the
rest of the data set by being especially small or large in numerical value.

For example, consider the scores of 10 Grade 12 students in a 100-item


Statistics test below:

5 38 56 60 67 70 73 78 79 95

The mean score for these ten Grade 12 students is 62.1. However,
inspecting the raw data suggests that this mean may not be the best way
to accurately reflect the score of the typical Grade 12 student as most
students have scores in the 5 to 95 range. The mean is being skewed by
the extremely low and extremely high scores. Therefore, in this situation,
we would like to have a better measure of central tendency. As we will find
out later, taking the median would be a better measure of central tendency
in this situation.

22 | P a g e
Median

The median is the middle score for a set of scores arranged from lowest
to highest. The mean is less affected by extremely low and extremely high
scores.

The symbol for median is ̃ (read as x-tilde)


How do we find the median? Suppose we have the following data:

65 55 89 56 35 14 56 55 87 45 92

To determine the median, first we have to rearrange the scores into order
of magnitude (from smallest to largest).

14 35 45 55 55 56 56 65 87 89 92

Our median is the score at the middle of the distribution. In this case 56 is
the middle score. There are 5 scores before it and 5 scores after it. This
works fine when you have an odd number of scores, but what happens
when you have an even number of scores? What of you have 10 scores
like the scores below?

65 55 89 56 35 14 56 55 87 45

Arrange that data according to order of magnitude (from smallest to


largest) then take the two middle scores (55 and 56) and compute the
average of the two scores. The median is
55.5. This gives us a more reliable picture of the tendency of the scores.

Mode

This is the simplest both in concept and in application. By definition,


the mode is referred to as the most frequent value in the distribution.
We shall use the symbol ̂ (read as –x hat) to represent the mode.

Study the score distribution below:

14 35 45 55 55 56 56 56 65 84 89

23 | P a g e
There are two most frequent scores 55 and 56 so we have a score
distribution with two modes, hence a bimodal distribution.

3.2.2 Grouped data

Mean

To compute the value of the mean of a data presented in a frequency


distribution, we will consider the midpoint method.
In using the Midpoint Method, the midpoint of each class interval is taken
as the representative of each class. These midpoints are multiplied by
their corresponding frequencies. The product is added and the sum is
divided by the total number of frequencies. The value obtained is
considered the mean of the grouped data. The formula is:

where

– represents the frequency of each class


– the midpoint of each class
– the total number of frequencies or sample size

To be able to apply the formula for the mean of a grouped data, we shall
follow the step below:

Step 1. Get the midpoint of each class


Step 2. Multiply each midpoint by its corresponding frequency
Step 3. Get the sum of the products in Step 2
Step 4. Divide the sum obtained in Step 3 by the total number of
frequencies. The result shall be rounded off to two decimal places.

24 | P a g e
Median

Just like the mean, the computation of the value of the median is done
through interpolation. The procedure requires the construction of the less
than cumulative frequency column ( ). The first step in finding the value
of the median is to divide the total number of frequencies by 2. This is
consistent with the definition of the median. The value shall be used to
determine the cumulative frequency before the median class denoted by
. refers to the highest value under the column that is less than
. The median class refers to the interval that contains the median, that is,
where the value is located. Hence, among the entries under the
column which are greater than
, the smallest shall be the frequency of the median class. If a
distribution contains an interval where the cumulative frequency is
exactly , the upper boundary of that class will be the median and no
interpolation is needed.
After identifying the median class, we shall approximate the position of the
median within the median class. This approximation shall be done by
subtracting the value of from
. Then, the difference is divided by the frequency of the median class
times the size of the class interval. The result is then added to the lower
boundary of the median class to get the median of the distribution.
The computing formula for the median for grouped data is given below.

where

– refers to the lower boundary of the median class


– the frequency of the median class
25 | P a g e
– less than cumulative frequency
– the class interval
– the total number of frequencies or sample

To be able to apply the formula for the median for grouped data, we shall
follow the steps below:
Step 1. Get .
Step 2. Determine the value of .
Step 3. Determine the median class.
Step 4. Determine the lower boundary and the frequency of the median
class and the size of the class interval.
Step 5. Substitute the values obtained in Steps 1 – 4 to the formula. Round
off the final result to two decimal places

Mode

In the computation of the value of the mode for grouped data, it is


necessary to identify the class interval that contains the mode. This
interval, called the modal class, contains the highest frequency in the
distribution. The next step after getting the modal class is to determine the
mode within the class. This value may be approximated by getting the
differences of the frequency of the modal to the frequency before and to
the frequency after the modal class. If we let be the difference of the
frequency of the modal class and the frequency of the interval preceding
the modal class and be the difference of the frequency of the modal
class and the frequency of the interval after the modal class, then the
mode within the class shall be approximated using the expression:

If this expression is added to the lower boundary of the modal class, then
we can come up with the computing formula for the value of the mode for
grouped data. The formula is:
26 | P a g e
To be able to apply the formula for the mode for grouped data, we shall
consider the following steps:

Step 1. Determine the modal class


Step 2. Get the value of
Step 3. Get the value of
Step 4. Get the lower boundary of the modal class
Step 5. Apply the formula by substituting the values obtained in the
preceding steps

Try this!

Find the mean, median and mode of this frequency table.


Scores

To be able to compute the value of the mean, we shall follow the steps
discussed earlier.
Step 1. Get the midpoint of each class. The midpoints are shown in the 3 rd
column.
Scores
11–22 3 16.5
23–34 5 28.5

27 | P a g e
35–46 11 40.5
47–58 19 52.5
59–70 14 64.5
71–82 6 76.5
83–94 2 88.5

Step 2. Multiply each midpoint by its corresponding frequency. The


products are shown in the 4th column.
Scores
11–22 3 16.5 49.5
23–34 5 28.5 142.5
35–46 11 40.5 445.5
47–58 19 52.5 997.5
59–70 14 64.5 903
71–82 6 76.5 459
83–94 2 88.5 177

Step 3. Get the sum of the products in Step 2.


Scores
11–22 3 16.5 49.5
23–34 5 28.5 142.5
35–46 11 40.5 445.5
47–58 19 52.5 997.5
59–70 14 64.5 903
71–82 6 76.5 459
83–94 2 88.5 177

28 | P a g e
Step 4. Divide the result in Step 3 by the sample size. The result is the
mean of the distribution. Hence,

To compute for the median, we shall construct the less than cumulative
frequency column.
We can use the existing table when we solved for the mean.
Scores
11–22 3 16.5 49.5 3
23–34 5 28.5 142.5 8
35–46 11 40.5 445.5 19
47–58 19 52.5 997.5 38 Median
class
59–70 14 64.5 903 52
71–82 6 76.5 459 58
83–94 2 88.5 177 60

Step 1.
Step 2.

Step 3. Class interval


Step 4.

Step 5.

29 | P a g e
To compute for the mode, we can still use the existing table.
Scores
11–22 3 16.5 49.5 3
23–34 5 28.5 142.5 8
35–46 11 40.5 445.5 19
47–58 19 52.5 997.5 38 Modal
class
59–70 14 64.5 903 52
71–82 6 76.5 459 58
83–94 2 88.5 177 60

To get the value of and , we have:

Substituting these values to the formula, we have

3.2.3 Comparison

Although there are many types of averages, the three measures that were
discussed are considered the simplest and the most important of all.
In the case of the mean, the following are some of the observations that
can be made.
a) The mean always exists in any distribution. This implies that
for any set of data, the mean can always be computed
b) The value of the mean in any distribution is unique. This
implies that for any distribution, there is only one possible value of
the mean
30 | P a g e
c) In the computation for this measure, it takes into
consideration all the values in the distribution
In the case of the median, we have the following observations.
a) Like the mean, the median also exists in any distribution
b) The value of the median is also unique
c) This is a positional measure
For the third measure, the mode has the following
characteristics. a) It does not always exist
b) If the mode exists, it is not always unique
c) In determining the value of the mode, it does not take into account
all the values in the distribution
Skewness

Of the three measures of central tendency, the mean is considered the


most important. Since all values are considered in the computation, it can
be used in higher statistical treatment.
There are some instances, however, when the mean is not a good
representative of a set of data. This happens when a set of data contains
extreme values either to the left or to the right of the average. In this
situation, the value of the mean is pulled to the direction of these extreme
values. Thus, the median should be used instead.
When a set of data is symmetric or normally distributed, the three
measures are identical or approximately equal. When the distribution is
31 | P a g e
skewed, that is, either negatively or positively skewed, the three averages
diverge. In any case, however, the value of the median will always be
between the mode and the mean.
A set of data is said to be positively skewed when the graph of the
distribution has a longer tail to the right. The data is said to be
negatively skewed when the longer tail is at the left.
3.3 Measures of Variability
The measures of central tendency discussed earlier simply approximate
the central value of the distribution but such descriptions are not enough
to be able to adequately describe the characteristics of a set of data.
Hence, there is a need to consider how the values are scattered on either
side of the center. Values used to determine the scatter of values in a
distribution are called measures of variation. We will discuss in this part
the range, the variance and the standard deviation.
3.3.1 Range
Among the measure of variation, the range is considered the simplest.
Earlier, we defined the range as the difference between the highest and
the lowest value in the distribution. For example, if the lowest value in the
distribution is 12 and the highest value is 125, then the range is the
difference between 125 and 12 which is 113. In symbols, if we let R be the
range, then
R=H–L
Where H – represents the highest value
L – represents the lowest value
In the case of grouped data, the difference between the highest upper
class boundary and the lowest lower class boundary is considered the
range. The rationale is that the class boundaries are considered the true
limits.
The range, of course has some disadvantages. First, this value is always
affected by extreme values. Second, in the process of computing the
value of the range, not all values are considered. Thus, the range does not

32 | P a g e
consider the variation of the items relative to the central value of the
distribution.
3.3.2 Variance
Variability can also be defined in terms of how close the scores in the
distribution are to the middle of the distribution. Using the mean as the
measure of the middle of the distribution, the variance is defined as the
average squared difference of the scores from the mean. The formula for
variance (s2) is given below

where
– midpoint of each class interval
– mean
– sample size
To be able to apply the formula for the variance, we shall consider the
steps below
Step 1. Compute the value of the mean
Step 2. Determine the deviation by subtracting the mean from
the midpoint of each class interval
Step 3. Square the deviations obtained in Step 2
Step 4. Multiply the frequencies by their corresponding squared
deviations
Step 5. Add the results in Step 4
Step 6. Divide the result in Step 5 by the sample size

3.2.3 Standard Deviation


We are now going to consider one of the most important measures of
variation – the standard deviation. Recall that, in the computation of the
variance, the deviation x – x was squared. This implies that the variance
33 | P a g e
is expressed in square units. Extracting the square root of the value of the
variance will give the value of the standard deviation.
If we let be σ (sigma) the standard deviation, then

or simply, the standard deviation is just the square root of the variance.

Try this!
Compute the Range, Variance and Standard Deviation of the example
given earlier (Computation of Measures of Central Tendency).
Range
R = H – L = 94 – 11 = 83

Variance
First, we will reproduce the frequency distribution. Applying the steps
stated before, we have
Scores
11 – 22 3 16.5 49.5 -36.4 1324.96 3974.88
23 – 34 5 28.5 142.5 -24.4 595.36 2976.80
35 – 46 11 40.5 445.5 -12.4 153.76 1691.36
47 – 58 19 52.5 997.5 -0.4 0.16 3.04
59 – 70 14 64.5 903.0 11.6 134.56 1883.84
71 – 82 6 76.5 459.0 23.6 556.96 3341.76
2534.72
83 – 94 2 88.5 177.0 35.6 1267.36

34 | P a g e
Standard Deviation
It is just the square root of the variance so,
σ=

σ=
σ = 16.54

3.3.4 Sample Variance and Sample Standard Deviation


Sometimes, our data are only a sample of the whole population.
Example: Sam has 20 rose bushes, but only counted the flowers on 6 of
them.
The population is all 20 rose bushes, and the sample is the 6 bushes that
Sam counted among the 20. Let us say that Sam‘s flower counts are 9, 4,
6, 13, 18 and 13, we can still estimate the Variance and Standard
Deviation.
When we use the sample as an estimate of the whole population, The
formula for the variance will change to:

s
And the Standard deviation formula is
s=
( )
√∑

Just remember that Standard Deviation will always be the square root of
the Variance.
The important change in the formula is ―n-1‖ instead of ―n‖ (which is
called Bessel‘s correction) but it does not affect the calculations. The
symbol will also change to reflect that we are working on a sample

35 | P a g e
instead of the whole population. (σ will be changed to s when using the
sample SD) Why take a sample?
Mostly because it is easier and cheaper. Imagine you want to know what
the whole university thinks. You cannot ask thousands of people, so
instead you may ask maybe only 300 people. Samuel Johnson once said
―You don‘t have to eat the whole ox to know that the meat is tough‖.
More notes on Standard Deviation
The Standard Deviation is simply the square root of the variance. It is an
especially useful measure of variability when the distribution is normal or
approximately normal because the proportion of the distribution within a
given number of standard deviations from the mean can be calculated.
For example. 68% of the distribution is within one standard deviation of the
mean and approximately 95% of the distribution is within two standard
deviations of the mean. Therefore, if you have a normal distribution with a
mean of 50 and a standard deviation of 10, then 68% of the distribution
would be between 50 – 10 = 40 and 50 + 10 = 60. Similarly, about 95% of
the distribution would be between 50 – (2 x 10) = 30 and 50 + (2 x 10) =
70. The symbol for the population standard deviation is σ.
Standard deviation is a measure of dispersion, the more dispersed the
data, the less consistent the data are. A lower standard deviation means
that the data are more clustered around the mean and hence the data set
is more consistent.

TRY THESE AGAIN!

Choose the letter of the best answer.

1. The most stable measure of central tendency


a. Mean
b. Median
c. Mode
d. Range
2. A measure of central tendency that is sensitive to extreme values
a. Mean
36 | P a g e
b. Median
c. Mode
d. Range
3. It is calculated by adding all values, then dividing the sum by the
number of values e. Mean
f. Rough median
g. Mode
h. Counting median
4. The most stable measure of variability
a. Average deviation
b. Range
c. Quartile deviation
d. Standard deviation
5. It is used with the median to classify a class into four groups
a. Average deviation
b. Range
c. Quartile deviation
d. Standard deviation

37 | P a g e

You might also like