Professional Documents
Culture Documents
Measurement Module
Measurement Module
TRAINING INSTITUTE
By
Addis Ababa
1
Contents
AN OVERVIEW OF MEASUREMENT AND EVALUATION ............................................................................ 4
UNIT INTRODUCTION ............................................................................................................................ 4
1.1 LEARNING OBJECTIVES OF THE UNIT ........................................................................................ 4
1.2 INTRODUCTION ......................................................................................................................... 4
1.3 PURPOSES/FUNCTIONS OF MEASUREMENT AND EVALUATION .................................................. 10
BLOOM’S TAXONOMY OF EDUCATIONAL OBJECTIVES ........................................................................... 15
2.1 Definition of objectives ................................................................................................................. 15
2.2 The Importance of Stating Instructional Objectives ..................................................................... 16
2.3 Taxonomy of Educational Objectives ............................................................................................ 16
2.4 Criteria for Selecting/writing Appropriate Objectives .................................................................. 20
2.5 Steps for Stating Instructional Objectives ..................................................................................... 20
CLASSROOM ACHIEVEMENT TESTS AND ASSESSMENTS ........................................................................ 23
3.1 INTRODUCTION ............................................................................................................................. 23
3.2. LEARNING OBJECTIVES OF THE UNIT ........................................................................................... 23
3.3. TYPES OF TESTS USED IN THE CLASSROOM ................................................................................. 24
TEST DEVELOPMENT – PLANNING THE CLASSROOM TEST..................................................................... 47
4.1 Test Development – Planning the Classroom Test ....................................................................... 47
4.2 Item Writing .................................................................................................................................. 53
ASSEMBLING, REPRODUCING, ADMINISTERING, AND SCORING OF CLASSROOM TESTS ...................... 55
5.1 UNIT INTRODUCTION .................................................................................................................... 55
5.2 LEARNING OBJECTIVES OF THE UNIT ............................................................................................ 55
5.3 INTRODUCTION ............................................................................................................................. 55
5.4 ARRANGING THE TEST ITEMS........................................................................................................ 56
5.5 WRITING TEST DIRECTIONS ........................................................................................................... 58
5.6 REPRODUCING TEST ITEMS ........................................................................................................... 59
5.7 ADMINISTERING THE TEST ............................................................................................................ 59
5.8 SCORING THE TEST ........................................................................................................................ 60
SUMMARIZING AND INTERPRETING TEST SCORES ................................................................................. 64
6.1 Methods of Interpreting Test Scores ............................................................................................ 64
6.2 Descriptive statistics ..................................................................................................................... 66
RELIABILITY AND VALIDITY OF A TEST ..................................................................................................... 89
2
7.1 Test Reliability ............................................................................................................................... 90
7.2 Validity of a Test.......................................................................................................................... 101
JUDGING THE QUALITY OF A CLASSROOM TEST................................................................................... 109
8.1 Judging The Quality Of A Classroom Test ................................................................................... 109
8.2 The Process of Item Analysis for Norm Referenced Classroom Test .......................................... 111
8.3 Item Analysis and Criterion Referenced Mastery Tests .............................................................. 115
8.4 Building A Test Item File (Item Bank) .......................................................................................... 117
3
UNIT ONE
AN OVERVIEW OF MEASUREMENT AND EVALUATION
UNIT INTRODUCTION
In this unit you will be introduced to definition of basic terms and concepts, functions of measurement and
evaluation, issues to be raised in measurement and evaluation, the measuring scales, the various ways of
test classification and the like.
1.2 INTRODUCTION
Teachers have always been concerned in measuring and evaluating students’ learning progress. Schools
certify learners based on the knowledge and skill acquired, parents are concerned in knowing how much
their children are achieving in their education, and employers also pay attention to ascertain the knowledge
and skills possessed by individuals before recruitment and selection.
Nevertheless, there have been criticisms on the quality of educational products. For example, there are
students who are unable to read and write effectively, lack fundamental arithmetic processes, and cannot
engage themselves in higher order thinking processes. The problems mentioned here and others require
TVET teachers and educators to be more concerned with valid and reliable measurements of educational
products. With no prerequisite, measurement and evaluation in education is the backbone of the education
practice and curriculum implementation in order to maintain the quality of school products including the
TVETs.
4
Learning Task1
_________________________________________________________________
__________________________________________________________
Terms in educational measurement and evaluation are often times used interchangeably. Let us see their
definitions and make comparisons among them.
A test is a measuring tool or instrument in education. More specifically, a test is considered to be a kind or
class of measurement device typically used to find out something about a person. Most of the times, when
you finish a lesson or lessons in a week, your teacher gives you a test. This test is an instrument given to
you by the teacher in order to obtain data on which you are judged. It is an educationally common type of
device which an individual completes; the intent is to determine changes or gains resulting from such
instruments as inventory, questionnaire, scale etc.
Testing on the other hand is the process of administering the test on the pupils. In other words the process
of making you or letting you take the test in order to obtain a quantitative representation of the cognitive or
non-cognitive traits you possess is called testing. So the instrument or tool is the test and the process of
administering the test is testing.
Assessment
There are many definitions and explanations of assessment in education.
Let us look at few of them.
i. Freeman and Lewis (1998) to assess is to judge the extent of students’ learning.
ii. Rowntree (1977): Assessment in education can be thought of as occurring whenever one person,
in some kind of interaction, direct or indirect, with another, is conscious of obtaining and
interpreting information about the knowledge and understanding, of abilities and attitudes of that
other person. To some extent or other, it is an attempt to know the person.
5
iii. Erwin, in Brown and Knight, (1994). Assessment is a systematic basis for making inference about
the learning and development of students… the process of defining, selecting, designing,
collecting, analysing, interpreting and using information to increase students’ learning and
development.
You will have to note from these definitions that
Assessment is a human activity.
Assessment involves interaction, which aims at seeking to understand what the learners have
achieved.
Assessment can be formal or informal.
Assessment may be descriptive rather than judgment in nature.
Its role is to increase students’ learning and development
It helps learners to diagnose their problems and to improve the quality of their subsequent learning.
In conducting assessment, one can use a variety of techniques such as: self-report, observations, interviews,
projective tests, paper and pencil tests, performance tests, surveys, projects etc...
Learning Task 2
Dear teacher, please list down the assessment techniques you are using in your
teaching activities.
_____________________________________________________________________
What would be the role of measurement and evaluation in maintaining the quality
of educational products?
_________________________________________________________________
Have you noticed some important assessment techniques that can be used in
obtaining information about your students’ characteristics and behaviour? Yes
________ NO ______
What possible plans will you set in order to improve your assessment techniques of
your students, teaching methods effectiveness, the curriculum, etc...?
1.1.1 Measurement
In simple terms, measurement refers to giving or assigning a number value to a certain attribute or
behaviour. It is a systematic process of obtaining the quantified degree to which a trait or an attribute is
present in an individual or object. In other words it is a systematic assignment of numerical values or figures
6
to a trait or an attribute in a person or object. For instance what is the height of your friend? What is the
weight of the meat? What is the length of the classroom?
Measurement conveys a broader meaning. Measurement uses a variety of ways to obtain information in a
quantitative form. Measurement can use paper and pencil test, rating scales, and observations to assign a
number value to a given trait or behaviour. Measurement can also mean to both the score obtained by the
measuring device and the process used to obtain the score.
7
Learning Task 3
Evaluation
What is evaluation for you? How it differs from assessment and measurement?
In the context of teaching-learning, evaluation refers to the process of making judgment about the value of
a student’s performance.
Evaluation is formative when conducted over small bodies of content to provide feedback in directing
further instruction and student learning. Formative evaluation then refers to an ongoing process which is
done before instruction, during instruction, and at the end of term or unit.
Summative evaluation on the other hand, is an evaluation conducted over a larger outcome of an extended
instructional sequence, over an entire course of a large or part of it.
Summative evaluation may serve for reporting a student’s overall achievement, licensing and certifying,
predicting success in related courses, assigning marks, and reporting overall achievement of a class.
8
Evaluation in the classroom context is directed to the improvement of student learning by supporting the
instructional process as in the following. Evaluation:
Serves to assess students’ abilities and needs using appraisal techniques. This may help to decide
for proper placement of students in learning sequences.
Identifies specific learning deficiencies and help in devising mechanisms to overcome learning
difficulties.
Assists in following up student progress during instruction through formative ways.
Allows evidence for judging the effectiveness of instruction or the curriculum through summative
evaluation.
In short, evaluation is taken as a professional judgement made to check the desirability or value of
something. It can also be interpreted as the determination of the match or mismatch between learning
achievement and learning objectives.
Self-check exercise
Types of evaluation
The different types of evaluation are: placement, formative, diagnostic and summative evaluations.
Placement Evaluation
This is a type of evaluations carried out in order to fix the students in the appropriate group or class. In
some schools for instance, students are assigned to classes according to their subject combinations, such as
science, Technical, arts, Commercial etc. before this is done an examination will be carried out. This is in
form of pretest or aptitude test. It can also be a type of evaluation made by the teacher to find out the entry
9
behavior of his students before he starts teaching. This may help the teacher to adjust his lesson plan. Tests
like readiness tests, ability tests, aptitude tests and achievement tests can be used.
Formative Evaluation
This is a type of evaluation designed to help both the student and teacher to pinpoint areas where the student
has failed to learn so that this failure may be rectified. It provides a feedback to the teacher and the student
and thus estimating teaching success e.g. weekly tests, terminal examinations etc.
Diagnostic Evaluation
This type of evaluation is carried out most of the time as a follow up evaluation to formative evaluation. As
a teacher, you have used formative evaluation to identify some weaknesses in your students. You have also
applied some corrective measures which have not showed success. What you will now do is to design a
type of diagnostic test, which is applied during instruction to find out the underlying cause of students
persistent learning difficulties. These diagnostic tests can be in the form of achievement tests, performance
test, self-rating, interviews, observations, etc.
Summative evaluation:
This is the type of evaluation carried out at the end of the course of instruction to determine the extent to
which the objectives have been achieved. It is called a summarizing evaluation because it looks at the entire
course of instruction or program and can pass judgment on the teacher and students, the curriculum and the
entire system. It is used for certification. Think of the educational certificates you have acquired from
examination bodies. These were awarded to you after you had gone through some types of examination.
This is an example of summative evaluation.
Instructional Functions
Measurement and evaluation stimulates teachers to clarify and refine meaningful learning objectives. Tests
provide means of feedback to the teacher. Feedback from tests helps the teacher provide more appropriate
instructional guidance for individual students as well as for the class as a whole. Well-designed tests may
also be of value for student self-diagnosis, as they help students identify specific weaknesses in their
learning.
Similarly, properly constructed tests can motivate learning. As a rule, students peruse mastery of learning
objectives more diligently if they expect to be evaluated. One can say that tests are useful means of over
10
learning. When a student reviews, interacts with or practice skills and concepts even after they have been
mastered, he is engaging in what psychologists call over learning. Even if a student correctly answered
every question on a test, he or she may be engaging in behaviour that is instructionally valuable apart from
the evaluation function being served by the test. And this is important for log-term retention.
Self-check exercise
1. Explain the way you are using measurement and evaluation (testing) to improve
instruction/your teaching.
______________________________________________________________________
______________________________________________________________________
2. How helpful is testing for student’s improvement for learning?
3. What is the relevance of quality control in the context of school and learning process?
__________________________________________________________________
Administrative functions
Measurement and Evaluation are very much useful in providing a mechanism of quality control for a school
or school system by facilitating a means for program evaluation and research, facilitating for better
classification and placement decisions, contributing to the quality of selecting and decisions, and by being
useful for accreditation, mastery, or certification purposes. In general, they serve the just mentioned
administrative functions.
Students must have accurate self-concepts in order to make sound decisions. They depend, somehow, on
the school to help them those self-concepts. Tests of aptitude and achievement, and interest and personality
inventories provide students with information on salient traits and help to develop realistic self-concepts.
11
The school teacher can also help particularly by providing the students with information concerning his
mastery of subject matter.
Tests can also be used in diagnosing an individual’s special abilities and aptitudes. Assisting a student with
educational and vocational choices, guiding him or her in the selection of curricular and extra-curricular
activities, and helping him or her solve personal and social adjustment problems, all require an objective
knowledge of the students’ abilities, interests, attitudes, and other personal characteristics or traits.
In using testing for guidance function, a series of assessments need to be administered, including an
interview, an interest inventory, a personality questionnaire, various aptitude tests, and achievement battery.
Information from these assessments, along with additional background information, facilitates a student’s
decision-making processes for educational and vocational choices
Self-check exercise
How do you gather information in order to provide guidance services for your students?
The following are also some of the more specific functions of measurement and evaluation:
i. Placement of student, which involves bringing students appropriately in the learning sequence and
classification or streaming of students according to ability or subjects.
ii. Selecting the students for courses – general, professional, technical, commercial etc.
iii. Certification: This helps to certify that a student has achieved a particular level of performance.
iv. Stimulating learning: this can be motivation of the student or teacher, providing feedback, suggesting
suitable practice etc.
v. Improving teaching: by helping to review the effectiveness of teaching arrangements.
vi. For research purposes.
vii. For modification of the curriculum purposes.
viii. For the purpose of selecting students for employment
ix. For modification of teaching methods.
x. For the purposes of promotions to the student.
xi. For reporting students’ progress to their parents.
xii. For the awards of scholarship and merit awards.
xiii. For the admission of students into educational institutions.
xiv. For the continuation of students.
12
Norm-Referenced and Criterion Referenced Measures
Dear teacher try to define and conceptualize the following terminologies and phrases:
Group---------
Norm---------
Reference------------
Cut point--------------
Standard -----------
Criterion ----------------
Group performance--------------
Group average--------------
Hope, you can make distinctions among the terminologies you defined above. When we consider the
evaluation or judgment aspect, any evaluation instrument can either be norm-referenced or criterion
referenced in the interpretation of individual scores.
Dear teacher, now give your own definition for the following phrases:
1. Norm-referenced tests
2. Criterion referenced tests
How do you judge the position of an individual tested based on norm-referencing/referring to the group
performance and criterion-referencing/referring a given set standard or cut point in test scores?
Norm-referenced tests
These are tests used to compare the performance of an individual with those of other individuals of
comparable background. In other words, the score of an individual in a norm-referenced testing has meaning
only when it is viewed in relation to the scores of other individuals on the test. The success or failure of an
individual on this kind of test is, therefore, determined on the basis of how he/she performs in relation to
his/her colleagues’ performance on the test.
A score of 35% might mean a superior ability if it happens to be one of the highest scores where as an
individual with a score of as high as 75% might be labelled “weak “on relative terms if that score happens
to be one of the lowest scores in the array of scores derived from the test.
13
Criterion-referenced tests
In contrast to norm-referenced tests criterion-referenced tests are tests when the score of an individual on a
given test is related to a specific performance standard for interpretation purposes. Such tests are labelled
criterion-referenced; the criterion in this respect being the specific performance standard.
If a given score of an examinee is equal to or greater than a specified standard (i.e. the criterion) the
examinee is said to have passed; otherwise she/he is deemed to have failed the test or examination.
Therefore, the success or failure of an individual on a criterion-referenced test (CRT) depends on what
he/she scores on the test in relation to the set standard, and this may depend, to a large extent on the test
content itself or the relative strictness of the marker if the test is not an objective one. As a result of any
criterion measure, therefore, it is possible for all the members of a class to pass or fail the test depending
on a number of obvious reasons such as the easiness or difficulty of items of the test.
SELF CHECK
1. Explain the difference between measurement and evaluation, assessment and testing.
2. What are the types of evaluation?
3. What is the major difference between test and testing?
4. In your own words define Assessment.
5. Give an example of a test.
6. What are the major differences and similarities between formative evaluation and diagnostic
evaluation?
7. List 5 purposes of measurement and evaluation?
8. List the instruments which you can use to measure the following: weight, height, length,
achievement in Mathematics, performance of students in Technical drawing, attitude of workers
towards delay in the payment of salaries
14
UNIT TWO
BLOOM’S TAXONOMY OF EDUCATIONAL OBJECTIVES
Introduction
Dear Trainees! Welcome to the second chapter of this module. In this unit, you are going to study
about Blooms taxonomy of educational objectives. In addition to this, you will also look at the
steps for stating instructional objectives.
Objectives
Dear learner, at the end of this unit you will be able to:
Dear Trainees, before you read the sections below, try to define the following
terms.
1. Objective
2. Aim
3. Goal
Educational aim: - a very broad educational objectives that can be stated by the
government of the country or ministry of education what the learners to be. E.g. To develop
all sided personality.
Educational goal: - is general purpose of education that is stated as a broad, long outcome
to work toward. Goal is primarily used in policy making and general program planning.
E.g. develop proficiency in kill, reading, writing and arithmetic.
15
General instructional objectives: - an intended outcome of an instruction that has been
stated in general enough terms to encompass a set of specific learning outcomes. E.g.
understand technical terms
Specific instructional objective/ learning outcomes/: -
are sets of more detailed statements that specify the means by which the various goals of the
course, course units, and educational package will be met. E.g. define technical terms in his or her
own words
2.2 The Importance of Stating Instructional Objectives
Well stating instructional objectives serve as a guide for teaching and testing/evaluation and
assessment. So they specifically useful to:
Help a teacher guide and monitor students learning
Provide criteria for evaluating students outcomes
Help in selecting or constructing assessment techniques
Help in communicating parents, students, administrators or others on what is expected of
students
Help in selecting appropriate instructional- Methods, materials, activities, contents and the
like
Uses as a feedback on how much the educational goals have been achieved.
2.3 Taxonomy of Educational Objectives
Benjamin Bloom and a group of people involved in education came up with a list of levels of
difficulty in what you can do with what you know. This group of different levels of describing
how you approach a problem is called taxonomy/classification: Cognitive, Affective, and
Psychomotor domains.
The Cognitive Domain is concerned with knowledge outcomes and intellectual abilities
and skills.
The Affective Domain is concerned with the attitudes, interests, appreciation, and modes
of adjustment.
The Psychomotor Domain is concerned with motor skills
16
Each of these three domains is further divided into categories and subcategories as follows.
Cognitive Domain
Level Definition Sample Verbs
Knowledge Recall and remember defines, describes, identifies, knows, labels, lists,
information. matches, names, outlines, recalls, recognizes,
reproduces, selects, states, memorizes, tells,
repeats, reproduces
Comprehension Understand the meaning, comprehends, converts, defends, distinguishes,
translation, interpolation, and estimates, explains, extends, generalizes, gives
interpretation of instructions and examples, infers, interprets, paraphrases,
problems. State a problem in predicts, rewrites, summarizes,
one's own words. Establish translates, shows relationship of, characterizes,
relationships between dates, associates, differentiates, classifies, compares
principles, generalizations or distinguishes
values
Application Use a concept in a new situation applies, changes, computes, constructs,
or unprompted use of an demonstrates, discovers, manipulates, modifies,
abstraction. Applies what was operates, predicts, prepares, produces, relates,
learned in the classroom into solves, uses, systematizes,
novel situations in the experiments, practices, exercises, utilizes,
workplace. Facilitate transfer of organizes
knowledge to new or unique
situations.
Analysis Separates material or concepts analyzes, breaks down, compares, contrasts,
into component parts so that its diagrams, deconstructs, differentiates,
organizational structure may be discriminates, distinguishes, identifies,
understood. Distinguishes illustrates, infers, outlines, relates, selects,
between facts and inferences separates, investigates, discovers, determines,
observes, examines
Synthesis Builds a structure or pattern from categorizes, combines, compiles, composes,
diverse elements. Put parts creates, devises, designs, explains, generates,
17
together to form a whole, with modifies, organizes, plans, rearranges,
emphasis on creating a new reconstructs, relates, reorganizes, revises,
meaning or structure. Originality rewrites, summarizes, tells, writes, synthesizes,
and creativity. imagines, conceives, concludes, invents
theorizes, constructs, creates
Evaluation Make judgments about the value appraises, compares, concludes, contrasts,
of ideas or materials. criticizes, critiques, defends, describes,
discriminates, evaluates, explains, interprets,
justifies, relates, summarizes, supports,
calculates, estimates, consults, judges, criticizes,
measures, decides, discusses, values, decides,
accepts/ rejects
Affective Domain
Receiving Awareness, willingness to hear, asks, chooses, describes, follows, gives, holds,
Phenomena selected attention. identifies, locates, names, points to, selects, sits,
erects, replies, uses
Responding to Active participation on the part answers, assists, aids, complies, conforms,
Phenomena of the learners. Attends and discusses, greets, helps, labels, performs,
reacts to a particular practices, presents, reads, recites, reports,
phenomenon. Learning selects, tells, writes.
outcomes may emphasize
compliance in
responding, willingness to
respond, or satisfaction in
responding (motivation).
Valuing The worth or value a person completes, demonstrates, differentiates,
attaches to a particular object, explains, follows,
phenomenon, or behavior. This forms, initiates, invites, joins, justifies, proposes,
ranges from simple acceptance to reads, reports, selects, shares, studies, works.
the more complex state of
commitment.
18
Organization Organizes values into priorities adheres, alters, arranges, combines, compares,
by contrasting different values, completes, defends, explains, formulates,
resolving conflicts between generalizes, identifies, integrates, modifies,
them, and creating a unique value orders, organizes, prepares, relates, synthesizes.
system. The emphasis is on
comparing, relating, and
synthesizing values.
Internalizing Has a value system that controls acts, discriminates, displays, influences, listens,
values their behavior. The behavior is modifies,
pervasive, consistent, performs, practices, proposes, qualifies,
predictable, and most questions, revises, serves, solves, verifies.
importantly, characteristic of the
learner.
Psychomotor Domain
Imitation Includes repeating an act that has begin, assemble, attempt, carry out, copy,
been demonstrated or explained, calibrate, construct, dissect, duplicate, follow,
and it includes trial and error mimic, move,
until an appropriate response is practice, proceed, repeat, reproduce, respond,
achieved. organize, sketch, start
Manipulation Includes repeating an act that has (similar to imitation), acquire, assemble,
been given oral or written complete, conduct, do, execute, improve,
instruction on how to do a certain maintain, make, manipulate, operate, pace,
activity. perform, produce, progress, use
Precision Response is complex and achieve, accomplish, advance, exceed, excel,
performed without hesitation. master, reach, refine, succeed, surpass, transcend
Articulation Skills are so well developed that adapt, alter, change, excel, rearrange, reorganize,
the individual can modify revise, surpass
movement patterns to fit special
requirements or to meet a
problem situation.
19
Naturalization Response is automatic. One acts arrange, combine, compose, construct, create,
"without thinking." design, refine, originate, transcend
Learning Task 2
In your major area, it is believed that 50% covers theoretical aspect and 50% covers the
practical aspect. You as a teacher, which instructional objectives do you think measure the
theory and which objectives measure the practice? Give example for each.
___________________________________________________________________________
___________________________________________________________________________
____________________________________________________________________
20
State each general objective at the proper level of generality; it should encompass a
readily definable domain of response
Stating specific learning outcomes
List beneath each general instructional objective a representative sample of specific
learning outcomes that describes the terminal performance students are expected to
demonstrate
Begin each specific learning outcome with an action verb that specifies observable
performance like identifies, describes
Make sure that each specific learning outcome is relevant to the general objective it
describes
Include a sufficient number specific learning outcomes to describe adequately the
performance of students who are attained the objective
Self-Check Exercises
Dear Trainees! You have now completed your study of chapter two. Therefore, you are expected to answer
the following self-test questions.
Instruction I: Match the domain descriptions listed under column A with their appropriate domains
under column B
Column A Column B
2. The ability to use learned materials in new and concrete situations B. Evaluation
4. The ability to judge the value of a material for a given purpose D. Analysis
G. Knowledge
H. Organization
21
Instruction II Choose the best answer from the alternatives given for each item
7. Which one of the following action verbs can be used in synthesis level of cognitive domain?
8. Which one of the following action verbs does not indicate learning outcome at the
evaluation level?
A. Decide B. Conclude C. Validate D. Discriminate,
9. Which one of the following affective domains does the learner expected to develop a
consistent philosophy of life?
A. Organization C. Characterization
B. Valuing D. Responding
10. Which one of the following action verbs can be used in synthesis level of cognitive domain?
A. Write C. Transfer
B. Distinguish D. Interpret
22
UNIT THREE
CLASSROOM ACHIEVEMENT TESTS AND ASSESSMENTS
3.1 INTRODUCTION
The classroom test, which is otherwise called teacher-made test, is an instrument of measurement and
evaluation. It is a major technique used for the assessment of students learning outcomes. The classroom
tests can be achievement or performance test and/or any other type of test like practical test, etc prepared
by the teacher for his specific class and purpose based on what he has taught.
Tests may be classified into two broad categories on the basis of nature of the measurement. These are:
Measures of maximum performance and measures of typical performance. In measures of maximum
performance, you have those procedures used to determine a person’s ability. They are concerned with how
well an individual performs when motivated to obtain as high a score as possible and the result indicates
what individuals can do when they put forth their best effort. Can you recall any test that should be included
in this category? Examples are Aptitude test, Achievement tests and intelligence tests.
On the other hand, measures of typical performance are those designed to reflect a person’s typical
behaviour. They fall into the general area of personality appraisal such as interests, attitudes and various
aspects of personal social adjustment. Because testing instruments cannot adequately be used to measure
these attributes self-report and observational techniques, such as interviews, questionnaires, anecdotal
records, ratings are sometimes used. These techniques are used in relevant combinations to provide the
desired results on which accurate judgment concerning learner’s progress and change can be made.
23
3.3. TYPES OF TESTS USED IN THE CLASSROOM
There are different types of test forms used in the classroom. These can be essay test, objectives test, norm-
referenced test or criterion referenced test. But we are going to concentrate on the essay test and objectives
test. These are the most common tests which you can easily construct for your purpose in the class.
3.3.1. Objective tests
Objective tests are those test items that are set in such a way that one and only one correct answer is available
to a given item. In this case every scorer would arrive at the same score for each item for each examination
even on repeated scoring occasions. This type of items sometimes calls on examinees to recall and write
down or to supply a word or phrase as an answer (free – response type). It could also require the examinees
to recognize and select from a given set of possible answers or options the one that is correct or most correct
(fixed-response type). This implies that the objective test consists of items measuring specific skills with
specific correct response to each of the items irrespective of the scorer’s personal opinion, bias, mood or
health at the time of scoring.
24
True/False or two option items
The true-false type of test is representative of a somewhat larger group called alternate-response items such
as, yes-no, correct-incorrect, agree-disagree, right-wrong, etc... This group consists of any question in which
the student is confronted with two possible answers. Since most of the points discussed here are equally
applicable to all alternative-response items, since teachers are familiar with the true-false type, the
following discussion will concentrate on true-false items.
Advantages of true/false items
It is commonly used to measure the ability to identify the correctness of statements of fact,
definitions of terms, statements of principles and other relatively simple learning outcomes to
which a declarative statement might be used with any of the several methods of responding.
It is also used to measure examinee ability to distinguish fact from opinion; superstition from
scientific belief.
It is used to measure the ability to recognize cause – and – effect relationships.
It is best used in situations in which there are only two possible alternatives such as right or wrong,
more or less, and so on.
It is easy to construct alternative response item but the validity and reliability of such item depend
on the skill of the item constructor. To construct unambiguous alternative response item, which
measures significant learning outcomes, requires much skill.
A large number of alternative response items covering a wide area of sampled course material can
be obtained and the examinees can respond to them in a short period of time.
Disadvantages of true/false items
It requires course material that can be phrased so that the statement is true or false without
qualification or exception as in the Social Sciences.
It is limited to learning outcomes in the knowledge area except for distinguishing between facts
and opinion or identifying cause – and – effect relationships.
It is susceptible to guessing with a fifty-fifty chance of the examinee selecting the correct answer
on chance alone. The chance selection of correct answer has the following effects:
i. It reduces the reliability of each item thereby making it necessary to include many items
in order to obtain a reliable measure of achievement.
ii. The diagnostic value of answers to guess test items is practically nil because analysis
based on such response is meaningless.
iii. The validity of examinees response is also questionable because of response set.
25
Guidelines for preparing true false items
26
Matching items
The matching test items usually consist of two parallel columns. One column contain a list of word, number,
symbol or other stimuli (premises) to be matched to a word, sentence, phrase or other possible answer from
the other column (responses) lists. The examinee is directed to match the responses to the appropriate
premises. Usually, the two lists have some sort of relationship. Although the basis for matching responses
to premises is sometimes self-evident but more often it must be explained in the directions.
The examinees task then is to identify the pairs of items that are to be associated on the basis indicated.
Sometimes the premises and responses list is an imperfect match with more lists in either of the two columns
and the direction indicating what to be done. For instance, the examinee may be required to use an item
more than once or not at all, or once. This deliberate procedure is used to prevent examinees from matching
the final pair of items on the basis of elimination.
Advantages of matching items
It is used whenever learning outcomes emphasize the ability to identify the relationship
between things and a sufficient number of homogenous premises and responses can be
obtained.
Essentially used to relate two things that have some logical basis for association.
It is adequate for measuring factual knowledge like testing the knowledge of terms, definitions,
dates, events, references to maps and diagrams.
The major advantage of matching exercise is that one matching item consists of many
problems. This compact form makes it possible to measure a large amount of related factual
material in a relatively short time.
It enables the sampling of larger content, which results in relatively higher content validity.
The guess factor can be controlled by skilfully constructing the items such that the correct
response for each premise must also serve as a plausible response for the other premises.
The scoring is simple and objective and can be done by machine.
Disadvantages of matching items
It is restricted to the measurement of factual information based on rote learning because the
material tested lend themselves to the listing of a number of important and related concepts.
Many topics are unique and cannot be conveniently grouped in homogenous matching clusters
and it is sometimes difficult to get homogenous materials clusters of premises and responses
that can sufficiently match even for contents that are adaptable for clustering.
It requires extreme care during construction in order to avoid encouraging serial memorization
rather than association and to avoid irrelevant clues to the correct answer.
27
Guidelines for preparing matching items
Use only homogeneous material in a set of matching items (i.e., dates and places should not be
in the same set).
Use the more involved expressions in the stem and keep the responses short and simple.
Supply directions that clearly state the basis for the matching, indicating whether or not a
response can be used more than once, and stating where the answer should be placed.
Make sure that there are never multiple correct responses for one stem (although a response
may be used as the correct answer for more than one stem).
Avoid giving inadvertent grammatical clues to the correct response (e.g., using a/an, singular/
plural verb forms).
Arrange items in the response column in some logical order (alphabetical, numerical, and
chronological) so that students can find them easily.
Avoid breaking a set of items (stems and responses) over two pages.
Use no more than 15 items in one set.
Provide more responses than stems to make process-of-elimination guessing less effective.
Number each stem for ease in later discussions.
Use capital letters for the response signs rather than lower-case letters.
Example: Directions: 1. On the line to the right of each phrase in column I, write the letter for the word in
column II that best matches true phrase.
2. Each word in column II may be used once, more than once, or not at all.
Column I Column II
1. Name of the answer in A. Difference
Addition problems
2. Name of the answer in B. Dividend
subtraction problems
3. Name of the answer in C. multiplicand
multiplication problems
4. Name of the answer in D. Product
division problems E. Quotient
F. Subtrahend
G. Sum
28
The multiple choice items (MCQs)
The multiple choice item consists of two parts – a problem and a list of suggested solutions. The problem
generally referred to as the stem may be stated as a direct question or an incomplete statement while the
suggested solutions generally referred to as the alternatives, choices or options may include words,
numbers, symbols or phrases. In its standard form, one of the options of the multiple choice item is the
correct or best answer and the others are intended to mislead, foil, or distract examinees from the correct
option and are therefore called distracters, foils or decoys. These incorrect alternatives receive their name
from their intended function – to distract the examinees who are in doubt about the correct answer.
Advantages of the MCQs
The multiple-choice item is the most widely used of the types of test available. It can be
used to measure a variety of learning outcomes from simple to complex.
It is adaptable to any subject matter content and educational objective at the knowledge
and understanding levels.
It can be used to measure knowledge outcomes concerned with vocabulary, facts,
principles, method and procedures and also aspects of understanding relating to the
application and interpretation of facts, principles and methods.
Most commercially developed and standardized achievement and aptitude tests make use
of multiple-choice items.
The main advantage of multiple-choice test is its wide applicability in the measurement of
various phases of achievement.
It is the desirable of all the test formats being free of many of the disadvantages of other
forms of objective items. For instance, it present a more well-defined problem than the
short-answer item, avoids the need for homogenous material necessary for the matching
item, reduces the clues and susceptibility to guessing characteristics of the true-false item
and is relatively free from response sets.
It is useful in diagnosis and it enables fine discrimination among the examinees on the basis
of the amount of what is being measured possessed by them.
It can be scored with a machine.
Disadvantages/limitations of the MCQs
It measures problem-solving behaviour at the verbal level only.
It is inappropriate for measuring learning outcomes requiring the ability to recall, organize
or present ideas because it requires selection of correct answer.
It is very difficult and time consuming to construct.
29
It requires more response time than any other type of objective item and may favour the
test-wise examinees if not adequately and skilful constructed.
Measuring evaluation and synthesis can be difficult.
Inappropriate for measuring outcomes that require skilled performance
Guidelines for preparing multiple-choice items
Use the stem to present the problem or question as clearly as possible; eliminate excessive
wordiness and irrelevant information.
Use direct questions rather than incomplete statements for the stem.
Include as much of the item as possible in the stem so that alternatives can be kept brief. Include
in the stem words that would otherwise be repeated in each option.
In testing for definitions, include the term in the stem rather than as one of the alternatives.
List alternatives on separate lines rather than including them as part of the stem so that they
can be clearly distinguished.
Keep all alternatives in a similar format (e. g. All phrases, all sentences, etc.).
Make sure that all options are plausible responses to the stem. (Poor alternatives should not be
included just for the sake of having more options.)
Check to see that all choices are grammatically consistent with the stem.
Try to make alternatives for an item approximately the same length. (Making the correct
response consistently longer is a common error.)
Use misconceptions which students have indicated in class or errors commonly made by
students in the class as the basis for incorrect alternatives.
Use “all of the above” and “none of the above” sparingly since these alternatives are often
chosen on the basis of incomplete knowledge. Words such as “all,” “always,” and “never” are
likely to signal incorrect options.
Use capital letters (A, C, D, and E) on tests as responses rather than lower-case letters (“a” gets
confused with “d” and “c” with “e” if the type or duplication is poor). Instruct students to use
capital letters when answering (for the same reason), or have them circle the letter or the whole
correct answer, or use scan able answer sheets.
Try to write items with equal numbers of alternatives in order to avoid asking students to
continually adjust to a new pattern caused by different numbers.
Put the incomplete part of the sentence at the end rather than the beginning of the stem. Phrase
the item as a statement rather than a direct question.
30
Use negatively stated items sparingly. (When they are used, it helps to underline or otherwise
visually emphasize the negative work.)
Make sure that there is only one best or correct response to the stem. If there are multiple correct
responses, instruct students to “choose the best response.”
Limit the number of alternatives to five or less. (The more alternatives used, the lower the
probability of getting the correct answer by guessing. Beyond five alternatives, however,
confusion and poor alternatives are likely.)
This is the type of test item, which requires the testee to give very brief answers to the questions. These
answers may be a word, a phrase, a number, a symbol or symbols etc. Supply test items can be in the form
of short answer or completion form. Both are supply-type test items consisting of direct questions which
require a short answer (short-answer type) or an incomplete statement or question to which a response must
be supplied by an examinee (completion type). The answers to such questions could be a word, phrase,
number or symbol. It is easy to develop and if well developed, the answers are definite and specific and can
be scored quickly and accurately.
Advantages of supply type items
Measure the ability to interpret diagrams, charts, graphs and pictorial data.
Most effective for measuring a specific learning outcome such as computational learning
outcomes in mathematics and sciences
31
to this question, there is also the possibility of spelling mistakes associated with free-response
questions that the scorer has to contend with.
Guidelines for preparing short- answer items
Questions must be carefully worded so that all students understand the specific nature of the
question asked and the answer required.
_Better: In what battle fought in 1869 did Tewodros II defeat Ras Ali?
Word completion or fill in the blank questions so that missing information is at, or near the end of,
the sentence. Make reading and responding easier.
- Better: If a room measures 7 meters by 4 meters, the perimeter is ______meters (or m).
Do not use too many blanks in completion items. The emphasis should be on knowledge and
comprehension not mind reading.
Consider: In the year __________, Prime Minister ___________ signed the _________, which led
to a __________ which was ____________.
Word each item in specific terms with clear meanings so that the intended answer is the only one
possible, and so that the answer is a single work, brief phrase, or number.
_In supply items, present much of the statement and blank the key word.
Poor: ___________ ____________ are words that refer to particular ________, ________,
or ___________.
_Better: Proper nouns are words that refer to particular _________, __________ or
________.
32
_Best: Words that refer to particular persons, objects or things are ____________.
Extended/Unrestricted/Open-ended/free response
Restricted/Closed-ended
Extended response items
No restrictions on response
No restrictions on No of pages
Originality required
Applicable in measuring higher level learning outcomes of the cognitive level such as analysis,
synthesis & evaluation level
Describe the processes of producing or cutting screw threads in the school technical workshop.
Why should the classroom teacher state his instructional objectives to cover the three domains of
educational objectives?
33
Open and Distance Learning is a viable option for the eradication of illiteracy in Ethiopia. Discuss
Such items are directional questions and aimed at the desired responses.
Examples:
34
Require the instructor to give critical comments.
Many educators use testing strategies that do not focus entirely on recalling facts. Instead, they ask students
to demonstrate practical skills and concepts they have learned. This strategy is called authentic assessment.
35
Authentic assessment aims to evaluate students' abilities in 'real-world' contexts. In other words, students
learn how to apply their skills to authentic tasks and projects. Authentic assessment does not encourage rote
learning and passive test-taking. Instead, it focuses on students' analytical skills, ability to integrate what
they learn, creativity, ability to work collaboratively and written and oral expression skills. It values the
learning process as much as the finished product.
One of the focal areas of educational measurement and evaluation is assessing students’ non cognitive
performances such as their interests, skills in performing various physical activities, carrying out laboratory
experiments, attitudes, project activities and workshop products the like. These affective and psychomotor
aspects of the learner should be well assessed. Identification of learner characteristics and adapting
classroom instruction to those characteristics could help all students master the knowledge and skill they
get from schooling. Classroom teachers thus should make sure that their students are attaining the
instructional objectives they are supposed to achieve by examining students non-cognitive performances
and tangible products.
Authentic assessment also identified as performance assessment that involves measurement activities that
ask students to demonstrate skills similar to those required in real life setting.
36
provide more specific and usable information
Limitations/disadvantages
Require more time and effort on an instructor’s/teacher’s part to develop,
More difficult to grade:
o often useful to create a grading rubric and authentic tools
Tools for authentic assessment
Observation Role playing
Group work Drama
Project work Concept map
Reflection Observation and observation
Portfolio devices
For practical subjects, this is the most obvious form of assessment watching someone doing something to
see if they can do it properly. It is the recommended form for competency-based programs. For any area in
which performance itself is not enough, direct observation needs to be supplemented by other methods.
One observation is not enough, but there is a trade-off, because observation is an extremely expensive way
of assessing.
Direct observation is valuable to collect real information and check the actual performance of the learners.
Reliability is only assured when everyone engaged in the assessment process is perfectly clear about what
is being looked for, and what evidence is required to determine competence. Developing observation
protocols is not a trivial activity. During observation we may use different instruments and devices that
help our observation more critical and evidence based. Some of commonly used devices are: Checklist,
anecdotal records, rating scale and running records.
I. Checklist
A checklist is a type of informational job aid used to reduce failure by compensating for potential limits of
human memory and attention. It helps to ensure consistency and completeness in carrying out a task. A
basic example is the "to do list."
It is prepared for checking the all-important tasks have been done in proper order. This list may be used for
prepared the shopping checklist, tasks checklist, to do checklist, invitation checklist, guest checklist,
packing checklist etc. A checklist is a list of items you need to verify, check and inspect. Checklists are
used in every imaginable field from building inspections to complex medical surgeries.
37
Example: An ICT instructor wants to evaluate students’ performance about excel document
performance out of ten points
1 Are no merged cells contained in the data area of the table (e.g. only headers and
for titles)?
2 Do all active worksheets in the workbook have clear and concise names that allow
the user to identify the source and contents of the table?
3 Are tables prefixed with the table name and table number?
4 Are table header rows formatted to repeat on the top of the table as it goes from
one page to another?
6 Does each separate table have its own worksheet and are each of those worksheets
named?
10 Has the document been reviewed in print preview for a final visual check?
An anecdotal record is like a short story that teachers use to record a significant incident that they have
observed. Anecdotal records are usually relatively short and may contain descriptions of behaviors and
students actual performance.
38
• Records typical or unusual performance, skills and behaviors
Anecdotes capture the richness and complexity of the moment as students interact with one another and
with materials. These records of student behavior and learning accumulated over time enhance the teacher's
understanding of the individual student as patterns or profiles begin to emerge. Behavior and performance
change can be tracked and documented, and placed in the student’s portfolio resulting in suggestions for
future observations and planning.
Anecdotal notes for a particular student can be periodically shared with that student or be shared at the
student’s request. They can also be shared with students and parents at parent–teacher–student conferences.
Tesfaye was in the art area during free choice. He was making letters, rolling
the paper and then he tied the paper roll with a string. He demonstrated this
process to Aster, Abdela and Mesfin who were also in the art area.
39
III. Rating scales
A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative
attribute. In the social sciences, particularly psychology, common examples are the Likert scale and 1-10
rating scales in which a person selects the number which is considered to reflect the perceived quality of a
product. A rating scale is a method that requires the ratter to assign a value, sometimes numeric, to the rated
object, as a measure of some rated attribute.
Please indicate the student’s skill in each of the following respect, as evidenced by this assignment, by
checking the appropriate box.
40
A. Running Records
A running record is a tool that helps teachers to identify patterns or sequences in student practical work,
reading behaviors, laboratory experiment procedure, drawing procedure and so on. These patterns allow a
teacher to see the strategies a student uses to make the practical tasks. A running record collect information
in narrative and pictorial form one by one following the flow of a certain action. Hence, running record
shows students practical performance from the beginning to ending.
For instance, a running record is one method of assessing a child's reading level by examining both accuracy
and the types of errors made. It is most often utilized as part of a reading recovery session in school or any
education center. A running record gives the teacher an indication of whether material currently being read
is too easy or too difficult for the child, and it serves as an indicator of the areas where a child's reading can
improve. For example, if a child frequently makes word substitutions that begin with the same letter as the
printed word, the teacher will know to focus on getting the child to look beyond the first letter of a word.
A running record can also be used for practical subjects besides reading.
Group project work has become a common feature of higher education. Many practitioners have recognized
that if students are going to be effective at group work they often need to improve their group work skills.
The skills of interacting in a group and working together to achieve a common goal within a specific
timescale need to be learnt.
Project work challenges students to think beyond the boundaries of the classroom, helping them develop
the skills, behaviors and confidence. Designing learning environments that help students question, analyze,
evaluate and extrapolate their plans, conclusions and ideas, leading them to higher order thinking, requires
feedback and evaluation that goes beyond a letter or number grade.
Since project work requires students to apply knowledge and skills throughout the project-building process,
teachers will have many opportunities to assess work quality, understanding and participation from the
moment students begin working.
For example, teachers’ evaluation can include tangible documents like the project vision, storyboard and
rough draft, verbal behaviors such as participation in group discussions and sharing of resources and ideas,
41
and non-verbal cognitive tasks such as risk taking and evaluation of information. They can also capture
snapshots of learning throughout the process by having students complete a project journal, a self-
assessment and by making a discussion of the process one component of the final presentation.
Project assessment may be conducted in to two ways. These are process and product assessment.
Assessing the process: evaluating individual teamwork skills and interaction. This includes list of
skills to assess, such as:
Assessing the product: Measuring the quantity and quality of individual work in a group project.
Process assessments are subjective and students are not always straightforward when evaluating
one another or themselves. However, in combination with product assessments and individual
assessments, they can offer valuable glimpses into how teams function and alert you to major
problems (e.g., particularly problematic team members or serious conflict), which can help to
inform your feedback and grading.
An example of peer assessment in team project members
42
C. Portfolio
What is a portfolio?
Portfolio is the systematic collection of student work measured against predetermined scoring criteria.
These criteria may include scoring guides, rubrics, check lists and rating scales. In "authentic assessment",
information or data is collected from various sources, through multiple methods, and over multiple points
in time.
Assessment portfolios can include performance-based assessments, such as writing samples that illustrate
different genres, solutions to math problems that show problem-solving ability, lab reports that demonstrate
an understanding of a scientific approach, or social studies research reports that show the ability to use
multiple sources. In addition, assessment portfolios can include scores on standardized and program specific
tests.
Portfolio is a collection of a student's work specifically selected to tell a particular story about the student.
Student portfolios take many forms, so it is not easy to describe them. A portfolio is not the pile of student
work that accumulates over a semester or year. Rather, a portfolio contains a purposefully selected subset
of student work. "Purposefully" selecting student work means deciding what type of story you want the
portfolio to tell. For example, do you want it to highlight or celebrate the progress a student has made?
Then, the portfolio might contain samples of earlier and later work, often with the student commenting
upon or assessing the growth. Do you want the portfolio to capture the process of learning and growth?
Then, the student and/or teacher might select items that illustrate the development of one or more skills
with reflection upon the process that led to that development. Or, do you want the portfolio to showcase the
final products or best work of a student? In that case, the portfolio would likely contain samples that best
exemplify the student's current ability to apply relevant knowledge and skills.
D. Self-Assessment
Self-assessment is the process of looking at oneself in order to assess aspects that are important to one's
identity. It is one of the motives that drive self-evaluation, along with self-verification and self-
enhancement.
The ultimate aim of education is to produce lifelong and independent learners. An essential component of
autonomous learning is the ability to assess one's own progress and deficiencies. Student self-assessment
should be incorporated into every evaluation process. Its specific form may vary with the developmental
level of the student, but the very youngest students can begin to examine and evaluate their own behavior
and accomplishments.
43
Instead of grading all assignments, allow students to correct some themselves. You may choose to randomly
collect these and check for accuracy. Share the specific evaluation criteria (or rubric) students should
employ in assessing various tasks or assignments. Provide them with criteria check sheets (or have the class
generate them) that specify exactly what constitutes a good product.
Encourage the student to apply specific criteria in making the self-assessment. Self-assessment can take
many forms, including:
6) Teacher-student interviews
44
These types of self-assessment ask students to review their work to determine what they have learned and
what areas of confusion still exist. Although each method differs slightly, all should include enough time
for students to consider thoughtfully and evaluate their progress.
An example of self-evaluation of a certain student to assess his/her reading performance based on the
following checklist.
Name………………………………………
Year………………………………………..
Department……………………………….
Date………………………………………..
No Activity Yes No
SELF CHECK
1. Briefly explain the meaning of objective test.
2. What are the two major advantages of objective test items over the Essay test item?
45
3. What is the major feature of objective test that distinguishes it from essay test?
4. Which of the types of Objective test items would you recommend for school wide test and why?
5. How would you make essay questions less subjective?
6. What are the two sub-divisions of the supply test items?
7. (a) Subjectivity in scoring is a major limitation of essay test?
True / false
(b) Essay questions cover the content of a course and the objectives as comprehensively as
possible? True / False
(c) Grading of essay questions is time consuming?
True / False
(d) Multiple choice questions should have only two options.
True / False
8. Construct five multiple choice questions in any course of your choice.
9. Give 5 examples of free response or extended response questions
10. Briefly identify the two most outstanding weaknesses of essay test as a measuring instrument.
11. When should essay test be used?
12. Why teachers use authentic assessment during practical performance of the students in Automotive
workshop?
13. How can teachers assess their students during project work? Explain the mechanism of project
assessment?
14. What are the advantages of authentic assessment?
15. Can we measure affective and psychomotor domains by using authentic assessment?
16. What are the basic characteristics of anecdotal records?
17. Explain the difference between running records and anecdotal records?
46
UNIT FOUR
TEST DEVELOPMENT – PLANNING THE CLASSROOM TEST
Introduction
Dear Trainees! Welcome to the fourth chapter of this module. In this unit, you will learn how to
plan a classroom test. You will learn what to consider in the planning stage, how to carry out
content survey and to scrutinize the instructional objectives as relevant factors in the development
of table of specification/test blue print.
Objectives
Dear learner, by the time you finish this unit, you will be able to:
Identify the sequence of planning a classroom test,
Prepare table of specifications for classroom test in a given subject.
recognize some common problems of teacher made tests
carry out content survey in the development of table of specifications
4.1 Test Development – Planning the Classroom Test
Learning Task 1
Dear Trainees, before you read the sections below, try to answer the following
questions.
1. You as a teacher, what steps do you follow when you prepare written test items for
your students?
________________________________________________________________
______________________________________________
2. What do you know about test blue print?
________________________________________________________________
____________________________________________________
3. What are some of the common problems observed in teacher made tests?
________________________________________________________________
_______________________________________________
47
The development, of good questions or items writing for the purpose of classroom test, cannot be
taken for granted. An inexperienced teacher may write good items by chance. But this is not always
possible. Development of good questions or items must follow a number of principles without
which no one can guarantee that the responses given to the tests will be relevant and consistent. In
this unit, we shall examine the various aspects of the teacher’s own test.
The development of valid, reliable and usable questions involves proper planning. The plan entails
designing a framework that can guide the test developers in the items development process. This
is necessary because classroom test is a key factor in the evaluation of learning outcomes.
The validity, reliability and usability of such test depend on the care with which the test are planned
and prepared. Planning helps to ensure that the test covers the pre-specified instructional objectives
and the subject matter (content) under consideration hence, planning classroom test entails
identifying the instructional objectives earlier stated, the subject matter (content) covered during
the teaching/learning process. This leads to the preparation of table of specification (the test blue
print) for the test while bearing in mind the type of test that would be relevant for the purpose of
testing.
As a teacher, you will be faced with several problems when it comes to your most important
functions – evaluating of learning outcomes. You are expected to observe your students in the
class, workshop, laboratory, field of play etc and rate their activities under these varied conditions.
You are required to correct and grade assignments and home works. You are required to give
weekly tests and end of term examinations. Most of the times, you are expected to decide on the
fitness of your students for promotion on the basis of continuous assessment exercises, end of term
examinations’ cumulative results and promotion examination given towards the end of the school
year. Given these conditions it becomes very important that you become familiar with the planning
construction and administration of good quality tests. An outline of the framework for planning
the classroom test will be discussed later.
48
4.1.1 Some Pit Falls In Teacher – Made Tests
You were told that testable educational objectives are classified by Bloom et al (1956) into recall
or memory or knowledge comprehension, application, analysis, synthesis, and evaluation. It means
that you do not only set objectives along these levels but also test them along the levels. The
following observations have been made about teacher-made tests. They are listed below in order
to make you avoid them when you construct your questions for your class tests.
Most teacher-tests are not appropriate to the different levels of learning outcomes. The
teachers specify their instructional objectives covering the whole range simple recall to
evaluation. Yet the teachers’ items fall within the recall of specific facts only
Many of the test exercises fail to measure what they are supposed to measure. In other
words most of the teacher-made tests are not valid. You may wonder what validity is. It is
a very important quality of a good test, which implies that a test is valid if it measures what
it is supposed to measure. You will read about in details later in this course.
Some classroom tests do not cover comprehensively the topics taught. One of the qualities
of a good test is that it should represent the entire topic taught. But, these tests cannot be
said to be a representative sample of the whole topic taught.
Most tests prepared by teacher lack clarify in the wordings. The questions of the tests are
ambiguous, not precise, and not clear and most of the times carelessly worded. Most of the
questions are general or global questions.
Most teacher-made tests fail item analysis test. They fail to discriminate properly and not
designed according to difficulty levels.
These are not the only pit falls. But you should try to avoid both the ones mentioned here and those
not mentioned here. Now let us look at how to develop test items.
4.1.2 Considerations in Planning a Classroom Test.
To plan a classroom test that will be both practical and effective in providing evidence of mastery
of the instructional objectives and content covered requires relevant considerations. Hence the
following serves as guide in planning a classroom test.
Determine the purpose of the test;
Describe the instructional objectives and content to be measured.
Determine the relative emphasis to be given to each learning outcome;
49
Select the most appropriate item formats (essay or objective);
Develop the test blue print to guide the test construction;
Prepare test items that are relevant to the learning outcomes specified in the test plan;
Decide on the pattern of scoring and the interpretation of result;
Decide on the length and duration of the test, and
Assemble the items into a test, prepare direction and administer the test.
50
specification is planned to take care of the coverage of content and objectives in the right
proportion according to the degree of relevance and emphasis (weight) attached to them in the
teaching learning process. The table of specifications is a two dimensional table that specifies the
level of objectives in relation to the content of the course or the type of the items in relation to the
content of the course. These are called table of specifications by objective and table of
specifications by test type respectively. A hypothetical table of specification by objective is
illustrated in table 4.1 below:
Set A 15% - 1 - 2 - - 3
Set B 15% - 1 - 2 - - 3
Set C 25% 1 - 1 1 1 1 5
Set D 25% 1 - 1 1 1 1 5
Set E 20% - 1 1 - - 2 4
Total 100% 2 3 3 6 2 4 20
1. The first consideration in the development of Test Blue –Print is the weight to be assigned
to higher order questions and the lower order questions (That is, to educational objectives
at higher and at lower cognitive levels). This is utilized in the allocation of numbers of
questions to be developed in each cell under content and objective dimensions. In the
hypothetical case under consideration, the level of difficulty for lower order questions
(range: knowledge to application) is 40% while the higher order questions (range: analysis
to evaluation) is 60%. This means that 40% of the total questions should be lower order
questions while 60% of the questions are higher order questions. The learners in this case
51
are assumed to be at the Senior Secondary Level of Education. Also, an attempt should be
made as in the above to ensure that the questions are spread across all the levels of Bloom’s
(1956) Taxonomy of Educational Objectives.
2. The blue-print is prepared by drawing 2-dimensional framework with the list of contents
vertically (left column) and objectives horizontally (top row) as shown in table 4.1 above.
3. Weights are assigned in percentages to both content and objectives dimensions as desired
and as already stated earlier.
4. Decisions on the number of items to be set and used are basis for determining items for
each content area. For instance, in table 4.1, set A is weighted 15% and 20 items are to be
generated in all. Therefore, total number of items for set A is obtained thus:
- Set A, weight: 15% of 20 items = 3 items
- Set B, weight: 15% of 20 items = 3 items
- Set C, weight: 25% of 20 items = 5 items
- Set D, weight: 25% of 20 items = 5 items
- Set E, weight: 20% of 20 items = 4 items.
The worked out values are then listed against each content area at the extreme right
(Total column) to correspond with its particular content.
5. The same procedure is repeated for the objective dimension. Just like in the above.
- Knowledge: weight 10% of 20 items = 2 items
- Comprehension: weight 15% of 20 items = 3 items
- Application: weight 15% of 20 items = 3 items
- Analysis: weight 30% of 20 items = 6 items
- Synthesis: weight 10% of 20 items = 2 items
- Evaluation: weight 20% of 20 items = 4 items.
Here also the worked out values are listed against each objective at the last horizontal
row, alongside the provision for total.
6. Finally, the items for each content are distributed to the relevant objectives in the
appropriate cells. This has also been indicated in the table 4.1 above. The Table of
Specification now completed, serves as a guide for constructing the test items. It should be
noted that in the table knowledge, comprehension and application levels have 2, 3, and 3
items respectively. That is, 2+3+3 = 8 items out of 20 items representing 40% of the total
52
test items. While analysis, synthesis and evaluation have 6, 2 and 4 items respectively. That
is, 6+2+4 = 12 items out of 20 items representing 60% of the total items.
7. The development of table of specification is followed by item writing. Once the table of
specification is adhered to in the item writing, the item would have appropriate content
validity at the required level of difficulty. The table of specification is applicable both for
writing essay items (subjective questions) and for writing objective items (multiple choice
questions, matching sets items, completion items, true/false items).
Set A 15% - 1 - 2 - 3
Set B 15% - 1 - 2 - 3
Set C 25% 1 - - 2 1 4
Set D 25% 1 - 3 1 1 6
Set E 20% - 1 - 3 - 4
Total 100% 2 3 3 10 2 20
53
3. Use unambiguous language so that the demands of the item would be clearly understood.
4. Endeavour to generate the items at the appropriate levels of difficulty as specified in the
table of specification. You may refer to Bloom (1956) taxonomy of educational objectives
for appropriate action verb required for each level of objective.
5. Give enough time to allow an average student to complete the task.
6. Build in a good scoring guide at the point of writing the test items.
7. Have the test exercises examined and critiqued by one or more colleagues. Then subject
the items to scrutiny by relevant experts. The experts should include experts in
measurement and evaluation and the specific subject specialist. Incorporate the critical
comments of the experts in the modification of the items.
8. Review the items and select the best according to the laid down table of specification/test
blue print. Also associated with test development is the statistical analysis – The Item
analysis. This is used to appraise the effectiveness of the individual items. Another
important factor is reliability analysis. Both item analysis and reliability analysis would be
treated in subsequent units. The item analysis and validity are determined by trail testing
the developed items using a sample from the population for which the test is developed.
Self-Check Exercises
Dear Trainees! You have now completed your study of chapter four. Therefore, you are expected to answer
the following self-test questions.
54
UNIT FIVE
ASSEMBLING, REPRODUCING, ADMINISTERING, AND SCORING OF CLASSROOM TESTS
5.1 UNIT INTRODUCTION
In the last two units you learned about types of tests and how to construct tests. Here you will learn test
administration and scoring of classroom test. You will learn how to ensure quality in test administration as
well as credibility and civility in test administration. Furthermore, you will learn how to score essay and
objective test items using various methods.
5.3 INTRODUCTION
Brain Storing
Think about arranging various items formats in a test. Which item format should appear first and
which should appear last?
Similarly consider arranging different items within a particular format. How do you think it can be
done?
Think about giving directions to different test formats and how do you think this can be done?
In most cases, different item formats cannot be administered orally or cannot be easily written on
chalkboard to administer. This means that there is a necessity for reproducing tests and during reproduction
care must be taken to assure that:
55
5.4 ARRANGING THE TEST ITEMS
It is very essential that the task presented to students must be as clear as possible. To do this we have to
group all items of the same format together rather than to intersperse them throughout the test. The
advantage of doing this is that:
Younger children may not realize that the first set of directions is applicable to all items of a
particular format and may become confused;
It makes it easier for the examinee to maintain a particular mental set rather than having to change
from one to another;
It makes it easier for the teacher to score the test, mainly when hand scoring is done.
In arranging item formats due emphasis should be given to the complexity of mental activity they demand
in answering them. In this way we have to arrange item formats so that they progress from the simple to
the complex. For instance items that measure simple recall should precede those that measure understanding
and application.
According to Ground (1985) item formats can be arranged in the following way, which roughly
approximates the complexity of the instructional objectives measured. Hence:
56
Similarly, teachers should pay due attention in arranging items within each item format. In line with this
they have to group those items dealing with the same instructional objectives together. When they do this,
teachers can ascertain which learning activities appear to be most readily understood by their students. As
a rule it is important if items are arranged in a way that they progress from the easy items that almost
everyone could answer correctly, even the less able students might be encouraged to do their best on the
remaining items.
In some subjects, tests may consist of drawings or diagrams. In this case the drawings should be placed
above the stem not to create a break or a gap between the stem and options.
Generally, the organization of the various test items in the final test should
Self-Check Activity
1. Enumerate the advantage of arranging test items from the easy to the difficult within a test
2. What would happen if test items are presented in an interspersed fashion in a test rather than in a
group way?
3. Genuinely, evaluate yourself and your practice in arranging test formats and test items within a
specific test format.
57
5.5 WRITING TEST DIRECTIONS
Teachers should be aware of the significance of providing clear and concise directions. The directions given
should transmit clear information concerning what to do, how to do and where to record answers. In other
words directions should tell students:
a) Each item format should have a specific set of directions. Besides, a general set of instructions, a
specific set of instructions must be provided to a particular item format. For computational
problems, we have to tell the degree of precision required, and the proper units they are required to
use.
b) For objective tests at the elementary level, give the students examples and/or practice exercises so
that they will see exactly what and how they are to perform their tasks.
c) Students should be told how the test will be scored. Concerning this issue they should be informed
cases like whether punctuation, spelling or other conditions might be taken in to consideration in
scoring essay questions, in an arithmetic test whether students will receive part scores for showing
a correct procedure even though they may have obtained an incorrect answer, and the like.
d) Above the second grade level, all directions should be written out. This is to mean for younger
children the directions may be read aloud in addition to being printed and available to each student.
Besides teachers should orally give directions to certain group of students such as slow learners, or
to students with reading problems.
Even though it is not the common practice in our educational system elsewhere instructions are given for
guessing. Since guessing and faking are the persistent sources of error in cognitive oriented tests and
affective oriented tests respectively, various procedures have been devised including application of a
correction formula to combat these problems. However, the research evidences showed that students should
be instructed to answer every item and that no correction for guessing be applied.
Activity
1. Identify the common errors that you have made in writing clear and concise directions.
2. Write sample directions
General direction for exam to be scored by a machine
58
Direction for specific item formats
- True-False
- Matching
- Short Answer
- Essay
59
These conditions involve the following issues:
Activity
60
to the questions during test preparation and construction of the test item. The marking scheme takes
into consideration the facts required to answer the questions and the extent to which the language used
meets the requirement of the subject. The actual marking is done following the procedures for scoring essay
questions (for essay questions) and for scoring objective items (for objective items).
The construction and scoring of essay questions are interrelated processes that require attention if a valid
and reliable measure of achievement is to be obtained. In the essay test the examiner is an active part of the
measurement instrument. Therefore, the viabilities within and between examiners affect the resulting score
of examinee. This variability is a source of error, which affects the reliability of essay test if not adequately
controlled. Hence, for the essay test result to serve useful purpose as valid measurement instrument
conscious effort is made to score the test objectively by using appropriate methods to minimize the effort
of personal biases and idiosyncrasies on the resulting scores; and applying standards to ensure that only
relevant factors indicated in the course objectives and called for during the test construction are considered
during the scoring. There are two common methods of scoring essay questions. These are:
This method is generally used satisfactorily to score restricted response questions. This is made possible by
the limited number of characteristics elicited by a single answer, which thus defines the degree of quality
precisely enough to assign point values to them. It is also possible to identify the particular weakness or
strength of each examinee with analytic scoring. Nevertheless, it is desirable to rate each aspect of the item
separately. This has the advantage of providing greater objectivity, which increases the diagnostic value of
the result.
61
II. The global/holistic rating method
In this method the examiner first sorts the response into categories of varying quality based on his general
or global impression on reading the response. The standard of quality helps to establish a relative scale,
which forms the basis for ranking responses from those with the poorest quality response to those that have
the highest quality response. Usually between five and ten categories are used with the rating method with
each of the piles representing the degree of quality and determines the credit to be assigned. For example,
where five categories are used, and the responses are awarded five letter grades: A, B, C, D and E. The
responses are sorted into the five categories where A -quality responses, B – quality, C – quality D- quality
and E-quality. There is usually the need to re-read the responses and to re-classify the misclassified ones.
This method is ideal for the extended response questions where relative judgments are made (no exact
numerical scores) concerning the relevance of ideas, organization of the material and similar qualities
evaluated in answers to extended response questions. Using this method requires a lot of skill and time in
determining the standard response for each quality category. It is desirable to rate each characteristic
separately. This provides for greater objectivity and increases the diagnostic value of the results. The
following are procedures for scoring essay questions objectively to enhance reliability.
Prepare the marking scheme or ideal answer or outline of expected answer immediately after
constructing the test items and indicate how marks are to be awarded for each section of the
expected response.
Use the scoring method that is most appropriate for the test item.
Decide how to handle factors that are irrelevant to the learning outcomes being measured. These
factors may include legibility of handwriting, spelling, sentence structure, punctuation and
neatness. These factors should be controlled when judging the content of the answers. Also decide
in advance how to handle the inclusion of irrelevant materials (uncalled for responses).
Score only one item in all the scripts at a time. This helps to control the “halo” effect in scoring.
Evaluate the answers to responses anonymously without knowledge of the examinee whose script
you are scoring. This helps in controlling bias in scoring the essay questions.
Evaluate the marking scheme (scoring key) before actual scoring by scoring a random sample of
examinees actual responses. This provides a general idea of the quality of the response to be
expected and might call for a revision of the scoring key before commencing actual scoring.
Make comments during the scoring of each essay item. These comments act as feedback to
examinees and a source of remediation to both examinees and examiner
62
Obtain two or more independent ratings if important decisions are to be based on the results. The result of
the different scorers should be compared and rating moderated to reflect the discrepancies for more reliable
results.
Objective test can be scored by various methods with ease unlike the essay test. Various techniques are
used to speed up the scoring and the techniques to use sometimes depend on the type of objective test. Some
of these techniques are as follows:
i. Machine Scoring
ii. Stencil Scoring
iii. Manual Scoring
63
UNIT SIX
SUMMARIZING AND INTERPRETING TEST SCORES
Introduction
Dear Trainees! Welcome to the sixth chapter of this module. In the previous unit, you have learnt
how to administer and score tests. In this unit, you will learn how to interpret the test scores using
different statistical tools such as measures of central tendency, measures of dispersion, measures
of relative position and measures of association. In addition you will learn about the standard
scores which comprise the standard deviation and the normal curve, the Z- score and the T – score.
Objectives
Dear learner, by the time you finish this unit, you will be able to:
Interpret classroom test scores by criterion-referenced or norm referenced
Calculate the average result of a given class of test scores
Decide if there is a relationship between two factors
Convert raw scores to z-scores and T-scores
Compare one’s score with the score of a group
Convert from one standard score to another and
Use the standard score in interpreting test scores.
6.1 Methods of Interpreting Test Scores
Test interpretation is a process of assigning meaning and usefulness to the scores obtained from
classroom test. This is necessary because the raw score obtained from a test standing on itself
rarely has meaning. For instance, a score of 50% in one mathematics test cannot be said to be
better than a score of 40% obtained by same testee in another mathematics test. The test scores on
their own lack a true zero point and equal units. Moreover, they are not based on the same standard
of measurement and as such meaning cannot be read into the scores on the basis of which academic
and psychological decisions may be taken. To compensate for these missing properties and to
make test scores more readily interpretable various methods of expressing test scores have been
devised to give meaning to a raw score. Generally, a score is given meaning by either converting
it into a description of the specific tasks that the learner can perform or by converting it into some
type of derived score that indicates the learner’s relative position in a clearly defined reference
group. The former method of interpretation is referred to as criterion – referenced interpretation
while the later is referred to as Norm – referenced interpretation.
64
Learning Task1
You are a teacher and you have been giving your students different tests/exams.
What do you do with the test results? How do you interpret them?
_________________________________________________________________
_________________________________________________________________
_________________________________________________________
65
clearly specified task, enough items are used for each interpretation to enable dependable and
informed decisions concerning the types of tasks a learner can perform.
Learning Task 2
Dear Trainees, before you read the sections below, try to define the following
terms.
1. Statistics
2. Descriptive Statistics
3. Inferential Statistics
____________________________________________________________
____________________________________________________________
____________________________________________________________
The purpose of descriptive statistical analysis is to describe the data that you have. You should
bear in mind that descriptive statistics do just what they say they will do – they describe the data
that you have. They don’t tell you anything about the data that you don’t have. For example, if you
carry out a study and find that the average number of times students in your study are pecked by
66
ducks is once per year, you cannot conclude that all students are pecked by ducks once per year.
This would be going beyond the information that you had.
One of the major purposes of statistics in test use is to allow us to describe and summarize data-
for example, test scores- in efficient and useful ways.
1. Describe
2. Interpret
3. Pass judgment
What is the average/ most popular/ mean / typical/ “middle” / most common data value?
A value (i.e. single number) that is used to represent where the majority of the data values
lie for a given random variable.
Three commonly used central location measures are:
1. Mean
2. Median
3. Mode
1. The Mean(𝑿)
It is the Arithmetic average of the observed scores. It is also the most commonly used measure of
location. This is the most popular and useful measure of central location.
67
Mean from ungrouped frequency distribution
∑𝑋
1. Mean( X ) = where∑=summation
𝑁
X=Raw score
N=total No of students
X =mean
∑𝑓𝑋
2. Mean( X ) = where∑=summation
𝑁
X=raw score
N=total No of students
f= frequency
𝟓 + 𝟑 + 𝟖 + −𝟐 + 𝟓
𝑴𝒆𝒂𝒏( x ) = = 𝟑. 𝟖
𝟓
Table 6.1
Score(x) 2 4 6 8 10
Frequency(f) 5 4 3 2 1
∑𝑓𝑋 (2 5) (4 4) (6 3) (8 2) (10 1)
Mean ( X ) = =
𝑁 5 4 3 2 1
10 16 18 16 10 70
= = 4.67
15 15
68
ADVANTAGES OF USING THE MEAN
It is a single number
It takes every data point into account
It is simple to understand and is understood by most people
It is an unbiased measure, meaning that it neither overestimates nor underestimates the
actual central value
DISADVANTAGES OF USING THE MEAN
It can only be calculated for quantitative variables.
It is affected by extreme values, commonly known as outliers(Outliers are extreme values
--either extremely smaller extremely large--relative to the majority of the other data
values).
Exercise 6.1
Score(x) 7 4 9 12 5
Frequency(f) 3 2 5 2 4
2. The Median
The median of a random variable is the value which divides ranked data into two equal
halves, i.e. it is the middle number of an ordered set of data.
The median is valid for quantitative random variables only.
It is the score that divides a score distribution into two equal parts.
It is most appropriate when dealing with small number of students.
69
Advantages of the median
It is easy to understand.
It is not affected by outliers
One of its disadvantages is that it can only be calculated for quantitative variables.
Example1: Here is the score of six people 14, 11, 8,6,7,9 calculate the median.
Example 2: Calculate the median of the following scores 14, 11, 9, 6,8,7,5.
9 1
th
Mdn= 9.
70
Example 3: Age in years of seven security staff.
Table 6.2
Age 23 23 28 30 38 58 63
Answer
7 1
th
3. The Mode
It is the most frequent value. Mode means most ‘’popular’’. The score/s with the highest frequency
is/are the mode of the distribution. The mode is valid for both quantitative and qualitative random
variables. A set of data may have one mode, or two or more modes.
1. It is easy to calculate
2. It is valid for all data types.
3. It is not affected by outliers
4. Most appropriate when numerical values in a data set are labels for categories (nominal)
1. There could be more than one mode, which could lead to confusion since no single value
is then representative of this number
2. It could be a random event and not truly representative of the data, especially in a relatively
small data set.
71
Calculations:
Put the individual values in numerical order ------small to large.
Take the value which occurs the most frequently = mode.
Example 1: Number of vacation days taken by 10 security personnel last year.
Table 6.3
Officer days 1 2 3 4 5 6 7 8 9 10
Vacation 3 5 6 10 10 10 12 15 15 15
Example3:
Table 6.4
Score(x) 10 11 12 13
Frequency(f) 3 4 1 5
Mode=13
2. A distribution may have one mode, two modes, multiple modes or no mode.
72
Exercise 6.2
1. Calculate the mean, median, and mode for the following scores.
Table 6.5
15 1 120 15
14 2 119 28
13 3 117 39
12 6 114 72
11 12 108 132
10 15 96 150
9 22 81 198
8 31 59 249
7 18 28 126
6 6 10 36
5 2 4 10
4 2 2 8
∑f=120
2. A survey of the ages of residents of Teacher’s homes yielded the following measures of central
tendency Mean=70 Median = 78 and Mode = 83. In which direction is the distribution likely
to be skewed?
73
Shapes of distributions
74
Fig 2 Skewed distribution
Skewness means direction of scores and it is a general methodology to describe the level of
difficulty.
In symmetrical distribution: mean, median, and mode are equal and its discrimination power is
high. For Fig 3, graph ‘’a ‘’ above, the discrimination power is greater than 0.3 while in Fig 3, in
graphs “e” and ’’ f ” above, the discrimination power is less than 0.3.
75
6.2.2 Measures of Variability
Distribution I 37 37 37 37 37
Distribution II 33 36 37 38 41
Note: the three distributions have the same arithmetic mean=37. But there is marked difference
among the distributions.
Therefore, the topic on dispersion of scores is concerned with studying measures which show the
amount of variability among data.
1. The Range
2. The Variance
3. The Standard Deviation
1. The range
The range is the difference between the highest and the lowest scores in a distribution. The higher
the value of the range, the greater the difference between the students in academic achievement.
However, it is a crude measure of variability.
2. The variance
Variance is the arithmetic mean of the squared deviations of individual scores from the mean. It is
expressed in squared units. It shows a spread or dispersion of scores i.e. a tendency for any set of
scores to depart from a central point or any other point.
76
Ungrouped score distribution
Definitional formulae
∑(𝑋− X )2 ∑𝑓(𝑋− X )2
𝛿2 = ; ;
𝑁 𝑁
Where 𝜎2 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
X= Raw score
X =Mean of the distribution
f=frequency
N=total number of students
𝑁∑𝑓𝑋 2 −(∑𝑓𝑋)2
Computational formula: 𝛿 2 = 𝑁2
The standard deviation is a measure of how much a set of scores varies on the average around the
mean of scores. In other words, it reveals how closely scores tend to vary from the mean.
Standard deviation is the positive square root of variance. It measures the extent to which scores
tend to deviate from the mean. It is useful for:
∑(𝑋−𝑋)2 ∑𝑓(𝑋−𝑋)2
σ= √ ; σ= √
𝑁 𝑁
- The larger the σ, the greater is the difference in academic achievement. The smaller the
standard deviation the less the scores tend to vary from the mean.
77
Example 1: Given the following data calculate the variance and the standard deviation.
Table 6.6
X f Cf fX X2 fX2
8 31 59 248 64 1984
9 21 80 189 81 1701
7 18 28 126 49 882
6 6 10 36 36 216
5 2 4 10 25 50
4 2 2 8 16 32
2
𝑁∑𝑓𝑋 2 − (∑𝑓𝑋)2
𝛿 =
𝑁2
119(9805)−(1053)2
= 1192
78
=4.09
σ=√4.09
=2.02
Exercise 6.3
Score(x) 7 4 9 12 5
Frequency(f) 3 2 5 2 4
Learning Task 3
Dear Trainees, before you read the sections below, try to answer the
following questions.
1. If a student scored 82% in Physics and this student scored 74% in
Mathematics, can we say that the student’s performance is better in Physics
than in Mathematics? How?
____________________________________________________________
____________________________________________________________
The measures of position or location are used to locate the relative position of a specific data value
in relation to the rest of the data. The most popular measures of position are:
Percentiles
Z- score
Deciles
Quartiles
79
A. Percentiles and Percentile Ranks
Percentiles and percentile ranks are frequently used as indicators of performance in both the
academic and corporate worlds. Percentiles and percentile ranks provide information about how a
person or thing relates to a larger group. Relative measures of this type are often extremely
valuable to researchers employing statistical techniques.
Percentiles
A percentile is the point in a distribution at or below which a given percentage of scores is found.
OR The value below which P% of the values fall is called the Pth percentile
For example, the 5th percentile is denoted by P5, the 10th by P10 and 95th by P95.
Percentile Rank
A percentile rank is used to determine where a particular score or value fits within a broader
distribution. For example: A student receives a score of 75 out of 100 on an exam and wishes to
determine how her score compares to the rest of the class. She calculates a percentile rank for a
score of 75 based on the reported scores of the entire class. Her percentile rank in this example
would be 80, meaning that 80 percent of scores on the exam were at or below 75.
Notes:
I. A Percentile is a value in the data set.
II. A Percentile rank of a given value is a percent that indicates the percentage of data is
smaller than the value.
III. Percentiles are not the same as percentage.
Calculation of Percentiles and Percentile ranks
A. In case of ranked raw data:
The (approximate) value of the K th percentile P k , is calculated by the formula,
Kn th
P k = Value of the term
100
Where k –is the percentile one wishes to calculate
n- is the total number of values in the distribution.
The percentile rank (PR) of a given value, xi, is obtained by the formula,
Numberofvalueslessth anxi
Percentile Rank (PR of xi) = 100%
TotlalNumb erofValues
80
Example 1: The following data is a score of 70 students in Measurement and Evaluation.
Table 6.7
Score(X) Frequency(f) Cumulative frequency(Cf)
95 8 70
92 6 62
85 11 56
84 9 45
80 12 36
75 6 24
70 5 18
68 13 13
Kn
B. P90= th term
100
90 70
= th term
100
= 63th term
81
= 95
= 18.6 %
This means that 18.6 percent of scores on the exam were at or below 70.
Numberofvalueslessth anxi
B. PR(84)= 100%
TotlalNumb erofValues
36
= 100%
70
= 51.4 %
82
This means that 51.4 percent of scores on the exam were at or below 84.
Exercise 6.4
B. The Z – Scores
The Z – score is the simple standard score which expresses test performance simply and directly
as the number of standard deviation units a raw score is above or below the mean. The Z-score is
computed by using the formula.
XX
Z – Score =
SD
Where
X = any raw score
X = arithmetic mean of the raw scores
SD = standard deviation
When the raw score is smaller than the mean the Z –score results in a negative (-) value which can
cause a serious problem if not well noted in test interpretation. Hence Z-scores are transformed
into a standard score system that utilizes only positive values.
Example 1: For a student with raw score of 30, calculate the Z – score If the mean is 40 and SD =
4
XX
Z – Score =
SD
30 40
Z – Score = = -2∙5
4
83
When we change this score in to standard score(T-score) it becomes
T – Score = 50 + 10 (z) = 50 + 10 (-2∙5) = 50 – 25 = 25
6.2.4 Measures of Relationship
Measures of association provide a means of summarizing the size of the association between two
variables. Most measures of association are scaled so that they reach a maximum numerical value
of 1 when the two variables have a perfect relationship with each other. They are also scaled so
that they have a value of 0 when there is no relationship between two variables. While there are
exceptions to these rules, most measures of association are of this sort. Some measures of
association are constructed to have a range of only 0 to 1; other measures have a range from -1 to
+1. The latter provide a means of determining whether the two variables have a positive or negative
association with each other.
Correlation
Chi-square
A. Correlation
A correlation coefficient is used to measure the strength of the relationship between numeric
variables (e.g., weight and height)
If the coefficient is between 0 and 1, as one variable increases, the other also increases. This is
called a positive correlation. For example, height and weight are positively correlated because
taller people usually weigh more.
If the correlation coefficient is between -1 and 0, as one variable increases the other decreases.
This is called a negative correlation. For example, age and hours slept per night are negatively
correlated because older people usually sleep fewer hours per night.
There are two common methods of computing correlation coefficient. These are:
Pearson Product-Moment Correlation.
Spearman Rank-Difference Correlation
84
Pearson Product-Moment Correlation:
This is the most widely used method and the coefficient is denoted by the symbol r. This method
is favoured when the number of scores is large and it’s also easier to apply to large group. The
computation is easier with ungrouped test scores and would be illustrated here. The computation
with grouped data appears more complicated and can be obtained from standard statistics test book.
The following steps listed below will serve as guide for computing a product-moment correlation
coefficient (r) from ungrouped data.
Step 1 - Begin by writing the pairs of score to be studied in two columns. Make certain that the
pair of scores for each examinee is in the same row. Call one Column X and the other Y
Step 2 - Square each of the entries in the X column and enter the result in the X2 column
Step 3 - Square each of the entries in the Y column and enter the result in the Y2 column
Step 4 - In each row, multiply the entry in the X column byte entry in the Y column, and enter the
result in theXY column
Step 5 - Add the entries in each column to find the sum of ( ) each column.
XY X Y
N N N
r=
X X Y Y
2 2
2
2
N N N N
XY M M X Y
OR r = N
SDX SDY
where
MX = mean of scores in X column
MY= mean of scores in Y column
SDX= standard deviation of scores in X column
SDY = standard deviation of scores in Y column
85
Example 1: Table 6.8: Computing Pearson Product-Moment Correlation for a pair of the
Hypothetical Ungrouped Data
Student Maths(X) Physics(Y) X2 Y2 XY
X X Y Y
2 2
2
2
N N N N
86
Spearman Rank-Difference Correlation:
This method is satisfactory when the number of scores to be correlated is small (less than 30). It is
easier to compute with a small number of cases than the Pearson Product-Moment Correlation. It
is a simple practical technique for most classroom purposes. To use the Spearman Rank-Difference
Method, the following steps listed under should be taken.
Computing Procedure for the Spearman Rank-Difference Correlation
Step 1 -Arrange pairs of scores for each examinee in columns (Columns 1 and 2)
Step 2 - Rank examinees from 1 to N (number in group) for each set of scores
Step 3 - Rank the difference (D) in ranks by subtracting the rank in the right hand column from
the rank in the left-hand column
Step 4- Square each difference in rank to obtain difference squared (D2)
Step 5- Sum the squared differences to obtain D 2
87
16 79 46 16 18 -2 4
17 77 44 17 20 -3 9
18 76 45 18 19 -1 1
19 75 62 19 9 10 100
20 74 77 20 1 19 361
6 D2 6 514 3084
ρ (rho) = 1 -
N N 12
= 1-
20(400 1)
= 1-
7980
= 0.61
Self-Check Exercises
Dear Trainees! You have now completed your study of chapter six. Therefore, you are expected
to answer the following self-test questions.
88
UNIT SEVEN
RELIABILITY AND VALIDITY OF A TEST
INTRODUCTION
Dear trainee, in this unit you will learn about test reliability and the methods of estimating
reliability. Specifically, you will learn about test retest method, equivalent form method, split half
method and Kuder Richardson Method. Furthermore, you will learn about the factors influencing
reliability measures such as length of test, spread of scores, difficulty of test and objectivity. In
addition to this, in this unit, you will also learn about validity, which is the single most important
criteria for judging the adequacy of a measurement instrument. You will learn types of validity
namely content, criterion and construct validity. Finally, you will learn validity of criterion-
referenced mastery tests and factors influencing validity.
OBJECTIVES
By the time you finish this unit you will be able to:
Define reliability of a test
State the various forms of reliability
Explain the factors that influence reliability measures
Compare and contrast the different forms of estimating reliability
Define validity as well as content, criterion and construct validity
Differentiate between content, criterion and construct validity
Describe how each of the three types of validity are determined
Interpret different validity estimates
Identify the different factors that affect validity
Assess different test items based on the principle of validity
89
7.1 Test Reliability
assessment? What are the types of reliability? What are the major factors that affect reliability
measures of a test?
____________________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
Dear trainee, in the previous units you have grasped the concept of test administration and
assembling and how essential test administration, assembling and scoring in a certain course
assessment. Whereas, in this part you are going to have good insights about reliability, types of
reliability and factors that affect reliability.
Reliability of a test may be defined as the degree to which a test is consistent, stable, dependable
or trustworthy in measuring what it is measuring. This definition implies that the reliability of a
test tries to answer questions like: How can we rely on the results from a test? How dependable
are scores from the test? How well are the items in the test consistent in measuring whatever it is
measuring? In general, reliability of a test seeks to find if the ability of a set of testees are
determined based on testing them two different times using the same test, or using two parallel
forms of the same test, or using scores on the same test marked by two different examiners, will
the relative standing of the testees on each of the pair of scores remain the same.
Reliability refers to the accuracy, consistency and stability with which a test measures whatever it
is measuring. The more the pair of scores observed for the same testee varies from each other, the
less reliable the measure is. The variation between this pair of scores is caused by numerous factors
other than influence test scores.
Such extraneous factors introduce a certain amount of error into all test scores. Thus, methods of
determine reliability are essential means of determining how much error is present under different
90
conditions. Hence, the more consistent our test results are from one measurement to another, the
less error there will be and consequently, the greater the reliability.
7.1.1 Types of Reliability Measures
There are different types of reliability measures. These measures are estimated by different
methods. The chief methods of estimating reliability measures are illustrated in table 6.1.
Table 7.1: Methods of Estimating Reliability
Method Types of Reliability Procedure
Measure
Test-retest Measure of stability Give the same test twice to the same group
with any time interval between tests
method
Equivalent-forms Measure of equivalence Give two forms of the test to the same group
in close succession
methods
Split-half Measure of internal Give test once. Score two equivalent halves
consistency say odd and even number items, correct
Method reliability coefficient to fit whole test by
Spearman- Brown formula
Kuder- Measure of internal Give test once. Score total test and apply
consistency kuder-Richardson formula
Richardson
methods
91
stability. How long the time interval should be between tests is determined largely by the use to
be made of the results. If the results of both administrations of the test are highly stable, the testees
whose scores are high on one administration of the test will tend to score high on other
administration of the test while the other testees will tend to stay in the same relative positions on
both administration of the test. Such stability would be indicated by a large correlation coefficient.
An important factor in interpreting measures of stability is the time interval between tests. A short
time interval such as a day or two inflates the consistency of the result since the testees will
remember some of their answers from the first test to the second. On the other hand, if the time
interval between tests is long about a year, the results will be influenced by the instability of the
testing procedure and by the actual changes in the learners over a period of time. Generally, the
longer the time interval between test and retest, the more the results will be influenced by changes
in the learners’ characteristics being measured and the smaller the reliability coefficient will be.
92
consistency. The coefficient indicates the degree to which equivalent results are obtained from the
two halves of the test. The reliability of the full test is usually obtained by applying the Spearman-
Brown formula.
1
2 Re liabilityo n test
That is, Reliability on full test = 2
1
1 Re liabilityo n test
2
The split-half method, like the equivalent forms method indicates the extent to which the sample
of test items is a dependable sample of the content being measured. In this case, a high correlation
between scores on the two halves of a test denotes the equivalence of the two halves and
consequently the adequacy of the sampling. Also like the equivalent forms method, it tells nothing
about changes in the individual from one time to another.
K M K M
1
K 1 KS 2
Where:
K = The number of items in the test
M = The mean of the test scores
S2 =Standard deviation of the test scores.
This method of reliability estimate test whether the items in the test are homogenous. In other
words, it seeks to know whether each test item measures the same quality or characteristics as
every other. If this is established, then the reliability estimate will be similar to that provided by
93
the split-half method. On the other hand, if the test lacks homogeneity an estimate smaller than
split-half reliability will result.
The Kuder-Richardson method and the Split-half method are widely used in determining reliability
because they are simple to apply. Nevertheless, the following limitations restrict their value. The
limitations are:
They are not appropriate for speed test in which test retest or equivalent form methods are
better estimates.
They, like the equivalent form method, do not indicate the constancy of a testee response
from day to day. It is only the test-retest procedures that indicate the extent to which test
results are generalizable over different periods of time.
They are adequate for teacher-made tests because these are usually power tests.
94
Computing Procedure for the Spearman Rank-Difference Correlation
Step 1. Arrange pairs of scores for each examinee in columns (Columns 1 and 2)
Step 2. Rank examinees from 1 to N (number in group) for each set of scores
Step 3. Rank the difference (D) in ranks by subtracting the rank in the right hand column from
the rank in the left-hand column
Step 4. Square each difference in rank to obtain difference squared (D2)
Step 5. Sum the squared differences to obtain D 2
Table 7.2: Computing Spearman Rank Difference Correlation for a Pair of Hypothetical
Data
Student Hydraulics Pneumatics Hydraulics Pneumatics D D2
Score Rank
Number Score Rank (RH-RP)
1 98 76 1 2 -1 1
2 97 75 2 3 -1 1
3 95 72 3 4 -1 1
4 94 70 4 5 -1 1
5 93 68 5 6 -1 1
6 91 66 6 7 -1 1
7 90 64 7 8 -1 1
8 89 60 8 10 -2 4
9 88 58 9 11 -2 4
10 87 57 10 12 -2 4
11 86 56 11 13 -2 4
12 84 54 12 14 -2 4
13 83 52 13 15 -2 4
14 81 50 14 16 -2 4
15 80 48 15 17 -2 4
16 79 46 16 18 -2 4
17 77 44 17 20 -3 9
18 76 45 18 19 -1 1
19 75 62 19 9 10 100
20 74 77 20 1 19 361
95
6 D2 6 514 3084
ρ (rho) = 1 -
N N 1 2
= 1-
20(400 1)
= 1-
7980
= 0.61
The following steps listed below will serve as guide for computing a product-moment correlation
coefficient (r) from ungrouped data.
Step 1. Begin by writing the pairs of score to be studied in two columns. Make certain that the pair
of scores for each examinee is in the same row. Call one Column X and the other Y
Step 2. Square each of the entries in the X column and enter the result in the X2 column
Step 3. Square each of the entries in the Y column and enter the result in the Y2 column
Step 4. In each row, multiply the entry in the X column by the entry in the Y column, and enter
the result in the XY column
Step 5. Add the entries in each column to find the sum of each column.
XY X Y
N N N
r=
X X Y Y
2 2
2
2
N N N N
XY M M X Y
OR r = N
SDX SDY
Where:
MX = Mean of scores in X column
MY = Mean of scores in Y column
SDX = Standard deviation of scores in X column
SDY = standard deviation of scores in Y column
96
Table 7.3: Computing Pearson Product-Moment Correlation for a pair of the Hypothetical
Ungrouped Data
Student Hydraulics(X) Pneumatics(Y) X2 Y2 XY
No
1 98 76 9604 5776 7448
2 97 75 9409 5625 7275
3 95 72 9025 5184 6840
4 94 70 8836 4900 6580
5 93 68 8649 4624 6324
6 91 66 8281 4356 6006
7 90 64 8100 4096 5760
8 89 60 7921 3600 5340
9 88 58 7744 3364 5104
10 87 57 7569 3249 4959
11 86 56 7396 3136 4816
12 84 54 7056 2916 4536
13 83 52 6889 2704 4316
14 81 50 6561 2500 4050
15 80 48 6400 2304 3840
16 79 46 6241 2116 3634
17 77 44 5929 1936 3388
18 76 45 5776 2025 3420
19 75 62 5625 3844 4650
20 74 77 5476 5929 5698
N=20 X 1717 Y 1200 X 148487
2
Y 2 XY
74184 103984
Computational value of the score shows that between Hydraulics(X) Pneumatics(Y) scores is r =
0.63
97
is never more than +1.00 and never less than -1.00. The two values have the same degree or level
of relationship but while the first indicates a direct relation the second indicates an inverse relation.
A guide for interpreting correlation coefficient (r) values obtained by correlating any two set of
test scores is presented in table 6.4 below:
Table 7.4: Interpretation of r-values
Correlation(r) Value Interpretation
+1.00 Perfect positive relationship
+0.80 – +0.99 Very high positive relationship
+0.60 – +0.79 High positive relationship
+0.40 – +0.59 Moderate positive relationship
+0.20 – +0.39 Low positive relationship
+0.01– +0.19 Negligible relationship
0.00 No relationship at all
-0.01 – -0.19 Negligible relationship
-0.20 – -0.39 Low negative relationship
-0.40 – -0.59 Moderate negative relationship
-0.60 – -0.79 High negative relationship
-0.80 – -0.99 Very high negative relationship
-1.00 Perfect negative relationship
98
Kuder-Richardson methods: This method like the spilt-half method is based on single
testing and so the reliability index is over estimation. Its value is however lower than the
value obtained for the spilt-half method.
It is clear from the above illustration that the size of the reliability coefficient resulting from the
method of estimating reliability is directly attributable to the type of consistency included in each
method. Thus, the more rigorous methods of estimating reliability yield smaller reliability
coefficient than the less rigorous methods. It is therefore essential that when estimating the
reliability of a measurement instrument, the method used, the time lapse between repeated
administration and the intervening experience must be noted as well as the assumptions and
limitations of the method used for a clearly understanding of the resulting reliability estimate.
If the quality of the test items and the nature of the testees can be assumed to remain the same,
then the relationship of reliability to length can be expressed by the simple formula stated as
follow:
nrii
rnn =
1 n 1rii
Where:
rnn = is the reliability of a test n times as long as the original test
rii = is the reliability of the original test
99
n = is as indicated, the factors by which the length of the test is increased
Increase in length of a test brings test scores to depend closer upon the characteristics of the person
being measured and more accurate appraisal of the person is obtained. However, we all know that
lengthen a test is limited by a number of practical considerations. The considerations are the
amount of time available for testing, factors of fatigue and boredom on part of the testees, inability
of classroom teachers to constructs more equally good test items. Nevertheless, reliability can be
increased as needed by lengthening the test within these limits.
100
to be used in measuring differences among individuals. This is because the bigger the spread of
scores, the greater the likelihood of its measured differences to be reliable.
7.1.2.4 Objectivity
This refers to the degree to which equally competent scorers obtain the same results in scoring a
test. Objective tests easily lend themselves to objectivity because they are usually constructed so
that they can be accurately scored by trained individuals and by the use of machines. For such test
constructed using highly objective procedures, the reliability of the test results is not affected by
the scoring procedures. Therefore, the teacher made classroom test calls for objectivity. This is
necessary in obtaining reliable measure of achievement. This is more obvious in essay testing and
various observational procedures where the results of testing depend to a large extent on the person
doing the scoring. Sometimes even the same scorer may get different results at different times.
This inconsistency in scoring has an adverse effect on the reliability of the measures obtained. The
resulting test scores reflect the opinions and biases of the scorer and the differences among testees
in the characteristics being measured.
Objectivity can be controlled by ensuring that evaluation procedures selected for the evaluation of
behaviour required in a test is both appropriate and as objective as possible. In the case of essay
test, objectivity can be increased by careful framing of the questions and by establishing a standard
set of rules for scoring. Objectivity increased in this manner will increase reliability without
undermining validity.
What are the major factors that affect validity? Write your answer on the space provided.
____________________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________
101
Dear trainee, in the previous subtitles you have realized what reliability is all about. Now, in the
next topic and subtopics you are going to have good insights with regard to validity, types of
validity and factors that affect validity.
Validity is the most important quality you have to consider when constructing or selecting a test.
It refers to the meaningfulness or appropriateness of the interpretations to be made from test scores
and other evaluation results. Validity is a measure or the degree to which a test measures what it
is intended to measure. It is always concerned with the specific use of the results and the soundness
of our proposed interpretations. Hence, to the extent that a test score is decided by factors or
abilities other than that which the test was designed or used to measure, its validity is impaired.
These concerns of validity are related and in each case the determination is based on a knowledge
of the interrelationship between scores on the test and the performance on other task or test that
accurately represents the actual behavior. The three approaches to test validation are briefly
discussed in table 7.5 below.
102
Table 7.5: Approaches To Test Validation
Evidence Procedure Meaning
Content – Related Compare the test tasks to the test How well the sample of test tasks
specifications describing the task represents the domain of tasks to
Evidence domain under consideration. be measured
Criterion – Related Compare test scores with another How well test performance
measure of performance obtained at a predicts future performance or
Evidence later date (for prediction) or with estimates current performance on
another measure of performance some valued measures other than
obtained concurrently the test itself (a criterion)
Construct – Related Establish the meaning of the scores How well test performance can be
on the test by controlling (or interpreted as a meaning of some
Evidence examining) the development of the characteristics or quality
test and experimentally determining
what factors influence test
performance
Content Validity:
When we want to find out if the entire content of the behavior or area is represented in the test we
compare the test task with the content of the behavior.
Criterion Validity:
When you are expecting a future performance based on the scores obtained currently by the
measure, correlate the scores obtained with the performance. The later performance is called the
criterion and the current score is the prediction. This is an empirical check on the value of the test
– a criterion-oriented or predictive validation.
Construct Validity:
Construct validity is the degree to which a test measures an intended hypothetical construct. Many
times psychologists assess/measure abstract attributes or constructs. The process of validating the
interpretations about that construct as indicated by the test score is construct validation. This can
be done experimentally, e.g., if we want to validate a measure of anxiety. We have a hypothesis
103
that anxiety increases when subjects are under the threat of an electric shock, then the threat of an
electric shock should increase anxiety scores.
SUMMARY
Dear trainee, in this unit you have discussed the concept of reliability and validity. Furthermore,
you have identified the types of reliability measures and methods of estimating them. Specifically,
you discussed about measure of equivalence, measure of stability and measure of internal
consistency. In addition you have learned about the test-retest method of estimating reliability, the
equivalent forms methods and the Kuder-Richardson method. Finally, you have learned the factors
such as length of test, spread of scores, and difficulty of test and objectivity that influences
reliability measures.
104
The second concept you have seen in this chapter was about test validity. You have learned about
content validity, criterion validity and construct validity. In addition, you have gone thorough and
identify validity of criterion reference mastery tests and factors that influence validity. Validity is
a measure or the degree to which a test measures what it is intended to measure. There are three
types of validity. Content validity is the process of determining the extent to which a set of test
tasks provided a relevant and representative sample of the domain of tasks under consideration.
Criterion validity is the process of determining the extent to which test performance is related to
some other valued measure of performance.
Construct validity is the process of determining the extent to which test performance can be
interpreted in terms of one or more psychological construct. A construct is a psychological quality
that are assumed exists in order to explain some aspect of behavior. Criterion – referenced mastery
tests are not designed to discriminate among individuals. Therefore, statistical validation
procedures play a less prominent role.
On the other hand, many factors may influence the validity of a test. These factors include factors
in the test itself, factors in the test administration, factors in the examinee’s response, functioning
content and teaching procedures as well as nature of the group and the criterion.
105
6. What are the factors that affect validity?
7. Explain the relationship between reliability and validity?
PART II. READ EACH OF THE FOLLOWING QUESTION AND CHOOSE THE MOST
APPROPRIATE ONE FROM THE GIVEN ALTERNATIVES.
1. A form of correlation that shows equal magnitude increment value in both variables is
called______?
A. Perfect positive relation
B. Perfect negative relation
C. No relation at all
D. Very high positive relation
2. Which of the following is not true about correlation?
A. It measures association
B. It is scaled between -1 to +1
C. It shows the relation of two variables
D. It estimates the average value of two variables
3. Which one is not true about negative correlation?
A. As one variable increases, the other will decrease
B. The coefficient lies between-1 and less than 0
C. Both variables show increment in the same direction
D. Each variable may increase or decrease in the opposite direction
4. What is the meaning of correlation coefficient r=0.00?
A. Perfect positive relation
B. There is no relation at all
C. There is negligible relation
D. Perfect negative relation
5. Which reliability measure uses a single exam two times for the same group?
A. Equivalence
B. Stability
C. Internal consistency
D. Split half
6. In split half method of estimating reliability, even and odd number items correlation is 0.64. What will
be the total items reliability?
A. 0.74 B. 0.87 C. 0.64 D. 0.78
106
7. Which one of the following dose not influence the reliability measures?
A. Content validity
B. Criterion validity
C. Construct validity
D. None of the above
9 . Which type of validity is guaranteed if test preparation represents the entire content of the course?
A. Content validity
B. Criterion validity
C. Construct validity
D. Face validity
10. Which one of the following dose not influence the reliability measures?
11. The method of computing the degree of relationship between two sets of scores is called__________?
12. The type of correlation that shows as one variable increase the other decreases?
A. Positive correlation
B. Negative correlation
C. Zero correlation
107
D. None of the above
108
UNIT EIGHT
JUDGING THE QUALITY OF A CLASSROOM TEST
INTRODUCTION
Dear trainee, in this unit you will learn how to judge the quality of a classroom test and an
item types. Specifically, you will learn about item analysis, purpose and uses. Furthermore, you
will learn the process of item analysis for classroom test and the computations involved. In
addition, you will learn item analysis of criterion-referenced mastery items. Finally, you will learn
about building a test item file or item bank.
OBJECTIVES
By the end of this unit you will be able to:
Differentiate distinctively between item difficulty, item discrimination and the distraction
power of an option
Recognize the need for item analysis, its place and importance in test development
Conduct item analysis of a classroom test
Calculate the value of each item parameter for different types of items
Appraise an item based on the results of item analysis
What are the steps of item analysis? What are the major purposes of item analysis? How could
you compute item difficulty index and discrimination index of a test? How can you evaluate the
effectiveness of distracters? How can you build a test item file or item bank?
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
109
The administration and scoring of a classroom test is closely followed by the appraisal of the result
of the test. This is done to obtain evidence concerning the quality of the test that was used such as
identifying some of the defective items. This helps to better appreciate the careful planning and
hard work that went into the preparation of the test. Moreover, the identified effective test items
are used to build up a file of high quality items, usually called question bank for future use.
This also includes determining the effectiveness of each option. The decision on the quality of an
item depends on the purpose for which the test is designed. However, for an item to effectively
measure what the entire test is measuring and provide valid and useful information, it should not
be too easy or too difficult. Moreover, its options should discriminate validity between high and
low performing learners in the class.
110
8.2 The Process of Item Analysis for Norm Referenced Classroom Test
The method for analyzing the effectiveness of test items differs for norm-referenced and criterion–
referenced test items. This is because they serve different functions. In norm-referenced test,
special emphasis is placed on item difficulty and item discriminating power. The process of item
analysis begins after the test has been administered or trial tested, scored and recorded. For most
norm–referenced classroom tests, a simplified form of item analysis is used.
The process of item analysis is carried out by using two contracting test groups composed from
the upper and lower 25% of the testees on which the items are administered or trial tested. The
upper and lower 25% is the optimum point at which balance is obtained between the sensitivity of
the groups in making adequate differentiation and reliability of the results for a normal distribution.
On the other hand, the upper and lower 25% when used are better estimate of the actual
discrimination value. They are significantly different and the middle values do not discriminate
sufficiently. In other to get the groups, the graded test papers are arranged from the highest score
to the lowest score in a descending order. The best 25 % are picked from the top and the poorest
25% from the bottom while the middle test papers are discarded.
To illustrate the method of item analysis using an example with a class of 40 learners taking a 10
item test that have been administered and scored, and using 25% test groups. The item analysis
procedure might follow this basic step.
Step 1. Arrange the 40 test papers by ranking them in order from the highest to the lowest score.
Step 2. Select the best 10 papers (upper 25% of 40 testees) with the highest total scores and the
least 10 papers (lower 25% of 40 testees) with the lowest total scores.
Step 3. Drop the middle 20 papers (the remaining 50% of the 40 testees) because they will no
longer be needed in the analysis.
Step 4. Draw a table as shown in table 8.1 in readiness for the tallying of responses for item
analysis.
Step 5. For each of the 10 test items, tabulate the number of testees in the upper and lower groups
who got the answer right or who selected each alternative (for multiple choice items).
Step 6. Compute the difficulty of each item (percentage of testees who got the item right).
111
Step 7. Compute the discriminating power of each item (difference between the number of testees
in the upper and lower groups who got the item right).
Step 8. Evaluate the effectiveness of the distracters in each item (attractiveness of the incorrect
alternatives) for multiple choice test items.
Table 8.1 Format For Tallying Responses Of Examinees For Item Analysis
2 Upper 25% 1 1 0 7* 1 0 10
0.55 0.30 0.00 0.10 0.10 * 0.00
Lower 25% 1 2 1 4 1 1 10
3 Upper 25% 3 0 1 2 4* 0 10
0.45 -0.10 -0.20 0.00 0.00 0.10 *
Lower 25% 1 0 1 3 5 0 10
112
.
Lower 25% 3 2 2 2 1 0 10
113
Hence for item 1 in table 8.1, the item discriminating power D is obtained thus:
H L 10 4 6
D= = = = 0∙60
n 10 10
Item discrimination values range from – 1∙00 to + 1∙00. The higher the discriminating index, the
better is an item in differentiating between high and low achievers. Usually, if item discriminating
power is a:
Positive value when a larger proportion of those in the high scoring group get the item right
compared to those in the low scoring group.
Negative value when more testees in the lower group than in the upper group get the item
right.
Zero(0) value when an equal number of testees in both groups get the item right; and
One (1.00) when all testees in the upper group get the item right and all the testees in the
lower group get the item wrong.
114
Incorrect options with positive distraction power are good distracters while one with negative
distracter must be changed or revised and those with zero should be improved on because they are
not good. Hence, they failed to distract the low achievers.
115
8.3.3 Analysis of Criterion Referenced Mastery Items
Ideally, a criterion referenced mastery test is analyzed to determine extent to which the test items
measure the effects of the instruction. In other to provide such evidence, the same test items is
given before instruction (pretest) and after instruction (posttest) and the results of the same test
pre-and-post administered are compared. The analysis is done by the use of item response chart.
The item response chart is prepared by listing the numbers of items across the top of the chart and
the testees names or identification numbers down the side of the chart and the record correct (+)
and incorrect (-) responses for each testee on the pretest (B) and the posttest (A). This is illustrated
in Table 8.2 for an arbitrary 10 testees.
Table 8.2: An Item Response Chart Showing Correct (+) And Incorrect (-) Responses For
Pretest And Posttest Given Before (B) And After (A) Instructions (Teaching -
Learning Process) Respectively
Item
Posttest (A) + + + + + +
Posttest (A) + + + + + +
116
Posttest (A) - - - - - -
Posttest (A) + + + + + -
An index of item effectiveness for each item is obtained by using the formula for a measure of
Sensitivity to Instructional Effects (S) given by:
R A RB
S=
T
Where:
RA = Number of testees who got the item right after the teaching-learning process.
RB = Number of testees who got the item right before the teaching-learning process.
T = Total number of testees who tried the item both times.
For example, item 1 of table 8.2, the index of sensitivity to instructional effect (S) is
R RB 10 0
S= A 1∙00
T 10
Usually for a criterion-referenced mastery test with respect to the index of sensitivity to
instructional effect,
An ideal item yields a value of 1.00.
Effective items fall between 0.00 and 1.00, the higher the positive value, the more sensitive
the item to instructional effects; and
Items with zero and negative values do not reflect the intended effects of instruction.
117
item measures and can be maintained on both content and objective categories. This makes it
possible to select items in accordance with any table of specifications in the particular area covered
by the file.
Building item file is a gradual process that progresses over time. At first it seems to be additional
work without immediate usefulness. But with time its usefulness becomes obvious when it
becomes possible to start using some of the items in the file and supplementing them with other
newly constructed ones. As the file grows into item bank most of the items can then be selected
from the bank without frequent repetition. Some of the advantages of item bank are that:
Parallel test can be generated from the bank which would allow learners who were ill for a
test or due to some other reasons were unavoidable absent to take up the test later;
They are cost effective since new questions do not have to be generated at the same rate
from year to year;
The quality of items gradually improves with modification of the existing ones with time;
and
The burden of test preparation is considerably lightened when enough high quality items
have been assembled in the item bank.
SUMMARY
In this unit you have learned how to judge the quality of a classroom test and an item types.
Particularly, you have understood item analysis procedure and how to find out mathematical index
of an item. You have made also discussion on the item analysis, purpose and uses. Besides, you
have gone thorough on the process of item analysis for classroom test and the computations
involved using practical examples. Finally, you have realized the two forms of item analysis
procedures that are criterion-referenced mastery items and norm referenced tests. Finally, you have
come up to realize that how teachers can building a test item file or item bank and how they use
items from item bank for different purposes.
118
SELF ASSESSMENT EXERCISE
PART I. READ THE FOLLOWING QUESTION AND GIVE SHORT AND PRECISE
ANSWER.
1. Explain the meaning of item analysis?
2. List and explain the purposes of item analysis?
3. Show norm referenced item analysis procedure?
4. When do teachers use norm reference and criterion referenced item analysis procedure?
5. How can you compute item difficulty and discrimination?
6. How do you evaluate the effectiveness of distractor in item analysis?
7. Explain the purposes of building item bank?
8. Explain the relationship between reliability and validity?
9. What is the difference between index of discriminating power (D) and index of sensitivity
to instructional effects (S)?
10. Do you think that item analysis could help teachers to improve their skill of classroom test
preparation? Why?
PART II. BASED ON THE FOLLOWING DATA, DETERMINE DIFFICULTY LEVEL (P)
DISCRIMINATION LEVEL (D) AND THE EFFECTIVENESS OF EACH DISTRACTER
OR OPTION (DP) AND INTERPRET THE RESULT OF THE FOLLOWING SIX
QUESTIONS/ITEMS OF 54 EXAMINEES.
1 Upper Group 8* 5 7 7 0
Lower Group 2* 9 8 8 0
119
Lower Group 5 1 9* 12 0
Lower Group 11 2* 8 6 0
Lower Group 0 7 12 8* 0
6 Upper Group 8 6 8* 5 0
Lower Group 4 3 18 2 0
120
Assignment (30%)
Prepare class room achievement test based on the subject matter you want.
3. What elements are included in the table of specification? What are the uses of table
of specification?
4. What do you think is the reason that multiple choice items are used most frequently
correlation for the following Pair of Hypothetical Data. How do you interpret
the results?
Number
1 1.50 55
2 1.80 65
3 1.68 72
121
4 1.78 70
5 1.64 68
6 1.90 80
7 1.75 64
8 1.62 60
9 1.85 76
10 1.87 78
7. Calculate the Mean, Median, Mode, Range, Variance and Standard Deviation of
Score(x) 10 5 8 2 7
Frequency(f) 3 2 5 6 4
8. Based on the following data, determine P value, D value and the effectiveness
of distracters?
A B C D Omit
Lower Group 2* 9 8 8 0
Lower Group 0 7 12 8* 0
Lower Group 11 2* 8 6 0
122