You are on page 1of 123

TECHNICAL VOCATIONAL EDUCATION AND

TRAINING INSTITUTE

EDUCATIONAL MEASUREMENT AND


EVALUATION OF LEARNING MODULE (VPD 201)
Module for the Course Educational Measurement and
Evaluation of Learning (VPD 201)

By

Gedefaw Kassie (PhD)

Tigist Bayleyegn (MA)

Temesgen Tadele (MA)

Federal TVET Institute

Addis Ababa

Faculty of Basic Science, Language and Pedagogy

Department of Vocational Pedagogy

1
Contents
AN OVERVIEW OF MEASUREMENT AND EVALUATION ............................................................................ 4
UNIT INTRODUCTION ............................................................................................................................ 4
1.1 LEARNING OBJECTIVES OF THE UNIT ........................................................................................ 4
1.2 INTRODUCTION ......................................................................................................................... 4
1.3 PURPOSES/FUNCTIONS OF MEASUREMENT AND EVALUATION .................................................. 10
BLOOM’S TAXONOMY OF EDUCATIONAL OBJECTIVES ........................................................................... 15
2.1 Definition of objectives ................................................................................................................. 15
2.2 The Importance of Stating Instructional Objectives ..................................................................... 16
2.3 Taxonomy of Educational Objectives ............................................................................................ 16
2.4 Criteria for Selecting/writing Appropriate Objectives .................................................................. 20
2.5 Steps for Stating Instructional Objectives ..................................................................................... 20
CLASSROOM ACHIEVEMENT TESTS AND ASSESSMENTS ........................................................................ 23
3.1 INTRODUCTION ............................................................................................................................. 23
3.2. LEARNING OBJECTIVES OF THE UNIT ........................................................................................... 23
3.3. TYPES OF TESTS USED IN THE CLASSROOM ................................................................................. 24
TEST DEVELOPMENT – PLANNING THE CLASSROOM TEST..................................................................... 47
4.1 Test Development – Planning the Classroom Test ....................................................................... 47
4.2 Item Writing .................................................................................................................................. 53
ASSEMBLING, REPRODUCING, ADMINISTERING, AND SCORING OF CLASSROOM TESTS ...................... 55
5.1 UNIT INTRODUCTION .................................................................................................................... 55
5.2 LEARNING OBJECTIVES OF THE UNIT ............................................................................................ 55
5.3 INTRODUCTION ............................................................................................................................. 55
5.4 ARRANGING THE TEST ITEMS........................................................................................................ 56
5.5 WRITING TEST DIRECTIONS ........................................................................................................... 58
5.6 REPRODUCING TEST ITEMS ........................................................................................................... 59
5.7 ADMINISTERING THE TEST ............................................................................................................ 59
5.8 SCORING THE TEST ........................................................................................................................ 60
SUMMARIZING AND INTERPRETING TEST SCORES ................................................................................. 64
6.1 Methods of Interpreting Test Scores ............................................................................................ 64
6.2 Descriptive statistics ..................................................................................................................... 66
RELIABILITY AND VALIDITY OF A TEST ..................................................................................................... 89

2
7.1 Test Reliability ............................................................................................................................... 90
7.2 Validity of a Test.......................................................................................................................... 101
JUDGING THE QUALITY OF A CLASSROOM TEST................................................................................... 109
8.1 Judging The Quality Of A Classroom Test ................................................................................... 109
8.2 The Process of Item Analysis for Norm Referenced Classroom Test .......................................... 111
8.3 Item Analysis and Criterion Referenced Mastery Tests .............................................................. 115
8.4 Building A Test Item File (Item Bank) .......................................................................................... 117

3
UNIT ONE
AN OVERVIEW OF MEASUREMENT AND EVALUATION
UNIT INTRODUCTION
In this unit you will be introduced to definition of basic terms and concepts, functions of measurement and
evaluation, issues to be raised in measurement and evaluation, the measuring scales, the various ways of
test classification and the like.

1.1 LEARNING OBJECTIVES OF THE UNIT


Dear learner, after reading through this unit and completion of the tasks and activities, you will be able to:

 Define basic terminologies in measurement and evaluation.


 Recognize the functions of measurement and evaluation.
 Recall the important issues to be raised in measurement and evaluation.
 Classify measurement tools based on their functions.
 Differentiate the use of norm-referenced and criterion referenced measures.
 Classify among the different types of tests.
 Review your knowledge and skill in measurement and evaluation.

1.2 INTRODUCTION
Teachers have always been concerned in measuring and evaluating students’ learning progress. Schools
certify learners based on the knowledge and skill acquired, parents are concerned in knowing how much
their children are achieving in their education, and employers also pay attention to ascertain the knowledge
and skills possessed by individuals before recruitment and selection.

Nevertheless, there have been criticisms on the quality of educational products. For example, there are
students who are unable to read and write effectively, lack fundamental arithmetic processes, and cannot
engage themselves in higher order thinking processes. The problems mentioned here and others require
TVET teachers and educators to be more concerned with valid and reliable measurements of educational
products. With no prerequisite, measurement and evaluation in education is the backbone of the education
practice and curriculum implementation in order to maintain the quality of school products including the
TVETs.

4
Learning Task1

 Write a short report on your own practices of using measurement and


evaluation in making instruction decisions.

_________________________________________________________________
__________________________________________________________

 What would be the role of measurement and evaluation in maintaining the


quality of educational products?

Terms in educational measurement and evaluation are often times used interchangeably. Let us see their
definitions and make comparisons among them.

Test and testing

A test is a measuring tool or instrument in education. More specifically, a test is considered to be a kind or
class of measurement device typically used to find out something about a person. Most of the times, when
you finish a lesson or lessons in a week, your teacher gives you a test. This test is an instrument given to
you by the teacher in order to obtain data on which you are judged. It is an educationally common type of
device which an individual completes; the intent is to determine changes or gains resulting from such
instruments as inventory, questionnaire, scale etc.

Testing on the other hand is the process of administering the test on the pupils. In other words the process
of making you or letting you take the test in order to obtain a quantitative representation of the cognitive or
non-cognitive traits you possess is called testing. So the instrument or tool is the test and the process of
administering the test is testing.
Assessment
There are many definitions and explanations of assessment in education.
Let us look at few of them.
i. Freeman and Lewis (1998) to assess is to judge the extent of students’ learning.
ii. Rowntree (1977): Assessment in education can be thought of as occurring whenever one person,
in some kind of interaction, direct or indirect, with another, is conscious of obtaining and
interpreting information about the knowledge and understanding, of abilities and attitudes of that
other person. To some extent or other, it is an attempt to know the person.

5
iii. Erwin, in Brown and Knight, (1994). Assessment is a systematic basis for making inference about
the learning and development of students… the process of defining, selecting, designing,
collecting, analysing, interpreting and using information to increase students’ learning and
development.
You will have to note from these definitions that
 Assessment is a human activity.
 Assessment involves interaction, which aims at seeking to understand what the learners have
achieved.
 Assessment can be formal or informal.
 Assessment may be descriptive rather than judgment in nature.
 Its role is to increase students’ learning and development
 It helps learners to diagnose their problems and to improve the quality of their subsequent learning.
In conducting assessment, one can use a variety of techniques such as: self-report, observations, interviews,
projective tests, paper and pencil tests, performance tests, surveys, projects etc...

Learning Task 2

Dear teacher, please list down the assessment techniques you are using in your
teaching activities.

_____________________________________________________________________

 What would be the role of measurement and evaluation in maintaining the quality
of educational products?

_________________________________________________________________

Have you noticed some important assessment techniques that can be used in
obtaining information about your students’ characteristics and behaviour? Yes
________ NO ______

What possible plans will you set in order to improve your assessment techniques of
your students, teaching methods effectiveness, the curriculum, etc...?
1.1.1 Measurement

In simple terms, measurement refers to giving or assigning a number value to a certain attribute or
behaviour. It is a systematic process of obtaining the quantified degree to which a trait or an attribute is
present in an individual or object. In other words it is a systematic assignment of numerical values or figures

6
to a trait or an attribute in a person or object. For instance what is the height of your friend? What is the
weight of the meat? What is the length of the classroom?

Measurement conveys a broader meaning. Measurement uses a variety of ways to obtain information in a
quantitative form. Measurement can use paper and pencil test, rating scales, and observations to assign a
number value to a given trait or behaviour. Measurement can also mean to both the score obtained by the
measuring device and the process used to obtain the score.

Some features of measurement

 Measurement is a way of making observation and assigning a number value.


 Measurement has rules by which we assign numerical description to observation of some attributes of
an object, person or situation.
 In the school context the attributes measured to the most are cognitive achievements, scholastic abilities
and cognitive aspects of skills development.
 An observation made by a measurement process always involves a measuring instrument such as paper
and pencil test, performance test, check lists etc...
 In quantifying the attributes of interest in measurement process, the attribute of interest has to be defined
and a measuring device has to be prepared for it.
 Results of the measurement from classroom tests may be used to:
- Direct prescriptive study for the students who took the test
- Indicate whether a student has mastered a well-defined body of subject matter or developed
particular skill.
- Indicate where a student stands in relation to others who took the test.
- Make terminal judgements about student progress and effectiveness of programs.
- A classroom measure can either be criterion-referenced or norm- referenced based on the
evaluation being made.

7
Learning Task 3

1. Define measurement in your own words in the context of teaching-learning


process.
______________________________________________________________
________________________________________________
2. Based on concrete example of your day to day teaching activities explain this
similarities and differences between educational assessment and
measurement.
______________________________________________________________
3. Discuss the major features of educational measurement.
___________________________________________________________________

Evaluation

What is evaluation for you? How it differs from assessment and measurement?

In the context of teaching-learning, evaluation refers to the process of making judgment about the value of
a student’s performance.

 Connotes the checking of effects of a curriculum, a course, a method of instruction etc...


 Evaluation is a judgement process, which is used to classify individuals for purposes of
grading, certification, placement or promotion. It is also used to determine the
effectiveness of curriculum, method of instruction, or a specific instructor.
 Evaluation is inseparable form measurement.
 Evaluation is described as formative and summative.

Evaluation is formative when conducted over small bodies of content to provide feedback in directing
further instruction and student learning. Formative evaluation then refers to an ongoing process which is
done before instruction, during instruction, and at the end of term or unit.

Summative evaluation on the other hand, is an evaluation conducted over a larger outcome of an extended
instructional sequence, over an entire course of a large or part of it.

Summative evaluation may serve for reporting a student’s overall achievement, licensing and certifying,
predicting success in related courses, assigning marks, and reporting overall achievement of a class.

8
Evaluation in the classroom context is directed to the improvement of student learning by supporting the
instructional process as in the following. Evaluation:

 Serves to assess students’ abilities and needs using appraisal techniques. This may help to decide
for proper placement of students in learning sequences.
 Identifies specific learning deficiencies and help in devising mechanisms to overcome learning
difficulties.
 Assists in following up student progress during instruction through formative ways.
 Allows evidence for judging the effectiveness of instruction or the curriculum through summative
evaluation.

In conducting an evaluation process, the teacher has to:

 Identify and formulate evaluation goals.


 Select and devise the appropriate tools or instruments for measuring progress towards the goals.
 Use the tools to quantify or give a number value for a given attribute of behaviour.
 Formulate judgments based on results obtained.

In short, evaluation is taken as a professional judgement made to check the desirability or value of
something. It can also be interpreted as the determination of the match or mismatch between learning
achievement and learning objectives.

Self-check exercise

1. Explain the conceptual difference between measurement and evaluation.


2. As a teacher what procedures you follow in conducting an evaluation process?
3. How do you use evaluation results to improve your students learning?
4. Was your evaluation practices effective? To what extent it supported your teaching?
5. Can you tell the major purposes of formative and summative evaluations?

Types of evaluation

The different types of evaluation are: placement, formative, diagnostic and summative evaluations.
 Placement Evaluation
This is a type of evaluations carried out in order to fix the students in the appropriate group or class. In
some schools for instance, students are assigned to classes according to their subject combinations, such as
science, Technical, arts, Commercial etc. before this is done an examination will be carried out. This is in
form of pretest or aptitude test. It can also be a type of evaluation made by the teacher to find out the entry

9
behavior of his students before he starts teaching. This may help the teacher to adjust his lesson plan. Tests
like readiness tests, ability tests, aptitude tests and achievement tests can be used.
 Formative Evaluation
This is a type of evaluation designed to help both the student and teacher to pinpoint areas where the student
has failed to learn so that this failure may be rectified. It provides a feedback to the teacher and the student
and thus estimating teaching success e.g. weekly tests, terminal examinations etc.
 Diagnostic Evaluation
This type of evaluation is carried out most of the time as a follow up evaluation to formative evaluation. As
a teacher, you have used formative evaluation to identify some weaknesses in your students. You have also
applied some corrective measures which have not showed success. What you will now do is to design a
type of diagnostic test, which is applied during instruction to find out the underlying cause of students
persistent learning difficulties. These diagnostic tests can be in the form of achievement tests, performance
test, self-rating, interviews, observations, etc.
 Summative evaluation:
This is the type of evaluation carried out at the end of the course of instruction to determine the extent to
which the objectives have been achieved. It is called a summarizing evaluation because it looks at the entire
course of instruction or program and can pass judgment on the teacher and students, the curriculum and the
entire system. It is used for certification. Think of the educational certificates you have acquired from
examination bodies. These were awarded to you after you had gone through some types of examination.
This is an example of summative evaluation.

1.3 PURPOSES/FUNCTIONS OF MEASUREMENT AND EVALUATION


Measurement and evaluation serves a variety of functions. The functions can be classified under three
interrelated categories. Namely: (1) Instructional (2) Administrative, and (3) Guidance and counselling.

What are the instructional functions of a measurement and evaluation process?

Instructional Functions

Measurement and evaluation stimulates teachers to clarify and refine meaningful learning objectives. Tests
provide means of feedback to the teacher. Feedback from tests helps the teacher provide more appropriate
instructional guidance for individual students as well as for the class as a whole. Well-designed tests may
also be of value for student self-diagnosis, as they help students identify specific weaknesses in their
learning.

Similarly, properly constructed tests can motivate learning. As a rule, students peruse mastery of learning
objectives more diligently if they expect to be evaluated. One can say that tests are useful means of over

10
learning. When a student reviews, interacts with or practice skills and concepts even after they have been
mastered, he is engaging in what psychologists call over learning. Even if a student correctly answered
every question on a test, he or she may be engaging in behaviour that is instructionally valuable apart from
the evaluation function being served by the test. And this is important for log-term retention.

Self-check exercise

1. Explain the way you are using measurement and evaluation (testing) to improve
instruction/your teaching.
______________________________________________________________________
______________________________________________________________________
2. How helpful is testing for student’s improvement for learning?

3. What is the relevance of quality control in the context of school and learning process?

__________________________________________________________________

4. Why do scholars say that tests may sometime demotivate students?

Administrative functions

Measurement and Evaluation are very much useful in providing a mechanism of quality control for a school
or school system by facilitating a means for program evaluation and research, facilitating for better
classification and placement decisions, contributing to the quality of selecting and decisions, and by being
useful for accreditation, mastery, or certification purposes. In general, they serve the just mentioned
administrative functions.

Guidance and counselling functions

Explain the following parts.

1. What is vocational planning?


2. What is planning for a field of study?
3. How can one choose his/her major field of study in the college?

Students must have accurate self-concepts in order to make sound decisions. They depend, somehow, on
the school to help them those self-concepts. Tests of aptitude and achievement, and interest and personality
inventories provide students with information on salient traits and help to develop realistic self-concepts.

11
The school teacher can also help particularly by providing the students with information concerning his
mastery of subject matter.

Tests can also be used in diagnosing an individual’s special abilities and aptitudes. Assisting a student with
educational and vocational choices, guiding him or her in the selection of curricular and extra-curricular
activities, and helping him or her solve personal and social adjustment problems, all require an objective
knowledge of the students’ abilities, interests, attitudes, and other personal characteristics or traits.

In using testing for guidance function, a series of assessments need to be administered, including an
interview, an interest inventory, a personality questionnaire, various aptitude tests, and achievement battery.
Information from these assessments, along with additional background information, facilitates a student’s
decision-making processes for educational and vocational choices

Self-check exercise

How do you gather information in order to provide guidance services for your students?

The following are also some of the more specific functions of measurement and evaluation:

i. Placement of student, which involves bringing students appropriately in the learning sequence and
classification or streaming of students according to ability or subjects.
ii. Selecting the students for courses – general, professional, technical, commercial etc.
iii. Certification: This helps to certify that a student has achieved a particular level of performance.
iv. Stimulating learning: this can be motivation of the student or teacher, providing feedback, suggesting
suitable practice etc.
v. Improving teaching: by helping to review the effectiveness of teaching arrangements.
vi. For research purposes.
vii. For modification of the curriculum purposes.
viii. For the purpose of selecting students for employment
ix. For modification of teaching methods.
x. For the purposes of promotions to the student.
xi. For reporting students’ progress to their parents.
xii. For the awards of scholarship and merit awards.
xiii. For the admission of students into educational institutions.
xiv. For the continuation of students.

12
Norm-Referenced and Criterion Referenced Measures
Dear teacher try to define and conceptualize the following terminologies and phrases:
 Group---------
 Norm---------
 Reference------------
 Cut point--------------
 Standard -----------
 Criterion ----------------
 Group performance--------------
 Group average--------------
Hope, you can make distinctions among the terminologies you defined above. When we consider the
evaluation or judgment aspect, any evaluation instrument can either be norm-referenced or criterion
referenced in the interpretation of individual scores.

Dear teacher, now give your own definition for the following phrases:

1. Norm-referenced tests
2. Criterion referenced tests

How do you judge the position of an individual tested based on norm-referencing/referring to the group
performance and criterion-referencing/referring a given set standard or cut point in test scores?

Give your answers below

Norm-referenced tests
These are tests used to compare the performance of an individual with those of other individuals of
comparable background. In other words, the score of an individual in a norm-referenced testing has meaning
only when it is viewed in relation to the scores of other individuals on the test. The success or failure of an
individual on this kind of test is, therefore, determined on the basis of how he/she performs in relation to
his/her colleagues’ performance on the test.

A score of 35% might mean a superior ability if it happens to be one of the highest scores where as an
individual with a score of as high as 75% might be labelled “weak “on relative terms if that score happens
to be one of the lowest scores in the array of scores derived from the test.

13
Criterion-referenced tests
In contrast to norm-referenced tests criterion-referenced tests are tests when the score of an individual on a
given test is related to a specific performance standard for interpretation purposes. Such tests are labelled
criterion-referenced; the criterion in this respect being the specific performance standard.

If a given score of an examinee is equal to or greater than a specified standard (i.e. the criterion) the
examinee is said to have passed; otherwise she/he is deemed to have failed the test or examination.
Therefore, the success or failure of an individual on a criterion-referenced test (CRT) depends on what
he/she scores on the test in relation to the set standard, and this may depend, to a large extent on the test
content itself or the relative strictness of the marker if the test is not an objective one. As a result of any
criterion measure, therefore, it is possible for all the members of a class to pass or fail the test depending
on a number of obvious reasons such as the easiness or difficulty of items of the test.

SELF CHECK
1. Explain the difference between measurement and evaluation, assessment and testing.
2. What are the types of evaluation?
3. What is the major difference between test and testing?
4. In your own words define Assessment.
5. Give an example of a test.
6. What are the major differences and similarities between formative evaluation and diagnostic
evaluation?
7. List 5 purposes of measurement and evaluation?
8. List the instruments which you can use to measure the following: weight, height, length,
achievement in Mathematics, performance of students in Technical drawing, attitude of workers
towards delay in the payment of salaries

14
UNIT TWO
BLOOM’S TAXONOMY OF EDUCATIONAL OBJECTIVES
Introduction
Dear Trainees! Welcome to the second chapter of this module. In this unit, you are going to study
about Blooms taxonomy of educational objectives. In addition to this, you will also look at the
steps for stating instructional objectives.
Objectives
Dear learner, at the end of this unit you will be able to:

 Explain the role of objectives in assessment and evaluation


 Classify instructional objective into cognitive, affective and psychomotor
 Follow the guidelines in stating or writing instructional objectives.
 List categories of each taxonomy of educational objectives
 classify action verbs used in each category
 write appropriate instructional objectives
 state general learning outcomes
2.1 Definition of objectives
Learning Task1

 Dear Trainees, before you read the sections below, try to define the following
terms.
1. Objective
2. Aim
3. Goal

 Educational aim: - a very broad educational objectives that can be stated by the
government of the country or ministry of education what the learners to be. E.g. To develop
all sided personality.

 Educational goal: - is general purpose of education that is stated as a broad, long outcome
to work toward. Goal is primarily used in policy making and general program planning.
E.g. develop proficiency in kill, reading, writing and arithmetic.

15
 General instructional objectives: - an intended outcome of an instruction that has been
stated in general enough terms to encompass a set of specific learning outcomes. E.g.
understand technical terms
 Specific instructional objective/ learning outcomes/: -
are sets of more detailed statements that specify the means by which the various goals of the
course, course units, and educational package will be met. E.g. define technical terms in his or her
own words
2.2 The Importance of Stating Instructional Objectives
Well stating instructional objectives serve as a guide for teaching and testing/evaluation and
assessment. So they specifically useful to:
 Help a teacher guide and monitor students learning
 Provide criteria for evaluating students outcomes
 Help in selecting or constructing assessment techniques
 Help in communicating parents, students, administrators or others on what is expected of
students
 Help in selecting appropriate instructional- Methods, materials, activities, contents and the
like
 Uses as a feedback on how much the educational goals have been achieved.
2.3 Taxonomy of Educational Objectives
Benjamin Bloom and a group of people involved in education came up with a list of levels of
difficulty in what you can do with what you know. This group of different levels of describing
how you approach a problem is called taxonomy/classification: Cognitive, Affective, and
Psychomotor domains.

 The Cognitive Domain is concerned with knowledge outcomes and intellectual abilities
and skills.
 The Affective Domain is concerned with the attitudes, interests, appreciation, and modes
of adjustment.
 The Psychomotor Domain is concerned with motor skills

16
Each of these three domains is further divided into categories and subcategories as follows.
Cognitive Domain
Level Definition Sample Verbs
Knowledge Recall and remember defines, describes, identifies, knows, labels, lists,
information. matches, names, outlines, recalls, recognizes,
reproduces, selects, states, memorizes, tells,
repeats, reproduces
Comprehension Understand the meaning, comprehends, converts, defends, distinguishes,
translation, interpolation, and estimates, explains, extends, generalizes, gives
interpretation of instructions and examples, infers, interprets, paraphrases,
problems. State a problem in predicts, rewrites, summarizes,
one's own words. Establish translates, shows relationship of, characterizes,
relationships between dates, associates, differentiates, classifies, compares
principles, generalizations or distinguishes
values
Application Use a concept in a new situation applies, changes, computes, constructs,
or unprompted use of an demonstrates, discovers, manipulates, modifies,
abstraction. Applies what was operates, predicts, prepares, produces, relates,
learned in the classroom into solves, uses, systematizes,
novel situations in the experiments, practices, exercises, utilizes,
workplace. Facilitate transfer of organizes
knowledge to new or unique
situations.
Analysis Separates material or concepts analyzes, breaks down, compares, contrasts,
into component parts so that its diagrams, deconstructs, differentiates,
organizational structure may be discriminates, distinguishes, identifies,
understood. Distinguishes illustrates, infers, outlines, relates, selects,
between facts and inferences separates, investigates, discovers, determines,
observes, examines
Synthesis Builds a structure or pattern from categorizes, combines, compiles, composes,
diverse elements. Put parts creates, devises, designs, explains, generates,

17
together to form a whole, with modifies, organizes, plans, rearranges,
emphasis on creating a new reconstructs, relates, reorganizes, revises,
meaning or structure. Originality rewrites, summarizes, tells, writes, synthesizes,
and creativity. imagines, conceives, concludes, invents
theorizes, constructs, creates
Evaluation Make judgments about the value appraises, compares, concludes, contrasts,
of ideas or materials. criticizes, critiques, defends, describes,
discriminates, evaluates, explains, interprets,
justifies, relates, summarizes, supports,
calculates, estimates, consults, judges, criticizes,
measures, decides, discusses, values, decides,
accepts/ rejects
Affective Domain
Receiving Awareness, willingness to hear, asks, chooses, describes, follows, gives, holds,
Phenomena selected attention. identifies, locates, names, points to, selects, sits,
erects, replies, uses
Responding to Active participation on the part answers, assists, aids, complies, conforms,
Phenomena of the learners. Attends and discusses, greets, helps, labels, performs,
reacts to a particular practices, presents, reads, recites, reports,
phenomenon. Learning selects, tells, writes.
outcomes may emphasize
compliance in
responding, willingness to
respond, or satisfaction in
responding (motivation).
Valuing The worth or value a person completes, demonstrates, differentiates,
attaches to a particular object, explains, follows,
phenomenon, or behavior. This forms, initiates, invites, joins, justifies, proposes,
ranges from simple acceptance to reads, reports, selects, shares, studies, works.
the more complex state of
commitment.

18
Organization Organizes values into priorities adheres, alters, arranges, combines, compares,
by contrasting different values, completes, defends, explains, formulates,
resolving conflicts between generalizes, identifies, integrates, modifies,
them, and creating a unique value orders, organizes, prepares, relates, synthesizes.
system. The emphasis is on
comparing, relating, and
synthesizing values.
Internalizing Has a value system that controls acts, discriminates, displays, influences, listens,
values their behavior. The behavior is modifies,
pervasive, consistent, performs, practices, proposes, qualifies,
predictable, and most questions, revises, serves, solves, verifies.
importantly, characteristic of the
learner.
Psychomotor Domain
Imitation Includes repeating an act that has begin, assemble, attempt, carry out, copy,
been demonstrated or explained, calibrate, construct, dissect, duplicate, follow,
and it includes trial and error mimic, move,
until an appropriate response is practice, proceed, repeat, reproduce, respond,
achieved. organize, sketch, start
Manipulation Includes repeating an act that has (similar to imitation), acquire, assemble,
been given oral or written complete, conduct, do, execute, improve,
instruction on how to do a certain maintain, make, manipulate, operate, pace,
activity. perform, produce, progress, use
Precision Response is complex and achieve, accomplish, advance, exceed, excel,
performed without hesitation. master, reach, refine, succeed, surpass, transcend
Articulation Skills are so well developed that adapt, alter, change, excel, rearrange, reorganize,
the individual can modify revise, surpass
movement patterns to fit special
requirements or to meet a
problem situation.

19
Naturalization Response is automatic. One acts arrange, combine, compose, construct, create,
"without thinking." design, refine, originate, transcend

Learning Task 2

 In your major area, it is believed that 50% covers theoretical aspect and 50% covers the
practical aspect. You as a teacher, which instructional objectives do you think measure the
theory and which objectives measure the practice? Give example for each.
___________________________________________________________________________
___________________________________________________________________________
____________________________________________________________________

2.4 Criteria for Selecting/writing Appropriate Objectives


Answers to the following questions serve as criteria for this purpose
 Do the objectives include all important outcomes of the course?
 Are the objectives in harmony with the general goals of the institute/school?
 Are the objectives in harmony with sound principles of learning?
 Are the objectives realistic in terms of the abilities of students, the time and facilities
available?
2.5 Steps for Stating Instructional Objectives
The list of objectives for a course, or unit should include all important learning outcomes
(cognitive, affective, and psychomotor outcomes) and should be stated in a manner that clearly
conveys what students are like at the end of the learning process. The following summary of steps
provides guidelines for obtaining a clear statement of instructional objectives.
 Stating general instructional objectives
 State each general objective as an intended learning outcome; in terms of students’
terminal performance
 Begin each general objective with a verb; like knows, applies, interprets
 State each general objective to include only one general learning outcome, that is,
objectives should be unitary; not knows and understands

20
 State each general objective at the proper level of generality; it should encompass a
readily definable domain of response
 Stating specific learning outcomes
 List beneath each general instructional objective a representative sample of specific
learning outcomes that describes the terminal performance students are expected to
demonstrate
 Begin each specific learning outcome with an action verb that specifies observable
performance like identifies, describes
 Make sure that each specific learning outcome is relevant to the general objective it
describes
 Include a sufficient number specific learning outcomes to describe adequately the
performance of students who are attained the objective
Self-Check Exercises

Dear Trainees! You have now completed your study of chapter two. Therefore, you are expected to answer
the following self-test questions.

Instruction I: Match the domain descriptions listed under column A with their appropriate domains
under column B

Column A Column B

1. The ability to break down a material in to its components A. Application

2. The ability to use learned materials in new and concrete situations B. Evaluation

3. The ability to put parts together to form a new whole C. Comprehension

4. The ability to judge the value of a material for a given purpose D. Analysis

5. The ability to react by doing something E. Synthesis

6. The ability to remember previously learned material F. Responding

G. Knowledge

H. Organization

21
Instruction II Choose the best answer from the alternatives given for each item

7. Which one of the following action verbs can be used in synthesis level of cognitive domain?

A. Write B. Transfer C. Distinguish D. Interpret

8. Which one of the following action verbs does not indicate learning outcome at the
evaluation level?
A. Decide B. Conclude C. Validate D. Discriminate,
9. Which one of the following affective domains does the learner expected to develop a
consistent philosophy of life?

A. Organization C. Characterization

B. Valuing D. Responding

10. Which one of the following action verbs can be used in synthesis level of cognitive domain?

A. Write C. Transfer

B. Distinguish D. Interpret

22
UNIT THREE
CLASSROOM ACHIEVEMENT TESTS AND ASSESSMENTS
3.1 INTRODUCTION
The classroom test, which is otherwise called teacher-made test, is an instrument of measurement and
evaluation. It is a major technique used for the assessment of students learning outcomes. The classroom
tests can be achievement or performance test and/or any other type of test like practical test, etc prepared
by the teacher for his specific class and purpose based on what he has taught.

Tests may be classified into two broad categories on the basis of nature of the measurement. These are:
Measures of maximum performance and measures of typical performance. In measures of maximum
performance, you have those procedures used to determine a person’s ability. They are concerned with how
well an individual performs when motivated to obtain as high a score as possible and the result indicates
what individuals can do when they put forth their best effort. Can you recall any test that should be included
in this category? Examples are Aptitude test, Achievement tests and intelligence tests.

On the other hand, measures of typical performance are those designed to reflect a person’s typical
behaviour. They fall into the general area of personality appraisal such as interests, attitudes and various
aspects of personal social adjustment. Because testing instruments cannot adequately be used to measure
these attributes self-report and observational techniques, such as interviews, questionnaires, anecdotal
records, ratings are sometimes used. These techniques are used in relevant combinations to provide the
desired results on which accurate judgment concerning learner’s progress and change can be made.

3.2. LEARNING OBJECTIVES OF THE UNIT


After going through this unit the learner should be able to:
 List the different types of items used in classroom test.
 Describe the different types of objectives questions.
 Describe the different types of essay questions.
 Compare the characteristics of objectives and essay tests.
 Explain the meaning of objective test,
 State the advantages and disadvantages of objective test;
 Identify when to use objective test;
 Enumerate the various types of objective test and their peculiar advantages and disadvantages;

23
3.3. TYPES OF TESTS USED IN THE CLASSROOM
There are different types of test forms used in the classroom. These can be essay test, objectives test, norm-
referenced test or criterion referenced test. But we are going to concentrate on the essay test and objectives
test. These are the most common tests which you can easily construct for your purpose in the class.
3.3.1. Objective tests
Objective tests are those test items that are set in such a way that one and only one correct answer is available
to a given item. In this case every scorer would arrive at the same score for each item for each examination
even on repeated scoring occasions. This type of items sometimes calls on examinees to recall and write
down or to supply a word or phrase as an answer (free – response type). It could also require the examinees
to recognize and select from a given set of possible answers or options the one that is correct or most correct
(fixed-response type). This implies that the objective test consists of items measuring specific skills with
specific correct response to each of the items irrespective of the scorer’s personal opinion, bias, mood or
health at the time of scoring.

Types of objective tests


The objective test can be classified into those that require the examinee to supply the answer to the test
items (free-response type) and those that require the examinee to select the answer from a given number of
alternatives (fixed response type). The free-response type consists of the short answer and completion items
while the fixed response type is commonly further divided into true-false or alternative response matching
items and multiple-choice items.
Objective test items

Supply test items Selection test items

Short Answers Completion multiple Matching


choice Arrangements True/false

Fig 1. Types of Objectives tests Items

Let us look at them briefly one by one.


I. Selection type items
This is the type where possible alternatives are provided for the teste to choose the most appropriate or the
correct option. Can you mention them? Let us take them one by one.

24
 True/False or two option items
The true-false type of test is representative of a somewhat larger group called alternate-response items such
as, yes-no, correct-incorrect, agree-disagree, right-wrong, etc... This group consists of any question in which
the student is confronted with two possible answers. Since most of the points discussed here are equally
applicable to all alternative-response items, since teachers are familiar with the true-false type, the
following discussion will concentrate on true-false items.
 Advantages of true/false items
 It is commonly used to measure the ability to identify the correctness of statements of fact,
definitions of terms, statements of principles and other relatively simple learning outcomes to
which a declarative statement might be used with any of the several methods of responding.
 It is also used to measure examinee ability to distinguish fact from opinion; superstition from
scientific belief.
 It is used to measure the ability to recognize cause – and – effect relationships.
 It is best used in situations in which there are only two possible alternatives such as right or wrong,
more or less, and so on.
 It is easy to construct alternative response item but the validity and reliability of such item depend
on the skill of the item constructor. To construct unambiguous alternative response item, which
measures significant learning outcomes, requires much skill.
 A large number of alternative response items covering a wide area of sampled course material can
be obtained and the examinees can respond to them in a short period of time.
 Disadvantages of true/false items
 It requires course material that can be phrased so that the statement is true or false without
qualification or exception as in the Social Sciences.
 It is limited to learning outcomes in the knowledge area except for distinguishing between facts
and opinion or identifying cause – and – effect relationships.
 It is susceptible to guessing with a fifty-fifty chance of the examinee selecting the correct answer
on chance alone. The chance selection of correct answer has the following effects:
i. It reduces the reliability of each item thereby making it necessary to include many items
in order to obtain a reliable measure of achievement.
ii. The diagnostic value of answers to guess test items is practically nil because analysis
based on such response is meaningless.
iii. The validity of examinees response is also questionable because of response set.

25
 Guidelines for preparing true false items

 Keep language as simple and clear as possible.


 Avoid using universal descriptors such as “never”, “none”, “always”, and “all”.
 Test wise students will recognize that there are few absolutes.
 Use certain key words sparingly since they tip students off to the correct answer. (The words all,
always, never, none, and only usually indicate a false statement, whereas the words generally,
sometimes, usually, maybe, and often are frequently used in true statements.)
 Do not include two ideas in one statement unless you are evaluating students understanding of
cause and effect relationships.
- Poor: porpoises are able to communicate because they are mammals. T F
- Better: Porpoises are mammals. T F
- Porpoises are able to communicate. T F
 Include more false than true statements in any given test. Because
- Tendency to mark more statements true than false.
- Discrimination between those who know the content and those who do not is greater for false
expressions.
 Avoid using negative statements specially double negatives
- Under the demands of the testing situation, students may fail to see the negative
qualifier.
- Poor: None of the steps in planning stages of test construction are not important. T F
- Better: All of the steps in planning stage of test construction are important. T F
 Test important ideas, knowledge, or understanding (rather than trivia, general knowledge, or
common sense). Look the following examples:
o Artists live longer than farmers. T F
o The coefficient of correlation shows the cause and effect relationship between two
paired variables. T F
o Avoid copying statements directly taken from textbook and other written materials.
 Keep the word length of true statements about the same as that of false statements
 Make sure that the statements used are entirely true or entirely false. (Partially or marginally true
or false statements cause unnecessary ambiguity.)

26
Matching items
The matching test items usually consist of two parallel columns. One column contain a list of word, number,
symbol or other stimuli (premises) to be matched to a word, sentence, phrase or other possible answer from
the other column (responses) lists. The examinee is directed to match the responses to the appropriate
premises. Usually, the two lists have some sort of relationship. Although the basis for matching responses
to premises is sometimes self-evident but more often it must be explained in the directions.

The examinees task then is to identify the pairs of items that are to be associated on the basis indicated.
Sometimes the premises and responses list is an imperfect match with more lists in either of the two columns
and the direction indicating what to be done. For instance, the examinee may be required to use an item
more than once or not at all, or once. This deliberate procedure is used to prevent examinees from matching
the final pair of items on the basis of elimination.
 Advantages of matching items
 It is used whenever learning outcomes emphasize the ability to identify the relationship
between things and a sufficient number of homogenous premises and responses can be
obtained.
 Essentially used to relate two things that have some logical basis for association.
 It is adequate for measuring factual knowledge like testing the knowledge of terms, definitions,
dates, events, references to maps and diagrams.
 The major advantage of matching exercise is that one matching item consists of many
problems. This compact form makes it possible to measure a large amount of related factual
material in a relatively short time.
 It enables the sampling of larger content, which results in relatively higher content validity.
 The guess factor can be controlled by skilfully constructing the items such that the correct
response for each premise must also serve as a plausible response for the other premises.
 The scoring is simple and objective and can be done by machine.
 Disadvantages of matching items
 It is restricted to the measurement of factual information based on rote learning because the
material tested lend themselves to the listing of a number of important and related concepts.
 Many topics are unique and cannot be conveniently grouped in homogenous matching clusters
and it is sometimes difficult to get homogenous materials clusters of premises and responses
that can sufficiently match even for contents that are adaptable for clustering.
 It requires extreme care during construction in order to avoid encouraging serial memorization
rather than association and to avoid irrelevant clues to the correct answer.

27
 Guidelines for preparing matching items

 Use only homogeneous material in a set of matching items (i.e., dates and places should not be
in the same set).
 Use the more involved expressions in the stem and keep the responses short and simple.
 Supply directions that clearly state the basis for the matching, indicating whether or not a
response can be used more than once, and stating where the answer should be placed.
 Make sure that there are never multiple correct responses for one stem (although a response
may be used as the correct answer for more than one stem).
 Avoid giving inadvertent grammatical clues to the correct response (e.g., using a/an, singular/
plural verb forms).
 Arrange items in the response column in some logical order (alphabetical, numerical, and
chronological) so that students can find them easily.
 Avoid breaking a set of items (stems and responses) over two pages.
 Use no more than 15 items in one set.
 Provide more responses than stems to make process-of-elimination guessing less effective.
 Number each stem for ease in later discussions.
 Use capital letters for the response signs rather than lower-case letters.
Example: Directions: 1. On the line to the right of each phrase in column I, write the letter for the word in
column II that best matches true phrase.
2. Each word in column II may be used once, more than once, or not at all.
Column I Column II
1. Name of the answer in A. Difference
Addition problems
2. Name of the answer in B. Dividend
subtraction problems
3. Name of the answer in C. multiplicand
multiplication problems
4. Name of the answer in D. Product
division problems E. Quotient
F. Subtrahend
G. Sum

28
The multiple choice items (MCQs)
The multiple choice item consists of two parts – a problem and a list of suggested solutions. The problem
generally referred to as the stem may be stated as a direct question or an incomplete statement while the
suggested solutions generally referred to as the alternatives, choices or options may include words,
numbers, symbols or phrases. In its standard form, one of the options of the multiple choice item is the
correct or best answer and the others are intended to mislead, foil, or distract examinees from the correct
option and are therefore called distracters, foils or decoys. These incorrect alternatives receive their name
from their intended function – to distract the examinees who are in doubt about the correct answer.
 Advantages of the MCQs
 The multiple-choice item is the most widely used of the types of test available. It can be
used to measure a variety of learning outcomes from simple to complex.
 It is adaptable to any subject matter content and educational objective at the knowledge
and understanding levels.
 It can be used to measure knowledge outcomes concerned with vocabulary, facts,
principles, method and procedures and also aspects of understanding relating to the
application and interpretation of facts, principles and methods.
 Most commercially developed and standardized achievement and aptitude tests make use
of multiple-choice items.
 The main advantage of multiple-choice test is its wide applicability in the measurement of
various phases of achievement.
 It is the desirable of all the test formats being free of many of the disadvantages of other
forms of objective items. For instance, it present a more well-defined problem than the
short-answer item, avoids the need for homogenous material necessary for the matching
item, reduces the clues and susceptibility to guessing characteristics of the true-false item
and is relatively free from response sets.
 It is useful in diagnosis and it enables fine discrimination among the examinees on the basis
of the amount of what is being measured possessed by them.
 It can be scored with a machine.
 Disadvantages/limitations of the MCQs
 It measures problem-solving behaviour at the verbal level only.
 It is inappropriate for measuring learning outcomes requiring the ability to recall, organize
or present ideas because it requires selection of correct answer.
 It is very difficult and time consuming to construct.

29
 It requires more response time than any other type of objective item and may favour the
test-wise examinees if not adequately and skilful constructed.
 Measuring evaluation and synthesis can be difficult.
 Inappropriate for measuring outcomes that require skilled performance
 Guidelines for preparing multiple-choice items

 Use the stem to present the problem or question as clearly as possible; eliminate excessive
wordiness and irrelevant information.
 Use direct questions rather than incomplete statements for the stem.
 Include as much of the item as possible in the stem so that alternatives can be kept brief. Include
in the stem words that would otherwise be repeated in each option.
 In testing for definitions, include the term in the stem rather than as one of the alternatives.
 List alternatives on separate lines rather than including them as part of the stem so that they
can be clearly distinguished.
 Keep all alternatives in a similar format (e. g. All phrases, all sentences, etc.).
 Make sure that all options are plausible responses to the stem. (Poor alternatives should not be
included just for the sake of having more options.)
 Check to see that all choices are grammatically consistent with the stem.
 Try to make alternatives for an item approximately the same length. (Making the correct
response consistently longer is a common error.)
 Use misconceptions which students have indicated in class or errors commonly made by
students in the class as the basis for incorrect alternatives.
 Use “all of the above” and “none of the above” sparingly since these alternatives are often
chosen on the basis of incomplete knowledge. Words such as “all,” “always,” and “never” are
likely to signal incorrect options.
 Use capital letters (A, C, D, and E) on tests as responses rather than lower-case letters (“a” gets
confused with “d” and “c” with “e” if the type or duplication is poor). Instruct students to use
capital letters when answering (for the same reason), or have them circle the letter or the whole
correct answer, or use scan able answer sheets.
 Try to write items with equal numbers of alternatives in order to avoid asking students to
continually adjust to a new pattern caused by different numbers.
 Put the incomplete part of the sentence at the end rather than the beginning of the stem. Phrase
the item as a statement rather than a direct question.

30
 Use negatively stated items sparingly. (When they are used, it helps to underline or otherwise
visually emphasize the negative work.)
 Make sure that there is only one best or correct response to the stem. If there are multiple correct
responses, instruct students to “choose the best response.”
 Limit the number of alternatives to five or less. (The more alternatives used, the lower the
probability of getting the correct answer by guessing. Beyond five alternatives, however,
confusion and poor alternatives are likely.)

II. Supply type items

This is the type of test item, which requires the testee to give very brief answers to the questions. These
answers may be a word, a phrase, a number, a symbol or symbols etc. Supply test items can be in the form
of short answer or completion form. Both are supply-type test items consisting of direct questions which
require a short answer (short-answer type) or an incomplete statement or question to which a response must
be supplied by an examinee (completion type). The answers to such questions could be a word, phrase,
number or symbol. It is easy to develop and if well developed, the answers are definite and specific and can
be scored quickly and accurately.
 Advantages of supply type items

 Suitable for measuring simple learning outcomes

 Measure the ability to interpret diagrams, charts, graphs and pictorial data.

 Most effective for measuring a specific learning outcome such as computational learning
outcomes in mathematics and sciences

 It measures simple learning outcomes, which makes it easier to construct.


 It minimizes guessing.
 Disadvantages of supply type items
 It is not suitable for measuring complex learning outcomes. It tends to measure only factual
knowledge and not the ability to apply such knowledge and it encourages memorization if
excessively used.
 It cannot be scored by a machine because the test item can, if not properly worded, elicit more
than one correct answer. Hence the scorer must make decision about the corrections of various
responses. For example, a question such as “Where did Menellik II born?” could be answered
by name of the town, state, country or even continent. Apart from the multiple correct answers

31
to this question, there is also the possibility of spelling mistakes associated with free-response
questions that the scorer has to contend with.
 Guidelines for preparing short- answer items

 Questions must be carefully worded so that all students understand the specific nature of the
question asked and the answer required.

_Poor: Tewodros II defeated Ras Ali in _______________?

_Better: In what battle fought in 1869 did Tewodros II defeat Ras Ali?

Or In what year did Tewodros II defeat Ras Ali at Deresge?

 Word completion or fill in the blank questions so that missing information is at, or near the end of,
the sentence. Make reading and responding easier.

_Poor: In the year ______________ Canada turned 100 yrs old.

_Better: Canada turned 100 yrs old in the year____________.

 When an answer is to be expressed in numerical units, the unit should be stated.

_ Poor: If a room measures 7 meters by 4 meters, the perimeter is _____________.

- Better: If a room measures 7 meters by 4 meters, the perimeter is ______meters (or m).

 Do not use too many blanks in completion items. The emphasis should be on knowledge and
comprehension not mind reading.

Consider: In the year __________, Prime Minister ___________ signed the _________, which led
to a __________ which was ____________.

 Word each item in specific terms with clear meanings so that the intended answer is the only one
possible, and so that the answer is a single work, brief phrase, or number.

 Omit important (key) words.

_In supply items, present much of the statement and blank the key word.

Poor: ___________ ____________ are words that refer to particular ________, ________,
or ___________.

_Better: Proper nouns are words that refer to particular _________, __________ or
________.

32
_Best: Words that refer to particular persons, objects or things are ____________.

3.3.2. Essay type items

Essay test items are of two types:

 Extended/Unrestricted/Open-ended/free response
 Restricted/Closed-ended
 Extended response items

 No restrictions on response

 No restrictions on No of pages

 Originality required

 No bound on the depth, breadth and the organization of the response

 Expose individual differences in attitudes, values and creative ability.

 Applicable in measuring higher level learning outcomes of the cognitive level such as analysis,
synthesis & evaluation level

Limitations/disadvantages of extended response essay items

o Scoring is difficult and unreliable (scorer unreliability)

o Insufficient for measuring knowledge of facts

Examples-extended responses type:

 Describe the sampling techniques used in research studies.

 Explain the various ways of preventing accident in a school workshop or laboratory.

 Describe the processes of producing or cutting screw threads in the school technical workshop.

 Describe the processes involved in cement production.

 Why should the classroom teacher state his instructional objectives to cover the three domains of
educational objectives?

33
 Open and Distance Learning is a viable option for the eradication of illiteracy in Ethiopia. Discuss

 Which of the following alternatives would you favor & why?

 Explain why you agree or disagree with the following statement:

 Restricted essay types

 Such items are directional questions and aimed at the desired responses.

 Useful for measuring learning outcomes at the lower cognitive levels:

o Knowledge, comprehension, analysis

 More efficient for measuring knowledge of factual information

 More reliability in scoring as compared to extended type

 Reduces scoring difficulty

 Examples:

o Give three advantages and two disadvantages of essay tests.

o State four uses of tests in education.

o Explain five factors which influence the choice of building site.

o Mention five rules for preventing accident in a workshop.

o State 5 technical drawing instruments and their uses.

 Advantages of essay tests

 Find out how ideas are related to each other.

 They increase security.

 Relatively easy to construct when compared to objective items.

 Measure higher level learning outcomes.

 Have influence on student study habits.

34
 Require the instructor to give critical comments.

 It is easy and economical to administer.

 Promote the development of problem – solving skills.


 Disadvantages/limitations
 Scoring is time consuming, subjective, & difficult.
 Low content validity: inadequate sampling of content
 Scorer unreliability/subjectivity in scoring
 Not suitable for item analysis
 Difficulty levels and discrimination powers
 How to make essay questions less subjective
 Avoid open-ended questions.
 Let the students’ answer the same questions. Avoid options/choices.
 Use students’ numbers instead of their names, to conceal their identity.
 Score all the answers to each question for all students at a time.
 Do not allow score on one question to influence you while marking the next. Always rearrange
the papers before you mark.
 Do not allow your feelings or emotions to influence your marking.
 Decide a policy for handling irrelevant or incorrect responses:
_ Hand writing
_ Spelling error
_ Sentences structure
_ Punctuation
_ Neatness
_ Poor grammar
_ Bluffing
 Write comments on the paper
3.3.3. Authentic assessment

What is authentic assessment?

Many educators use testing strategies that do not focus entirely on recalling facts. Instead, they ask students
to demonstrate practical skills and concepts they have learned. This strategy is called authentic assessment.

35
Authentic assessment aims to evaluate students' abilities in 'real-world' contexts. In other words, students
learn how to apply their skills to authentic tasks and projects. Authentic assessment does not encourage rote
learning and passive test-taking. Instead, it focuses on students' analytical skills, ability to integrate what
they learn, creativity, ability to work collaboratively and written and oral expression skills. It values the
learning process as much as the finished product.

One of the focal areas of educational measurement and evaluation is assessing students’ non cognitive
performances such as their interests, skills in performing various physical activities, carrying out laboratory
experiments, attitudes, project activities and workshop products the like. These affective and psychomotor
aspects of the learner should be well assessed. Identification of learner characteristics and adapting
classroom instruction to those characteristics could help all students master the knowledge and skill they
get from schooling. Classroom teachers thus should make sure that their students are attaining the
instructional objectives they are supposed to achieve by examining students non-cognitive performances
and tangible products.

Authentic assessment also identified as performance assessment that involves measurement activities that
ask students to demonstrate skills similar to those required in real life setting.

In authentic assessment, students:

 show laboratory and workshop procedures


 simulate and copy any kind of product
 do science experiments
 conduct social-science research
 write stories and reports
 read and interpret literature
 solve math problems that have real-world applications
 requires judgment and innovation
 replicates and simulates something in the workplace
 Advantages of authentic assessment
 They are likely to be more valid than conventional tests;
 For HOT (higher order thinking) skills and
 Practical workshop products because they involve
o Real-world tasks
o More interesting for students and thus motivating

36
 provide more specific and usable information
 Limitations/disadvantages
 Require more time and effort on an instructor’s/teacher’s part to develop,
 More difficult to grade:
o often useful to create a grading rubric and authentic tools
 Tools for authentic assessment
 Observation  Role playing
 Group work  Drama
 Project work  Concept map
 Reflection  Observation and observation
 Portfolio devices

For practical subjects, this is the most obvious form of assessment watching someone doing something to
see if they can do it properly. It is the recommended form for competency-based programs. For any area in
which performance itself is not enough, direct observation needs to be supplemented by other methods.
One observation is not enough, but there is a trade-off, because observation is an extremely expensive way
of assessing.

Direct observation is valuable to collect real information and check the actual performance of the learners.
Reliability is only assured when everyone engaged in the assessment process is perfectly clear about what
is being looked for, and what evidence is required to determine competence. Developing observation
protocols is not a trivial activity. During observation we may use different instruments and devices that
help our observation more critical and evidence based. Some of commonly used devices are: Checklist,
anecdotal records, rating scale and running records.

I. Checklist

A checklist is a type of informational job aid used to reduce failure by compensating for potential limits of
human memory and attention. It helps to ensure consistency and completeness in carrying out a task. A
basic example is the "to do list."

It is prepared for checking the all-important tasks have been done in proper order. This list may be used for
prepared the shopping checklist, tasks checklist, to do checklist, invitation checklist, guest checklist,
packing checklist etc. A checklist is a list of items you need to verify, check and inspect. Checklists are
used in every imaginable field from building inspections to complex medical surgeries.

37
Example: An ICT instructor wants to evaluate students’ performance about excel document
performance out of ten points

NO General layout and formatting requirement Yes No

1 Are no merged cells contained in the data area of the table (e.g. only headers and
for titles)?

2 Do all active worksheets in the workbook have clear and concise names that allow
the user to identify the source and contents of the table?

3 Are tables prefixed with the table name and table number?

4 Are table header rows formatted to repeat on the top of the table as it goes from
one page to another?

5 If color is used to emphasize the importance of text, is there an alternate method?

6 Does each separate table have its own worksheet and are each of those worksheets
named?

7 Have track changes been accepted or rejected and turned off?

8 Have extraneous comments been removed?

9 Is the document free of text boxes?

10 Has the document been reviewed in print preview for a final visual check?

II. Anecdotal records

What is an anecdotal record?

An anecdotal record is like a short story that teachers use to record a significant incident that they have
observed. Anecdotal records are usually relatively short and may contain descriptions of behaviors and
students actual performance.

Major characteristics of anecdotal records:

• Simple reports of practical performance

• Result of direct observation on laboratory, field work and workshops

• Accurate and specific behavior of students

• Gives context of students behavior

38
• Records typical or unusual performance, skills and behaviors

Anecdotes capture the richness and complexity of the moment as students interact with one another and
with materials. These records of student behavior and learning accumulated over time enhance the teacher's
understanding of the individual student as patterns or profiles begin to emerge. Behavior and performance
change can be tracked and documented, and placed in the student’s portfolio resulting in suggestions for
future observations and planning.

Anecdotal notes for a particular student can be periodically shared with that student or be shared at the
student’s request. They can also be shared with students and parents at parent–teacher–student conferences.

The purpose of anecdotal notes is to:

1) Provide information regarding a student's skill development over a period of time

2) Provide ongoing records about individual instructional needs

3) Capture observations of significant skills and behaviors

4) Provide ongoing documentation of learning

Example of Anecdotal Records:

Student's Name: Tesfaye Tolossa

Date & Time: 18/03/16 8:45 am

Place/ Learning Center: free choice-art area

Observed Event & skill: Drawing skill

Tesfaye was in the art area during free choice. He was making letters, rolling
the paper and then he tied the paper roll with a string. He demonstrated this
process to Aster, Abdela and Mesfin who were also in the art area.

39
III. Rating scales

A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative
attribute. In the social sciences, particularly psychology, common examples are the Likert scale and 1-10
rating scales in which a person selects the number which is considered to reflect the perceived quality of a
product. A rating scale is a method that requires the ratter to assign a value, sometimes numeric, to the rated
object, as a measure of some rated attribute.

An example of a rating scale for an information literacy assignment

Please indicate the student’s skill in each of the following respect, as evidenced by this assignment, by
checking the appropriate box.

No Content Outstandin Very Acceptable Marginall Inadequat


g good y e
Acceptabl
e

1 Identify, locate and access sources of


information

2 Critically evaluate information, including its


legitimacy, validity and appropriateness

3 Organize information to present a sound


central idea supported by relevant material in
logical order

4 Use information to answer questions and


solve problems

5 Clearly articulate information and ideas

6 Use information technologies to


communicate, manage, and process
information

7 Use information technologies to solve


problems

8 Use the work of others accurately and


ethically

40
A. Running Records

A running record is a tool that helps teachers to identify patterns or sequences in student practical work,
reading behaviors, laboratory experiment procedure, drawing procedure and so on. These patterns allow a
teacher to see the strategies a student uses to make the practical tasks. A running record collect information
in narrative and pictorial form one by one following the flow of a certain action. Hence, running record
shows students practical performance from the beginning to ending.

For instance, a running record is one method of assessing a child's reading level by examining both accuracy
and the types of errors made. It is most often utilized as part of a reading recovery session in school or any
education center. A running record gives the teacher an indication of whether material currently being read
is too easy or too difficult for the child, and it serves as an indicator of the areas where a child's reading can
improve. For example, if a child frequently makes word substitutions that begin with the same letter as the
printed word, the teacher will know to focus on getting the child to look beyond the first letter of a word.
A running record can also be used for practical subjects besides reading.

B. Project Work Assessment


Working as part of a group is a key skill of students, yet assessing work produced by a group can be a real
challenge, just how do we assess the group work and allocate marks fairly to the individual members of the
group?

Group project work has become a common feature of higher education. Many practitioners have recognized
that if students are going to be effective at group work they often need to improve their group work skills.
The skills of interacting in a group and working together to achieve a common goal within a specific
timescale need to be learnt.

Project work challenges students to think beyond the boundaries of the classroom, helping them develop
the skills, behaviors and confidence. Designing learning environments that help students question, analyze,
evaluate and extrapolate their plans, conclusions and ideas, leading them to higher order thinking, requires
feedback and evaluation that goes beyond a letter or number grade.

Since project work requires students to apply knowledge and skills throughout the project-building process,
teachers will have many opportunities to assess work quality, understanding and participation from the
moment students begin working.

For example, teachers’ evaluation can include tangible documents like the project vision, storyboard and
rough draft, verbal behaviors such as participation in group discussions and sharing of resources and ideas,

41
and non-verbal cognitive tasks such as risk taking and evaluation of information. They can also capture
snapshots of learning throughout the process by having students complete a project journal, a self-
assessment and by making a discussion of the process one component of the final presentation.

Project assessment may be conducted in to two ways. These are process and product assessment.

 Assessing the process: evaluating individual teamwork skills and interaction. This includes list of
skills to assess, such as:

 adoption of complementary team roles  creative problem solving

 cooperative behavior  use of a range of working methods

 time and task management  negotiation

 Assessing the product: Measuring the quantity and quality of individual work in a group project.
Process assessments are subjective and students are not always straightforward when evaluating
one another or themselves. However, in combination with product assessments and individual
assessments, they can offer valuable glimpses into how teams function and alert you to major
problems (e.g., particularly problematic team members or serious conflict), which can help to
inform your feedback and grading.
An example of peer assessment in team project members

Area Below expectation Good Exceptional

Project contributions Made few substantive Contributed a fair share of Contributed


contributions to the teams’ substance to team final considerable substance
final product product to teams final product

Leadership Rarely or never exercised Accepted a fair share of Routinely provided


leadership leadership responsibilities excellent leadership

Collaboration Undermine group discussions Respected others opinions Repeated others


or often failed to participate and contributed to the opinions made major
groups discussion contributions to the
groups discussion

42
C. Portfolio
What is a portfolio?

Portfolio is the systematic collection of student work measured against predetermined scoring criteria.
These criteria may include scoring guides, rubrics, check lists and rating scales. In "authentic assessment",
information or data is collected from various sources, through multiple methods, and over multiple points
in time.

Assessment portfolios can include performance-based assessments, such as writing samples that illustrate
different genres, solutions to math problems that show problem-solving ability, lab reports that demonstrate
an understanding of a scientific approach, or social studies research reports that show the ability to use
multiple sources. In addition, assessment portfolios can include scores on standardized and program specific
tests.

Portfolio is a collection of a student's work specifically selected to tell a particular story about the student.
Student portfolios take many forms, so it is not easy to describe them. A portfolio is not the pile of student
work that accumulates over a semester or year. Rather, a portfolio contains a purposefully selected subset
of student work. "Purposefully" selecting student work means deciding what type of story you want the
portfolio to tell. For example, do you want it to highlight or celebrate the progress a student has made?
Then, the portfolio might contain samples of earlier and later work, often with the student commenting
upon or assessing the growth. Do you want the portfolio to capture the process of learning and growth?
Then, the student and/or teacher might select items that illustrate the development of one or more skills
with reflection upon the process that led to that development. Or, do you want the portfolio to showcase the
final products or best work of a student? In that case, the portfolio would likely contain samples that best
exemplify the student's current ability to apply relevant knowledge and skills.

D. Self-Assessment
Self-assessment is the process of looking at oneself in order to assess aspects that are important to one's
identity. It is one of the motives that drive self-evaluation, along with self-verification and self-
enhancement.

The ultimate aim of education is to produce lifelong and independent learners. An essential component of
autonomous learning is the ability to assess one's own progress and deficiencies. Student self-assessment
should be incorporated into every evaluation process. Its specific form may vary with the developmental
level of the student, but the very youngest students can begin to examine and evaluate their own behavior
and accomplishments.

43
Instead of grading all assignments, allow students to correct some themselves. You may choose to randomly
collect these and check for accuracy. Share the specific evaluation criteria (or rubric) students should
employ in assessing various tasks or assignments. Provide them with criteria check sheets (or have the class
generate them) that specify exactly what constitutes a good product.

Encourage the student to apply specific criteria in making the self-assessment. Self-assessment can take
many forms, including:

1) Writing workshop performance 3) Reflection on the practical products

2) Discussion (whole-class and small-group) 4) Weekly laboratory self-evaluations

5) Self-assessment checklists and inventories

6) Teacher-student interviews

44
These types of self-assessment ask students to review their work to determine what they have learned and
what areas of confusion still exist. Although each method differs slightly, all should include enough time
for students to consider thoughtfully and evaluate their progress.

An example of self-evaluation of a certain student to assess his/her reading performance based on the
following checklist.

Name………………………………………

Year………………………………………..

Department……………………………….

Date………………………………………..

No Activity Yes No

1 Do I choose the correct level?

2 Do I choose a verity of books?

3 Do I listen to suggest others?

4 Do I use all available sources?

5 Do I enjoy silent reading time?

6 Do I choose to read at other times?

7 Do I read different books for different purposes?

8 Do I know what to do when I do not understand


something?

9 Do I know what to do when I do not know a word?

10 Do I use dictionary frequently?

SELF CHECK
1. Briefly explain the meaning of objective test.
2. What are the two major advantages of objective test items over the Essay test item?

45
3. What is the major feature of objective test that distinguishes it from essay test?
4. Which of the types of Objective test items would you recommend for school wide test and why?
5. How would you make essay questions less subjective?
6. What are the two sub-divisions of the supply test items?
7. (a) Subjectivity in scoring is a major limitation of essay test?
True / false
(b) Essay questions cover the content of a course and the objectives as comprehensively as
possible? True / False
(c) Grading of essay questions is time consuming?
True / False
(d) Multiple choice questions should have only two options.
True / False
8. Construct five multiple choice questions in any course of your choice.
9. Give 5 examples of free response or extended response questions
10. Briefly identify the two most outstanding weaknesses of essay test as a measuring instrument.
11. When should essay test be used?
12. Why teachers use authentic assessment during practical performance of the students in Automotive
workshop?
13. How can teachers assess their students during project work? Explain the mechanism of project
assessment?
14. What are the advantages of authentic assessment?
15. Can we measure affective and psychomotor domains by using authentic assessment?
16. What are the basic characteristics of anecdotal records?
17. Explain the difference between running records and anecdotal records?

46
UNIT FOUR
TEST DEVELOPMENT – PLANNING THE CLASSROOM TEST
Introduction
Dear Trainees! Welcome to the fourth chapter of this module. In this unit, you will learn how to
plan a classroom test. You will learn what to consider in the planning stage, how to carry out
content survey and to scrutinize the instructional objectives as relevant factors in the development
of table of specification/test blue print.

Objectives
Dear learner, by the time you finish this unit, you will be able to:
 Identify the sequence of planning a classroom test,
 Prepare table of specifications for classroom test in a given subject.
 recognize some common problems of teacher made tests
 carry out content survey in the development of table of specifications
4.1 Test Development – Planning the Classroom Test

Learning Task 1

 Dear Trainees, before you read the sections below, try to answer the following
questions.
1. You as a teacher, what steps do you follow when you prepare written test items for
your students?
________________________________________________________________
______________________________________________
2. What do you know about test blue print?
________________________________________________________________
____________________________________________________
3. What are some of the common problems observed in teacher made tests?
________________________________________________________________
_______________________________________________

47
The development, of good questions or items writing for the purpose of classroom test, cannot be
taken for granted. An inexperienced teacher may write good items by chance. But this is not always
possible. Development of good questions or items must follow a number of principles without
which no one can guarantee that the responses given to the tests will be relevant and consistent. In
this unit, we shall examine the various aspects of the teacher’s own test.

The development of valid, reliable and usable questions involves proper planning. The plan entails
designing a framework that can guide the test developers in the items development process. This
is necessary because classroom test is a key factor in the evaluation of learning outcomes.

The validity, reliability and usability of such test depend on the care with which the test are planned
and prepared. Planning helps to ensure that the test covers the pre-specified instructional objectives
and the subject matter (content) under consideration hence, planning classroom test entails
identifying the instructional objectives earlier stated, the subject matter (content) covered during
the teaching/learning process. This leads to the preparation of table of specification (the test blue
print) for the test while bearing in mind the type of test that would be relevant for the purpose of
testing.

As a teacher, you will be faced with several problems when it comes to your most important
functions – evaluating of learning outcomes. You are expected to observe your students in the
class, workshop, laboratory, field of play etc and rate their activities under these varied conditions.
You are required to correct and grade assignments and home works. You are required to give
weekly tests and end of term examinations. Most of the times, you are expected to decide on the
fitness of your students for promotion on the basis of continuous assessment exercises, end of term
examinations’ cumulative results and promotion examination given towards the end of the school
year. Given these conditions it becomes very important that you become familiar with the planning
construction and administration of good quality tests. An outline of the framework for planning
the classroom test will be discussed later.

48
4.1.1 Some Pit Falls In Teacher – Made Tests
You were told that testable educational objectives are classified by Bloom et al (1956) into recall
or memory or knowledge comprehension, application, analysis, synthesis, and evaluation. It means
that you do not only set objectives along these levels but also test them along the levels. The
following observations have been made about teacher-made tests. They are listed below in order
to make you avoid them when you construct your questions for your class tests.
 Most teacher-tests are not appropriate to the different levels of learning outcomes. The
teachers specify their instructional objectives covering the whole range simple recall to
evaluation. Yet the teachers’ items fall within the recall of specific facts only
 Many of the test exercises fail to measure what they are supposed to measure. In other
words most of the teacher-made tests are not valid. You may wonder what validity is. It is
a very important quality of a good test, which implies that a test is valid if it measures what
it is supposed to measure. You will read about in details later in this course.
 Some classroom tests do not cover comprehensively the topics taught. One of the qualities
of a good test is that it should represent the entire topic taught. But, these tests cannot be
said to be a representative sample of the whole topic taught.
 Most tests prepared by teacher lack clarify in the wordings. The questions of the tests are
ambiguous, not precise, and not clear and most of the times carelessly worded. Most of the
questions are general or global questions.
 Most teacher-made tests fail item analysis test. They fail to discriminate properly and not
designed according to difficulty levels.

These are not the only pit falls. But you should try to avoid both the ones mentioned here and those
not mentioned here. Now let us look at how to develop test items.
4.1.2 Considerations in Planning a Classroom Test.
To plan a classroom test that will be both practical and effective in providing evidence of mastery
of the instructional objectives and content covered requires relevant considerations. Hence the
following serves as guide in planning a classroom test.
 Determine the purpose of the test;
 Describe the instructional objectives and content to be measured.
 Determine the relative emphasis to be given to each learning outcome;

49
 Select the most appropriate item formats (essay or objective);
 Develop the test blue print to guide the test construction;
 Prepare test items that are relevant to the learning outcomes specified in the test plan;
 Decide on the pattern of scoring and the interpretation of result;
 Decide on the length and duration of the test, and
 Assemble the items into a test, prepare direction and administer the test.

I. Analysis of the Instructional Objectives


The instructional objectives of the course are critically considered while developing the test items.
This is because the instructional objectives are the intended behavioral changes or intended
learning outcomes of instructional programs which students are expected to possess at the end of
the course or program of study. The instructional objectives usually stated for the assessment of
behavior in the cognitive domain of educational objectives are classified by Bloom (1956) in his
taxonomy of educational objectives into knowledge, comprehension, application, analysis,
synthesis and evaluation. The objectives are also given relative weight in respect to the level of
importance and emphasis given to them. Educational objectives and the content of a course form
the nucleus on which test development revolves.

II. Content Survey


This is an outline of the content (subject matter or topics) of a course of program to be covered in
the test. The test developer assigns relative weight to the outlined content – topics and subtopics
to be covered in the test. This weighting depends on the importance and emphasis given to that
content area. Content survey is necessary since it is the means by which the objectives are to be
achieved and level of mastering determined.

4.1.3 Planning the table of specifications/test blue print


The table of specification is a two dimensional table that specifies the level of objectives in relation
to the content of the course. A well planned table of specification enhances content validity of that
test for which it is planned. The two dimensions (content and objectives) are put together in a table
by listing the objectives across the top of the table (horizontally) and the content down the table
(vertically) to provide the complete framework for the development of the test items. The table of

50
specification is planned to take care of the coverage of content and objectives in the right
proportion according to the degree of relevance and emphasis (weight) attached to them in the
teaching learning process. The table of specifications is a two dimensional table that specifies the
level of objectives in relation to the content of the course or the type of the items in relation to the
content of the course. These are called table of specifications by objective and table of
specifications by test type respectively. A hypothetical table of specification by objective is
illustrated in table 4.1 below:

Table 4.1 A Hypothetical Test Blue Print/Table of Specifications by objective


Content Objectives Total

Area Weight Knowl. Compreh. Applicat. Analys. Synth. Eval. 100 %

10% 15% 15% 30% 10% 20%

Set A 15% - 1 - 2 - - 3

Set B 15% - 1 - 2 - - 3

Set C 25% 1 - 1 1 1 1 5

Set D 25% 1 - 1 1 1 1 5

Set E 20% - 1 1 - - 2 4

Total 100% 2 3 3 6 2 4 20

1. The first consideration in the development of Test Blue –Print is the weight to be assigned
to higher order questions and the lower order questions (That is, to educational objectives
at higher and at lower cognitive levels). This is utilized in the allocation of numbers of
questions to be developed in each cell under content and objective dimensions. In the
hypothetical case under consideration, the level of difficulty for lower order questions
(range: knowledge to application) is 40% while the higher order questions (range: analysis
to evaluation) is 60%. This means that 40% of the total questions should be lower order
questions while 60% of the questions are higher order questions. The learners in this case
51
are assumed to be at the Senior Secondary Level of Education. Also, an attempt should be
made as in the above to ensure that the questions are spread across all the levels of Bloom’s
(1956) Taxonomy of Educational Objectives.
2. The blue-print is prepared by drawing 2-dimensional framework with the list of contents
vertically (left column) and objectives horizontally (top row) as shown in table 4.1 above.
3. Weights are assigned in percentages to both content and objectives dimensions as desired
and as already stated earlier.
4. Decisions on the number of items to be set and used are basis for determining items for
each content area. For instance, in table 4.1, set A is weighted 15% and 20 items are to be
generated in all. Therefore, total number of items for set A is obtained thus:
- Set A, weight: 15% of 20 items = 3 items
- Set B, weight: 15% of 20 items = 3 items
- Set C, weight: 25% of 20 items = 5 items
- Set D, weight: 25% of 20 items = 5 items
- Set E, weight: 20% of 20 items = 4 items.
The worked out values are then listed against each content area at the extreme right
(Total column) to correspond with its particular content.
5. The same procedure is repeated for the objective dimension. Just like in the above.
- Knowledge: weight 10% of 20 items = 2 items
- Comprehension: weight 15% of 20 items = 3 items
- Application: weight 15% of 20 items = 3 items
- Analysis: weight 30% of 20 items = 6 items
- Synthesis: weight 10% of 20 items = 2 items
- Evaluation: weight 20% of 20 items = 4 items.
Here also the worked out values are listed against each objective at the last horizontal
row, alongside the provision for total.
6. Finally, the items for each content are distributed to the relevant objectives in the
appropriate cells. This has also been indicated in the table 4.1 above. The Table of
Specification now completed, serves as a guide for constructing the test items. It should be
noted that in the table knowledge, comprehension and application levels have 2, 3, and 3
items respectively. That is, 2+3+3 = 8 items out of 20 items representing 40% of the total

52
test items. While analysis, synthesis and evaluation have 6, 2 and 4 items respectively. That
is, 6+2+4 = 12 items out of 20 items representing 60% of the total items.
7. The development of table of specification is followed by item writing. Once the table of
specification is adhered to in the item writing, the item would have appropriate content
validity at the required level of difficulty. The table of specification is applicable both for
writing essay items (subjective questions) and for writing objective items (multiple choice
questions, matching sets items, completion items, true/false items).

Table 4.2 A Hypothetical Test Blue Print/Table of Specifications by test type


Content Test Type

Area Weight True/False Completion Matching Multiple Essay Total


choice
10% 15% 15% 10% 100 %
50%

Set A 15% - 1 - 2 - 3

Set B 15% - 1 - 2 - 3

Set C 25% 1 - - 2 1 4

Set D 25% 1 - 3 1 1 6

Set E 20% - 1 - 3 - 4

Total 100% 2 3 3 10 2 20

4.2 Item Writing


The next task in planning the classroom test is to prepare the actual test items. The following is a
guide for item writing:
1. Keep the test blueprint in mind and in view as you are writing the test items. The blueprint
represents the master plan and should readily guide you in item writing and review.
2. Generate more items than specified in the table of specification. This is to give room for
item that would not survive the item analysis hurdles.

53
3. Use unambiguous language so that the demands of the item would be clearly understood.
4. Endeavour to generate the items at the appropriate levels of difficulty as specified in the
table of specification. You may refer to Bloom (1956) taxonomy of educational objectives
for appropriate action verb required for each level of objective.
5. Give enough time to allow an average student to complete the task.
6. Build in a good scoring guide at the point of writing the test items.
7. Have the test exercises examined and critiqued by one or more colleagues. Then subject
the items to scrutiny by relevant experts. The experts should include experts in
measurement and evaluation and the specific subject specialist. Incorporate the critical
comments of the experts in the modification of the items.
8. Review the items and select the best according to the laid down table of specification/test
blue print. Also associated with test development is the statistical analysis – The Item
analysis. This is used to appraise the effectiveness of the individual items. Another
important factor is reliability analysis. Both item analysis and reliability analysis would be
treated in subsequent units. The item analysis and validity are determined by trail testing
the developed items using a sample from the population for which the test is developed.

Self-Check Exercises

Dear Trainees! You have now completed your study of chapter four. Therefore, you are expected to answer
the following self-test questions.

1. Identify five important considerations in planning a classroom test.


2. Why is it necessary to prepare a table of specification or test blue print before writing test items?
3. What are the first steps to follow in writing test items?
4. Write five pitfalls of teacher made tests.

54
UNIT FIVE
ASSEMBLING, REPRODUCING, ADMINISTERING, AND SCORING OF CLASSROOM TESTS
5.1 UNIT INTRODUCTION
In the last two units you learned about types of tests and how to construct tests. Here you will learn test
administration and scoring of classroom test. You will learn how to ensure quality in test administration as
well as credibility and civility in test administration. Furthermore, you will learn how to score essay and
objective test items using various methods.

5.2 LEARNING OBJECTIVES OF THE UNIT


By the time you finish this unit you will be able to:
 Explain the meaning of test administration
 State the steps involved in test administration
 Identify the need for civility and credibility in test administration
 State the factors to be considered for credible and civil test administration
 Score both essay and objective test using various methods.

5.3 INTRODUCTION
Brain Storing

 Think about arranging various items formats in a test. Which item format should appear first and
which should appear last?
 Similarly consider arranging different items within a particular format. How do you think it can be
done?
 Think about giving directions to different test formats and how do you think this can be done?

In most cases, different item formats cannot be administered orally or cannot be easily written on
chalkboard to administer. This means that there is a necessity for reproducing tests and during reproduction
care must be taken to assure that:

 Test items are legible;


 Similar item formats are grouped together;
 Clear and concise directions are given or not; and the manner in which students will record their
answers.

55
5.4 ARRANGING THE TEST ITEMS
It is very essential that the task presented to students must be as clear as possible. To do this we have to
group all items of the same format together rather than to intersperse them throughout the test. The
advantage of doing this is that:

 Younger children may not realize that the first set of directions is applicable to all items of a
particular format and may become confused;
 It makes it easier for the examinee to maintain a particular mental set rather than having to change
from one to another;
 It makes it easier for the teacher to score the test, mainly when hand scoring is done.

In arranging item formats due emphasis should be given to the complexity of mental activity they demand
in answering them. In this way we have to arrange item formats so that they progress from the simple to
the complex. For instance items that measure simple recall should precede those that measure understanding
and application.

According to Ground (1985) item formats can be arranged in the following way, which roughly
approximates the complexity of the instructional objectives measured. Hence:

 True-false or alternative-response items;


 Matching items;
 Short-answer items;
 Multiple choice items;
 Interpretive exercise; and
 Essay questions should appear in a test in this order.

56
Similarly, teachers should pay due attention in arranging items within each item format. In line with this
they have to group those items dealing with the same instructional objectives together. When they do this,
teachers can ascertain which learning activities appear to be most readily understood by their students. As
a rule it is important if items are arranged in a way that they progress from the easy items that almost
everyone could answer correctly, even the less able students might be encouraged to do their best on the
remaining items.

In some subjects, tests may consist of drawings or diagrams. In this case the drawings should be placed
above the stem not to create a break or a gap between the stem and options.

Generally, the organization of the various test items in the final test should

- Have separate sections for each item format.


- Be arranged so that these sections progress from the easy to the complex.
- Group the items within each section so that the very easy ones are at the beginning and the items
progress in difficulty.
- Space the items so that they are not crowded and can be easily read.
- Keep all stems and options, together on the same page; if possible, diagrams and questions should
be kept together.
- If a diagram is used for multiple choice exercise, have the diagram come above the stem.
- Avoid a definite response pattern to the correct answer.

Self-Check Activity
1. Enumerate the advantage of arranging test items from the easy to the difficult within a test
2. What would happen if test items are presented in an interspersed fashion in a test rather than in a
group way?
3. Genuinely, evaluate yourself and your practice in arranging test formats and test items within a
specific test format.

57
5.5 WRITING TEST DIRECTIONS
Teachers should be aware of the significance of providing clear and concise directions. The directions given
should transmit clear information concerning what to do, how to do and where to record answers. In other
words directions should tell students:

 The time to be allotted to the various sections


 The value of the items, and
 Whether or not students should guess at any answers they may be unsure of.

While Writing Test Directions:

a) Each item format should have a specific set of directions. Besides, a general set of instructions, a
specific set of instructions must be provided to a particular item format. For computational
problems, we have to tell the degree of precision required, and the proper units they are required to
use.
b) For objective tests at the elementary level, give the students examples and/or practice exercises so
that they will see exactly what and how they are to perform their tasks.
c) Students should be told how the test will be scored. Concerning this issue they should be informed
cases like whether punctuation, spelling or other conditions might be taken in to consideration in
scoring essay questions, in an arithmetic test whether students will receive part scores for showing
a correct procedure even though they may have obtained an incorrect answer, and the like.
d) Above the second grade level, all directions should be written out. This is to mean for younger
children the directions may be read aloud in addition to being printed and available to each student.
Besides teachers should orally give directions to certain group of students such as slow learners, or
to students with reading problems.

Even though it is not the common practice in our educational system elsewhere instructions are given for
guessing. Since guessing and faking are the persistent sources of error in cognitive oriented tests and
affective oriented tests respectively, various procedures have been devised including application of a
correction formula to combat these problems. However, the research evidences showed that students should
be instructed to answer every item and that no correction for guessing be applied.

Activity

1. Identify the common errors that you have made in writing clear and concise directions.
2. Write sample directions
 General direction for exam to be scored by a machine

58
 Direction for specific item formats
- True-False
- Matching
- Short Answer
- Essay

5.6 REPRODUCING TEST ITEMS


1. Space the items so that they are not crowded, if items are tightly crammed together it would be
difficult for student to easily read and answer the questions. This is largely true when writing
multiple choice items, hence for multiple choice tests the options should be placed in a vertical
column below the test item rather than in paragraph fashion.
2. For the alternate response test have a column of T’S and F’S at either the right or left hand side of
the items. If we do this, students need only to circle or underline the correct response.
3. Concerning matching exercise, have the two lists on the same page.
4. When we use interpretive exercise, the introductory material-be it a graph, chart, diagram, or piece
of prose-and the items based on it should e on the same page. In case if the material used is too
long, facing pages should be used when possible.
5. All items should be numbered consecutively. For the matching and multiple choice item, the
material in the list to be matched and/or the options to be used should be lettered.
6. If the responses are recorded directly on the test booklet, it will make scoring easier if all responses
to objective items are recorded on one side of the page, regardless of the item format used.
7. In the elementary grades, if workspace is needed to solve numerical problems, provide this space
in the test booklet rather than having examinees use scratch paper. This would minimize recording
errors that might occur when students transfer questions from test booklet to scratch paper for
computation.
8. All illustrative material used should be clear, legible, and accurate.
9. Proof the test carefully before it is reproduced. If you can it is better if a teacher who is teaching
the same subject with you checks errors for early correction. But if errors are found after the test
has been produced, they should be called to the students’ attention before the actual test is begun.
10. Even for essay tests, every student should have a copy of the test.
Teachers should not write the questions on the black board.

5.7 ADMINISTERING THE TEST


During administering the test, there are certain conditions that we have to establish so that each examinee
can do his/her best.

59
These conditions involve the following issues:

 The physical conditions should be as comfortable as possible;


 The examinees should be as relaxed as possible;
 If we find out that the physical and environmental conditions are not what we would like them to
be, the interpretation of the students’ scores must be make accordingly;
 Do not give tests when the examinees would normally be doing pleasant activities such as eating
their lunch, when they are in their vacation or in their sporting season and the like. Besides, do not
give tests immediately before or after long vacation or a holiday;
 Try to establish a positive mental attitude in students who will be tested, since individuals usually
perform better at any activity in the approach the experience with a positive attitude;
 Similarity to the above condition, teachers should do their best to lessen tension and nervousness
of students because people cannot do their best when they are excessively tense and nervous and
tests do induce anxiety, more so in some students than in other; and test anxiety affects optimum
performance of students;
 Furthermore, teachers should make sure that the students understand the directions and that sheets
are being used correctly.
 Moreover, the teachers should keep the students informed of time remaining, for instance by
writing the time left on the blackboard at 15-minute intervals; and
 Careful proctoring should take place so that cheating is eliminated, discouraged and/or detected
(proctoring is different from “being present in the room”) and the single most effective method to
minimize cheating is careful proctoring.

Activity

1. Enumerate the conditions that should be fulfilled during administering tests.


2. How do you think psychological conditions are related with performance of students on a given
test? Identify these psychological conditions. What can be done by teachers in alleviating problems
emanating from these conditions?
3. Proctoring is different from “being present in the classroom.” Discuss

5.8 SCORING THE TEST


In the evaluation of classroom learning outcomes marking schemes are prepared alongside the construction
of the test items in order to score the test objectively. The marking scheme describes how marks are to be
distributed amongst the questions and between the various parts of the question. This distribution is
dependent on the objectives stated for the learning outcome during teaching and the weight assigned

60
to the questions during test preparation and construction of the test item. The marking scheme takes
into consideration the facts required to answer the questions and the extent to which the language used
meets the requirement of the subject. The actual marking is done following the procedures for scoring essay
questions (for essay questions) and for scoring objective items (for objective items).

1. Scoring essay tests

The construction and scoring of essay questions are interrelated processes that require attention if a valid
and reliable measure of achievement is to be obtained. In the essay test the examiner is an active part of the
measurement instrument. Therefore, the viabilities within and between examiners affect the resulting score
of examinee. This variability is a source of error, which affects the reliability of essay test if not adequately
controlled. Hence, for the essay test result to serve useful purpose as valid measurement instrument
conscious effort is made to score the test objectively by using appropriate methods to minimize the effort
of personal biases and idiosyncrasies on the resulting scores; and applying standards to ensure that only
relevant factors indicated in the course objectives and called for during the test construction are considered
during the scoring. There are two common methods of scoring essay questions. These are:

I. The point or analytic method


II. The global/holistic rating method

I. The point or analytic method


In this method each answer is compared with already prepared ideal marking scheme (scoring key) and
marks are assigned according to the adequacy of the answer. When used conscientiously, the analytic
method provides a means for maintaining uniformity in scoring between scorers and between scripts thus
improving the reliability of the scoring.

This method is generally used satisfactorily to score restricted response questions. This is made possible by
the limited number of characteristics elicited by a single answer, which thus defines the degree of quality
precisely enough to assign point values to them. It is also possible to identify the particular weakness or
strength of each examinee with analytic scoring. Nevertheless, it is desirable to rate each aspect of the item
separately. This has the advantage of providing greater objectivity, which increases the diagnostic value of
the result.

61
II. The global/holistic rating method
In this method the examiner first sorts the response into categories of varying quality based on his general
or global impression on reading the response. The standard of quality helps to establish a relative scale,
which forms the basis for ranking responses from those with the poorest quality response to those that have
the highest quality response. Usually between five and ten categories are used with the rating method with
each of the piles representing the degree of quality and determines the credit to be assigned. For example,
where five categories are used, and the responses are awarded five letter grades: A, B, C, D and E. The
responses are sorted into the five categories where A -quality responses, B – quality, C – quality D- quality
and E-quality. There is usually the need to re-read the responses and to re-classify the misclassified ones.
This method is ideal for the extended response questions where relative judgments are made (no exact
numerical scores) concerning the relevance of ideas, organization of the material and similar qualities
evaluated in answers to extended response questions. Using this method requires a lot of skill and time in
determining the standard response for each quality category. It is desirable to rate each characteristic
separately. This provides for greater objectivity and increases the diagnostic value of the results. The
following are procedures for scoring essay questions objectively to enhance reliability.

 Prepare the marking scheme or ideal answer or outline of expected answer immediately after
constructing the test items and indicate how marks are to be awarded for each section of the
expected response.
 Use the scoring method that is most appropriate for the test item.
 Decide how to handle factors that are irrelevant to the learning outcomes being measured. These
factors may include legibility of handwriting, spelling, sentence structure, punctuation and
neatness. These factors should be controlled when judging the content of the answers. Also decide
in advance how to handle the inclusion of irrelevant materials (uncalled for responses).
 Score only one item in all the scripts at a time. This helps to control the “halo” effect in scoring.
 Evaluate the answers to responses anonymously without knowledge of the examinee whose script
you are scoring. This helps in controlling bias in scoring the essay questions.
 Evaluate the marking scheme (scoring key) before actual scoring by scoring a random sample of
examinees actual responses. This provides a general idea of the quality of the response to be
expected and might call for a revision of the scoring key before commencing actual scoring.
 Make comments during the scoring of each essay item. These comments act as feedback to
examinees and a source of remediation to both examinees and examiner

62
Obtain two or more independent ratings if important decisions are to be based on the results. The result of
the different scorers should be compared and rating moderated to reflect the discrepancies for more reliable
results.

2. Scoring objective tests

Objective test can be scored by various methods with ease unlike the essay test. Various techniques are
used to speed up the scoring and the techniques to use sometimes depend on the type of objective test. Some
of these techniques are as follows:
i. Machine Scoring
ii. Stencil Scoring
iii. Manual Scoring

SELF ASSESSMENT EXERCISE


1. What is test administration?

2. What are the steps involved in test administration?

3. List the methods for scoring

i. Essay test item


ii. Objective test items.
4. What is a “halo” effect?
5. Discuss different conditions that should be taken into account while scoring objective and essay
tests.
6. What different techniques can be applied in using hand scoring of objective type tests?
7. Discuss the advantages and limitations of using analytical method of scoring essay tests and of
global method of scoring those tests

63
UNIT SIX
SUMMARIZING AND INTERPRETING TEST SCORES
Introduction
Dear Trainees! Welcome to the sixth chapter of this module. In the previous unit, you have learnt
how to administer and score tests. In this unit, you will learn how to interpret the test scores using
different statistical tools such as measures of central tendency, measures of dispersion, measures
of relative position and measures of association. In addition you will learn about the standard
scores which comprise the standard deviation and the normal curve, the Z- score and the T – score.
Objectives
Dear learner, by the time you finish this unit, you will be able to:
 Interpret classroom test scores by criterion-referenced or norm referenced
 Calculate the average result of a given class of test scores
 Decide if there is a relationship between two factors
 Convert raw scores to z-scores and T-scores
 Compare one’s score with the score of a group
 Convert from one standard score to another and
 Use the standard score in interpreting test scores.
6.1 Methods of Interpreting Test Scores
Test interpretation is a process of assigning meaning and usefulness to the scores obtained from
classroom test. This is necessary because the raw score obtained from a test standing on itself
rarely has meaning. For instance, a score of 50% in one mathematics test cannot be said to be
better than a score of 40% obtained by same testee in another mathematics test. The test scores on
their own lack a true zero point and equal units. Moreover, they are not based on the same standard
of measurement and as such meaning cannot be read into the scores on the basis of which academic
and psychological decisions may be taken. To compensate for these missing properties and to
make test scores more readily interpretable various methods of expressing test scores have been
devised to give meaning to a raw score. Generally, a score is given meaning by either converting
it into a description of the specific tasks that the learner can perform or by converting it into some
type of derived score that indicates the learner’s relative position in a clearly defined reference
group. The former method of interpretation is referred to as criterion – referenced interpretation
while the later is referred to as Norm – referenced interpretation.

64
Learning Task1

 You are a teacher and you have been giving your students different tests/exams.
What do you do with the test results? How do you interpret them?

_________________________________________________________________
_________________________________________________________________
_________________________________________________________

6.1.1 Criterion – Referenced Interpretation


Criterion - referenced interpretation is the interpretation of test raw score based on the conversion
of the raw score into a description of the specific tasks that the learner can perform. That is, a score
is given meaning by comparing it with the standard of performance that is set before the test is
given. It permits the description of a learner’s test performance without referring to the
performance of others. This is essentially done in terms of some universally understood measure
of proficiency like speed, precision or the percentage correct score in some clearly defined domain
of learning tasks. Examples of criterion referenced interpretation are:
 Types 60 words per minute without error.
 Measures the room temperature within + 0∙1 degree of accuracy (precision).
 Defines 75% of the elementary concepts of electricity items correctly (percentage-correct
score).
Such interpretation is appropriate for tests that are focused on a single objective and for which
standards of performance can be either empirically or logically derived. The percentage-correct
score is widely used in criterion-referenced interpretation. This type of interpretation is primarily
useful in mastery testing where a clearly defined and delimited domain of learning tasks can be
most readily obtained. For Criterion-referenced test to be meaningful, the test has to be specifically
designed to measure a set of clearly stated learning tasks. Therefore, in order to be able to describe
test performance in terms of a learner’s mastery or non-mastery of the predefined, delimited and

65
clearly specified task, enough items are used for each interpretation to enable dependable and
informed decisions concerning the types of tasks a learner can perform.

6.1.2 Norm-Referenced Interpretation


Norm – referenced interpretation is the interpretation of raw score based on the conversion of the
raw score into some type of derived score that indicates the learner’s relative position in a clearly
defined referenced group. This type of interpretation reveals how a learner compares with other
learners who have taken the same test. Norm – referenced interpretation is usually used in the
classroom test interpretation by ranking the testees raw scores from highest to lowest scores. It is
then interpreted by noting the position of an individual’s score relative to that of other testees in
the classroom test. The interpretation such as third position from highest position or about average
position in the class provides a meaningful report for the teacher and the testees on which to base
decision. In this type of test score interpretation, what is important is a sufficient spread of test
scores to provide reliable ranking. The percentage score or the relative easy / difficult nature of the
test is not necessarily important in the interpretation of test scores in terms of relative performance.
6.2 Descriptive statistics

Learning Task 2

 Dear Trainees, before you read the sections below, try to define the following
terms.
1. Statistics
2. Descriptive Statistics
3. Inferential Statistics

____________________________________________________________
____________________________________________________________
____________________________________________________________

The purpose of descriptive statistical analysis is to describe the data that you have. You should
bear in mind that descriptive statistics do just what they say they will do – they describe the data
that you have. They don’t tell you anything about the data that you don’t have. For example, if you
carry out a study and find that the average number of times students in your study are pecked by

66
ducks is once per year, you cannot conclude that all students are pecked by ducks once per year.
This would be going beyond the information that you had.

One of the major purposes of statistics in test use is to allow us to describe and summarize data-
for example, test scores- in efficient and useful ways.

1. Describe
2. Interpret
3. Pass judgment

For this we need to apply statistical procedures. It includes the followings:

1. Measures of Central tendency


2. Measures of Variability
3. Measures of Relative position
4. Measures of Relationship

6.2.1 Measures of central tendency/measures of location

 What is the central data point?

 What is the average/ most popular/ mean / typical/ “middle” / most common data value?
 A value (i.e. single number) that is used to represent where the majority of the data values
lie for a given random variable.
 Three commonly used central location measures are:
1. Mean
2. Median
3. Mode
1. The Mean(𝑿)

It is the Arithmetic average of the observed scores. It is also the most commonly used measure of
location. This is the most popular and useful measure of central location.

67
Mean from ungrouped frequency distribution

𝒔𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒅𝒂𝒕𝒂 𝒑𝒐𝒊𝒏𝒕𝒔


𝒎𝒆𝒂𝒏 =
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒅𝒂𝒕𝒂 𝒑𝒐𝒊𝒏𝒕𝒔


∑𝑋
1. Mean( X ) = where∑=summation
𝑁

X=Raw score
N=total No of students

X =mean

∑𝑓𝑋
2. Mean( X ) = where∑=summation
𝑁

X=raw score
N=total No of students
f= frequency

Example 1: The mean of the sample of five measurements 5, 3, 8, -2, 5 is given by

𝟓 + 𝟑 + 𝟖 + −𝟐 + 𝟓
𝑴𝒆𝒂𝒏( x ) = = 𝟑. 𝟖
𝟓

Example 2: Calculate the mean of the following data.

Table 6.1

Score(x) 2 4 6 8 10

Frequency(f) 5 4 3 2 1


∑𝑓𝑋 (2  5)  (4  4)  (6  3)  (8  2)  (10  1)
Mean ( X ) = =
𝑁 5  4  3  2 1
10  16  18  16  10 70
= =  4.67
15 15

68
ADVANTAGES OF USING THE MEAN
 It is a single number
 It takes every data point into account
 It is simple to understand and is understood by most people
 It is an unbiased measure, meaning that it neither overestimates nor underestimates the
actual central value
DISADVANTAGES OF USING THE MEAN
 It can only be calculated for quantitative variables.
 It is affected by extreme values, commonly known as outliers(Outliers are extreme values
--either extremely smaller extremely large--relative to the majority of the other data
values).

Exercise 6.1

1. Calculate the mean of the sample of seven measurements 7, 13,


11, 20, 15, 18, 24
2. Calculate the mean of the following data.

Score(x) 7 4 9 12 5

Frequency(f) 3 2 5 2 4

2. The Median

 The median of a random variable is the value which divides ranked data into two equal
halves, i.e. it is the middle number of an ordered set of data.
 The median is valid for quantitative random variables only.
 It is the score that divides a score distribution into two equal parts.
 It is most appropriate when dealing with small number of students.

69
Advantages of the median

 It is easy to understand.
 It is not affected by outliers

One of its disadvantages is that it can only be calculated for quantitative variables.

For ungrouped distribution

a) Even number of scores


- Arrange the scores in decreasing /increasing order.
- Take the middle scores and divide by two.

Example1: Here is the score of six people 14, 11, 8,6,7,9 calculate the median.

(𝑵/𝟐 +𝑵/𝟐+𝟏)th (6 / 2)th  (6 / 2  1)th term


Remark: Median= 𝟐
term =
2

(3rd  4th )term


Mdn= = Mdn= [9+8
2
] = 17/2= 8.50
2

b) Odd number of scores


- Arrange the scores in descending/ascending order.
- Take the middle score.

Example 2: Calculate the median of the following scores 14, 11, 9, 6,8,7,5.

Step 1 Arrange 5,6,7,8,9,11,14,17,22

Remark: mdn= [𝑁+1


2
]th term

 9  1
th

Mdn =   term =5th term


 2 

Mdn= 9.

70
Example 3: Age in years of seven security staff.

Table 6.2

Name of the Belay John Solomon Lilly Kibrom Jemal Petros


staff

Age 23 23 28 30 38 58 63

Answer

• There are 7 pieces of data -- odd number

 7 1
th

Mdn =   term =4th term


 2 

Median = 30 years of age (Lilly’s age)

3. The Mode

It is the most frequent value. Mode means most ‘’popular’’. The score/s with the highest frequency
is/are the mode of the distribution. The mode is valid for both quantitative and qualitative random
variables. A set of data may have one mode, or two or more modes.

Advantages of using the mode

1. It is easy to calculate
2. It is valid for all data types.
3. It is not affected by outliers
4. Most appropriate when numerical values in a data set are labels for categories (nominal)

Disadvantages of using the mode

1. There could be more than one mode, which could lead to confusion since no single value
is then representative of this number
2. It could be a random event and not truly representative of the data, especially in a relatively
small data set.

71
Calculations:
 Put the individual values in numerical order ------small to large.
 Take the value which occurs the most frequently = mode.
Example 1: Number of vacation days taken by 10 security personnel last year.

Table 6.3

Officer days 1 2 3 4 5 6 7 8 9 10

Vacation 3 5 6 10 10 10 12 15 15 15

Mode = 10 and 15; Two modes = bimodal

Example3:

Table 6.4

Score(x) 10 11 12 13

Frequency(f) 3 4 1 5

Mode=13

Note: 1. The mode is appropriate for nominal or categorical type of data.

2. A distribution may have one mode, two modes, multiple modes or no mode.

72
Exercise 6.2

1. Calculate the mean, median, and mode for the following scores.
Table 6.5

Score(X) Frequency(f) Cummulitive fX


Frequency(Cf)

15 1 120 15

14 2 119 28

13 3 117 39

12 6 114 72

11 12 108 132

10 15 96 150

9 22 81 198

8 31 59 249

7 18 28 126

6 6 10 36

5 2 4 10

4 2 2 8

∑f=120

2. A survey of the ages of residents of Teacher’s homes yielded the following measures of central
tendency Mean=70 Median = 78 and Mode = 83. In which direction is the distribution likely
to be skewed?

73
Shapes of distributions

Fig1: Normal Distribution

74
Fig 2 Skewed distribution

Fig 3 Different shapes of frequency distribution

Skewness means direction of scores and it is a general methodology to describe the level of
difficulty.

In symmetrical distribution: mean, median, and mode are equal and its discrimination power is
high. For Fig 3, graph ‘’a ‘’ above, the discrimination power is greater than 0.3 while in Fig 3, in
graphs “e” and ’’ f ” above, the discrimination power is less than 0.3.

75
6.2.2 Measures of Variability

Consider the following three distributions:

Distribution I 37 37 37 37 37

Distribution II 33 36 37 38 41

Distribution III 1 11 19 20 134

Note: the three distributions have the same arithmetic mean=37. But there is marked difference
among the distributions.

- In Distribution I, the five values are identical.


- In Distribution II, there is small scatter of values.
- In Distribution III, there is a great dispersion of values.

Therefore, the topic on dispersion of scores is concerned with studying measures which show the
amount of variability among data.

1. The Range
2. The Variance
3. The Standard Deviation

1. The range

The range is the difference between the highest and the lowest scores in a distribution. The higher
the value of the range, the greater the difference between the students in academic achievement.
However, it is a crude measure of variability.

2. The variance

Variance is the arithmetic mean of the squared deviations of individual scores from the mean. It is
expressed in squared units. It shows a spread or dispersion of scores i.e. a tendency for any set of
scores to depart from a central point or any other point.

76
Ungrouped score distribution

Definitional formulae

 
∑(𝑋− X )2 ∑𝑓(𝑋− X )2
𝛿2 = ; ;
𝑁 𝑁

Where 𝜎2 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

X= Raw score


X =Mean of the distribution
f=frequency
N=total number of students
𝑁∑𝑓𝑋 2 −(∑𝑓𝑋)2
Computational formula: 𝛿 2 = 𝑁2

3. The standard deviation

The standard deviation is a measure of how much a set of scores varies on the average around the
mean of scores. In other words, it reveals how closely scores tend to vary from the mean.

Standard deviation is the positive square root of variance. It measures the extent to which scores
tend to deviate from the mean. It is useful for:

a) Comparing one or more sets of observations /scores


b) Comparing an individual’s performance to that of a group.
For ungrouped Distribution

∑(𝑋−𝑋)2 ∑𝑓(𝑋−𝑋)2
σ= √ ; σ= √
𝑁 𝑁

- The larger the σ, the greater is the difference in academic achievement. The smaller the
standard deviation the less the scores tend to vary from the mean.

77
Example 1: Given the following data calculate the variance and the standard deviation.

Table 6.6

X f Cf fX X2 fX2

15 1 119 15 225 225

14 2 118 28 196 392

13 3 116 39 169 507

12 6 113 72 144 864

11 12 107 132 121 1452

10 15 96 150 100 1500

8 31 59 248 64 1984

9 21 80 189 81 1701

7 18 28 126 49 882

6 6 10 36 36 216

5 2 4 10 25 50

4 2 2 8 16 32

∑=119 ∑=1053 ∑=1226 ∑=9805

2
𝑁∑𝑓𝑋 2 − (∑𝑓𝑋)2
𝛿 =
𝑁2

119(9805)−(1053)2
= 1192

78
=4.09

σ=√4.09

=2.02

Exercise 6.3

3. Calculate the variance and standard deviation of the following


data.

Score(x) 7 4 9 12 5

Frequency(f) 3 2 5 2 4

6.2.3 Measures of Position

Learning Task 3

 Dear Trainees, before you read the sections below, try to answer the
following questions.
1. If a student scored 82% in Physics and this student scored 74% in
Mathematics, can we say that the student’s performance is better in Physics
than in Mathematics? How?
____________________________________________________________
____________________________________________________________

The measures of position or location are used to locate the relative position of a specific data value
in relation to the rest of the data. The most popular measures of position are:
 Percentiles
 Z- score
 Deciles
 Quartiles

79
A. Percentiles and Percentile Ranks
Percentiles and percentile ranks are frequently used as indicators of performance in both the
academic and corporate worlds. Percentiles and percentile ranks provide information about how a
person or thing relates to a larger group. Relative measures of this type are often extremely
valuable to researchers employing statistical techniques.
Percentiles
A percentile is the point in a distribution at or below which a given percentage of scores is found.
OR The value below which P% of the values fall is called the Pth percentile
For example, the 5th percentile is denoted by P5, the 10th by P10 and 95th by P95.
Percentile Rank
A percentile rank is used to determine where a particular score or value fits within a broader
distribution. For example: A student receives a score of 75 out of 100 on an exam and wishes to
determine how her score compares to the rest of the class. She calculates a percentile rank for a
score of 75 based on the reported scores of the entire class. Her percentile rank in this example
would be 80, meaning that 80 percent of scores on the exam were at or below 75.
Notes:
I. A Percentile is a value in the data set.
II. A Percentile rank of a given value is a percent that indicates the percentage of data is
smaller than the value.
III. Percentiles are not the same as percentage.
Calculation of Percentiles and Percentile ranks
A. In case of ranked raw data:
The (approximate) value of the K th percentile P k , is calculated by the formula,

 Kn  th
P k = Value of the   term
 100 
Where k –is the percentile one wishes to calculate
n- is the total number of values in the distribution.
The percentile rank (PR) of a given value, xi, is obtained by the formula,
 Numberofvalueslessth anxi 
Percentile Rank (PR of xi) =    100%
 TotlalNumb erofValues 

80
Example 1: The following data is a score of 70 students in Measurement and Evaluation.
Table 6.7
Score(X) Frequency(f) Cumulative frequency(Cf)

95 8 70

92 6 62

85 11 56

84 9 45

80 12 36

75 6 24

70 5 18

68 13 13

Based on the data given in the table 6.7 above, calculate


A. The 50th percentile
B. The 90th percentile
Solutions
 Kn 
A. P50 =   th term
 100 
 50  70 
=  th term
 100 
= 35th term = 80

 Kn 
B. P90=   th term
 100 
 90  70 
=  th term
 100 
= 63th term

81
= 95

1. Based on the data given in the table above, calculate


A. The percentile rank of 70
B. The percentile rank of 84
Solutions
 Numberofvalueslessth anxi 
A. PR(70)=    100%
 TotlalNumb erofValues 
 13 
=    100%
 70 

= 18.6 %

 This means that 18.6 percent of scores on the exam were at or below 70.
 Numberofvalueslessth anxi 
B. PR(84)=    100%
 TotlalNumb erofValues 
 36 
=    100%
 70 

= 51.4 %

82
 This means that 51.4 percent of scores on the exam were at or below 84.

Exercise 6.4

 Dear learners based on the data given in table 6.7 Calculate


A. The 75th percentile
B. The 25th percentile
C. The percentile rank of 80
D. The percentile rank of 92

B. The Z – Scores
The Z – score is the simple standard score which expresses test performance simply and directly
as the number of standard deviation units a raw score is above or below the mean. The Z-score is
computed by using the formula.

XX
Z – Score =
SD
Where
X = any raw score

X = arithmetic mean of the raw scores
SD = standard deviation
When the raw score is smaller than the mean the Z –score results in a negative (-) value which can
cause a serious problem if not well noted in test interpretation. Hence Z-scores are transformed
into a standard score system that utilizes only positive values.
Example 1: For a student with raw score of 30, calculate the Z – score If the mean is 40 and SD =
4


XX
 Z – Score =
SD
30  40
 Z – Score = = -2∙5
4

83
 When we change this score in to standard score(T-score) it becomes
 T – Score = 50 + 10 (z) = 50 + 10 (-2∙5) = 50 – 25 = 25
6.2.4 Measures of Relationship

Measures of association provide a means of summarizing the size of the association between two
variables. Most measures of association are scaled so that they reach a maximum numerical value
of 1 when the two variables have a perfect relationship with each other. They are also scaled so
that they have a value of 0 when there is no relationship between two variables. While there are
exceptions to these rules, most measures of association are of this sort. Some measures of
association are constructed to have a range of only 0 to 1; other measures have a range from -1 to
+1. The latter provide a means of determining whether the two variables have a positive or negative
association with each other.

The two commonly used measures are:

 Correlation
 Chi-square

A. Correlation
 A correlation coefficient is used to measure the strength of the relationship between numeric
variables (e.g., weight and height)
 If the coefficient is between 0 and 1, as one variable increases, the other also increases. This is
called a positive correlation. For example, height and weight are positively correlated because
taller people usually weigh more.
 If the correlation coefficient is between -1 and 0, as one variable increases the other decreases.
This is called a negative correlation. For example, age and hours slept per night are negatively
correlated because older people usually sleep fewer hours per night.
 There are two common methods of computing correlation coefficient. These are:
 Pearson Product-Moment Correlation.
 Spearman Rank-Difference Correlation

84
Pearson Product-Moment Correlation:
This is the most widely used method and the coefficient is denoted by the symbol r. This method
is favoured when the number of scores is large and it’s also easier to apply to large group. The
computation is easier with ungrouped test scores and would be illustrated here. The computation
with grouped data appears more complicated and can be obtained from standard statistics test book.
The following steps listed below will serve as guide for computing a product-moment correlation
coefficient (r) from ungrouped data.
Step 1 - Begin by writing the pairs of score to be studied in two columns. Make certain that the
pair of scores for each examinee is in the same row. Call one Column X and the other Y
Step 2 - Square each of the entries in the X column and enter the result in the X2 column
Step 3 - Square each of the entries in the Y column and enter the result in the Y2 column
Step 4 - In each row, multiply the entry in the X column byte entry in the Y column, and enter the
result in theXY column
Step 5 - Add the entries in each column to find the sum of (  ) each column.

Step 6 -Apply the following formula

 XY    X   Y 
N  N  N 
r=   

X X Y  Y 
2 2
2
 2

  
   

N  N  N  N 

 XY  M M X Y
OR r = N
SDX SDY
where
MX = mean of scores in X column
MY= mean of scores in Y column
SDX= standard deviation of scores in X column
SDY = standard deviation of scores in Y column

85
Example 1: Table 6.8: Computing Pearson Product-Moment Correlation for a pair of the
Hypothetical Ungrouped Data
Student Maths(X) Physics(Y) X2 Y2 XY

1No 98 76 9604 5776 7448


2 97 75 9409 5625 7275
3 95 72 9025 5184 6840
4 94 70 8836 4900 6580
5 93 68 8649 4624 6324
6 91 66 8281 4356 6006
7 90 64 8100 4096 5760
8 89 60 7921 3600 5340
9 88 58 7744 3364 5104
10 87 57 7569 3249 4959
11 86 56 7396 3136 4816
12 84 54 7056 2916 4536
13 83 52 6889 2704 4316
14 81 50 6561 2500 4050
15 80 48 6400 2304 3840
16 79 46 6241 2116 3634
17 77 44 5929 1936 3388
18 76 45 5776 2025 3420
19 75 62 5625 3844 4650
20 74 77 5476 5929 5698
N=20  X  1717 Y  1200 X2  Y 2  74184  XY  103984
148487
 XY    X   Y 
N  N  N 
r=   

X X Y  Y 
2 2
2
 2

  
   

N  N  N  N 

103984  1717  1200 


  
20  20  20 
r=
2 2
148487  1717  74184  1200 
   
20  20  20  20 
5199.2  5151
r=
54.13  109.2
48.2
r= = 0.63
76.88

86
Spearman Rank-Difference Correlation:
This method is satisfactory when the number of scores to be correlated is small (less than 30). It is
easier to compute with a small number of cases than the Pearson Product-Moment Correlation. It
is a simple practical technique for most classroom purposes. To use the Spearman Rank-Difference
Method, the following steps listed under should be taken.
Computing Procedure for the Spearman Rank-Difference Correlation
Step 1 -Arrange pairs of scores for each examinee in columns (Columns 1 and 2)
Step 2 - Rank examinees from 1 to N (number in group) for each set of scores
Step 3 - Rank the difference (D) in ranks by subtracting the rank in the right hand column from
the rank in the left-hand column
Step 4- Square each difference in rank to obtain difference squared (D2)
Step 5- Sum the squared differences to obtain  D 2

Step 6 - Apply the following formula


6   D2
ρ (rho) = 1 -

N N 2 1 
Where
D = Difference in rank and N= total no of students
Table 6.9: Computing Spearman Rank Difference Correlation for a Pair of Hypothetical
Data
Student Maths Physics Maths Physics D D2

Number Score Score Rank Rank (RM-RP)


1 98 76 1 2 -1 1
2 97 75 2 3 -1 1
3 95 72 3 4 -1 1
4 94 70 4 5 -1 1
5 93 68 5 6 -1 1
6 91 66 6 7 -1 1
7 90 64 7 8 -1 1
8 89 60 8 10 -2 4
9 88 58 9 11 -2 4
10 87 57 10 12 -2 4
11 86 56 11 13 -2 4
12 84 54 12 14 -2 4
13 83 52 13 15 -2 4
14 81 50 14 16 -2 4
15 80 48 15 17 -2 4

87
16 79 46 16 18 -2 4
17 77 44 17 20 -3 9
18 76 45 18 19 -1 1
19 75 62 19 9 10 100
20 74 77 20 1 19 361

6   D2 6  514 3084
ρ (rho) = 1 -

N N 12
 = 1-
20(400  1)
= 1-
7980
= 0.61

Table 6.10: Interpretation of r-values


Correlation(r) value Interpretation
+1.00 Perfect positive relationship
+0.80 – +0.99 Very high positive relationship
+0.60 – +0.79 High positive relationship
+0.40 – +0.59 Moderate positive relationship
+0.20 – +0.39 Low positive relationship
+0.01– +0.19 Negligible relationship
0.00 No relationship at all
-0.01 – -0.19 Negligible relationship
-0.20 – -0.39 Low negative relationship
-0.40 – -0.59 Moderate negative relationship
-0.60 – -0.79 High negative relationship
-0.80 – -0.99 Very high negative relationship
-1.00 Perfect negative relationship

Self-Check Exercises

Dear Trainees! You have now completed your study of chapter six. Therefore, you are expected
to answer the following self-test questions.

1. What is test interpretation and why is it necessary to interpret classroom test?


2. Highlight the major difference between criterion-reference interpretation and Norm-
referenced interpretation.
3. For a student with raw score of 62, calculate the Z – score and the T-score If the mean is
92 and SD = 6

88
UNIT SEVEN
RELIABILITY AND VALIDITY OF A TEST

INTRODUCTION

Dear trainee, in this unit you will learn about test reliability and the methods of estimating
reliability. Specifically, you will learn about test retest method, equivalent form method, split half
method and Kuder Richardson Method. Furthermore, you will learn about the factors influencing
reliability measures such as length of test, spread of scores, difficulty of test and objectivity. In
addition to this, in this unit, you will also learn about validity, which is the single most important
criteria for judging the adequacy of a measurement instrument. You will learn types of validity
namely content, criterion and construct validity. Finally, you will learn validity of criterion-
referenced mastery tests and factors influencing validity.

OBJECTIVES
By the time you finish this unit you will be able to:
 Define reliability of a test
 State the various forms of reliability
 Explain the factors that influence reliability measures
 Compare and contrast the different forms of estimating reliability
 Define validity as well as content, criterion and construct validity
 Differentiate between content, criterion and construct validity
 Describe how each of the three types of validity are determined
 Interpret different validity estimates
 Identify the different factors that affect validity
 Assess different test items based on the principle of validity

89
7.1 Test Reliability

assessment? What are the types of reliability? What are the major factors that affect reliability
measures of a test?
____________________________________________________________________________________

_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________

Dear trainee, in the previous units you have grasped the concept of test administration and
assembling and how essential test administration, assembling and scoring in a certain course
assessment. Whereas, in this part you are going to have good insights about reliability, types of
reliability and factors that affect reliability.

Reliability of a test may be defined as the degree to which a test is consistent, stable, dependable
or trustworthy in measuring what it is measuring. This definition implies that the reliability of a
test tries to answer questions like: How can we rely on the results from a test? How dependable
are scores from the test? How well are the items in the test consistent in measuring whatever it is
measuring? In general, reliability of a test seeks to find if the ability of a set of testees are
determined based on testing them two different times using the same test, or using two parallel
forms of the same test, or using scores on the same test marked by two different examiners, will
the relative standing of the testees on each of the pair of scores remain the same.

Reliability refers to the accuracy, consistency and stability with which a test measures whatever it
is measuring. The more the pair of scores observed for the same testee varies from each other, the
less reliable the measure is. The variation between this pair of scores is caused by numerous factors
other than influence test scores.

Such extraneous factors introduce a certain amount of error into all test scores. Thus, methods of
determine reliability are essential means of determining how much error is present under different

90
conditions. Hence, the more consistent our test results are from one measurement to another, the
less error there will be and consequently, the greater the reliability.
7.1.1 Types of Reliability Measures
There are different types of reliability measures. These measures are estimated by different
methods. The chief methods of estimating reliability measures are illustrated in table 6.1.
Table 7.1: Methods of Estimating Reliability
Method Types of Reliability Procedure
Measure

Test-retest Measure of stability Give the same test twice to the same group
with any time interval between tests
method

Equivalent-forms Measure of equivalence Give two forms of the test to the same group
in close succession
methods

Split-half Measure of internal Give test once. Score two equivalent halves
consistency say odd and even number items, correct
Method reliability coefficient to fit whole test by
Spearman- Brown formula

Kuder- Measure of internal Give test once. Score total test and apply
consistency kuder-Richardson formula
Richardson

methods

7.1.1.1 Test Retest Method - Measure Of Stability


Estimating reliability by means of test-retest method requires the same test to be administered
twice to the same group of learners with a given time interval between the two administrations.
The resulting test scores are correlated and the correlation coefficient provides a measure of

91
stability. How long the time interval should be between tests is determined largely by the use to
be made of the results. If the results of both administrations of the test are highly stable, the testees
whose scores are high on one administration of the test will tend to score high on other
administration of the test while the other testees will tend to stay in the same relative positions on
both administration of the test. Such stability would be indicated by a large correlation coefficient.

An important factor in interpreting measures of stability is the time interval between tests. A short
time interval such as a day or two inflates the consistency of the result since the testees will
remember some of their answers from the first test to the second. On the other hand, if the time
interval between tests is long about a year, the results will be influenced by the instability of the
testing procedure and by the actual changes in the learners over a period of time. Generally, the
longer the time interval between test and retest, the more the results will be influenced by changes
in the learners’ characteristics being measured and the smaller the reliability coefficient will be.

7.1.1.2 Equivalent Forms Method-Measure of Equivalence


To estimate reliability by means of equivalent or parallel form method involves the use of two
different but equivalent forms of the test. The two forms of the tests are administered to the same
group of learners in close succession and the resulting test scores are correlated. The resulted
correlation coefficient provides a measure of equivalence. That is, correlation coefficient indicates
the degree to which both forms of the tests are measuring the same aspects of behavior. This
method reflects the extent to which the test represents an adequate sample of the characteristics
being measured rather than the stability of the testee. It eliminates the problem of selecting a proper
time interval between tests as in test retest method but has the need for two equivalent forms of
the test. The need for equivalent forms of the test restricts its use almost entirely to standardized
testing where it is widely used.

7.1.1.3 Split-Half Method- Measure of Internal Consistency


This is a method of estimating the reliability of test scores by the means of single administration
of a single form of a test. The test is administered to a group of testees and then is divided into two
halves that are equivalent usually odd and even number items for scoring purposes. The two split
half produces two scores for each testee which when correlated, provides a measure of internal

92
consistency. The coefficient indicates the degree to which equivalent results are obtained from the
two halves of the test. The reliability of the full test is usually obtained by applying the Spearman-
Brown formula.
1
2  Re liabilityo n test
That is, Reliability on full test = 2
1
1  Re liabilityo n test
2
The split-half method, like the equivalent forms method indicates the extent to which the sample
of test items is a dependable sample of the content being measured. In this case, a high correlation
between scores on the two halves of a test denotes the equivalence of the two halves and
consequently the adequacy of the sampling. Also like the equivalent forms method, it tells nothing
about changes in the individual from one time to another.

7.1.1.4 Kuder-Richardson Method - Measure of Internal Consistency


This is a method of estimating the reliability of test scores from a single administration of a single
form of a test by means of formulas such as those developed by Kuder and Richardson. Like the
spilt-half method, these formulas provide a measure of internal consistency. Kuder-Richardson
formula 21, a less accurate but simpler formula to compute can be applied to the results of any test
that has been scored on the basis of the number of correct answers. A modified version of the
formula is:
Reliability estimate (KR21) =

K   M K  M   
1    
K  1   KS 2 
Where:
K = The number of items in the test
M = The mean of the test scores
S2 =Standard deviation of the test scores.

This method of reliability estimate test whether the items in the test are homogenous. In other
words, it seeks to know whether each test item measures the same quality or characteristics as
every other. If this is established, then the reliability estimate will be similar to that provided by

93
the split-half method. On the other hand, if the test lacks homogeneity an estimate smaller than
split-half reliability will result.
The Kuder-Richardson method and the Split-half method are widely used in determining reliability
because they are simple to apply. Nevertheless, the following limitations restrict their value. The
limitations are:
 They are not appropriate for speed test in which test retest or equivalent form methods are
better estimates.
 They, like the equivalent form method, do not indicate the constancy of a testee response
from day to day. It is only the test-retest procedures that indicate the extent to which test
results are generalizable over different periods of time.
 They are adequate for teacher-made tests because these are usually power tests.

Methods of Computing Correlation Coefficient


A correlation coefficient expresses the degree of relationship between two sets of scores by
numbers ranging from + 1.00 to -1∙00. A perfect positive correlation is indicated by a coefficient
of +1∙00 and a perfect negative correlation by a coefficient of -1∙00. A correlation coefficient of
0∙00 lies midway between these extremes and indicates no relationship between the two sets of
scores. The larger the coefficient (positive or negative), the higher the degree of relationship
expressed. There are two common methods of computing correlation coefficient. These are:
 Spearman Rank-Difference Correlation
 Pearson Product-Moment Correlation
You will note that correlation indicates the degree of relationship between two scores but not the
causation of the relationship. Usually, further study is needed to determine the cause of any
particular relationship. The following are illustrations of the computation of correlation
coefficients using both methods.

I. Spearman Rank-Difference Correlation:


This method is satisfactory when the number of scores to be correlated is small, which is less than
30. It is easier to compute with a small number of cases than the Pearson Product-Moment
Correlation. It is a simple practical technique for most classroom purposes. To use the Spearman
Rank-Difference Method, the following steps listed under should be taken.

94
Computing Procedure for the Spearman Rank-Difference Correlation
Step 1. Arrange pairs of scores for each examinee in columns (Columns 1 and 2)
Step 2. Rank examinees from 1 to N (number in group) for each set of scores
Step 3. Rank the difference (D) in ranks by subtracting the rank in the right hand column from
the rank in the left-hand column
Step 4. Square each difference in rank to obtain difference squared (D2)
Step 5. Sum the squared differences to obtain  D 2

Step 6. Apply the following formula


6   D2
ρ (rho) = 1 -
 
N N 2 1
Where:
 = Sum of
D = Difference in rank
N = Number of examinees

Table 7.2: Computing Spearman Rank Difference Correlation for a Pair of Hypothetical
Data
Student Hydraulics Pneumatics Hydraulics Pneumatics D D2
Score Rank
Number Score Rank (RH-RP)
1 98 76 1 2 -1 1
2 97 75 2 3 -1 1
3 95 72 3 4 -1 1
4 94 70 4 5 -1 1
5 93 68 5 6 -1 1
6 91 66 6 7 -1 1
7 90 64 7 8 -1 1
8 89 60 8 10 -2 4
9 88 58 9 11 -2 4
10 87 57 10 12 -2 4
11 86 56 11 13 -2 4
12 84 54 12 14 -2 4
13 83 52 13 15 -2 4
14 81 50 14 16 -2 4
15 80 48 15 17 -2 4
16 79 46 16 18 -2 4
17 77 44 17 20 -3 9
18 76 45 18 19 -1 1
19 75 62 19 9 10 100
20 74 77 20 1 19 361

95
6   D2 6  514 3084
ρ (rho) = 1 -

N N 1 2
 = 1-
20(400  1)
= 1-
7980
= 0.61

II. Pearson Product-Moment Correlation:


This is the most widely used method and the coefficient is denoted by the symbol r. This method
is favored when the number of scores is large and it is also easier to apply to large group. The
computation is easier with ungrouped test scores and would be illustrated here. The computation
with grouped data appears more complicated and can be obtained from standard statistics test book.

The following steps listed below will serve as guide for computing a product-moment correlation
coefficient (r) from ungrouped data.
Step 1. Begin by writing the pairs of score to be studied in two columns. Make certain that the pair
of scores for each examinee is in the same row. Call one Column X and the other Y
Step 2. Square each of the entries in the X column and enter the result in the X2 column
Step 3. Square each of the entries in the Y column and enter the result in the Y2 column
Step 4. In each row, multiply the entry in the X column by the entry in the Y column, and enter
the result in the XY column
Step 5. Add the entries in each column to find the sum of  each column.

Step 6. Apply the following formula

 XY    X   Y 
N  N  N 
r=   

X X Y  Y 
2 2
2
 2

  
   

N  N  N  N 

 XY  M M  X Y
OR r = N
SDX SDY
Where:
MX = Mean of scores in X column
MY = Mean of scores in Y column
SDX = Standard deviation of scores in X column
SDY = standard deviation of scores in Y column
96
Table 7.3: Computing Pearson Product-Moment Correlation for a pair of the Hypothetical
Ungrouped Data
Student Hydraulics(X) Pneumatics(Y) X2 Y2 XY
No
1 98 76 9604 5776 7448
2 97 75 9409 5625 7275
3 95 72 9025 5184 6840
4 94 70 8836 4900 6580
5 93 68 8649 4624 6324
6 91 66 8281 4356 6006
7 90 64 8100 4096 5760
8 89 60 7921 3600 5340
9 88 58 7744 3364 5104
10 87 57 7569 3249 4959
11 86 56 7396 3136 4816
12 84 54 7056 2916 4536
13 83 52 6889 2704 4316
14 81 50 6561 2500 4050
15 80 48 6400 2304 3840
16 79 46 6241 2116 3634
17 77 44 5929 1936 3388
18 76 45 5776 2025 3420
19 75 62 5625 3844 4650
20 74 77 5476 5929 5698
N=20  X  1717 Y  1200  X  148487
2
Y 2   XY 
74184 103984
Computational value of the score shows that between Hydraulics(X) Pneumatics(Y) scores is r =
0.63

Interpretation of Correlation Coefficient (r) Values


Correlation coefficient indicates the degree of relationship between two sets of scores by numbers
ranging from +1.00 to -1.00. You may know that a perfect correlation is indicated by a coefficient
of +1.00 and a perfect negative correlation by a coefficient of -1.00. Thus, a correlation coefficient
of 0.00 lies midway between these extremes and indicates no relationship between the two sets of
scores. In addition to the direction of relationship which is indicated by a positive sign (+) for a
positive (direct relation) or a negative sign (-) for a negative (inverse relation), correlation also has
a size, a number which indicates the level or degree of relationship. The larger this number the
more closely or highly two set of scores relate. We have said that correlation coefficient or index
take on values between +1.00 and -1.00. That is, the size of relationship between two set of scores

97
is never more than +1.00 and never less than -1.00. The two values have the same degree or level
of relationship but while the first indicates a direct relation the second indicates an inverse relation.
A guide for interpreting correlation coefficient (r) values obtained by correlating any two set of
test scores is presented in table 6.4 below:
Table 7.4: Interpretation of r-values
Correlation(r) Value Interpretation
+1.00 Perfect positive relationship
+0.80 – +0.99 Very high positive relationship
+0.60 – +0.79 High positive relationship
+0.40 – +0.59 Moderate positive relationship
+0.20 – +0.39 Low positive relationship
+0.01– +0.19 Negligible relationship
0.00 No relationship at all
-0.01 – -0.19 Negligible relationship
-0.20 – -0.39 Low negative relationship
-0.40 – -0.59 Moderate negative relationship
-0.60 – -0.79 High negative relationship
-0.80 – -0.99 Very high negative relationship
-1.00 Perfect negative relationship

Comparisons On Results Of Reliability Estimated By Different Methods


Different methods of estimating reliability of a test yield different values of reliability estimates
even for the same test. This is because of the differences in the way each of the procedures defines
measurement error. In other words, the size of the reliability coefficient is related to the method of
estimating reliability. This is clarified below:
 Test-retest method: Reliability coefficient based on test retest method is always lower
than reliability obtained from split-half method but higher than those obtained through
equivalent forms method. This is because test-retest method is affected by time to time
fluctuation.
 Equivalent forms method: This method has the least value of reliability which real
reliability could possibly take. Equivalent forms method is affected by both time to time
and form to form fluctuations.
 Spilt-half method: Estimation of reliability from the spilt-half method gives the largest
value which the real reliability can possibly take. This is because any reliability estimate
that is based on the result of a single testing will result in an overestimation of the reliability
index.

98
 Kuder-Richardson methods: This method like the spilt-half method is based on single
testing and so the reliability index is over estimation. Its value is however lower than the
value obtained for the spilt-half method.

It is clear from the above illustration that the size of the reliability coefficient resulting from the
method of estimating reliability is directly attributable to the type of consistency included in each
method. Thus, the more rigorous methods of estimating reliability yield smaller reliability
coefficient than the less rigorous methods. It is therefore essential that when estimating the
reliability of a measurement instrument, the method used, the time lapse between repeated
administration and the intervening experience must be noted as well as the assumptions and
limitations of the method used for a clearly understanding of the resulting reliability estimate.

7.1.2 Factors Influencing Reliability Measures


The reliability of classroom tests is affected by some factors. These factors can be controlled
through adequate care during test construction. Therefore, the knowledge of the factors are
necessary to classroom teachers to enable them control them through adequate care during test
construction in order to build in more reliability in norm referenced classroom tests.

7.1.2.1 Length of Test


The reliability of a test is affected by the length. The longer the length of a test is, the higher its
reliability will be. This is because longer test will provide a more adequate sample of the behaviour
being measured, and the scores are apt to be less distorted by chance factors such as guessing.

If the quality of the test items and the nature of the testees can be assumed to remain the same,
then the relationship of reliability to length can be expressed by the simple formula stated as
follow:
nrii
rnn =
1  n  1rii
Where:
rnn = is the reliability of a test n times as long as the original test
rii = is the reliability of the original test

99
n = is as indicated, the factors by which the length of the test is increased

Increase in length of a test brings test scores to depend closer upon the characteristics of the person
being measured and more accurate appraisal of the person is obtained. However, we all know that
lengthen a test is limited by a number of practical considerations. The considerations are the
amount of time available for testing, factors of fatigue and boredom on part of the testees, inability
of classroom teachers to constructs more equally good test items. Nevertheless, reliability can be
increased as needed by lengthening the test within these limits.

7.1.2.2 Spread of Scores


The reliability coefficients of a test are directly influenced by the spread of scores in the group
tested. The larger the spread of scores is, the higher the estimate of reliability will be if all other
factors are kept constant. Larger reliability coefficients result when individuals tend to stay in same
relative position in a group from one testing to another. It therefore follows that anything that
reduces the possibility of shifting positions in the group also contributes to larger reliability
coefficient. This means that greater differences between the scores of individuals reduce the
possibility of shifting positions. Hence, errors of measurement have less influence on the relative
position of individuals when the differences among group members are large when there is a wide
spread of scores.

7.1.2.3 Difficulty of Test


When norm-referenced test are too easy or too difficult for the group members taking it, it tends
to produce scores of low reliability. This is so since both easy and difficult tests will result in a
restricted spread of scores. In the case of easy test, the scores are closed together at the top of the
scale while for the difficult test; the scores are grouped together at the bottom end of the scale.
Thus for both easy and difficult tests, the differences among individuals are small and tend to be
unreliable. Therefore, a norm-referenced test of ideal difficulty is desired to enable the scores to
spread out over the full range of the scale. This implies that classroom achievement tests are to be
designed to measure differences among testees. This can be achieved by constructing test items
with at least average scores of 50 percent and with the scores ranging from zero to near perfect
scores. Constructing tests that match this level of difficulty permits the full range of possible scores

100
to be used in measuring differences among individuals. This is because the bigger the spread of
scores, the greater the likelihood of its measured differences to be reliable.

7.1.2.4 Objectivity
This refers to the degree to which equally competent scorers obtain the same results in scoring a
test. Objective tests easily lend themselves to objectivity because they are usually constructed so
that they can be accurately scored by trained individuals and by the use of machines. For such test
constructed using highly objective procedures, the reliability of the test results is not affected by
the scoring procedures. Therefore, the teacher made classroom test calls for objectivity. This is
necessary in obtaining reliable measure of achievement. This is more obvious in essay testing and
various observational procedures where the results of testing depend to a large extent on the person
doing the scoring. Sometimes even the same scorer may get different results at different times.
This inconsistency in scoring has an adverse effect on the reliability of the measures obtained. The
resulting test scores reflect the opinions and biases of the scorer and the differences among testees
in the characteristics being measured.

Objectivity can be controlled by ensuring that evaluation procedures selected for the evaluation of
behaviour required in a test is both appropriate and as objective as possible. In the case of essay
test, objectivity can be increased by careful framing of the questions and by establishing a standard
set of rules for scoring. Objectivity increased in this manner will increase reliability without
undermining validity.

7.2 Validity of a Test

What are the major factors that affect validity? Write your answer on the space provided.
____________________________________________________________________________________

_____________________________________________________________________________
_____________________________________________________________

101
Dear trainee, in the previous subtitles you have realized what reliability is all about. Now, in the
next topic and subtopics you are going to have good insights with regard to validity, types of
validity and factors that affect validity.

Validity is the most important quality you have to consider when constructing or selecting a test.
It refers to the meaningfulness or appropriateness of the interpretations to be made from test scores
and other evaluation results. Validity is a measure or the degree to which a test measures what it
is intended to measure. It is always concerned with the specific use of the results and the soundness
of our proposed interpretations. Hence, to the extent that a test score is decided by factors or
abilities other than that which the test was designed or used to measure, its validity is impaired.

7.2.1 Types of Validity


The concern of validity is basically three, which are:
 Determining the extent to which performance on a test represents level of knowledge of
the subject matter content which the test was designed to measure – content validity of the
test.
 Determining the extent to which performance on a test represents the amount of what was
being measured possessed by the examinee – construct validity of the test.
 Determining the extent to which performance on a test represents an examinee’s probable
task – criterion (concurrent and prediction) validity of a test.

These concerns of validity are related and in each case the determination is based on a knowledge
of the interrelationship between scores on the test and the performance on other task or test that
accurately represents the actual behavior. The three approaches to test validation are briefly
discussed in table 7.5 below.

102
Table 7.5: Approaches To Test Validation
Evidence Procedure Meaning

Content – Related Compare the test tasks to the test How well the sample of test tasks
specifications describing the task represents the domain of tasks to
Evidence domain under consideration. be measured

Criterion – Related Compare test scores with another How well test performance
measure of performance obtained at a predicts future performance or
Evidence later date (for prediction) or with estimates current performance on
another measure of performance some valued measures other than
obtained concurrently the test itself (a criterion)

(for estimating present status)

Construct – Related Establish the meaning of the scores How well test performance can be
on the test by controlling (or interpreted as a meaning of some
Evidence examining) the development of the characteristics or quality
test and experimentally determining
what factors influence test
performance

Content Validity:
When we want to find out if the entire content of the behavior or area is represented in the test we
compare the test task with the content of the behavior.
Criterion Validity:
When you are expecting a future performance based on the scores obtained currently by the
measure, correlate the scores obtained with the performance. The later performance is called the
criterion and the current score is the prediction. This is an empirical check on the value of the test
– a criterion-oriented or predictive validation.
Construct Validity:
Construct validity is the degree to which a test measures an intended hypothetical construct. Many
times psychologists assess/measure abstract attributes or constructs. The process of validating the
interpretations about that construct as indicated by the test score is construct validation. This can
be done experimentally, e.g., if we want to validate a measure of anxiety. We have a hypothesis

103
that anxiety increases when subjects are under the threat of an electric shock, then the threat of an
electric shock should increase anxiety scores.

7.2.2 Factors Influencing Validity


Many factors tend to influence the validity of test interpretation. These factors include factors in
the test itself. The following are the list of factors in the test itself that can prevent the test items
from functioning as intended and thereby lower the validity of the interpretations from the test
scores. They are:
 Unclear directions on how examinees should respond to test items
 Too difficult reading vocabulary and sentence structure
 Inappropriate level of difficulty of the test items
 Poorly structured test items
 Ambiguity leading to misinterpretation of test items
 Inappropriate test items for the outcomes being measured
 Test too short to provide adequate sample
 Improper arrangement of items in the test and
 Identifiable pattern of answer that leads to guessing.
Thus, in order to ensure validity conscious effort should be made during construction, selection
and use of test and other evaluation instruments to control those factors that have adverse effect
on validity and interpretation of results.

SUMMARY
Dear trainee, in this unit you have discussed the concept of reliability and validity. Furthermore,
you have identified the types of reliability measures and methods of estimating them. Specifically,
you discussed about measure of equivalence, measure of stability and measure of internal
consistency. In addition you have learned about the test-retest method of estimating reliability, the
equivalent forms methods and the Kuder-Richardson method. Finally, you have learned the factors
such as length of test, spread of scores, and difficulty of test and objectivity that influences
reliability measures.

104
The second concept you have seen in this chapter was about test validity. You have learned about
content validity, criterion validity and construct validity. In addition, you have gone thorough and
identify validity of criterion reference mastery tests and factors that influence validity. Validity is
a measure or the degree to which a test measures what it is intended to measure. There are three
types of validity. Content validity is the process of determining the extent to which a set of test
tasks provided a relevant and representative sample of the domain of tasks under consideration.
Criterion validity is the process of determining the extent to which test performance is related to
some other valued measure of performance.

Construct validity is the process of determining the extent to which test performance can be
interpreted in terms of one or more psychological construct. A construct is a psychological quality
that are assumed exists in order to explain some aspect of behavior. Criterion – referenced mastery
tests are not designed to discriminate among individuals. Therefore, statistical validation
procedures play a less prominent role.

On the other hand, many factors may influence the validity of a test. These factors include factors
in the test itself, factors in the test administration, factors in the examinee’s response, functioning
content and teaching procedures as well as nature of the group and the criterion.

SELF ASSESSMENT EXERCISE


PART I. READ THE FOLLOWING QUESTIONS AND GIVE SHORT AND PRECISE
ANSWER.
1. Define the reliability of a test.
2. Mention methods of estimating reliability and the type of reliability measure associated
with each of them.
3. What are the factors that influence reliability measures?
4. Define the following terms:
i. Content Validity
ii. Criterion related Validity
iii. Construct Validity
5. What are the three main concerns of validity of a test?

105
6. What are the factors that affect validity?
7. Explain the relationship between reliability and validity?

PART II. READ EACH OF THE FOLLOWING QUESTION AND CHOOSE THE MOST
APPROPRIATE ONE FROM THE GIVEN ALTERNATIVES.
1. A form of correlation that shows equal magnitude increment value in both variables is
called______?
A. Perfect positive relation
B. Perfect negative relation
C. No relation at all
D. Very high positive relation
2. Which of the following is not true about correlation?
A. It measures association
B. It is scaled between -1 to +1
C. It shows the relation of two variables
D. It estimates the average value of two variables
3. Which one is not true about negative correlation?
A. As one variable increases, the other will decrease
B. The coefficient lies between-1 and less than 0
C. Both variables show increment in the same direction
D. Each variable may increase or decrease in the opposite direction
4. What is the meaning of correlation coefficient r=0.00?
A. Perfect positive relation
B. There is no relation at all
C. There is negligible relation
D. Perfect negative relation
5. Which reliability measure uses a single exam two times for the same group?
A. Equivalence
B. Stability
C. Internal consistency
D. Split half
6. In split half method of estimating reliability, even and odd number items correlation is 0.64. What will
be the total items reliability?
A. 0.74 B. 0.87 C. 0.64 D. 0.78

106
7. Which one of the following dose not influence the reliability measures?

A. Length of the test


B. Too easy or too difficult questions
C. The variability of score distribution
D. The nature of the subject matter

8. Which type of validity focuses on standard?

A. Content validity
B. Criterion validity
C. Construct validity
D. None of the above

9 . Which type of validity is guaranteed if test preparation represents the entire content of the course?

A. Content validity

B. Criterion validity

C. Construct validity

D. Face validity

10. Which one of the following dose not influence the reliability measures?

A. Length of the test

B. Too easy or too difficult questions

C. The variability of score distribution

D. The nature of the subject matter

11. The method of computing the degree of relationship between two sets of scores is called__________?

A. Correlation B. Mode C. Variance D. Average

12. The type of correlation that shows as one variable increase the other decreases?

A. Positive correlation

B. Negative correlation

C. Zero correlation

107
D. None of the above

13. The correlation coefficient of r = 0.99 can be interpreted as:

A. Perfect positive relationship

B. Very high positive relationship

C. Low positive relationship

D. Perfect negative relationship

14. Measurement reliability refers to the:

A. Comprehensiveness of the scores

B. Consistency of the scores

C. Dependency of the scores

D. Accuracy of the scores

15. If a measure is consistent over multiple occasions, it has


A. Construct validity
B. Inter-rater reliability
C. Test-retest reliability
D. Internal validity
16. The validity of a measure refers to the:
A. Consistency of the measurement.
B. Particular type of construct specification
C. Accuracy with which it measures the construct.
D. Comprehensiveness with which it measures the construct
17. A measure has high internal consistency reliability when:
A. Multiple observers obtain the same score every time they use the measure
B. Multiple observers make the same ratings using the measure
C. Participants score at the high end of the scale every time they complete the measure
D. Each of the items correlates with other items on the measure

108
UNIT EIGHT
JUDGING THE QUALITY OF A CLASSROOM TEST

INTRODUCTION

Dear trainee, in this unit you will learn how to judge the quality of a classroom test and an
item types. Specifically, you will learn about item analysis, purpose and uses. Furthermore, you
will learn the process of item analysis for classroom test and the computations involved. In
addition, you will learn item analysis of criterion-referenced mastery items. Finally, you will learn
about building a test item file or item bank.

OBJECTIVES
By the end of this unit you will be able to:
 Differentiate distinctively between item difficulty, item discrimination and the distraction
power of an option
 Recognize the need for item analysis, its place and importance in test development
 Conduct item analysis of a classroom test
 Calculate the value of each item parameter for different types of items
 Appraise an item based on the results of item analysis

8.1 Judging The Quality Of A Classroom Test

What are the steps of item analysis? What are the major purposes of item analysis? How could
you compute item difficulty index and discrimination index of a test? How can you evaluate the
effectiveness of distracters? How can you build a test item file or item bank?
____________________________________________________________________________________

____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________

109
The administration and scoring of a classroom test is closely followed by the appraisal of the result
of the test. This is done to obtain evidence concerning the quality of the test that was used such as
identifying some of the defective items. This helps to better appreciate the careful planning and
hard work that went into the preparation of the test. Moreover, the identified effective test items
are used to build up a file of high quality items, usually called question bank for future use.

8.1.1 Item Analysis


Item analysis is the process of testing the item to ascertain specifically whether the item is
functioning properly in measuring what the entire test is measuring. As already mentioned, item
analysis begins after the test has been administered and scored. It involves detailed and systematic
examination of the testees’ responses to each item to determine the difficulty level and
discriminating power of the item.

This also includes determining the effectiveness of each option. The decision on the quality of an
item depends on the purpose for which the test is designed. However, for an item to effectively
measure what the entire test is measuring and provide valid and useful information, it should not
be too easy or too difficult. Moreover, its options should discriminate validity between high and
low performing learners in the class.

8.1.2 Purpose and Uses of Item Analysis


Item analysis is usually designed to help determine whether an item functions as intended with
respect to discriminating between high and low achievers in a norm-referenced test, and measuring
the effects of the instruction in a criterion referenced test items. It is also a means of determining
items having desirable qualities of a measuring instrument, those that need revision for future use
and even for identifying deficiencies in the teaching/learning process. In addition, item analysis
has other useful benefits amongst which are providing data on which to base discussion of the test
results, remediation of learning deficiencies and subsequent improvement of classroom
instruction. Moreover, the item analysis procedures provide a basis for increase skill in test
construction.

110
8.2 The Process of Item Analysis for Norm Referenced Classroom Test
The method for analyzing the effectiveness of test items differs for norm-referenced and criterion–
referenced test items. This is because they serve different functions. In norm-referenced test,
special emphasis is placed on item difficulty and item discriminating power. The process of item
analysis begins after the test has been administered or trial tested, scored and recorded. For most
norm–referenced classroom tests, a simplified form of item analysis is used.

The process of item analysis is carried out by using two contracting test groups composed from
the upper and lower 25% of the testees on which the items are administered or trial tested. The
upper and lower 25% is the optimum point at which balance is obtained between the sensitivity of
the groups in making adequate differentiation and reliability of the results for a normal distribution.
On the other hand, the upper and lower 25% when used are better estimate of the actual
discrimination value. They are significantly different and the middle values do not discriminate
sufficiently. In other to get the groups, the graded test papers are arranged from the highest score
to the lowest score in a descending order. The best 25 % are picked from the top and the poorest
25% from the bottom while the middle test papers are discarded.

To illustrate the method of item analysis using an example with a class of 40 learners taking a 10
item test that have been administered and scored, and using 25% test groups. The item analysis
procedure might follow this basic step.
Step 1. Arrange the 40 test papers by ranking them in order from the highest to the lowest score.
Step 2. Select the best 10 papers (upper 25% of 40 testees) with the highest total scores and the
least 10 papers (lower 25% of 40 testees) with the lowest total scores.
Step 3. Drop the middle 20 papers (the remaining 50% of the 40 testees) because they will no
longer be needed in the analysis.
Step 4. Draw a table as shown in table 8.1 in readiness for the tallying of responses for item
analysis.
Step 5. For each of the 10 test items, tabulate the number of testees in the upper and lower groups
who got the answer right or who selected each alternative (for multiple choice items).
Step 6. Compute the difficulty of each item (percentage of testees who got the item right).

111
Step 7. Compute the discriminating power of each item (difference between the number of testees
in the upper and lower groups who got the item right).
Step 8. Evaluate the effectiveness of the distracters in each item (attractiveness of the incorrect
alternatives) for multiple choice test items.

Table 8.1 Format For Tallying Responses Of Examinees For Item Analysis

Alternatives with Correct Option Distracter Index


Option Starred
Ite
m
Testees T
No. O
mi O P- D-
t Val Value
T ue
A B C D E A B C D E
A

1 Upper 25% 0 10* 0 0 0 0 10

Lower 25% 2 4 1 3 0 0 10 0.70 0.60 0.20 * 0.10 0.30 0.00

2 Upper 25% 1 1 0 7* 1 0 10
0.55 0.30 0.00 0.10 0.10 * 0.00
Lower 25% 1 2 1 4 1 1 10

3 Upper 25% 3 0 1 2 4* 0 10
0.45 -0.10 -0.20 0.00 0.00 0.10 *
Lower 25% 1 0 1 3 5 0 10

4 Upper 25% 0 0 10* 0 0 0 10


1.00 0.00 0.00 0.00 * 0.00 0.00
Lower 25% 0 0 10 0 0 0 10

5 Upper 25% 2 3 3 1 1* 0 10 0.10 0.00 0.10 0.00 -0.20 0.10 *


10
Lower 25% 3 3 1 2 1 0

112
.

25 Upper 25% 6* 1 1 1 0 1 10 0.45 0.3 * 0.10 0.10 0.10 0.10

Lower 25% 3 2 2 2 1 0 10

8.2.1 Computing Item Difficulty


The difficulty index P for each of the items is obtained by using the formula:
Item Difficulty (P) =Number of testees who got item right (T)
Total number of testees responding to item (N) i.e. P = T/N
14
Thus for item I in table 8.1, P = = 0.7
20
The item difficult indicates the percentage of testees who got the item right in the two groups used
for the analysis. That is 0.7 x 100% = 70%.

8.2.2 Computing Item Discriminating Power (D)


Item discrimination power is an index which indicates how well an item is able to distinguish
between the high achievers and low achievers given what the test is measuring. That is, it refers to
the degree to which it discriminates between testees with high and low achievement. It is obtained
from this formula:
H L
D=
n
Where:
D= Item Discrimination Power
H= Number of high scorers who got the item right

L= Number of low scorers who got the item right

n= Total Number of examinees in upper or lower group

113
Hence for item 1 in table 8.1, the item discriminating power D is obtained thus:
H  L 10  4 6
D= = = = 0∙60
n 10 10
Item discrimination values range from – 1∙00 to + 1∙00. The higher the discriminating index, the
better is an item in differentiating between high and low achievers. Usually, if item discriminating
power is a:
 Positive value when a larger proportion of those in the high scoring group get the item right
compared to those in the low scoring group.
 Negative value when more testees in the lower group than in the upper group get the item
right.
 Zero(0) value when an equal number of testees in both groups get the item right; and
 One (1.00) when all testees in the upper group get the item right and all the testees in the
lower group get the item wrong.

8.2.3 Evaluating the Effectiveness of Distracters


The distraction power of a distracter is its ability to differentiate between those who do not know
and those who know what the item is measuring. That is, a good distracter attracts more testees
from the lower group than the upper group. The distraction power or the effectiveness of each
distracter (incorrect option) for each item could be obtained using the formula:
LH
Do=
n
Where:
Do = Option Distracter Power
H =Number of high scorers who marked option
L=Number of low scorers who marked option
n = Total Number of examinees in upper or lower group
For item 1 of table 8.1 effectiveness of the distracters are:
LH 20 2
For option A: Do =   = 0∙20
n 10 10
L  H 1 0 1
For option C: Do =   = 0∙10
n 10 10
L  H 30 3
For option D: Do =   = 0∙30
n 10 10

114
Incorrect options with positive distraction power are good distracters while one with negative
distracter must be changed or revised and those with zero should be improved on because they are
not good. Hence, they failed to distract the low achievers.

8.3 Item Analysis and Criterion Referenced Mastery Tests


The item analysis procedures we used earlier for norm–referenced tests are not directly applicable
to criterion–referenced mastery tests. In this case indexes of item difficulty and item discriminating
power are less meaningful because criterion referenced tests are designed to describe learners in
terms of the types of learning tasks they can perform unlike in the norm-referenced test where
reliable ranking of testees is desired.

8.3.1 Item Difficulty


In the criterion referenced mastery tests the desired level of item difficulty of each test item is
determined by the learning outcome it is designed to measure and not as earlier stated on the items
ability to discriminate between high and low achievers. However, the standard formula for
determining item difficulty can be applied here but the results are not usually used to select test
items or to manipulate item difficulty. Rather, the result is used for diagnostic purposes. Also most
items will have a larger difficulty index when the instruction is effective with large percentage of
the testees passing the test.

8.3.2 Item Discriminating Power


As you know the ability of test items to discriminate between high and low achievers are not crucial
to evaluating the effectiveness of criterion referenced tests this is because some of the best items
might have low or zero indexes of discrimination. This usually occurs when all testees answer a
test item correctly at the end of the teaching learning process implying that both the teaching-
learning process and the item are effective. Moreover, they provide useful information concerning
the mastery of items by the testees unlike in the norm-referenced test where they would be
eliminated for failing to eliminate between the high and the low achievers. Therefore, the
traditional indexes of discriminating power are of little value for judging the test items quality
since the purpose and emphasis of criterion referenced test is to describe what learners can do
rather than to discriminate among them.

115
8.3.3 Analysis of Criterion Referenced Mastery Items
Ideally, a criterion referenced mastery test is analyzed to determine extent to which the test items
measure the effects of the instruction. In other to provide such evidence, the same test items is
given before instruction (pretest) and after instruction (posttest) and the results of the same test
pre-and-post administered are compared. The analysis is done by the use of item response chart.
The item response chart is prepared by listing the numbers of items across the top of the chart and
the testees names or identification numbers down the side of the chart and the record correct (+)
and incorrect (-) responses for each testee on the pretest (B) and the posttest (A). This is illustrated
in Table 8.2 for an arbitrary 10 testees.

Table 8.2: An Item Response Chart Showing Correct (+) And Incorrect (-) Responses For
Pretest And Posttest Given Before (B) And After (A) Instructions (Teaching -
Learning Process) Respectively

Testee Identification Number

Item

001 002 003 004 005 … 010 Remark

Pretest (B) - - - - - … - Ideal

Posttest (A) + + + + + +

Pretest (B) + + + + + … + Too easy

Posttest (A) + + + + + +

Pretest (B) - - - - - … - Too

Posttest (A) - - - - - - difficult

Pretest (B) + + + + + … + Defective

116
Posttest (A) - - - - - -

Pretest (B) - + - - + … - Effective

Posttest (A) + + + + + -

An index of item effectiveness for each item is obtained by using the formula for a measure of
Sensitivity to Instructional Effects (S) given by:
R A  RB
S=
T

Where:
RA = Number of testees who got the item right after the teaching-learning process.
RB = Number of testees who got the item right before the teaching-learning process.
T = Total number of testees who tried the item both times.

For example, item 1 of table 8.2, the index of sensitivity to instructional effect (S) is
R  RB 10  0
S= A   1∙00
T 10

Usually for a criterion-referenced mastery test with respect to the index of sensitivity to
instructional effect,
 An ideal item yields a value of 1.00.
 Effective items fall between 0.00 and 1.00, the higher the positive value, the more sensitive
the item to instructional effects; and
 Items with zero and negative values do not reflect the intended effects of instruction.

8.4 Building A Test Item File (Item Bank)


This entails a gradual collection and compilation of items administered, analyzed and selected
based on their effectiveness and psychometric characteristics identified through the procedure of
item analysis over time. This file of effective items can be built and maintained easily by recording
them on item card, adding item analysis information indicating both objective and content area the

117
item measures and can be maintained on both content and objective categories. This makes it
possible to select items in accordance with any table of specifications in the particular area covered
by the file.

Building item file is a gradual process that progresses over time. At first it seems to be additional
work without immediate usefulness. But with time its usefulness becomes obvious when it
becomes possible to start using some of the items in the file and supplementing them with other
newly constructed ones. As the file grows into item bank most of the items can then be selected
from the bank without frequent repetition. Some of the advantages of item bank are that:
 Parallel test can be generated from the bank which would allow learners who were ill for a
test or due to some other reasons were unavoidable absent to take up the test later;
 They are cost effective since new questions do not have to be generated at the same rate
from year to year;
 The quality of items gradually improves with modification of the existing ones with time;
and
 The burden of test preparation is considerably lightened when enough high quality items
have been assembled in the item bank.

SUMMARY
In this unit you have learned how to judge the quality of a classroom test and an item types.
Particularly, you have understood item analysis procedure and how to find out mathematical index
of an item. You have made also discussion on the item analysis, purpose and uses. Besides, you
have gone thorough on the process of item analysis for classroom test and the computations
involved using practical examples. Finally, you have realized the two forms of item analysis
procedures that are criterion-referenced mastery items and norm referenced tests. Finally, you have
come up to realize that how teachers can building a test item file or item bank and how they use
items from item bank for different purposes.

118
SELF ASSESSMENT EXERCISE

PART I. READ THE FOLLOWING QUESTION AND GIVE SHORT AND PRECISE
ANSWER.
1. Explain the meaning of item analysis?
2. List and explain the purposes of item analysis?
3. Show norm referenced item analysis procedure?
4. When do teachers use norm reference and criterion referenced item analysis procedure?
5. How can you compute item difficulty and discrimination?
6. How do you evaluate the effectiveness of distractor in item analysis?
7. Explain the purposes of building item bank?
8. Explain the relationship between reliability and validity?
9. What is the difference between index of discriminating power (D) and index of sensitivity
to instructional effects (S)?
10. Do you think that item analysis could help teachers to improve their skill of classroom test
preparation? Why?

PART II. BASED ON THE FOLLOWING DATA, DETERMINE DIFFICULTY LEVEL (P)
DISCRIMINATION LEVEL (D) AND THE EFFECTIVENESS OF EACH DISTRACTER
OR OPTION (DP) AND INTERPRET THE RESULT OF THE FOLLOWING SIX
QUESTIONS/ITEMS OF 54 EXAMINEES.

Item Group Alternatives Omit/Skip


Number
A B C D

1 Upper Group 8* 5 7 7 0

Lower Group 2* 9 8 8 0

2 Upper Group 13 2 10* 2 0

119
Lower Group 5 1 9* 12 0

3 Upper Group 2 20* 3 2 0

Lower Group 11 2* 8 6 0

4 Upper Group 0 4 3 20* 0

Lower Group 0 7 12 8* 0

5 Upper Group 27* 0 0 0 0

Lower Group 27* 0 0 0 0

6 Upper Group 8 6 8* 5 0

Lower Group 4 3 18 2 0

1. Compute the item difficulty of the above six items?


2. Compute the discriminating power of the above six items?
3. Describe the effectiveness of distracters of the above six items?

120
Assignment (30%)

Take any course in TVET colleges or above in English medium of instruction.

Prepare class room achievement test based on the subject matter you want.

1. Planning Class room Test:

1.1. State Specific Instructional Objectives.

1.2. Construct Table of Specification.

2. Preparing Test Items

2.1. Prepare 4 True/False Items.

2.2. Prepare 5 Matching Items.

2.3. Prepare 3 Completion/short answer Items.

2.4. Prepare 5 Multiple choice Items

2.5. Prepare 3 Essay Items.

2.5.1. Two of them should be restricted type

2.5.2. One of them should be extended type

3. What elements are included in the table of specification? What are the uses of table

of specification?

4. What do you think is the reason that multiple choice items are used most frequently

at all levels of education and courses/subjects?

5. Explain the difference between validity and reliability?

6. Computing Spearman Rank Difference Correlation and Pearson product moment

correlation for the following Pair of Hypothetical Data. How do you interpret

the results?

Student Height(m) Weight(Kg)

Number

1 1.50 55
2 1.80 65
3 1.68 72

121
4 1.78 70
5 1.64 68
6 1.90 80
7 1.75 64
8 1.62 60
9 1.85 76
10 1.87 78

7. Calculate the Mean, Median, Mode, Range, Variance and Standard Deviation of

the following data.

Score(x) 10 5 8 2 7

Frequency(f) 3 2 5 6 4

8. Based on the following data, determine P value, D value and the effectiveness

of distracters?

Item Group Alternatives

A B C D Omit

Item 1 Upper Group 8* 5 7 7 0

Lower Group 2* 9 8 8 0

Item 2 Upper Group 0 4 3 20* 0

Lower Group 0 7 12 8* 0

Item 3 Upper Group 2 20* 3 2 0

Lower Group 11 2* 8 6 0

122

You might also like