Educational Measurement and Evaluation

SCHOOL OF EDUCATION
UNIT CODE: BEP3101
UNIT NAME: EDUCATIONAL MEASUREMENT AND EVALUATION
1
PURPOSE OF THE COURSE
To help the teacher trainee gain proper skills in constructing tests and interpreting the test
result for high quality teaching
Expected Learning Outcomes

By the end of the course unit the learner should be able to:-
i) Define and explain the meaning of terms and concepts like measurement, assessment,
evaluation, tests, diagnostic evaluation, formative and summative evaluation
ii) Explain the importance of measurement and evaluation
iii) Explain and formulate educational/instructional objectives
iv) Explain the concepts of reliability and validity of a test
v) Prepare a test table and marking schemes/scoring table
vi) Explain the various types of tests and outline the process of moderating examinations
vii) Outline the procedure of administering examinations and processing examination results
viii) Explain and apply measures of relationship in interpreting examination results
COURSE CONTENT
Introduction to educational measurement & evaluation; meaning of basic concepts, formative

and summative evaluation, importance of measurement and evaluation in education,
instructional objectives, formulation of instructional objectives, blooms taxonomy of
educational objectives ,
Test construction; importance of validity and reliability of tests, different ways of ascertaining
validity and reliability of tests. Construction of test items; guidelines information making of
valid test items, testing for validity and reliability of test items, preparation of marking
schemes; scoring of tests, Moderation of tests. Test item analysis,; difficulty index ,
discrimination index.
Interpreting test results; scoring and grading, analyzing examination results, reporting
test results. Types and categories of evaluation; formative evaluation, Summative
evaluation, placement evaluation, diagnostic evaluation.
2
Instructional Materials and Equipment
Projector
Course Assessment
Examination 70%
Continuous Assessments (Exercises and Tests) 30%
Total 100%
Recommended Text Books

i. J.P. Lal (2006); Educational Measurement And Evaluation; Anmol Publications Pvt Ltd
ii. Orodho J. A (2005); Techniques in Writing Research Proposal and Reports; Kanezja H. P
Enterprises – Nairobi
iii. Burger W. F (2004); Essentials of Mathematics for Elementary Teachers; Wiley 6th Ed.
iv. Munene C.K & Ogula P.A (1998); A Handbook on Educational Assessment and Evaluation;
Nairobi New Kemit Publishers
v. Coolican H (1994); Research Methods and Statistics in Psychology (2nd Edition); London
Nodder and Stoughton
Overton T. ( ) Assessing Learners with Special Needs: 6TH ED.
Text Books for further Reading

i. Swarupa R. T. P (2006); Educational Measurement and Evaluation; Discovery
Publishing House.
ii. Ibanga J (1981); Guide on Tests and Measurements for Teachers and Students; Piaco
Press Calabar
CHAPTER 1
INTRODUCTION TO EDUCATIONAL MEASUREMENT & EVALUATION
Objectives
At the end of this topic, you should be able to:
 Define terms used in educational measurement and evaluation
 Explain the importance of measurement and evaluation in education
 Identify and explain approaches used in educational measurement and evaluation
1. O Introduction
Educational measurement and evaluation is the study of methods, approaches and strategies used to
measure, assess and evaluate in the educational setting. Evaluation has been conceived either as the
assessment of the merit and worth of educational programmes (Guba and Lincoln, 1981; Glatthorn,
1987; and Scriven, 1991), or as the acquisition and analysis of information on a given educational
programme for the purpose of decision making (Nevo, 1986; Shiundu & Omulando, 1992). The course
involves the study of tests, test construction and item construction as well as statistical procedures used
to analyze tests and test results.
1.1 Meaning of Terms and Concepts

(Measurement, assessment, evaluation, tests, diagnostic evaluation, formative and summative
evaluation)
Test: A method to determine a student's ability to complete certain tasks or demonstrate mastery
of a skill or knowledge of content. Some types would be multiple choice tests, or a weekly
spelling test. While it is commonly used interchangeably with assessment, or even evaluation, it
can be distinguished by the fact that a test is one form of an assessment. A test or assessment
yields information relative to an objective or goal. In that sense, we test or assess to determine
whether or not an objective or goal has been obtained.
Assessment: Assessment is a process by which information is obtained relative to some known

objective or goal. Assessment is a broad term that includes testing. A test is a special form of
assessment. Tests are assessments made under contrived circumstances especially so that they
may be administered. In other words, all tests are assessments, but not all assessments are tests.
An assessment may also include methods such as observations, interviews, behavior monitoring,
etc. We test at the end of a lesson or unit. We assess progress at the end of a school year through
testing, and we assess verbal and quantitative skills through such instruments as the SAT and
GRE. Whether implicit or explicit, assessment is most usefully connected to some goal or
objective for which the assessment is designed.
The process of gathering information to monitor progress and make educational decisions
if necessary. Assessment is therefore quite different from measurement, and has uses that
suggest very different purposes.
Assessment of skill attainment is rather straightforward. Either the skill exists at some acceptable
level or it doesn’t. Skills are readily demonstrable. Assessment of understanding is much more
difficult and complex. Skills can be practiced; understandings cannot. We can assess a person’s
11
knowledge in a variety of ways, but there is always a leap, an inference that we make about
what a person does in relation to what it signifies about what he knows.
Evaluation: Procedures used to determine whether the subject (i.e. student) meets a preset
criteria, such as qualifying for special education services. This uses assessment (remember that
an assessment may be a test) to make a determination of qualification in accordance with a
predetermined criteria. Evaluation is perhaps the most complex and least understood of the
terms. Inherent in the idea of evaluation is "value." When we evaluate, what we are doing is
engaging in some process that is designed to provide information that will help us make a
judgment about a given situation. Generally, any evaluation process requires information about
the situation in question. When we evaluate, we are saying that the process will yield
information regarding the worthiness, appropriateness, goodness, validity, legality, etc., of
something for which a reliable measurement or assessment has been made.
We evaluate every day. Teachers, in particular, are constantly evaluating students, and such
evaluations are usually done in the context of
comparisons between what was intended (learning, progress, behavior) and what was obtained.
When used in a learning objective, the definition of evaluate is: To classify objects, situations,
people, conditions, etc., according to defined criteria of quality. Indication of quality must be
given in the defined criteria of each class category.
Measurement refers to the process by which the attributes or dimensions of some physical
object are determined. One exception seems to be in the use of the word measure in
determining the IQ of a person, attitudes or preferences.
However, when we measure, we generally use some standard instrument to determine how big,
tall, heavy, voluminous, hot, cold, fast, or straight something actually is. Standard instruments
refer to instruments such as rulers, scales, thermometers, pressure gauges, etc. We measure to
obtain information about what is. Such information may or may not be useful, depending on
the accuracy of the instruments we use, and our skill at using them.
To sum up, we measure distance, we assess learning, and we evaluate results in terms of
some set of criteria. These three terms are certainly connected, but it is useful to think of them
as separate but connected ideas and processes
Assessment: Assessment is a process by which information is obtained relative to some known

objective or goal. Assessment is a broad term that includes testing. A test is a special form of
assessment. Tests are assessments made under contrived circumstances especially so that they
may be administered. In other words, all tests are assessments, but not all assessments are tests.
An assessment may also include methods such as observations, interviews, behavior monitoring,
etc. We test at the end of a lesson or unit. We assess progress at the end of a school year through
testing, and we assess verbal and quantitative skills through such instruments as the SAT and
GRE. Whether implicit or explicit, assessment is most usefully connected to some goal or
objective for which the assessment is designed.
More specifically, assessment is a method the teacher uses to make decisions on learners’
progress. It is an essential process in teaching and learning as it enables the teacher to evaluate
the level and extent of learners’ achievement of the set objectives.
Assessment enables the teacher to;

i. Determine the level of achievement of set objectives
12
ii. Determines how much knowledge the learners have grasped
iii. Establishes how the learners have mastered skills taught and acquired attitudes
iv. Detects the difficulties and challenges learners are encountering, which forms the basis
for remedial teaching
v. Check on the effectiveness on the use of resources and methods of instruction
vi. Provides basis for learner promotion and reward.
vii. Provide information to school administration, parents and stakeholders for necessary
action.
viii. motivating and directing learning
ix. providing feedback to student on their performance
x. providing feedback on instruction and/or the curriculum
1.2 Importance of Measurement and Evaluation in Education

Assessment is important because it drives students learning (Brissenden and Slater, n.d.).
Whether we like it or not, most students tend to focus their energies on the best or most
expeditious way to pass their ‘tests.’ Based on this knowledge, we can use our assessment
strategies to manipulate the kinds of learning that takes place. For example, assessment strategies
that focus predominantly on recall of knowledge will likely promote superficial learning. On the
other hand, if we choose assessment strategies that demand critical thinking or creative problem-
solving, we are likely to realize a higher level of student performance or achievement.
Good assessment can help students become more effective self-directed learners (Angelo
and Cross, 1993).Well-designed assessment strategies also play a critical role in educational
decision-making and Is a vital component of ongoing quality improvement processes at the
lesson, course and/or curriculum level.
2.3 Types and Approaches to Assessment

Numerous terms are used to describe different types and approaches to learner assessment.
Although somewhat arbitrary, it is useful to these various terms as representing dichotomous
poles (McAlpine, 2002).
Formative <--------------------------------- > Summative
Informal <--------------------------------- > Formal
Continuous <---------------------------------- > Final
Process <--------------------------------- > Product
Divergent <--------------------------------- > Convergent
Formative vs. Summative Assessment
Formative assessment is designed to assist the learning process by providing feedback to the
learner, which can be used to identify strengths and weakness and hence improve future
performance. Formative assessment is most appropriate where the results are to be used
internally by those involved in the learning process (students, teachers, curriculum developers).
Summative assessment is used primarily to make decisions for grading or determine readiness
for progression. Typically summative assessment occurs at the end of an educational activity and
is designed to judge the learner’s overall performance. In addition to providing the basis for
13
grade assignment, summative assessment is used to communicate students’ abilities to external
stakeholders, e.g., administrators and employers.
Informal vs. Formal Assessment

With informal assessment, the judgments are integrated with other tasks, e.g., lecturer feedback
on the answer to a question or preceptor feedback provided while performing a bedside
procedure. Informal assessment is most often used to provide formative feedback. As such, it
tends to be less threatening and thus less stressful to the student. However, informal feedback
is prone to high subjectivity or bias.
Formal assessment occurs when students are aware that the task that they are doing is for
assessment purposes, e.g., a written examination or OSCE. Most formal assessments also
are summative in nature and thus tend to have greater motivation impact and are associated
with increased stress. Given their role in decision-making, formal assessments should be
held to higher standards of reliability and validity than informal assessments.
Continuous vs. Final Assessment

Continuous assessment occurs throughout a learning experience (intermittent is probably a more
realistic term). Continuous assessment is most appropriate when student and/or instructor
knowledge of progress or achievement is needed to determine the subsequent progression or
sequence of activities. Continuous assessment provides both students and teachers with the
information needed to improve teaching and learning in process. Obviously, continuous
assessment involves increased effort for both teacher and student.
Final (or terminal) assessment is that which takes place only at the end of a learning activity. It
is most appropriate when learning can only be assessed as a complete whole rather than as
constituent parts. Typically, final assessment is used for summative decision-making. Obviously,
due to its timing, final assessment cannot be used for formative purposes.
Process vs. Product Assessment

Process assessment focuses on the steps or procedures underlying a particular ability or task,
i.e., the cognitive steps in performing a mathematical operation or the procedure involved in
analyzing a blood sample. Because it provides more detailed information, process assessment is
most useful when a student is learning a new skill and for providing formative feedback to assist
in improving performance.
Product assessment focuses on evaluating the result or outcome of a process. Using the above
examples, we would focus on the answer to the math computation or the accuracy of the blood test
results. Product assessment is most appropriate for documenting proficiency or competency in a
given skill, i.e., for summative purposes. In general, product assessments are easier to create than
process assessments, requiring only a specification of the attributes of the final product.
Divergent vs. Convergent Assessment
Divergent assessments are those for which a range of answers or solutions might be considered
correct. Examples include essay tests, and solutions to the typical types of indeterminate
problems+
e. Divergent assessments tend to be more authentic and most appropriate in evaluating higher
cognitive skills. However, these types of assessment are often time consuming to evaluate and
the resulting judgments often exhibit poor reliability.
14
A convergent assessment has only one correct response (per item). Objective test items are the
best example and demonstrate the value of this approach in assessing knowledge. Obviously,
convergent assessments are easier to evaluate or score than divergent assessments.
Unfortunately, this “ease of use” often leads to their widespread application of this approach
even when contrary to good assessment practices.
1.4 Comparison between Assessment and Evaluation
Dimension Assessment Evaluation

Timing Formative Summative
Focus of Measurement Process-Oriented Product-Oriented
Relationship Between Administrator and Reflective Prescriptive
Recipient
Findings and Uses Diagnostic Judgmental
Modifiability of Criteria, Measures Flexible Fixed
Standards of Measurement Absolute (Individual) Comparative
Relation Between Objects of A/E Cooperative Competitive
From: Apple, D.K. & Krumsieg. K. (1998). Process education teaching institute handbook.
Pacific Crest
1.5 Review Questions

1. What is a test?
2. explain what is meant by measurement in education
3. Differentiate between assessment and evaluation
4. Why is assessment and evaluation important?
5. Analyze the 5 different approaches used in educational assessment and evaluation
CHAPTER 2
INSTRUCTIONAL OBJECTIVES
Objectives
i. Formulate instructional objectives
ii. Explain the importance Blooms taxonomy of educational objectives
2.0 Introduction
Instructional objectives are statements of what is to be achieved at the end of the instructional
process. They are, therefore, the subject of assessment and evaluation. This chapter discusses the
importance of instructional objectives, and their formulation.
2.1 Importance of instructional/learning objectives

i. Instructional/learning objectives communicative to the learners, instructors and other
interested people what the learners should be able to do at the end of the lesson.
ii. Instructional/learning objectives help learners organize their study and avoid getting lost
(if learners are informed).
iii. Instructional/learning objectives help the teacher plan learning activities with focus
iv. Instructional/learning objectives enable the teacher select the most appropriate
teaching approaches.
v. Well written Instructional/learning objectives help to take time in developing the lesson.
vi. Instructional/learning objectives form the basis for the development of the instruction by
limiting the content.
vii. Instructional/learning enable the teacher identify appropriate teaching &
learning resources
viii. Instructional/learning form the basis for lesson evaluation.
2.2 Steps in Writing Instructional Objectives

There are 5 steps in writing outcomes
Decide on the content area. This is defining the limits of what should go into instruction.
Use action verbs to identify specific behaviors. The verb should be (a) an observable behavior
that produces measurable results (b) the verb should also be at the highest skill level that the
learner would be required to perform.
Specify the content area after the verb, for example: calculate averages or computer
variances. It is important to specify the content area for clarity. An unspecified content area
would for example be
(a) Calculate statistical information
(b) Computer values needed in economics These
are wide areas and cannot be measured well.
Specify applicable conditions. Identify any tools used, information to be supplied or other
constraints. For example: given a calculator, calculate the average of a list of numbers
16
OR
Without using a calculator, calculate the average of a list of numbers.
Specify application criteria -identify any desired levels of speed,accuracy,quality,quantity e.t.c .
For example: Given a calculator, calculate averages from a list of numbers correctly, all the time.
OR
Given a spreadsheet package, compute variances from a list of numbers rounded to the second
decimal point.
 Review each learning outcome to be sure it is complete, clear and concise
2.3 Domains of Learning

Psychologist Benjamin Bloom and his colleagues identified thee domains or learning
outcomes of educational activities. This classification frame work came to be known as the
taxonomy of educational objectives or Blooms taxonomy.
It identifies three levels or domains of learning and is hierarchical. Each of the three domains
is divided into levels, starting from the simplest to the most complex.
Learning at higher levels is dependant in knowledge and skills acquired at other levels. The
taxonomy, therefore, provides a sequential model for dealing with curriculum content.
The cognitive domain

The cognitive domain involves knowledge and the development of mental skills.
There are six levels which are listed from the simplest to the most complex there are
Knowledge
Knowledge can be defined as the collection or recall of appropriate, previously
learnt information. It includes the recalls of terminology and specific facts.
Verbs to use in the statement of instructional objectives at this level include: define,
describe, enumerate, identify label, list, match, name, read, select, reproduce and state.
Comprehension
Comprehension entails the understanding of information so as to be able to translate,
perceive and interpret instruction and problems. Learners should be able to classify, cite,
convert, describe, discuss, explain, give examples, paraphrase, restate in own words,
summarize, understand, distinguish and rewrite.
Application
Application is using previously learnt information in new situations to solve problems. Those
experiencing this level of learning should show the capacity to apply, change, compute, modify,
predict, prepare, relate, solve, show, use and produce.
Analysis
Analysis refers to the ability to break down informational materials into their component parts
examine them and understand the organizational structure. It may involve identifying motives or
causes, making inferences or finding evidence to support generalization .at this level, learners
should be able to break down, correlate, discriminate, differentiate, distinguish, focus, illustrate,
infer, limit, outline point out, prioritize, recognize separate, subdivide, select and compare.
Synthesis
17
Synthesis refers to the building of structures or patterns from various kinds of elements.
Learners at this level can put parts together to form a whole that has a new meaning or structure.
Key words in use for this level of learning include categories
Evaluation
Evaluation refers to the process of making judgment s about information, its value and quality.
Learning and the level includes appraising, comparing and contrasting, defending, judging,
interpreting, justifying. Discriminating and evaluating
The affective domain

Concerned with emotions and attitudes
Has five levels
Receiving –involves awareness and paying attention
 Key words are name, choose, point at, select, use, locate, follow, describe
Responding
 Involves active participation in the learning process. The learner reacts to what they have
received in.
 Measuring verbs include answer, assist, aid, discuss, greet, help, label, perform,
practice, present, read, record select & write.
 This describes the value or worth that a learner attaches to a particular object,
phenomenon or behaviour.
 Key terms include complete, join, demonstrate, differenciate, explain, form, initiate,
write, join, justify, propose, report, share & study & work.
 This involves arranging values according to priority by comparing, relating &

synthesizing them.
 Key verbs are alter, arrange, combine, explain, organize, synthesize,
relate. Internalization of values
 The learner has formed a value system that influences his or her behaviour.
 Key words are influence, perform, practice, qualify, question, verify,
display, discriminate, act.
The psychomotor domain
The psychomotor domain involves; physical movement, coordination and use of the motor skill.
This domain describe the ability to physically manipulate things such as tools. Bloom and his
colleagues did not create levels n this domain. Other educators have created seven levels from
the simplest t; the most complex form. They are
Perception
This is the ability to use the senses to guide motor activity. Individuals experiencing learning at
this level should be able to choose, describe, detect, differentiate, distinguish, identify, isolate,
select.
Set
This indicates the readiness to act. At this level, individuals should be able to display, explain,
move, proceed, restate and volunteer
Guided response
18
This refers to the early stages of learning a complex skill that include imitation, and trial and
error. Achievement at this level is attained by practicing. Learning is demonstrated by copying,
tracing, following, reacting and reproducing.
Mechanism
This refers to the intermediate stage of learning a complex skill. Learnt responses should
become habitual, confident and proficient. Learners can assemble, construct, dismantle, fasten,
fix, grind, heat, measure, mend, sketch, organize and calibrate.
Complete overt response

This is the skillful performance of physical acts that involve complex movement patterns.
Proficiency is indicated by quick, accurate and highly coordinated performance. Key terms that
show learning at this level include assemble, build, calibrate, construct, dismantle, display, fix,
fasten, grind, heat, manipulate, measure, mix, organize and sketch.
Adaptation
Individuals experiencing learning at this level have well-developed skills and are able to make
modifications to fit special requirements. Learners at this level adapt, alter, change, rearrange
and reorganize.
Origination
This involves creativity. The learner can create new movement patterns to suit different
situations. Individuals at this level should demonstrate the ability to arrange, build, combine,
compose, construct, design, initiate and make.

1. What are instructional objectives?
2. Why are instructional objectives important in educational measurement and evaluation?
3. Outline the steps in writing instructional objectives
4. Identify and discuss the 6 levels of the cognitive domain of learning as put forth by
Bloom.
CHAPTER 3
TEST CONSTRUCTION
Objectives
i. Define basic terms
ii. Explain the importance of Validity and reliability of tests
iii. Discuss different ways of ascertaining Validity and reliability of tests
3.0 Introduction
Test construction is the process of building a test. For a test to be deemed good, the tests
reliability and validity must be determined. This chapter discusses test validity and reliability.
3.1 Reliability of Tests

Reliability is the degree to which an assessment tool produces stable and consistent results.
Reliability is, therefore, the extent to which the research measures that which it is purported to
measure
3.2Types of Reliability
1. Test-retest reliability is the degree to which scores are consistent over time. Test-retest
reliability is a measure of reliability obtained by administering the same test twice over a period
of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in
order to evaluate the test for stability over time. Exatoymple: A test designed to assess student
learning in psychology could be given to a group of students twice, with the second
administration perhaps coming a week after the first. The obtained correlation coefficient
would indicate the stability of the scores.
2. Parallel forms reliability/ Equivalent-Forms or Alternate-Forms Reliability:
Two tests that are identical in every way except for the actual items included. Used when
it is likely that test takers will recall responses made during the first session and when alternate
forms are available. Correlate the two scores. The obtained coefficient is called the coefficient of
stability or coefficient of equivalence. Problem: Difficulty of constructing two forms that are
essentially equivalent.
It is a measure of reliability obtained by administering different versions of an assessment tool
(both versions must contain items that probe the same construct, skill, knowledge base, etc.) to
20
the same group of individuals. The scores from the two versions can then be correlated in order
to evaluate the consistency of results across alternate versions.
Example: If you wanted to evaluate the reliability of a critical thinking assessment, you
might create a large set of items that all pertain to critical thinking and then randomly split
the questions up into two sets, which would represent the parallel forms. Both of the above
require two administrations
3. Inter-rater reliability is a measure of reliability used to assess the degree to which
different judges or raters agree in their assessment decisions. Inter-rater reliability is useful
because human observers will not necessarily interpret answers the same way; raters may
disagree as to how well certain responses or material demonstrate knowledge of the construct
or skill being assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating the degree
to which art portfolios meet certain standards. Inter-rater reliability is especially useful when
judgments can be considered relatively subjective. Thus, the use of this type of reliability would
probably be more likely when evaluating artwork as opposed to math problems.
4. Internal consistency reliability. It is determining how all items on the test relate to all other
items. It is a measure of reliability used to evaluate the degree to which different test items that
probe the same construct produce similar results.
A. Average inter-item correlation is a subtype of internal consistency reliability. It is obtained
by taking all of the items on a test that probe the same construct (e.g., reading comprehension),
determining the correlation coefficient for each pair of items, and finally taking the average of
all of these correlation coefficients. This final step yields the average inter-item correlation.
B. Split-half reliability is another subtype of internal consistency reliability. The process of
obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to
probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The
entire test is administered to a group of individuals, the total score for each “set” is computed,
and finally the split-half reliability is obtained by determining the correlation between the two
total “set” scores. Requires only one administration. Especially appropriate when the test is very
long. The most commonly used method to split the test into two is using the odd-even strategy.
Since longer tests tend to be more reliable, and since split-half reliability represents the reliability
of a test only half as long as the actual test, a correction formula must be applied to the
coefficient. Spearman-Brown prophecy formula. Split-half reliability is a form of internal
consistency reliability.
3.3 Validity
Validity refers to how well a test measures what it is purported to measure or the extent to which
a test measures what it is supposed to measure.
Why is it necessary?
While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to
be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an excess
of 5lbs. The scale is reliable because it consistently reports the same weight every day, but it is
not valid because it adds 5lbs to your true weight. It is not a valid measure of your weight.
21
3.4 Types of Validity
1. Content Validity:
When we want to find out if the entire content of the behavior/construct/area is
represented in the test we compare the test task with the content of the behavior. This is a logical
method, not an empirical one. Example, if we want to test knowledge on American Geography it
is not fair to have most questions limited to the geography of New England.
2. Face Validity:
Basically face validity refers to the degree to which a test appears to measure what it
purports to measure. Face Validity ascertains that the measure appears to be assessing the
intended construct under study. The stakeholders can easily assess face validity. Although this is
not a very “scientific” type of validity, it may be an essential component in enlisting motivation
of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the
ability, they may become disengaged with the task.
Example: If a measure of art appreciation is created all of the items should be related to the
different components and types of art. If the questions are regarding historical time periods, with
no reference to any artistic movement, stakeholders may not be motivated to give their best effort
or invest in this measure because they do not believe it is a true assessment of art appreciation.
3. Construct Validity.
Construct validity is the degree to which a test measures an intended hypothetical construct. it
Construct validity is the degree to which a test measures an intended hypothetical construct. is
used to ensure that the measure is actually measure what it is intended to measure (i.e. the
construct), and not other variables. Using a panel of “experts” familiar with the construct is a
way in which this type of validity can be assessed. The experts can examine the items and decide
what that specific item is intended to measure. Students can be involved in this process to obtain
their feedback.
Example: A women’s studies program may design a cumulative assessment of learning
throughout the major. The questions are written with complicated wording and phrasing.
This can cause the test inadvertently becoming a test of reading comprehension, rather than a
test of women’s studies. It is important that the measure is actually assessing the intended
construct, rather than an extraneous factor.
4. Criterion-Related Validity
When you are expecting a future performance based on the scores obtained currently by the
measure, correlate the scores obtained with the performance. The later performance is called the
criterion and the current score is the prediction. This is an empirical check on the value of the
test – a criterion-oriented or predictive validation. It is used to predict future or current
performance - it correlates test results with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student learning
throughout the major. The new measure could be correlated with a standardized measure of
ability in this discipline, such as an ETS field test or the GRE subject test. The higher the
correlation between the established measure and new measure, the more faith stakeholders
can have in the new assessment tool.
5. Formative Validity
22
When applied to outcomes assessment it is used to assess how well a measure is able to
provide information to help improve the program under study. Example: When designing a
rubric for history one could assess student’s knowledge across the discipline. If the measure can
provide information that students are lacking knowledge in a certain area, for instance the Civil
Rights Movement, then that assessment tool is providing meaningful information that can be
used to improve the course or program requirements.
6. Concurrent Validity:
Concurrent validity is the degree to which the scores on a test are related to the scores on
another, already established, test administered at the same time, or to some other valid criterion
available at the same time. Example, a new simple test is to be used in place of an old
cumbersome one, which is considered useful, measurements are obtained on both at the same
time. Logically, predictive and concurrent validation are the same, the term concurrent
validation is used to indicate that no time elapsed between measures.
7. Sampling Validity (similar to content validity)
Ensures that the measure covers the broad range of areas within the concept under study. Not
everything can be covered, so items need to be sampled from all of the domains. This may need
to be completed using a panelof “experts” to ensure that the content area is adequately sampled.
Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an individual
personally feels are the most important or relevant areas).
Example: When designing an assessment of learning in the theatre department, it would not be
sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound,
functions of stage managers should all be included. The assessment should reflect the content
area in its entirety.
3.5 Ways to improve validity

1. Make sure your goals and objectives are clearly defined and
operationalized. Expectations of students should be written down.
2. Match your assessment measure to your goals and objectives. Additionally, have the
test reviewed by faculty at other schools to obtain feedback from an outside party who is
less invested in the instrument.
3. Get students involved; have the students look over the assessment for troublesome wording, or
other difficulties.
4. If possible, compare your measure with other measures, or data that may be available.

1. Define test reliability
2. Explain 5 forms of reliability
3. What is test validity?
4. Discuss the following forms of test validity
a. construct validity
b. content validity
c. criterion related validity
5. How can the validity of a test be enhanced?
23
CHAPTER 4
TYPES OF TESTS
Objectives
1. Discuss different types of tests
2. Explain Intelligence tests
4.0 Introduction
In chapter one a test was defined as a method to determine a student's ability to complete
certain tasks or demonstrate mastery of a skill or knowledge of content. Some types would be
multiple choice tests, or a weekly spelling test. This chapter discusses the types of tests as well
as the qualities of a good test.
24
4.1 Types of Tests
Oral questions
These are questions that the teacher asks on a continuous basis to assess the learners
progress during a lesson.
Quizzes
These are short answers questions which the teacher uses to determine the level of
mastery of specific content
Tests and examinations

These are teacher made tests which are used to evaluate learners progress after a
number of topics have been taught or at the end of the term or school year. The tests may have
different test items such as short answer questions, and essay questions, open ended essay
question or closed/ objective questions. Open ended essay questions are used to determine
learners ‘achievement in more demanding geography content. Objective type questions include:
multiple choice questions, completion item tests, true /false items and pictorial
Instructional/learning items. These are used by the teacher depending on the level of the learners
and, the objective to be achieved and the time of testing available.
Observation
Some teaching and learning activities can be best assessed through observation. These include
learners’ participation in debates and discussions and group work. The teacher can use check
lists and observation schedules as shown in the table below.
Name Activity Behavior to be Observation made
observed
Tabitha Neema Group work  Preparation  Displayed adequate
preparation in terms of
content
 Willingness to  Contributed adequately
contribute
 Attitude  Positive attitude
 Expression  Not audible and orderly
The information gained from such observation would be the basis for personal academic
counseling and guidance
Projects
The teacher should assign projects to students as individuals or as groups. When assigning
projects, the learners need to be given enough information with regard to the scope of the project
and the mode of reporting the findings as well as the source of materials. Although different
projects have different focus, when assessing, generally look out for ;
i. Neatness
ii. Relevance
iii. Accuracy
25
iv. Completeness
Assignments
Assignments are tasks or responsibilities given by the teacher to learners to perform. The teacher
then grades the performance. A number of topics in geography give opportunity to assign
learners tasks that can be evaluated. These assignments include:
 reading selected texts and reference materials
 making guided notes
 writing reports on specific topics or field visits
 writing essays
Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement

Test Items
Type of Item Advantages Disadvantages
True-False Many items can be administered in a Limited primarily to testing
relatively short time. Moderately knowledge of information. Easy to
easy to write; easily scored. guess correctly on many items, even if
material has not been mastered.
Multiple-Choice Can be used to assess broad range of Difficult and time consuming to write
content in a brief period. Skillfully good items. Possible to assess higher
written items can measure higher order cognitive skills, but most items
order cognitive skills. Can be scored assess only knowledge. Some correct
quickly. answers can be guesses.
Matching Items can be written quickly. A Higher order cognitive skills are
broad range of content can be difficult to assess.
assessed. Scoring can be done
efficiently.
Short Answer or Many can be administered in a brief Difficult to identify defensible criteria
Completion amount of time. Relatively efficient for correct answers. Limited to
to score. Moderately easy to write. questions that can be answered or
completed in very few words.
Essay Can be used to measure higher order Time consuming to administer and
cognitive skills. Relatively easy to score. Difficult to identify reliable
write questions. Difficult for criteria for scoring. Only a limited
respondent to get correct answer by range of content can be sampled
guessing. during any one testing period.
Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.
4.2 Why do we give exams to students?

 To evaluate and grade students. Exams provide a controlled environment for
independent work and so are often used to verify the state of students’ learning.
 To motivate students to study. Students do tend to open their books more often when an
26
evaluation is coming up. Exams can be great motivators.
 To add variety to student learning. Exams are a form of learning activity. They
can enable students to see the material from a different perspective. They also
provide feedback that students can then use to improve their understanding.
 To identify faults and correct them. Exams enable both students and instructors to
identify which areas of the material taught are not being understood properly. This
allows students to seek help, and instructors to address areas that may need more
attention, thus enabling student progression and improvement.
 To obtain feedback. You can use exams to evaluate your own teaching. Students’
performance on the exam will pinpoint areas where you should spend more time or
change your current approach.
 To provide statistics for the course or institution. Institutions often want information
on how students are doing. How many are passing and failing, and what is the average
achievement in class? Exams can provide this information.
 To accredit qualified students. Certain professions demand that students demonstrate the
acquisition of certain skills or knowledge. An exam can provide such proof – for
example,.
 ensuring standards of progression are met
4.3 What are the qualities of a good / fair exam?

 Consistency. If you gave the same exam twice to the same students, they should get a
similar grade each time.
 Validity. Make sure your questions address what you want to evaluate.
 Realistic expectations. Your exam should contain questions that match the average
student’s ability level. It should also be possible to respond to all questions in the time
allowed. To check the exam, ask a teaching assistant to take the test – if they can’t
completed it in well under the time permitted then the exam needs to be revised.
 Uses multiple question types. Different students are better at different types of questions.
In order to allow all students to demonstrate their abilities, exams should include a variety
of types of questions.
 Offer multiple ways to obtain full marks. Exams can be highly stressful and artificial
ways to demonstrate knowledge. In recognition of this, you may want to provide
questions that allow multiple ways to obtain full marks. For example, ask students to list
five of the seven benefits of multiple-choice questions.
 Free of bias. Your students will differ in many ways including language proficiency,
socio-economic background, physical disabilities, etc. When constructing an exam, you
should keep student differences in mind to watch for ways that the exams could create
obstacles for some students. For example, the use of colloquial language could create
difficulties for students for whom English is a first language, and examples easily
understood by North American students may be inaccessible to international students.
 Redeemable. An exam should not be the sole opportunity to obtain marks. There
should be other opportunities as well. Assignments and midterms allow students to
practice answering your types of questions and adapt to your expectations.
 Demanding. An exam that is too easy does not test students’ understanding of the
material.
 Transparent marking criteria. Students should know what is expected of them. They
27
should be able to identify the characteristics of a satisfactory answer and understand the
relative importance of those characteristics. This can be achieved in many ways; you
can provide feedback on assignments, describe your expectations in class, or post model
solutions on a course website.
 Timely. Spread exams out over the semester. Giving two exams one week apart doesn’t
give students adequate time to receive and respond to the feedback provided by the first
exam. When possible, plan the exams to fit logically within the flow of the course
material. It might be helpful to place tests at the end of important learning units rather than
simply give a midterm halfway through the semester.
Review Questions
1. What is a test?
2. Why are tests given?
3. Identify and discuss any 4 types of tests
4. Analyze the characteristics of a good test
CHAPTER 5
PREPARATION OF TABLE OF SPECIFICATIONS
Objectives
i. Define basic terms
ii. Explain the Importance of table of specification
iii. Describe the Preparation of table of specification
28
5.0 Introduction
The Table of Specifications is a blueprint for the preparation of an exam. It serves as the "map"
or guide to assigning the appropriate number of items to topics included in the course or subject.
Essentially, a table of specification is a table chart that breaks down the topics that will be on a
test and the amount of test questions or percentage of weight each section will have on the final
test grade.
This type of table is mainly used by teachers to help break down their testing outline on a
specific subject. Some teachers use this particular table as their teaching guideline. For many
teachers, a table of specification is both part of the process of test building and a product of
the test building process.
5.1 Importance of table of specification

 Ensure content validity of the test
Ensures that the measure covers the broad range of areas within the concept under
study. This ensures sampling validity. The use of a Table of Specifications is to increase
the validity and quality of objective type assessments.
 Ensures that the learners are tested at different levels/domains of learning
 Ensures construct validity
 This table provides teachers and their students with a visual approximation of the content
that will tested and the amount of weight it is given on a test. The teacher should know in
advance specifically what is being assessed as well as the level of critical thinking
required of the students.
 The purpose is to coordinate the assessment questions with the time spent on any
particular content area, the objectives of the unit being taught, and the level of
critical thinking required by the objectives or state standards.
5.2 Preparation of table of specification

The following is a simplified method of preparing a Table of Specifications.
i. List all the topics and sub topics that are included in the subject or course.
ii. Assign corresponding percentages based on the professional requirements
or institutional requirements. Below is an example:
Subject – form 1 geography

Topics included:
a. weather station - 25 %
b. elements of the weather – 37.5 %
c. the atmosphere – 37.5 %
this gives a total of 100 %
iii. Decide on the number of items that you would like the test to be. Let's say
you wanted a 16 item - test; the number of items per topic would then be:
a. weather station - 25 % ------------------------4
29
b. elements of the weather – 37.5 % ---------- 6
c. the atmosphere – 37.5 % ------------------- 6
This gives a total of 16 items.
iv. Assign the specific type of question you would like to ask depending on what
skill or cognitive learning, you would like to emphasize
Sample table of specification

Topic Sub topic No. of % of Kldge Compre applic anal synt Evaluat
questi questi hension ation ysis hesis Ion
ons ons
weather Weather 4 25% 2 2
station
Elements 6 37.5 2 2 2
of weather
The 6 37.5 2 2 2
atmosphere
TOTAL 16 100 4 4 2 2 2 2
The left hand column 1 and 2 is the specific content areas taught in the unit. For example, the
content of a math unit could be whole numbers and decimals, addition and subtraction, numerical
form, and expanded form. These are the areas of content taught
Column three of the table is a summary of the number of questions for each content area
and is completed after the questions have been written or determined in advance based on
classroom instruction devoted to each content area.
Column four is the percent of questions devoted to each content area. This is calculated
by taking the number of questions per area and dividing by the number of questions on the entire
test.
Columns 5-10 are based on Bloom’s Taxonomy. The question number is listed in the
column and row that best describes the content and level of critical thinking required to answer
the question.
Review Questions
1. What is a table of specifications?
2. Why should teachers construct a table of specifications?
3. Outline the steps in constructing a table of specification
CHAPTER 6
CONSTRUCTION OF TEST ITEMS
Objectives
 Discuss the guidelines when writing different kinds of test items
6.0 Introduction
For effective writing of test items, certain guidelines are necessary. The following section
examines the guidelines when writing different test items.
6.1Guidelines information making of valid test items
6.1.1 Writing Selected Response (objective) Assessment Items
Selected response (objective) assessment items are very efficient – once the items are created,
you can assess and score a great deal of content rather quickly. Note that the term objective refers
to the fact that each question has a right and wrong answer and that they can be impartially
scored. In fact, the scoring can be automated if you have access to an optical scanner for scoring
paper tests or a computer for computerized tests. However, the construction of these “objective”
items might well include subjective input by the teacher/creator.
a) Multiple Choice
Multiple choice questions consist of a stem (question or statement) with several answer
choices (distractors). Important points in constructing thee tests are:
 All answer choices should be plausible and
homogeneous. Example
What does peaked mean?
31
A. was sharp
B. was at its height
C. was mountainous
D. was rising
Non-Example
.
What does peaked mean.
A. was pale
B. was at its height
C. was hot
D. was beautiful
 Answers should be similar in length and grammatical

form. Example
The scrub tree numbers are dwindling so rapidly that some fear it soon may be found nowhere
at all.
What does the word dwindling mean.
A. multiplying
B. dividing
C. growing smaller
D. growing larger
 List answer choices in logical (alphabetical or numerical) order.

Example
One of the largest black sea bass ever caught in Florida weighed 5 pounds, 1 ounce. Which of the
following shows the weight of the black sea bass in ounces?
A. 81 ounces
B. 86 ounces
C. 91 ounces
D. 96 ounces
 Avoid using "All of the Above" options.

Example
How was the Grand Canyon formed?
A.The canyon once had a waterfall.
B. Big rainstorms washed rocks out of the canyon.
C. A flowing river cut into the rocks to form the canyon.
32
D. The canyon was formed from rocks that came from other places.
b) Matching
Matching items consist of two lists of words, phrases, or images (often referred to as stems and
responses). Students review the list of stems and match each with a word, phrase, or image from
the list of responses.
Important points:
 Alternatives should be short, homogeneous and arranged in logical
order. Example
Match the following equations (place the corresponding letter in the blank).
_____ Area of triangle A. A=bh

_____ Area of rectangle B. A=1/2(bh)
_____ Area of parallelogram C. A=lw
_____ Area of circle D. A=lwh
E. A=πr2
 Responses should be plausible and similar in length and grammatical

form. Example
Match the following.
_____ Botany A. Science of all living organisms
_____ oology B. S ience of plants
_ ___ Biology C. Science of the earth
_____ Chemis ry D. Science of matter
E. Science of animals
 Include more response options than

stems. Example
Match the following equations (place corresponding letter in the blank).
_____ 7/25 A. 0.09
_____ 5/4 B. 0.25
_____ ¼ C. 0.28
_____ 9/10 D. 0.90
E. 1.25
 As a general rule, the stems should be longer and the responses should be
shorter. Example
Match the description of the flag to its country.
33
_____ Red background with white cross A. Austria
_____ Green background with large red circle B. Germany
_____ Two red strips and one white stripe (in center) C. Denmark
_____ Red strip (top); white stripe; green stripe (bottom D. Bangladesh
E. Hungary
c) True/False items
True/false questions can appear to be easier to write; however, it is difficult to write effective
true/false questions. Also, the reliability of T/F questions is not generally very high because of the
high possibility of guessing. In most cases, T/F questions are not recommended.
Important points:
 Statements should be completely true or completely false.
Example
The metric system is used consistently in Canada.
True
False
 Use simple, easy-to-follow statements.
Example
Botany is a branch of the biological sciences that embraces the study of plants and plant life.
True
False
Avoid using negatives -- especially double negatives.
There is nothing illegal about staying home from school.
True
False
Avoid absolutes such as "always; never."
Example
The news and information posted on the CNN website is usually accurate.
True
False
Non-Example
The news and information posted on the CNN website is always accurate.
True
False
d) Fill-in-the-Blank Items
The simplest forms of constructed response questions are fill-in-the-blank or short answer
questions. For example, the question may take one of the following forms:
34
1. Who was the 16th president of the United States?
2. The 16th president of the United States was ___________________.
These assessments are relatively easy to construct, yet they have the potential to test recall,
rather than simply recognition. They also control for guessing, which can be a major factor,
especially for T/F or multiple choice questions.
When creating short answer items, make sure the question is clear and there is a single,
correct answer. Here are a few guidelines, along with examples and non-examples
Ask a direct question that has a definitive answer.
Example
Ms. Joyce has dinner with three of her friends. The four friends decide to split the cost equally.
The bill comes to $32.80, and the women plan to leave a 15% tip. How much should Ms. Joyce
pay for her share of the dinner? __________________
Non-Example
Ms. Joyce has dinner with three of her friends. The four friends decide to split the cost equally.
The bill comes to $32.80, and the women plan to leave a small tip. How much should Ms. Joyce
pay for her share of the dinner? ____________________
If using fill-in-the blank, use only one blank per item.
Example
Salt consists primarily of sodium and _____________.
Non-Example
_____________ consists primarily of ____________ and _____________.
If using fill-in-the blank, place the blank near the end of the sentence.
Example
A ball is dropped from a height of 20 meters above the ground. As the ball falls, it increases
in speed. The kinetic and potential energies of the ball will be equal at __________ meters.
Although constructed response assessments can more easily demand higher levels of
thinking, they are more difficult to score.
6.1.2Essay Questions (Short and Extended Response)
Essay questions are a more complex version of constructed response assessments. With essay
questions, there is one general question or proposition, and the student is asked to respond in
writing. This type of assessment is very powerful -- it allows the students to express themselves
and demonstrate their reasoning related to a topic. Essay questions often demand the use of
higher level thinking skills, such as analysis, synthesis, and evaluation.
Essay questions may appear to be easier to write than multiple choice and other question types, but
writing effective essay questions requires a great deal of thought and planning. If an essay
question is vague, it will be much more difficult for the students to answer and much more
difficult for the instructor to score. Well-written essay questions have the following features:
Specify how the students should respond.

Example
Using details and information from the article (America’s Saltiest Sea: Great Salt Lake),
summarize the main points of the article. For a complete and correct response, consider
these points.
35
 its history
 its interesting features
 why it is a landmark
Non-Example
Using details and information from the article (America’s Saltiest Sea: Great Salt
Lake), summarize the main points of the article.
Emphasize higher-level thinking skills.

Example
Scientists have found that oceans can influence the temperature of nearby landmasses. Coastal
landmasses tend to have more moderate temperatures in summer and winter than inland
landmasses at the same latitude.
Considering the influence of ocean temperatures, explain why inland temperatures vary
in summer and winter to a greater degree than coastal temperatures.
Essay questions are used both as formative assessments (in classrooms) and summative
assessments (on standardized tests).
There are 2 major categories of essay questions -- short response (also referred to as restricted
or brief ) and extended response.
a) Short Response
Short response questions are more focused and constrained than extended response questions.
For example, a short response might ask a student to "write an example," "list three reasons," or
"compare and contrast two techniques." The short response items on the Florida assessment
(FCAT) are designed to take about 5 minutes to complete and the student is allowed up to 8 lines
for each answer. The short responses are scored using a 2-point scoring rubric. A complete and
correct answer is worth 2 points. A partial answer is worth 1 point.
Sample Short Response Question
(form 3 Reading)
How are the scrub jay and the mockingbird different? Support your answer with details
and information from the article.
b) Extended Response
Extended responses can be much longer and complex then short responses, but students should
be encouraged to remain focused and organized. Students have 14 lines for each answer to an
extended response item, and they are advised to allow approximately 10-15 minutes completing
each item. The extended responses are scored using a 4-point scoring rubric. A complete and
correct answer is worth 4 points. A partial answer is worth 1, 2, or 3 points.
Sample Extended Response Question
(form 1 Science)
Robert is designing a demonstration to display at his school’s science fair. He will show how
changing the position of a fulcrum on a lever changes the amount of force needed to lift an
object. To do this, Robert will use a piece of wood for a lever and a block of wood to act as a
fulcrum. He plans to move the fulcrum to different places on the lever to see how its
placement affects the force needed to lift an object.
Part A Identify at least two other actions that would make Robert’s demonstration better.
36
Part B Explain why each action would improve the demonstration.
Review Questions
1. Outline the guidelines for constructing multiple choice questions
2. What are the characteristics of well-constructed matching item questions items?
3. State important [points while constructing True/False test items
4. Identify 2 types of essay type test items
5. Outline the features of well written essay questions
CHAPTER 7:
PREPARATION OF MARKING SCHEME
Objectives
 Prepare marking schemes for different kinds of tests
 Explain the meaning and purpose of moderation
 Discuss how to moderate examinations/tests
7.0 Introduction
A marking scheme is a set of criteria used in assessing student learning.
Why prepare a marking scheme

Preparing a marking scheme ahead of time will allow you to review your questions, to verify that
they are really testing the material you want to test, and to think about possible alternative
answers that might come up. This section discusess guidelines for preparing marking schemes as
well as moderation of tests
7.1 Guidelines When Making Marking Schemes

 Look at what others have done. Chances are that you are not the only person who
teaches this course. Look at how others choose to assign grades.
 Make a marking scheme usable by non-experts. Write a model answer and use this as
the basis for a marking scheme usable by non-experts. This ensures that your TAs and
your students can easily understand your marking scheme. It also allows you to have an
external examiner mark the response, if need be.
 Give consequential marks. Generally, marking schemes should not penalize the same
37
error repeatedly. If an error is made early but carried through the answer, you should only
penalize it once if the rest of the response is sound.
 Review the marking scheme after the exam. Once the exam has been written, read a
few answers and review your key. You may sometimes find that students have interpreted
your question in a way that is different from what you had intended. Students may come
up with excellent answers that may be slightly outside of what was asked. Consider
giving these students partial marks.
 When marking, make notes on exams. These notes should make it clear why you gave
a particular mark. If exams are returned to the students, your notes will help them
understand their mistakes and correct them. They will also help you should students want
to review their exam long after it has been given, or if they appeal their grade.
7.2 Sample Marking Scheme for Presentations

This is an example of a marking scheme for a presentation assignment
 Presentation (40%)
1. verbal & non-verbal communication (10%) - avoid note reading, eye
contact, enthusiasm, body language, volume of voice, clarity of language
2. visual aids (10%) - minimal text, appealing layout, distraction-free, large font size
3. structure (10%) - clarity of goals, organization, logical progression, good flow
4. discussion questions (10%) - ability to guide discussion: ability to ask and answer
questions, maintain order
 Content (60%)
1. introduction (10%) - introduce background material
2. thesis statement (5%) - clear, focused
3. main body (35%) - depth, synthesis of references, accuracy, summary, figures
or tables
4. discussion questions (10%) - thought-provoking, sufficient quantity
7.3 Scoring Essay Items
Although essay questions are powerful assessment tools, they can be difficult to score. With
essays, there isn't a single, correct answer and it is almost impossible to use an automatic scantron
or computer-based system. In order to minimize the subjectivity and bias that may occur in the
assessment, teachers should prepare a list of criteria prior to scoring the essays. Consider, for
example, the following question and scoring criteria:
Sample Question (history/Social Studies)

Consider the time period during the Vietnam War and the reasons there were riots in cities and
at university campuses. Write an essay explaining three of those reasons. Include information on
the impact (if any) of the riots. The essay should be approximately one page in length. Your
score will depend on the accuracy of your reasons, the organization of your essay, and
brevity. Although spelling, punctuation, and grammar will not be considered in grading, please
do your best to consider them in your writing. (10 points possible)
Scoring Criteria (for the teacher)
 Content Accuracy -- up to 2 points for each accurate reason the riots ensued (6 points
38
total)
 Organization -- up to 3 points for essay organization (e.g., introduction, well
expressed points, conclusion)
 Brevity -- up to 1 point for appropriate brevity (i.e., no extraneous or "filler"
information) No penalty for spelling, punctuation, or grammatical errors.
By outlining the criteria for assessment, the students know precisely how they will be assessed
and where they should concentrate their efforts. In addition, the instructor can provide
feedback that is less biased and more consistent. Additional techniques for scoring constructed
response items include:
 Do not look at the student's name when you grade the essay.
 Outline an exemplary response before reviewing student responses.
 Scan through the responses and look for major discrepancies in the answers -- this
might indicate that the question was not clear.
 If there are multiple questions, score Question #1 for all students, then Question #2, etc.
Use a scoring rubric that provides specific areas of feedback for the students
7.4What is Moderation?
Moderation is a set of processes designed and implemented by the assessors/evaluators to:
 Provide system-wide comparability of grades and scores derived from internal-based
assessment
 Form the basis for valid and reliable assessment in schools
 Maintain the quality of assessment and the credibility, validity and acceptability
of certificates.
 Qualitative moderation - Unit Grades from student assessment are moderated by

peer review against system criteria;
 Statistical moderation - Scores from student assessment within courses are placed on the
same scale
 Moderation is necessary for producing valid, credible and publicly acceptable certificates
in an assessment system.
 Moderation provides for comparability of standards across the classes and schools.
7.5 How to Moderate

Let’s say our exam has a mean of 70 and a standard deviation of 10. The students have done
fairly well here. If I want to compare the scores in this exam with another exam with a mean of
50 and standard deviation of 20, it’s possible to scale that in a very simple way. We reduce the
mean from the marks. We divide by the standard deviation. Then multiply by the new standard
deviation. And add back the new mean.
If the first column has the marks in a school internal exam, and the second in a public exam, we can
scale the internal scores to be in line with the public exam scores for them to be comparable. The
internal exam has a higher average, which means that it was easier, and a lower spread, which
means that most of the students answered similarly. When scaling it to the public exam, students
who performed well in the interal exam would continue to perform well after scaling.
39
But students with an average performance would have their scores pulled down.
This is because the internal exam is an easy one, and in order to make it comparable, we’re
stretching their marks to the same range. As a result, the good performers would continue
getting a top score. But poor performers who’ve gotten a better score than they would have in a
public exam lose out.
Review Questions
1. What is a marking scheme?
2. Why prepare a marking scheme?
3. Explain the guidelines when making marking schemes
4. What is test moderation?
5. Outline the process of test moderation
CHAPTER 8
TEST ITEM ANALYSIS
Objectives
1. Discuss how to determine the Difficulty index of a test
2. Discuss how to determine the Discrimination index of a test
40
8.0 Introduction
After you create your objective assessment items and give your test, how can you be sure that
the items are appropriate -- not too difficult and not too easy? How will you know if the test
effectively differentiates between students who do well on the overall test and those who do not?
An item analysis is a valuable, yet relatively easy, procedure that teachers can use to answer both
of these questions.
8.1 Item Analysis

Involves test discrimination index and test difficulty index
8.1.1 Difficulty Index

To determine the difficulty level of test items, a measure called the Difficulty Index is used.
This measure asks teachers to calculate the proportion of students who answered the test item
accurately. By looking at each alternative (for multiple choice), we can also find out if there are
answer choices that should be replaced. For example, let's say you gave a multiple choice quiz
and there were four answer choices (A, B, C, and D). The following table illustrates how many
students selected each answer choice for Question #1 and #2.
Question A B C D
#1 0 3 24* 3
#2 12* 13 3 2
* Denotes correct answer.
For Question #1, we can see that A was not a very good distractor -- no one selected that answer.
We can also compute the difficulty of the item by dividing the number of students who choose
the correct answer (24) by the number of total students (30). Using this formula, the difficulty of
Question #1 (referred to as p) is equal to 24/30 or .80. A rough "rule-of-thumb" is that if the item
difficulty is more than .75, it is an easy item; if the difficulty is below .25, it is a difficult item.
Given these parameters, this item could be regarded moderately easy -- lots (80%) of students got
it correct. In contrast, Question #2 is much more difficult (12/30 = .40). In fact, on Question #2,
more students selected an incorrect answer (B) than selected the correct answer (A). This item
should be carefully analyzed to ensure that B is an appropriate distractor.
8.1.2 Discrimination Index

Another measure, the Discrimination Index, refers to how well an assessment differentiates
between high and low scorers. In other words, you should be able to expect that the high-
performing students would select the correct answer for each question more often than the low-
performing students. If this is true, then the assessment is said to have a positive discrimination
index (between 0 and 1) -- indicating that students who received a high total score chose the
41
correct answer for a specific item more often than the students who had a lower overall score.
If, however, you find that more of the low-performing students got a specific item correct, then
the item has a negative discrimination index (between -1 and 0). Let's look at an example.
Table 2 displays the results of ten questions on a quiz. Note that the students are arranged with
the top overall scorers at the top of the table.
Questions
Student Total
Score (%)
1 2 3
Asif 90 1 0 1
Sam 90 1 0 1
Jill 80 0 0 1
Charlie 80 1 0 1
Sonya 70 1 0 1
Ruben 60 1 0 0
Clay 60 1 0 1
Kelley 50 1 1 0
Justin 50 1 1 0
Tonya 40 0 1 0
"1" indicates the answer was correct; "0" indicates it was incorrect.
Follow these steps to determine the Difficulty Index and the Discrimination Index.
After the students are arranged with the highest overall scores at the top, count the number of
students in the upper and lower group who got each item correct. For Question #1, there were
4 students in the top half who got it correct, and 4 students in the bottom half.
 Determine the Difficulty Index by dividing the number who got it correct by the total
number of students. For Question #1, this would be 8/10 or p=.80.
 Determine the Discrimination Index by subtracting the number of students in the
lower group who got the item correct from the number of students in the upper group
who got the item correct. Then, divide by the number of students in each group (in this
case, there are five in each group). For Question #1, that means you would subtract 4
from 4, and divide by 5, which results in a Discrimination Index of 0.
 The answers for Questions 1-3 are provided in Table 2.
# Correct (Upper # Correct (Lower Difficulty Discrimination

Item
group) group) (p) (D)
42
Question
4 4 .80 0
1
Question
0 3 .30 -0.6
2
Question
5 1 .60 0.8
3
Now that we have the table filled in, what does it mean? We can see that Question #2 had a
difficulty index of .30 (meaning it was quite difficult), and it also had a negative discrimination
index of -0.6 (meaning that the low-performing students were more likely to get this item
correct). This question should be carefully analyzed, and probably deleted or changed. Our "best"
overall question is Question 3, which had a moderate difficulty level (.60), and discriminated
extremely well (0.8).
Another consideration for an item analysis is the cognitive level that is being assessed. For example,
you might categorize the questions based on Bloom's taxonomy (perhaps grouping questions that
address Level I and those that address Level II). In this manner, you would be able to determine if
the difficulty index and discrimination index of those groups of questions are appropriate. For
example, you might note that the majority of the questions that demand higher levels of thinking
skills are too difficult or do not discriminate well. You could then concentrate on improving those
questions and focus your instructional strategies on higher-level skills.
Review Questions
1. What is test analysis?
2. Why is test analysis important
3. Differentiate between test difficulty index and test discrimination index.
4. Explain what is meant by positive discrimination and negative discriminatis
58
CHAPTER 9
INTERPRETING TEST RESULTS
Objectives
i. Discuss Scoring and grading of tests
ii. Discuss the Analysis of examination results
iii. Explain the reporting test results s
13.0 Introduction
The process of evaluation does not end at analysis but must go on to interpretation of the
anlysed results. This section examines this process.
13.1 Scoring and Grading
Grading comparisons
Some kind of comparison is being made when grades are assigned. For example, an instructor
may compare a student's performance to that of his or her classmates, to standards of excellence
(i.e., pre-determined objectives, contracts, professional standards) or to combinations of each.
Four common comparisons used to determine college and university grades and the major
advantages and disadvantages of each are discussed in the following section.
Comparisons with Other Students
By comparing a student's overall course performance with that of some relevant group of
students, the instructor assigns a grade to show the student's level of achievement or standing
within that group. An "A" might not represent excellence in attainment of knowledge and skill if
the reference group as a whole is somewhat inept. All students enrolled in a course during a given
semester or all students enrolled in a course since its inception are examples of possible
comparison groups. The nature of the reference group used is the key to interpreting grades based
70
on comparisons with other students.
Some Advantages of Grading Based on Comparison With Other Students
1. Individuals whose academic performance is outstanding in comparison to their peers

are rewarded.
2. The system is a common one that many faculty members are familiar with. Given
additional information about the students, instructor, or college department, grades from
the system can be interpreted easily.
Some Disadvantages of Grading Based on Comparison With Other Students
1. No matter how outstanding the reference group of students is, some will receive low
grades; no matter how low the overall achievement in the reference group, some students
will receive high grades. Grades are difficult to interpret without additional information
about the overall quality of the group.
2. Grading standards in a course tend to fluctuate with the quality of each class of students.
Standards are raised by the performance of a bright class and lowered by the performance of
a less able group of students. Often a student's grade depends on who was in the class.
3. There is usually a need to develop course "norms" which account for more than a single
class performance. Students of an instructor who is new to the course may be at a
particular disadvantage since the reference group will necessarily be small and very
possibly atypical compared with future classes.
Comparisons with Established Standards
Grades may be obtained by comparing a student's performance with specified absolute standards
rather than with such relative standards as the work of other students. In this grading method,
the instructor is interested in indicating how much of a set of tasks or ideas a student knows,
rather than how many other students have mastered more or less of that domain. A "C" in an
introductory statistics class might indicate that the student has minimal knowledge of descriptive
and inferential statistics. A much higher achievement level would be required for an "A." Note
that students' grades depend on their level of content mastery; thus the levels of performance of
their classmates has no bearing on the final course grade. There are no quotas in each grade
category. It is possible in a given class that all students could receive an "A" or a "B."
Some Advantages of Grading Based on Comparison to Absolute Standards
1. Course goals and standards must necessarily be defined clearly and communicated to
the students.
2. Most students, if they work hard enough and receive adequate instruction, can obtain
high grades. The focus is on achieving course goals, not on competing for a grade.
71
3. Final course grades reflect achievement of course goals. The grade indicates "what"
a student knows rather than how well he or she has performed relative to the
reference group.
4. Students do not jeopardize their own grade if they help another student with course work.
Some Disadvantages of Grading Based on Comparison to Absolute Standards
1. It is difficult and time consuming to determine what course standards should be for each
possible course grade issued.
2. The instructor has to decide on reasonable expectations of students and necessary

prerequisite knowledge for subsequent courses. Inexperienced instructors may be at
a disadvantage in making these assessments.
3. A complete interpretation of the meaning of a course grade cannot be made unless

the major course goals are also available.
Comparisons Based on Learning Relative to Improvement and Ability
The following two comparisons--with improvement and ability--are sometimes used by

instructors in grading students. There are such serious philosophical and methodological
problems related to these comparisons that their use is highly questionable for most educational
situations.
Relative to Improvement . . .
Students' grades may be based on the knowledge and skill they possess at the end of a course
compared to their level of achievement at the beginning of the course. Large gains are assigned
high grades and small gains are represented by low grades. Students who enter a course with
some pre-course know-ledge are obviously penalized; they have less to gain from a course than
does a relatively naive student. The post test-pretest gain score is more error-laden, from a
measurement perspective, than either of the scores from which it is derived. Though growth is
certainly important when assessing the impact of instruction, it is less useful as a basis for
determining course grades than end-of-course competence. The value of grades which reflect
growth in a college-level course is probably minimal.
Relative to Ability . . .
Course grades might represent the amount students learned in a course relative to how much they
could be expected to learn as predicted from their measured academic ability. Students with high
ability scores (e.g., scores on the Scholastic Aptitude Test or American College Test) would be
expected to achieve higher final examination scores than those with lower ability scores. When grades
are based on comparisons with predicted ability, an "overachiever" and an "underachiever" may
receive the same grade in a particular course, yet their levels of competence with respect to the course
content may be vastly different. The first student may not be prepared to take a more advanced
course, but the second student may be. A course grade may, in part, reflect the amount of effort the
instructor believes a student has put into a course. The high ability students who can
72
satisfy course requirements with minimal effort are penalized for their apparent "lack" of
effort. Since the letter grade alone does not communicate such information, the value of
ability-based grading does not warrant its use.
A single course grade should represent only one of the several grading comparisons noted
above. To expect a course grade to reflect more than one of these comparisons is too much of a
communication burden. Instructors who wish to communicate more than relative group standing,
or subject matter competence or level of effort, must find additional ways to provide such
information to each student. Suggestions for doing so are noted near the end of Section V of this
booklet.
BASIC GRADING GUIDELINES
1. Grades should conform to the practice in the institution in which the grading occurs
2. Grading components should yield accurate information. Carefully written tests and/or
graded assignments (homework papers, projects) are keys to accurate grading..
3. Grading plans should be communicated to the class at the beginning of each semester. By
stating the grading procedures at the beginning of a course, the instructor is essentially
making a "contract" with the class about how each student is going to be evaluated. The
contract should provide the students with a clear understanding of the instructor's
expectations so that the students can structure their work efforts. Students should be
informed about: which course activities will be considered in their final grade; the
importance or weight of exams, quizzes, homework sets, papers and projects; and which
topics are more important than others.
4. Grading plans stated at the beginning of the course should not be changed without
thoughtful consideration and a complete explanation to the students.two common
complaints found on students' post-course evaluations are that grading procedures stated
at the beginning of the course were either inconsistently followed or were changed
without explanation or even advanced notice. One could look at the situation of altering
or inconsistently following the grading plan as being analogous to playing a game
wherein the rules arbitrarily change, sometimes without the players' knowledge.
5. The number of components or elements used to assign course grades should be large
enough to enhance high accuracy in grading.From a decision-making point of view, the
more pieces of information available to the decision-maker, the more confidence one can
have that the decision will be accurate and appropriate. This same principle applies to the
process of assigning grades. If only a final exam score is used to assign a course grade, the
adequacy of the grade will depend on how well the test covered all the relevant aspects of
course content and how typically the student performed on one specific day during a 2-3
hour period.
13.2 Analyzing examination results
Guidelines for analyzing results

 Monitor the quality of your exams. Exams provide you with the opportunity to obtain
feedback on student learning, your teaching methods, and the quality of the exam itself.
 Write impressions on your exam and keep them. During the exam and the marking of the
exam, keep track of which questions seem to be well understood, and which questions
73
were frequently misunderstood.
 Collect numerical data. If you have machine-scorable exams, you can get statistics on
your questions, such as which questions were missed most often or which distracters were
most often chosen. In other cases you can collect an overview of the marks.
 Get student feedback. You can leave space specifically for feedback on exams, or you
can obtain feedback in class after the exam.
 Reflect on the gathered information. Reviewing examination results can help you
identify concepts and methods that students are having difficulty with – questions that
were missed – as well as concepts and methods that were well understood – questions
generally successfully answered. Or it may highlight well-constructed or poorly -
constructed exam question. Consider using this information to:
 Change how you teach the remainder of the term
 Check for improvement on specific topics or methods over a term
 Redesign the course or the examination for future classes
 Assess your teaching practice – what is working especially well and what can be
improved upon

1. Discuss the scoring and grading of tests
2. outline the grading guidelines
3 Discuss the guidelines for analyzing results
4. how should tests be reported
5. explain the advantages and disadvantages of grading based on comparison with other
students
CHAPTER 10
TYPES AND CATEGORIES OF EVALUATION
Objective
 Identify and explain the different categories of evaluation
14.0 Introduction
Definitions of educational evaluation in general and course evaluation in particular abound in the
literature, but only a few will be highlighted. One of the earliest and most common definitions of
evaluation is that it is “the process of determining to what extent the educational objectives are
actually being realized” (Tyler, 1949, p. 69). Evaluation has been conceived either as the
assessment of the merit and worth of educational programmes (Guba and Lincoln, 1981;
Glatthorn, 1987; and Scriven, 1991), or as the acquisition and analysis of information on a given
educational programme for the purpose of decision making (Nevo, 1986; Shiundu and Omulando,
1992; and Teachers Proficiency Course Training Manual, 2007).
Evaluation is a vital concept in any education system. In fact, the success or failure of any
programme in education may be attributed nearly entirely to the quality and quantity of evaluation
done at the beginning of, and during the implementation of the programme. Yet evaluation
remains one of the least developed aspects of formal education in general and curriculum in
particular. This is most probably due to the often narrow and simplistic conceptions of course
evaluation among most stakeholders in education, including some professional educators. It is,
therefore, very important to conceive course evaluation in its broad sense.
14.1Types of Course Evaluation

The literature on evaluation indicates that there are three major types curriculum evaluation:
diagnostic, formative, and summative.
14.1.1 Formative Evaluation

Formative Evaluation
Formative evaluation is carried out during the teaching/learning process to g data which will be
used to strengthen or improve the process. Formative evaluation examines the delivery of the
programme, the quality of its implementation, and: t assessment of the organizational context,
personnel, procedures, and inputs among other things. At instructional level, formative
evaluation includes: weekly tests, observation checklists and termly tests. Instructional
evaluation is often regarded formative component of programme evaluation (Shiundu &
Omulando, 1992; M & Willis, 2007).
14. I.2 Summative Evaluation
Summative evaluation is done at the end of a course or a programme. It is concerned with

purposes and outcomes of the teaching-learning process. Traditionally, summative evaluation
75
tests whether the stated objectives of the programme have been achieved. The terminal
examinations such as Kenya Certificate of Primary Education (KCPE) and the Kenya Certificate
of Secondary Education (KCSE) examinations should not be confused for summative evaluation,
but that they contribute significantly towards this type of evaluation (Marsh & Willis 2007;
Teachers Proficiency Course, 2007). From a broader perspective, summative evaluation includes
the evaluation of the teacher’s performance in using the curriculum, the infrastructure, the
learning/teaching resources, time allocation, administrative support, the cost of the programme,
and the impact of the programme. The findings of summative evaluation may lead to curriculum
continuity, enhancement, or change (Shiundu & Omulando, 1992).
14.1.3 Diagnostic evaluation

Diagnostic evaluation is done before teaching/learning begins so as to diagnose specific areas of
weakness or strength of a learner and to determine their nature, before a programme is designed
and implemented. Data from diagnostic evaluated are used to categorize (but not to label) the
learner for the purposes of appropriate instruction (Marsh and Willis, 2007; Teachers Proficiency
Course, 2007), or to ill the context in which a curriculum will operate so as to justify the
implementation of curriculum innovations (Shiundu & Omulando, 1992). Examples of diagnose,
evaluation include entry-level tests, situational analysis, and in the classroom situation a review
of pre-requisite
Review Questions
1. What is evaluation
2. Explain what is meant by diagnostic evaluation
3. Differentiate between summative and formative evaluation
77

Educational Measurement and Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Educational Measurement and Evaluation

Uploaded by

Copyright:

Available Formats

SCHOOL OF EDUCATION

UNIT CODE: BEP3101

UNIT NAME: EDUCATIONAL MEASUREMENT AND EVALUATION

Expected Learning Outcomes

Introduction to educational measurement & evaluation; meaning of basic concepts, formative

Recommended Text Books

Text Books for further Reading

1.1 Meaning of Terms and Concepts

Assessment: Assessment is a process by which information is obtained relative to some known

Assessment: Assessment is a process by which information is obtained relative to some known

Assessment enables the teacher to;

1.2 Importance of Measurement and Evaluation in Education

2.3 Types and Approaches to Assessment

Informal vs. Formal Assessment

Continuous vs. Final Assessment

Process vs. Product Assessment

1.4 Comparison between Assessment and Evaluation

Dimension Assessment Evaluation

1.5 Review Questions

2.1 Importance of instructional/learning objectives

2.2 Steps in Writing Instructional Objectives

2.3 Domains of Learning

The cognitive domain

The affective domain

 This involves arranging values according to priority by comparing, relating &

Complete overt response

2.4 Review Questions

3.1 Reliability of Tests

3.5 Ways to improve validity

3.6 Review Questions

Tests and examinations

Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement

Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.

4.2 Why do we give exams to students?

4.3 What are the qualities of a good / fair exam?

5.1 Importance of table of specification

5.2 Preparation of table of specification

Subject – form 1 geography

Sample table of specification

6.1Guidelines information making of valid test items

6.1.1 Writing Selected Response (objective) Assessment Items

 Answers should be similar in length and grammatical

 List answer choices in logical (alphabetical or numerical) order.

 Avoid using "All of the Above" options.

_____ Area of triangle A. A=bh

 Responses should be plausible and similar in length and grammatical

 Include more response options than

6.1.2Essay Questions (Short and Extended Response)

Specify how the students should respond.

Emphasize higher-level thinking skills.

Why prepare a marking scheme

7.1 Guidelines When Making Marking Schemes

7.2 Sample Marking Scheme for Presentations

7.3 Scoring Essay Items

Sample Question (history/Social Studies)

 Qualitative moderation - Unit Grades from student assessment are moderated by

7.5 How to Moderate

8.1 Item Analysis

8.1.1 Difficulty Index

8.1.2 Discrimination Index

# Correct (Upper # Correct (Lower Difficulty Discrimination

13.1 Scoring and Grading

Comparisons with Other Students

Some Advantages of Grading Based on Comparison With Other Students