Professional Documents
Culture Documents
1
PURPOSE OF THE COURSE
To help the teacher trainee gain proper skills in constructing tests and interpreting the test
result for high quality teaching
COURSE CONTENT
Test construction; importance of validity and reliability of tests, different ways of ascertaining
validity and reliability of tests. Construction of test items; guidelines information making of
valid test items, testing for validity and reliability of test items, preparation of marking
schemes; scoring of tests, Moderation of tests. Test item analysis,; difficulty index ,
discrimination index.
Interpreting test results; scoring and grading, analyzing examination results, reporting
test results. Types and categories of evaluation; formative evaluation, Summative
evaluation, placement evaluation, diagnostic evaluation.
2
Instructional Materials and Equipment
Projector
Course Assessment
Examination 70%
Continuous Assessments (Exercises and Tests) 30%
Total 100%
Objectives
At the end of this topic, you should be able to:
Define terms used in educational measurement and evaluation
Explain the importance of measurement and evaluation in education
Identify and explain approaches used in educational measurement and evaluation
1. O Introduction
Educational measurement and evaluation is the study of methods, approaches and strategies used to
measure, assess and evaluate in the educational setting. Evaluation has been conceived either as the
assessment of the merit and worth of educational programmes (Guba and Lincoln, 1981; Glatthorn,
1987; and Scriven, 1991), or as the acquisition and analysis of information on a given educational
programme for the purpose of decision making (Nevo, 1986; Shiundu & Omulando, 1992). The course
involves the study of tests, test construction and item construction as well as statistical procedures used
to analyze tests and test results.
Test: A method to determine a student's ability to complete certain tasks or demonstrate mastery
of a skill or knowledge of content. Some types would be multiple choice tests, or a weekly
spelling test. While it is commonly used interchangeably with assessment, or even evaluation, it
can be distinguished by the fact that a test is one form of an assessment. A test or assessment
yields information relative to an objective or goal. In that sense, we test or assess to determine
whether or not an objective or goal has been obtained.
11
knowledge in a variety of ways, but there is always a leap, an inference that we make about
what a person does in relation to what it signifies about what he knows.
Evaluation: Procedures used to determine whether the subject (i.e. student) meets a preset
criteria, such as qualifying for special education services. This uses assessment (remember that
an assessment may be a test) to make a determination of qualification in accordance with a
predetermined criteria. Evaluation is perhaps the most complex and least understood of the
terms. Inherent in the idea of evaluation is "value." When we evaluate, what we are doing is
engaging in some process that is designed to provide information that will help us make a
judgment about a given situation. Generally, any evaluation process requires information about
the situation in question. When we evaluate, we are saying that the process will yield
information regarding the worthiness, appropriateness, goodness, validity, legality, etc., of
something for which a reliable measurement or assessment has been made.
We evaluate every day. Teachers, in particular, are constantly evaluating students, and such
evaluations are usually done in the context of
comparisons between what was intended (learning, progress, behavior) and what was obtained.
When used in a learning objective, the definition of evaluate is: To classify objects, situations,
people, conditions, etc., according to defined criteria of quality. Indication of quality must be
given in the defined criteria of each class category.
Measurement refers to the process by which the attributes or dimensions of some physical
object are determined. One exception seems to be in the use of the word measure in
determining the IQ of a person, attitudes or preferences.
However, when we measure, we generally use some standard instrument to determine how big,
tall, heavy, voluminous, hot, cold, fast, or straight something actually is. Standard instruments
refer to instruments such as rulers, scales, thermometers, pressure gauges, etc. We measure to
obtain information about what is. Such information may or may not be useful, depending on
the accuracy of the instruments we use, and our skill at using them.
To sum up, we measure distance, we assess learning, and we evaluate results in terms of
some set of criteria. These three terms are certainly connected, but it is useful to think of them
as separate but connected ideas and processes
More specifically, assessment is a method the teacher uses to make decisions on learners’
progress. It is an essential process in teaching and learning as it enables the teacher to evaluate
the level and extent of learners’ achievement of the set objectives.
12
ii. Determines how much knowledge the learners have grasped
iii. Establishes how the learners have mastered skills taught and acquired attitudes
iv. Detects the difficulties and challenges learners are encountering, which forms the basis
for remedial teaching
v. Check on the effectiveness on the use of resources and methods of instruction
vi. Provides basis for learner promotion and reward.
vii. Provide information to school administration, parents and stakeholders for necessary
action.
viii. motivating and directing learning
ix. providing feedback to student on their performance
x. providing feedback on instruction and/or the curriculum
Good assessment can help students become more effective self-directed learners (Angelo
and Cross, 1993).Well-designed assessment strategies also play a critical role in educational
decision-making and Is a vital component of ongoing quality improvement processes at the
lesson, course and/or curriculum level.
Summative assessment is used primarily to make decisions for grading or determine readiness
for progression. Typically summative assessment occurs at the end of an educational activity and
is designed to judge the learner’s overall performance. In addition to providing the basis for
13
grade assignment, summative assessment is used to communicate students’ abilities to external
stakeholders, e.g., administrators and employers.
Formal assessment occurs when students are aware that the task that they are doing is for
assessment purposes, e.g., a written examination or OSCE. Most formal assessments also
are summative in nature and thus tend to have greater motivation impact and are associated
with increased stress. Given their role in decision-making, formal assessments should be
held to higher standards of reliability and validity than informal assessments.
Final (or terminal) assessment is that which takes place only at the end of a learning activity. It
is most appropriate when learning can only be assessed as a complete whole rather than as
constituent parts. Typically, final assessment is used for summative decision-making. Obviously,
due to its timing, final assessment cannot be used for formative purposes.
14
A convergent assessment has only one correct response (per item). Objective test items are the
best example and demonstrate the value of this approach in assessing knowledge. Obviously,
convergent assessments are easier to evaluate or score than divergent assessments.
Unfortunately, this “ease of use” often leads to their widespread application of this approach
even when contrary to good assessment practices.
CHAPTER 2
INSTRUCTIONAL OBJECTIVES
Objectives
At the end of this topic, you should be able to:
i. Formulate instructional objectives
ii. Explain the importance Blooms taxonomy of educational objectives
2.0 Introduction
Instructional objectives are statements of what is to be achieved at the end of the instructional
process. They are, therefore, the subject of assessment and evaluation. This chapter discusses the
importance of instructional objectives, and their formulation.
16
OR
Without using a calculator, calculate the average of a list of numbers.
Specify application criteria -identify any desired levels of speed,accuracy,quality,quantity e.t.c .
For example: Given a calculator, calculate averages from a list of numbers correctly, all the time.
OR
Given a spreadsheet package, compute variances from a list of numbers rounded to the second
decimal point.
Review each learning outcome to be sure it is complete, clear and concise
Comprehension
Comprehension entails the understanding of information so as to be able to translate,
perceive and interpret instruction and problems. Learners should be able to classify, cite,
convert, describe, discuss, explain, give examples, paraphrase, restate in own words,
summarize, understand, distinguish and rewrite.
Application
Application is using previously learnt information in new situations to solve problems. Those
experiencing this level of learning should show the capacity to apply, change, compute, modify,
predict, prepare, relate, solve, show, use and produce.
Analysis
Analysis refers to the ability to break down informational materials into their component parts
examine them and understand the organizational structure. It may involve identifying motives or
causes, making inferences or finding evidence to support generalization .at this level, learners
should be able to break down, correlate, discriminate, differentiate, distinguish, focus, illustrate,
infer, limit, outline point out, prioritize, recognize separate, subdivide, select and compare.
Synthesis
17
Synthesis refers to the building of structures or patterns from various kinds of elements.
Learners at this level can put parts together to form a whole that has a new meaning or structure.
Key words in use for this level of learning include categories
Evaluation
Evaluation refers to the process of making judgment s about information, its value and quality.
Learning and the level includes appraising, comparing and contrasting, defending, judging,
interpreting, justifying. Discriminating and evaluating
This describes the value or worth that a learner attaches to a particular object,
phenomenon or behaviour.
Key terms include complete, join, demonstrate, differenciate, explain, form, initiate,
write, join, justify, propose, report, share & study & work.
Perception
This is the ability to use the senses to guide motor activity. Individuals experiencing learning at
this level should be able to choose, describe, detect, differentiate, distinguish, identify, isolate,
select.
Set
This indicates the readiness to act. At this level, individuals should be able to display, explain,
move, proceed, restate and volunteer
Guided response
18
This refers to the early stages of learning a complex skill that include imitation, and trial and
error. Achievement at this level is attained by practicing. Learning is demonstrated by copying,
tracing, following, reacting and reproducing.
Mechanism
This refers to the intermediate stage of learning a complex skill. Learnt responses should
become habitual, confident and proficient. Learners can assemble, construct, dismantle, fasten,
fix, grind, heat, measure, mend, sketch, organize and calibrate.
Adaptation
Individuals experiencing learning at this level have well-developed skills and are able to make
modifications to fit special requirements. Learners at this level adapt, alter, change, rearrange
and reorganize.
Origination
This involves creativity. The learner can create new movement patterns to suit different
situations. Individuals at this level should demonstrate the ability to arrange, build, combine,
compose, construct, design, initiate and make.
CHAPTER 3
TEST CONSTRUCTION
Objectives
At the end of this topic, you should be able to:
i. Define basic terms
ii. Explain the importance of Validity and reliability of tests
iii. Discuss different ways of ascertaining Validity and reliability of tests
3.0 Introduction
Test construction is the process of building a test. For a test to be deemed good, the tests
reliability and validity must be determined. This chapter discusses test validity and reliability.
3.2Types of Reliability
1. Test-retest reliability is the degree to which scores are consistent over time. Test-retest
reliability is a measure of reliability obtained by administering the same test twice over a period
of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in
order to evaluate the test for stability over time. Exatoymple: A test designed to assess student
learning in psychology could be given to a group of students twice, with the second
administration perhaps coming a week after the first. The obtained correlation coefficient
would indicate the stability of the scores.
2. Parallel forms reliability/ Equivalent-Forms or Alternate-Forms Reliability:
Two tests that are identical in every way except for the actual items included. Used when
it is likely that test takers will recall responses made during the first session and when alternate
forms are available. Correlate the two scores. The obtained coefficient is called the coefficient of
stability or coefficient of equivalence. Problem: Difficulty of constructing two forms that are
essentially equivalent.
It is a measure of reliability obtained by administering different versions of an assessment tool
(both versions must contain items that probe the same construct, skill, knowledge base, etc.) to
20
the same group of individuals. The scores from the two versions can then be correlated in order
to evaluate the consistency of results across alternate versions.
Example: If you wanted to evaluate the reliability of a critical thinking assessment, you
might create a large set of items that all pertain to critical thinking and then randomly split
the questions up into two sets, which would represent the parallel forms. Both of the above
require two administrations
3. Inter-rater reliability is a measure of reliability used to assess the degree to which
different judges or raters agree in their assessment decisions. Inter-rater reliability is useful
because human observers will not necessarily interpret answers the same way; raters may
disagree as to how well certain responses or material demonstrate knowledge of the construct
or skill being assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating the degree
to which art portfolios meet certain standards. Inter-rater reliability is especially useful when
judgments can be considered relatively subjective. Thus, the use of this type of reliability would
probably be more likely when evaluating artwork as opposed to math problems.
4. Internal consistency reliability. It is determining how all items on the test relate to all other
items. It is a measure of reliability used to evaluate the degree to which different test items that
probe the same construct produce similar results.
A. Average inter-item correlation is a subtype of internal consistency reliability. It is obtained
by taking all of the items on a test that probe the same construct (e.g., reading comprehension),
determining the correlation coefficient for each pair of items, and finally taking the average of
all of these correlation coefficients. This final step yields the average inter-item correlation.
B. Split-half reliability is another subtype of internal consistency reliability. The process of
obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to
probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The
entire test is administered to a group of individuals, the total score for each “set” is computed,
and finally the split-half reliability is obtained by determining the correlation between the two
total “set” scores. Requires only one administration. Especially appropriate when the test is very
long. The most commonly used method to split the test into two is using the odd-even strategy.
Since longer tests tend to be more reliable, and since split-half reliability represents the reliability
of a test only half as long as the actual test, a correction formula must be applied to the
coefficient. Spearman-Brown prophecy formula. Split-half reliability is a form of internal
consistency reliability.
3.3 Validity
Validity refers to how well a test measures what it is purported to measure or the extent to which
a test measures what it is supposed to measure.
Why is it necessary?
While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to
be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an excess
of 5lbs. The scale is reliable because it consistently reports the same weight every day, but it is
not valid because it adds 5lbs to your true weight. It is not a valid measure of your weight.
21
3.4 Types of Validity
1. Content Validity:
When we want to find out if the entire content of the behavior/construct/area is
represented in the test we compare the test task with the content of the behavior. This is a logical
method, not an empirical one. Example, if we want to test knowledge on American Geography it
is not fair to have most questions limited to the geography of New England.
2. Face Validity:
Basically face validity refers to the degree to which a test appears to measure what it
purports to measure. Face Validity ascertains that the measure appears to be assessing the
intended construct under study. The stakeholders can easily assess face validity. Although this is
not a very “scientific” type of validity, it may be an essential component in enlisting motivation
of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the
ability, they may become disengaged with the task.
Example: If a measure of art appreciation is created all of the items should be related to the
different components and types of art. If the questions are regarding historical time periods, with
no reference to any artistic movement, stakeholders may not be motivated to give their best effort
or invest in this measure because they do not believe it is a true assessment of art appreciation.
3. Construct Validity.
Construct validity is the degree to which a test measures an intended hypothetical construct. it
Construct validity is the degree to which a test measures an intended hypothetical construct. is
used to ensure that the measure is actually measure what it is intended to measure (i.e. the
construct), and not other variables. Using a panel of “experts” familiar with the construct is a
way in which this type of validity can be assessed. The experts can examine the items and decide
what that specific item is intended to measure. Students can be involved in this process to obtain
their feedback.
Example: A women’s studies program may design a cumulative assessment of learning
throughout the major. The questions are written with complicated wording and phrasing.
This can cause the test inadvertently becoming a test of reading comprehension, rather than a
test of women’s studies. It is important that the measure is actually assessing the intended
construct, rather than an extraneous factor.
4. Criterion-Related Validity
When you are expecting a future performance based on the scores obtained currently by the
measure, correlate the scores obtained with the performance. The later performance is called the
criterion and the current score is the prediction. This is an empirical check on the value of the
test – a criterion-oriented or predictive validation. It is used to predict future or current
performance - it correlates test results with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student learning
throughout the major. The new measure could be correlated with a standardized measure of
ability in this discipline, such as an ETS field test or the GRE subject test. The higher the
correlation between the established measure and new measure, the more faith stakeholders
can have in the new assessment tool.
5. Formative Validity
22
When applied to outcomes assessment it is used to assess how well a measure is able to
provide information to help improve the program under study. Example: When designing a
rubric for history one could assess student’s knowledge across the discipline. If the measure can
provide information that students are lacking knowledge in a certain area, for instance the Civil
Rights Movement, then that assessment tool is providing meaningful information that can be
used to improve the course or program requirements.
6. Concurrent Validity:
Concurrent validity is the degree to which the scores on a test are related to the scores on
another, already established, test administered at the same time, or to some other valid criterion
available at the same time. Example, a new simple test is to be used in place of an old
cumbersome one, which is considered useful, measurements are obtained on both at the same
time. Logically, predictive and concurrent validation are the same, the term concurrent
validation is used to indicate that no time elapsed between measures.
7. Sampling Validity (similar to content validity)
Ensures that the measure covers the broad range of areas within the concept under study. Not
everything can be covered, so items need to be sampled from all of the domains. This may need
to be completed using a panelof “experts” to ensure that the content area is adequately sampled.
Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an individual
personally feels are the most important or relevant areas).
Example: When designing an assessment of learning in the theatre department, it would not be
sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound,
functions of stage managers should all be included. The assessment should reflect the content
area in its entirety.
23
CHAPTER 4
TYPES OF TESTS
Objectives
At the end of this topic, you should be able to:
1. Discuss different types of tests
2. Explain Intelligence tests
4.0 Introduction
In chapter one a test was defined as a method to determine a student's ability to complete
certain tasks or demonstrate mastery of a skill or knowledge of content. Some types would be
multiple choice tests, or a weekly spelling test. This chapter discusses the types of tests as well
as the qualities of a good test.
24
4.1 Types of Tests
Oral questions
These are questions that the teacher asks on a continuous basis to assess the learners
progress during a lesson.
Quizzes
These are short answers questions which the teacher uses to determine the level of
mastery of specific content
Observation
Some teaching and learning activities can be best assessed through observation. These include
learners’ participation in debates and discussions and group work. The teacher can use check
lists and observation schedules as shown in the table below.
Name Activity Behavior to be Observation made
observed
Tabitha Neema Group work Preparation Displayed adequate
preparation in terms of
content
Willingness to Contributed adequately
contribute
Attitude Positive attitude
Expression Not audible and orderly
The information gained from such observation would be the basis for personal academic
counseling and guidance
Projects
The teacher should assign projects to students as individuals or as groups. When assigning
projects, the learners need to be given enough information with regard to the scope of the project
and the mode of reporting the findings as well as the source of materials. Although different
projects have different focus, when assessing, generally look out for ;
i. Neatness
ii. Relevance
iii. Accuracy
25
iv. Completeness
Assignments
Assignments are tasks or responsibilities given by the teacher to learners to perform. The teacher
then grades the performance. A number of topics in geography give opportunity to assign
learners tasks that can be evaluated. These assignments include:
reading selected texts and reference materials
making guided notes
writing reports on specific topics or field visits
writing essays
26
evaluation is coming up. Exams can be great motivators.
To add variety to student learning. Exams are a form of learning activity. They
can enable students to see the material from a different perspective. They also
provide feedback that students can then use to improve their understanding.
To identify faults and correct them. Exams enable both students and instructors to
identify which areas of the material taught are not being understood properly. This
allows students to seek help, and instructors to address areas that may need more
attention, thus enabling student progression and improvement.
To obtain feedback. You can use exams to evaluate your own teaching. Students’
performance on the exam will pinpoint areas where you should spend more time or
change your current approach.
To provide statistics for the course or institution. Institutions often want information
on how students are doing. How many are passing and failing, and what is the average
achievement in class? Exams can provide this information.
To accredit qualified students. Certain professions demand that students demonstrate the
acquisition of certain skills or knowledge. An exam can provide such proof – for
example,.
ensuring standards of progression are met
27
should be able to identify the characteristics of a satisfactory answer and understand the
relative importance of those characteristics. This can be achieved in many ways; you
can provide feedback on assignments, describe your expectations in class, or post model
solutions on a course website.
Timely. Spread exams out over the semester. Giving two exams one week apart doesn’t
give students adequate time to receive and respond to the feedback provided by the first
exam. When possible, plan the exams to fit logically within the flow of the course
material. It might be helpful to place tests at the end of important learning units rather than
simply give a midterm halfway through the semester.
Review Questions
1. What is a test?
2. Why are tests given?
3. Identify and discuss any 4 types of tests
4. Analyze the characteristics of a good test
CHAPTER 5
PREPARATION OF TABLE OF SPECIFICATIONS
Objectives
At the end of this topic, you should be able to:
i. Define basic terms
ii. Explain the Importance of table of specification
iii. Describe the Preparation of table of specification
28
5.0 Introduction
The Table of Specifications is a blueprint for the preparation of an exam. It serves as the "map"
or guide to assigning the appropriate number of items to topics included in the course or subject.
Essentially, a table of specification is a table chart that breaks down the topics that will be on a
test and the amount of test questions or percentage of weight each section will have on the final
test grade.
This type of table is mainly used by teachers to help break down their testing outline on a
specific subject. Some teachers use this particular table as their teaching guideline. For many
teachers, a table of specification is both part of the process of test building and a product of
the test building process.
iii. Decide on the number of items that you would like the test to be. Let's say
you wanted a 16 item - test; the number of items per topic would then be:
a. weather station - 25 % ------------------------4
29
b. elements of the weather – 37.5 % ---------- 6
c. the atmosphere – 37.5 % ------------------- 6
This gives a total of 16 items.
iv. Assign the specific type of question you would like to ask depending on what
skill or cognitive learning, you would like to emphasize
The left hand column 1 and 2 is the specific content areas taught in the unit. For example, the
content of a math unit could be whole numbers and decimals, addition and subtraction, numerical
form, and expanded form. These are the areas of content taught
Column three of the table is a summary of the number of questions for each content area
and is completed after the questions have been written or determined in advance based on
classroom instruction devoted to each content area.
Column four is the percent of questions devoted to each content area. This is calculated
by taking the number of questions per area and dividing by the number of questions on the entire
test.
Columns 5-10 are based on Bloom’s Taxonomy. The question number is listed in the
column and row that best describes the content and level of critical thinking required to answer
the question.
Review Questions
1. What is a table of specifications?
2. Why should teachers construct a table of specifications?
3. Outline the steps in constructing a table of specification
CHAPTER 6
CONSTRUCTION OF TEST ITEMS
Objectives
At the end of this topic, you should be able to:
Discuss the guidelines when writing different kinds of test items
6.0 Introduction
For effective writing of test items, certain guidelines are necessary. The following section
examines the guidelines when writing different test items.
Selected response (objective) assessment items are very efficient – once the items are created,
you can assess and score a great deal of content rather quickly. Note that the term objective refers
to the fact that each question has a right and wrong answer and that they can be impartially
scored. In fact, the scoring can be automated if you have access to an optical scanner for scoring
paper tests or a computer for computerized tests. However, the construction of these “objective”
items might well include subjective input by the teacher/creator.
a) Multiple Choice
Multiple choice questions consist of a stem (question or statement) with several answer
choices (distractors). Important points in constructing thee tests are:
All answer choices should be plausible and
homogeneous. Example
What does peaked mean?
31
A. was sharp
B. was at its height
C. was mountainous
D. was rising
Non-Example
.
What does peaked mean.
A. was pale
B. was at its height
C. was hot
D. was beautiful
32
D. The canyon was formed from rocks that came from other places.
b) Matching
Matching items consist of two lists of words, phrases, or images (often referred to as stems and
responses). Students review the list of stems and match each with a word, phrase, or image from
the list of responses.
Important points:
Alternatives should be short, homogeneous and arranged in logical
order. Example
Match the following equations (place the corresponding letter in the blank).
As a general rule, the stems should be longer and the responses should be
shorter. Example
Match the description of the flag to its country.
33
_____ Red background with white cross A. Austria
_____ Green background with large red circle B. Germany
_____ Two red strips and one white stripe (in center) C. Denmark
_____ Red strip (top); white stripe; green stripe (bottom D. Bangladesh
E. Hungary
c) True/False items
True/false questions can appear to be easier to write; however, it is difficult to write effective
true/false questions. Also, the reliability of T/F questions is not generally very high because of the
high possibility of guessing. In most cases, T/F questions are not recommended.
Important points:
Statements should be completely true or completely false.
Example
The metric system is used consistently in Canada.
True
False
Use simple, easy-to-follow statements.
Example
Botany is a branch of the biological sciences that embraces the study of plants and plant life.
True
False
Avoid using negatives -- especially double negatives.
There is nothing illegal about staying home from school.
True
False
Avoid absolutes such as "always; never."
Example
The news and information posted on the CNN website is usually accurate.
True
False
Non-Example
The news and information posted on the CNN website is always accurate.
True
False
d) Fill-in-the-Blank Items
The simplest forms of constructed response questions are fill-in-the-blank or short answer
questions. For example, the question may take one of the following forms:
34
1. Who was the 16th president of the United States?
2. The 16th president of the United States was ___________________.
These assessments are relatively easy to construct, yet they have the potential to test recall,
rather than simply recognition. They also control for guessing, which can be a major factor,
especially for T/F or multiple choice questions.
When creating short answer items, make sure the question is clear and there is a single,
correct answer. Here are a few guidelines, along with examples and non-examples
Ask a direct question that has a definitive answer.
Example
Ms. Joyce has dinner with three of her friends. The four friends decide to split the cost equally.
The bill comes to $32.80, and the women plan to leave a 15% tip. How much should Ms. Joyce
pay for her share of the dinner? __________________
Non-Example
Ms. Joyce has dinner with three of her friends. The four friends decide to split the cost equally.
The bill comes to $32.80, and the women plan to leave a small tip. How much should Ms. Joyce
pay for her share of the dinner? ____________________
If using fill-in-the blank, use only one blank per item.
Example
Salt consists primarily of sodium and _____________.
Non-Example
_____________ consists primarily of ____________ and _____________.
If using fill-in-the blank, place the blank near the end of the sentence.
Example
A ball is dropped from a height of 20 meters above the ground. As the ball falls, it increases
in speed. The kinetic and potential energies of the ball will be equal at __________ meters.
Although constructed response assessments can more easily demand higher levels of
thinking, they are more difficult to score.
Essay questions are a more complex version of constructed response assessments. With essay
questions, there is one general question or proposition, and the student is asked to respond in
writing. This type of assessment is very powerful -- it allows the students to express themselves
and demonstrate their reasoning related to a topic. Essay questions often demand the use of
higher level thinking skills, such as analysis, synthesis, and evaluation.
Essay questions may appear to be easier to write than multiple choice and other question types, but
writing effective essay questions requires a great deal of thought and planning. If an essay
question is vague, it will be much more difficult for the students to answer and much more
difficult for the instructor to score. Well-written essay questions have the following features:
35
its history
its interesting features
why it is a landmark
Non-Example
Using details and information from the article (America’s Saltiest Sea: Great Salt
Lake), summarize the main points of the article.
Essay questions are used both as formative assessments (in classrooms) and summative
assessments (on standardized tests).
There are 2 major categories of essay questions -- short response (also referred to as restricted
or brief ) and extended response.
a) Short Response
Short response questions are more focused and constrained than extended response questions.
For example, a short response might ask a student to "write an example," "list three reasons," or
"compare and contrast two techniques." The short response items on the Florida assessment
(FCAT) are designed to take about 5 minutes to complete and the student is allowed up to 8 lines
for each answer. The short responses are scored using a 2-point scoring rubric. A complete and
correct answer is worth 2 points. A partial answer is worth 1 point.
Sample Short Response Question
(form 3 Reading)
How are the scrub jay and the mockingbird different? Support your answer with details
and information from the article.
b) Extended Response
Extended responses can be much longer and complex then short responses, but students should
be encouraged to remain focused and organized. Students have 14 lines for each answer to an
extended response item, and they are advised to allow approximately 10-15 minutes completing
each item. The extended responses are scored using a 4-point scoring rubric. A complete and
correct answer is worth 4 points. A partial answer is worth 1, 2, or 3 points.
Sample Extended Response Question
(form 1 Science)
Robert is designing a demonstration to display at his school’s science fair. He will show how
changing the position of a fulcrum on a lever changes the amount of force needed to lift an
object. To do this, Robert will use a piece of wood for a lever and a block of wood to act as a
fulcrum. He plans to move the fulcrum to different places on the lever to see how its
placement affects the force needed to lift an object.
Part A Identify at least two other actions that would make Robert’s demonstration better.
36
Part B Explain why each action would improve the demonstration.
Review Questions
1. Outline the guidelines for constructing multiple choice questions
2. What are the characteristics of well-constructed matching item questions items?
3. State important [points while constructing True/False test items
4. Identify 2 types of essay type test items
5. Outline the features of well written essay questions
CHAPTER 7:
PREPARATION OF MARKING SCHEME
Objectives
At the end of this topic, you should be able to:
Prepare marking schemes for different kinds of tests
Explain the meaning and purpose of moderation
Discuss how to moderate examinations/tests
7.0 Introduction
A marking scheme is a set of criteria used in assessing student learning.
37
error repeatedly. If an error is made early but carried through the answer, you should only
penalize it once if the rest of the response is sound.
Review the marking scheme after the exam. Once the exam has been written, read a
few answers and review your key. You may sometimes find that students have interpreted
your question in a way that is different from what you had intended. Students may come
up with excellent answers that may be slightly outside of what was asked. Consider
giving these students partial marks.
When marking, make notes on exams. These notes should make it clear why you gave
a particular mark. If exams are returned to the students, your notes will help them
understand their mistakes and correct them. They will also help you should students want
to review their exam long after it has been given, or if they appeal their grade.
Although essay questions are powerful assessment tools, they can be difficult to score. With
essays, there isn't a single, correct answer and it is almost impossible to use an automatic scantron
or computer-based system. In order to minimize the subjectivity and bias that may occur in the
assessment, teachers should prepare a list of criteria prior to scoring the essays. Consider, for
example, the following question and scoring criteria:
38
total)
Organization -- up to 3 points for essay organization (e.g., introduction, well
expressed points, conclusion)
Brevity -- up to 1 point for appropriate brevity (i.e., no extraneous or "filler"
information) No penalty for spelling, punctuation, or grammatical errors.
By outlining the criteria for assessment, the students know precisely how they will be assessed
and where they should concentrate their efforts. In addition, the instructor can provide
feedback that is less biased and more consistent. Additional techniques for scoring constructed
response items include:
Do not look at the student's name when you grade the essay.
Outline an exemplary response before reviewing student responses.
Scan through the responses and look for major discrepancies in the answers -- this
might indicate that the question was not clear.
If there are multiple questions, score Question #1 for all students, then Question #2, etc.
Use a scoring rubric that provides specific areas of feedback for the students
7.4What is Moderation?
Moderation is a set of processes designed and implemented by the assessors/evaluators to:
Provide system-wide comparability of grades and scores derived from internal-based
assessment
Form the basis for valid and reliable assessment in schools
Maintain the quality of assessment and the credibility, validity and acceptability
of certificates.
Moderation is necessary for producing valid, credible and publicly acceptable certificates
in an assessment system.
Moderation provides for comparability of standards across the classes and schools.
39
But students with an average performance would have their scores pulled down.
This is because the internal exam is an easy one, and in order to make it comparable, we’re
stretching their marks to the same range. As a result, the good performers would continue
getting a top score. But poor performers who’ve gotten a better score than they would have in a
public exam lose out.
Review Questions
1. What is a marking scheme?
2. Why prepare a marking scheme?
3. Explain the guidelines when making marking schemes
4. What is test moderation?
5. Outline the process of test moderation
CHAPTER 8
TEST ITEM ANALYSIS
Objectives
At the end of this topic, you should be able to:
1. Discuss how to determine the Difficulty index of a test
2. Discuss how to determine the Discrimination index of a test
40
8.0 Introduction
After you create your objective assessment items and give your test, how can you be sure that
the items are appropriate -- not too difficult and not too easy? How will you know if the test
effectively differentiates between students who do well on the overall test and those who do not?
An item analysis is a valuable, yet relatively easy, procedure that teachers can use to answer both
of these questions.
Question A B C D
#1 0 3 24* 3
#2 12* 13 3 2
* Denotes correct answer.
For Question #1, we can see that A was not a very good distractor -- no one selected that answer.
We can also compute the difficulty of the item by dividing the number of students who choose
the correct answer (24) by the number of total students (30). Using this formula, the difficulty of
Question #1 (referred to as p) is equal to 24/30 or .80. A rough "rule-of-thumb" is that if the item
difficulty is more than .75, it is an easy item; if the difficulty is below .25, it is a difficult item.
Given these parameters, this item could be regarded moderately easy -- lots (80%) of students got
it correct. In contrast, Question #2 is much more difficult (12/30 = .40). In fact, on Question #2,
more students selected an incorrect answer (B) than selected the correct answer (A). This item
should be carefully analyzed to ensure that B is an appropriate distractor.
41
correct answer for a specific item more often than the students who had a lower overall score.
If, however, you find that more of the low-performing students got a specific item correct, then
the item has a negative discrimination index (between -1 and 0). Let's look at an example.
Table 2 displays the results of ten questions on a quiz. Note that the students are arranged with
the top overall scorers at the top of the table.
Questions
Student Total
Score (%)
1 2 3
Asif 90 1 0 1
Sam 90 1 0 1
Jill 80 0 0 1
Charlie 80 1 0 1
Sonya 70 1 0 1
Ruben 60 1 0 0
Clay 60 1 0 1
Kelley 50 1 1 0
Justin 50 1 1 0
Tonya 40 0 1 0
"1" indicates the answer was correct; "0" indicates it was incorrect.
Follow these steps to determine the Difficulty Index and the Discrimination Index.
After the students are arranged with the highest overall scores at the top, count the number of
students in the upper and lower group who got each item correct. For Question #1, there were
4 students in the top half who got it correct, and 4 students in the bottom half.
Determine the Difficulty Index by dividing the number who got it correct by the total
number of students. For Question #1, this would be 8/10 or p=.80.
Determine the Discrimination Index by subtracting the number of students in the
lower group who got the item correct from the number of students in the upper group
who got the item correct. Then, divide by the number of students in each group (in this
case, there are five in each group). For Question #1, that means you would subtract 4
from 4, and divide by 5, which results in a Discrimination Index of 0.
The answers for Questions 1-3 are provided in Table 2.
42
Question
4 4 .80 0
1
Question
0 3 .30 -0.6
2
Question
5 1 .60 0.8
3
Now that we have the table filled in, what does it mean? We can see that Question #2 had a
difficulty index of .30 (meaning it was quite difficult), and it also had a negative discrimination
index of -0.6 (meaning that the low-performing students were more likely to get this item
correct). This question should be carefully analyzed, and probably deleted or changed. Our "best"
overall question is Question 3, which had a moderate difficulty level (.60), and discriminated
extremely well (0.8).
Another consideration for an item analysis is the cognitive level that is being assessed. For example,
you might categorize the questions based on Bloom's taxonomy (perhaps grouping questions that
address Level I and those that address Level II). In this manner, you would be able to determine if
the difficulty index and discrimination index of those groups of questions are appropriate. For
example, you might note that the majority of the questions that demand higher levels of thinking
skills are too difficult or do not discriminate well. You could then concentrate on improving those
questions and focus your instructional strategies on higher-level skills.
Review Questions
1. What is test analysis?
2. Why is test analysis important
3. Differentiate between test difficulty index and test discrimination index.
4. Explain what is meant by positive discrimination and negative discriminatis
58
CHAPTER 9
INTERPRETING TEST RESULTS
Objectives
At the end of this topic, you should be able to:
i. Discuss Scoring and grading of tests
ii. Discuss the Analysis of examination results
iii. Explain the reporting test results s
13.0 Introduction
The process of evaluation does not end at analysis but must go on to interpretation of the
anlysed results. This section examines this process.
Grading comparisons
Some kind of comparison is being made when grades are assigned. For example, an instructor
may compare a student's performance to that of his or her classmates, to standards of excellence
(i.e., pre-determined objectives, contracts, professional standards) or to combinations of each.
Four common comparisons used to determine college and university grades and the major
advantages and disadvantages of each are discussed in the following section.
By comparing a student's overall course performance with that of some relevant group of
students, the instructor assigns a grade to show the student's level of achievement or standing
within that group. An "A" might not represent excellence in attainment of knowledge and skill if
the reference group as a whole is somewhat inept. All students enrolled in a course during a given
semester or all students enrolled in a course since its inception are examples of possible
comparison groups. The nature of the reference group used is the key to interpreting grades based
70
on comparisons with other students.
2. The system is a common one that many faculty members are familiar with. Given
additional information about the students, instructor, or college department, grades from
the system can be interpreted easily.
1. No matter how outstanding the reference group of students is, some will receive low
grades; no matter how low the overall achievement in the reference group, some students
will receive high grades. Grades are difficult to interpret without additional information
about the overall quality of the group.
2. Grading standards in a course tend to fluctuate with the quality of each class of students.
Standards are raised by the performance of a bright class and lowered by the performance of
a less able group of students. Often a student's grade depends on who was in the class.
3. There is usually a need to develop course "norms" which account for more than a single
class performance. Students of an instructor who is new to the course may be at a
particular disadvantage since the reference group will necessarily be small and very
possibly atypical compared with future classes.
Grades may be obtained by comparing a student's performance with specified absolute standards
rather than with such relative standards as the work of other students. In this grading method,
the instructor is interested in indicating how much of a set of tasks or ideas a student knows,
rather than how many other students have mastered more or less of that domain. A "C" in an
introductory statistics class might indicate that the student has minimal knowledge of descriptive
and inferential statistics. A much higher achievement level would be required for an "A." Note
that students' grades depend on their level of content mastery; thus the levels of performance of
their classmates has no bearing on the final course grade. There are no quotas in each grade
category. It is possible in a given class that all students could receive an "A" or a "B."
1. Course goals and standards must necessarily be defined clearly and communicated to
the students.
2. Most students, if they work hard enough and receive adequate instruction, can obtain
high grades. The focus is on achieving course goals, not on competing for a grade.
71
3. Final course grades reflect achievement of course goals. The grade indicates "what"
a student knows rather than how well he or she has performed relative to the
reference group.
4. Students do not jeopardize their own grade if they help another student with course work.
1. It is difficult and time consuming to determine what course standards should be for each
possible course grade issued.
Relative to Improvement . . .
Students' grades may be based on the knowledge and skill they possess at the end of a course
compared to their level of achievement at the beginning of the course. Large gains are assigned
high grades and small gains are represented by low grades. Students who enter a course with
some pre-course know-ledge are obviously penalized; they have less to gain from a course than
does a relatively naive student. The post test-pretest gain score is more error-laden, from a
measurement perspective, than either of the scores from which it is derived. Though growth is
certainly important when assessing the impact of instruction, it is less useful as a basis for
determining course grades than end-of-course competence. The value of grades which reflect
growth in a college-level course is probably minimal.
Relative to Ability . . .
Course grades might represent the amount students learned in a course relative to how much they
could be expected to learn as predicted from their measured academic ability. Students with high
ability scores (e.g., scores on the Scholastic Aptitude Test or American College Test) would be
expected to achieve higher final examination scores than those with lower ability scores. When grades
are based on comparisons with predicted ability, an "overachiever" and an "underachiever" may
receive the same grade in a particular course, yet their levels of competence with respect to the course
content may be vastly different. The first student may not be prepared to take a more advanced
course, but the second student may be. A course grade may, in part, reflect the amount of effort the
instructor believes a student has put into a course. The high ability students who can
72
satisfy course requirements with minimal effort are penalized for their apparent "lack" of
effort. Since the letter grade alone does not communicate such information, the value of
ability-based grading does not warrant its use.
A single course grade should represent only one of the several grading comparisons noted
above. To expect a course grade to reflect more than one of these comparisons is too much of a
communication burden. Instructors who wish to communicate more than relative group standing,
or subject matter competence or level of effort, must find additional ways to provide such
information to each student. Suggestions for doing so are noted near the end of Section V of this
booklet.
1. Grades should conform to the practice in the institution in which the grading occurs
2. Grading components should yield accurate information. Carefully written tests and/or
graded assignments (homework papers, projects) are keys to accurate grading..
3. Grading plans should be communicated to the class at the beginning of each semester. By
stating the grading procedures at the beginning of a course, the instructor is essentially
making a "contract" with the class about how each student is going to be evaluated. The
contract should provide the students with a clear understanding of the instructor's
expectations so that the students can structure their work efforts. Students should be
informed about: which course activities will be considered in their final grade; the
importance or weight of exams, quizzes, homework sets, papers and projects; and which
topics are more important than others.
4. Grading plans stated at the beginning of the course should not be changed without
thoughtful consideration and a complete explanation to the students.two common
complaints found on students' post-course evaluations are that grading procedures stated
at the beginning of the course were either inconsistently followed or were changed
without explanation or even advanced notice. One could look at the situation of altering
or inconsistently following the grading plan as being analogous to playing a game
wherein the rules arbitrarily change, sometimes without the players' knowledge.
5. The number of components or elements used to assign course grades should be large
enough to enhance high accuracy in grading.From a decision-making point of view, the
more pieces of information available to the decision-maker, the more confidence one can
have that the decision will be accurate and appropriate. This same principle applies to the
process of assigning grades. If only a final exam score is used to assign a course grade, the
adequacy of the grade will depend on how well the test covered all the relevant aspects of
course content and how typically the student performed on one specific day during a 2-3
hour period.
CHAPTER 10
TYPES AND CATEGORIES OF EVALUATION
Objective
At the end of this topic, you should be able to:
Identify and explain the different categories of evaluation
14.0 Introduction
Definitions of educational evaluation in general and course evaluation in particular abound in the
literature, but only a few will be highlighted. One of the earliest and most common definitions of
evaluation is that it is “the process of determining to what extent the educational objectives are
actually being realized” (Tyler, 1949, p. 69). Evaluation has been conceived either as the
assessment of the merit and worth of educational programmes (Guba and Lincoln, 1981;
Glatthorn, 1987; and Scriven, 1991), or as the acquisition and analysis of information on a given
educational programme for the purpose of decision making (Nevo, 1986; Shiundu and Omulando,
1992; and Teachers Proficiency Course Training Manual, 2007).
Evaluation is a vital concept in any education system. In fact, the success or failure of any
programme in education may be attributed nearly entirely to the quality and quantity of evaluation
done at the beginning of, and during the implementation of the programme. Yet evaluation
remains one of the least developed aspects of formal education in general and curriculum in
particular. This is most probably due to the often narrow and simplistic conceptions of course
evaluation among most stakeholders in education, including some professional educators. It is,
therefore, very important to conceive course evaluation in its broad sense.
75
tests whether the stated objectives of the programme have been achieved. The terminal
examinations such as Kenya Certificate of Primary Education (KCPE) and the Kenya Certificate
of Secondary Education (KCSE) examinations should not be confused for summative evaluation,
but that they contribute significantly towards this type of evaluation (Marsh & Willis 2007;
Teachers Proficiency Course, 2007). From a broader perspective, summative evaluation includes
the evaluation of the teacher’s performance in using the curriculum, the infrastructure, the
learning/teaching resources, time allocation, administrative support, the cost of the programme,
and the impact of the programme. The findings of summative evaluation may lead to curriculum
continuity, enhancement, or change (Shiundu & Omulando, 1992).
Review Questions
1. What is evaluation
2. Explain what is meant by diagnostic evaluation
3. Differentiate between summative and formative evaluation
77