Professional Documents
Culture Documents
Evaluation Extrenal Notes
Evaluation Extrenal Notes
Evaluation is the process of determining to what extent the educational objectives are
being realised. – Ralph Tyler
It is a continuous process.
Purpose of Evaluation
1
Purpose of Evaluation in Nursing Education
Principles of evaluation
1. Determining and clarifying what is to be evaluated always has priority in the evaluation
process.
4.Proper use of evaluation techniques requires an awareness of both their limitations and
strengths.
Characteristics of Evaluation
2
Functions of Evaluation
Types of Evaluation
Formative evaluation is continuous, diagnostic and focused on both what students are doing well
and areas where they need to improve (Carnegie Mellon, n.d.). As the goal of formative
evaluation is to improve future performance, a mark or grade is not usually included (Gaberson,
Oermann & Scellenbarger, 2015; Marsh et al., 2005). Formative evaluations, sometimes referred
to as mid-term evaluation, should precede final or summative evaluation.
Summative evaluation summarizes how students have or have not achieved the outcomes and
competencies stipulated in course objectives (Carnegie Mellon, n.d.), and includes a mark or
grade. Summative evaluation can be completed at mid-term or at end of term. Both formative and
summative evaluation consider context. They can include measurement and assessment methods
noted previously as well as staff observations, written work, presentations and a variety of other
measures.
Diagnostic Evaluation:• This type of evaluation is concerned with finding out the reasons for
students persistent or recurring learning difficulties that cannot be resolved by standard
corrective measures or formative evaluation .• The aim of diagnostic evaluation is to find out
the causes of learning problems and plan to take remedial actions.• Observational techniques
or specially prepared diagnostic techniques can be used to diagnose the problem
3
TEST:
Test is defined as a series of questions on the basis of which information is sought.
Objectivity:
A test must have the trait of objectivity, that is it must be free from the subjective
element so that there is complete interpersonal agreement among experts regarding
the meaning of items and scoring of the test. Objectivity here refers of two aspects of
the test i-e
Objectivity of Items:
By objectivity of items is meant that the items should be phrased in such a manner
that they are interpreted in exactly the same way by all those who are taking the test.
For ensuring objectivity of items, items must have uniformity of order of
presentation (either ascending or descending).
Objectivity of Scoring:
By objectivity of scoring is meant that the scoring method of the test should be a
standard one so that complete uniformity can be maintained when the test is scored
by different experts at different times.
Reliability:
A test must also be reliable. Reliability is “Self-correlation of the test.” It shows the
extent to which the results obtained are consisted when the test is administered.
Once or more than once on the same sample with a reasonable gap. Consistency in
results obtained in a single administration is the index of internal consistency of the
test and consistency in results obtained upon testing and retesting is the index of
temporal consistency. Reliability thus, includes both internal consistency as well as
temporal consistency. A test to be called sound must be reliable because reliability
indicates the extent to which the scores obtained in the test are free from such
internal defects of standardization, which are likely to produce errors of
measurement.
4
Validity:
Validity is another prerequisite for a test to be sound. Validity indicates the extent to
which the test measure what it intends to measure, when compared with some
outside independent criteria. In other words it is the correlation of the test with some
outside criteria. The criteria should be independent one and should be regarded as
the best index of trait or ability being measured by the test. Generally, validity of a
test is dependent upon the reliability because a test which yields inconsistent results(
poor reliability) is ordinarily not expected to correlate with some outside
independent criteria.
Norms:
A test must also be guided by certain norms. Norms refer to the “average
performance of the representative sample on a given test.” There are four common
types of norms;
Age norm
Grade norm
Percentile norms
Standard score norms.
Depending upon the purpose and use, a test constructor prepares any of these above
norms of his test. Norms help in interpretation of the scores. In the absence of norms
no meaning can be added to the score obtained on the test.
Practicability:
A test must also be practicable from the point of view of the time taken in its
completion, length, scoring etc. in other words, the test should not be lengthy and
the scoring method must not be difficult nor one which can only be done by highly
specialized person.
TEST CONSTRUCTION
5
Principles of Test Construction
9. Make test valid & reliable – Reliable when it produce dependant, consistent, and
accurate scores – Valid when it measures what it purports to measure – Test which
are written clearly and unambiguous are reliable – Tests with fairly more items are
reliable than tests with less items – Tests which are well planned, covers wide
objectives, & are well executed are more valid
6
“GENERAL STEPS OF TEST CONSTRUCTION”
1. Planning
2. Writing items for the test.
3. Preliminary administration of the test.
4. Reliability of the final test.
5. Validity of the final test.
6. Preparation of norms for the final test.
7. Preparation of manual.
Each stages are briefly described below.
1. PLANNING:
The first step in the test construction is the careful planning. At this stage, the author
has to spell out the broad and specific objectives of the test in clear terms. That is the
the purpose or purposes for which they will use the test. Also the author has to kept
im mind the following points.
What will be the appropriate age range, educational level and cultural
background of the examinees, who would find it desirable to take the test?
What will be the content of the test? Is this content coverage different from
that of the existing tests developed for the same or similar purposes? Is this
cultural specific?
what would be the nature of items, that is to decide if the test will be multiple
choice, true false, inventive response or n some other form.
What would be the type of instructions i-e written or to be delivered orally?
7
Whether the test would be administered individually or in groups? Will the
test be designed or modified for computer administration.
The test constructor must have to decide about the probable length and time
for completion of test.
Is there any potential harm for the examinees resulting from the
administration of this test? Are there any safeguards built into the
recommended testing procedure to prevent any sort of harm to anyone
involved in the use of this test.
How will the scores be interpreted? Will the scores of an examinee be compare
to others in the criteria group or will they be use to assess mastery of a specific
content area?
Item:
A single question or task that is not often broken down into any smaller units. (Bean,
1953:15)
The second step in item writing is the preparation of the items of test. Item writing
starts with the planning done earlier. If the test constructor decides to prepare an
essay test, then the essay items are written down. However, if he decide to construct
an objective test, he writes down the objective items such as the alternative response
item, matching item, multiple choice item, completion item, short answer item,
pictorial form of item, etc. Depending upon the purpose, he decides to write any of
these objective type of items.
The item writer must have a thorough knowledge and complete mastery of the
subject matter. In other words, he must be fully acquainted with all facts,
principles, misconceptions, Fallacies in a particular field so that he may be
able to write good and appropriate items.
8
The item writer must be fully aware of those persons for whom the test is
meant. He must also be aware of the intelligence level of those persons so that
he may manipulate the difficulty level of the items for proper adjustment with
their ability level. He must also be able to avoid irrelevant clues to correct
responses.
The item writer must be familiar with different types of items along with their
advantages and disadvantages. He must also be aware of the characteristics of
good items and the common probable errors in writing items.
The item writer must have a large vocabulary. He must know the different
meanings of a word so that confusion in writing the items may be avoided. He
must be able to convey the meaning of the items in the simplest possible
language.
Always give due importance to the difficulty level of test items (easy, average,
difficult).
Before starting to write the items, divide the whole unit/ variable into sub-
units/ components and decide the weightage given to each sub-unit/
component.
Follow all these rules and write down the test items.
Arrangement of Items:
After the items have been written down, they are reviewed by some experts are by the
item writer himself and then arranged in the order in which they are to appear in the
final test. Generally, items are arranged in an increasing order of difficult those
having the same form (say alternative form, matching, multiple-choice, etc.) and
dealing with same contents are placed together.
3. PRELIMINARY ADMINISTRATION:
Before proceeding toward administration of the test review by at least three experts.
When the test have been written down and modified in the light of the suggestions
9
and criticisms given by the experts, the test is said to be ready for experimental try-
out.
The Pre-Try-Out:
The first administration of the test is called PRE-TRY-OUT. The sample size for pre-
try out should be 400.The main purpose of the pre tryout of any psychological and
educational test is as follows:
ITEM ANALYSIS
Item analysis is a statistical technique which is used for selecting and rejecting the items of
the test on the basis of their difficulty value and discriminated power.
2. Take 27% of answer sheets having highest scores and mark them as Higher group (H).
3. Take 27% of answer sheets having lowest scores and mark them as Lower group (L)
10
4. Count the number of Right answers in the Higher group for each question and mark it as
(RH).
5. Count the number of Right answers in the Lower group for each question and mark it as
(RL).
Then calculate the Item difficulty or Difficulty Index (DI) and Discriminating Power (D.P)
RH- No.of right answers in the highest group for a particular question.
RL- No.of right answers in the lowest group for a particular question.
DP = (RH-RL)/ NH
Or
DP- (RH-RL)/ NL
RH- No.of right answers in the highest group for a particular question.
RL- No.of right answers in the lowest group for a particular question.
After item analysis, items with item difficulty in between 0.25 and 0.75 and items
with discriminating power greater than 0.4 will be selected for the final test.
Based on pre-try out and item analysis, the author will prepare the final test.
11
5. ESTABLISHING RELIABILITY OF THE FINAL TEST
Test-retest
Alternate form
Split –half method
Test-Retest Method:
It is the oldest and commonly used method of testing reliability. The test retest
method assesses the external consistency of a test. Examples of appropriate tests
include questionnaires and psycho metric tests. It measures the stability of a test
over time.
A typical assessment would involve giving participants the same test on two separate
occasions. Each and every thing from start to end will be same in both tests. Results
of first test need to be correlated with the result of second test. If the same or similar
results are obtained then external reliability is established.
The timing of the test is important if the duration is to brief then participants may
recall information from the first test which could bias the results. Alternatively, if the
duration is too long it is feasible that the participants could have changed in some
important way which could also bias the results.
Utility and worth of a psychological test decreases with time so the test should be
revised and updated. When tests are not revised systematic error may arise.
Alternate Form:
In alternate form two equivalent forms of the test are administered to the same
group of examinees. An individual has given one form of the test and after a period of
time the person is given a different version of the same test. The two form of the rest
are then correlated to yield a coefficient of equivalence.
Split-Half Method:
The split half method assesses the internal consistency of a test. It measures the
extent to which all parts of the test contribute equally to what is being measured. The
12
test is technically spitted into odd and even form. The reason behind this is when we
making test we always have the items in order of increasing difficulty if we put (1,2,
—-10) in one half and (11,12,—-20) in another half then all easy question/items will
goes to one group and all difficult questions/items will goes to the second group.
When we split the test we should split it with same format/theme e.g. Multiple
questions – multiple questions or blanks – blanks.
It refers to the extent to which test claim to measure what it claims to measure.
If a test is reliable then it is not necessary that it is valid but if a test is valid then it
must be reliable.
Types of Validity:
Face validity
Construct validity
Criterion related validity
Face Validity
Face validity is determined by a review of the items and not through the use of
statistical analysis. Face validity is not investigated through formal procedures.
Instead anyone who looks over the test, including examinees, may develop an
informal opinion as to whether or not the test is measuring what it is supposed to
measure. While it is clearly of some value to have the test appear to be valid, face
validity alone is insufficient for establishing that the test is measuring what it claims
to measure.
Construct Validity:
It implies using the construct correctly (concepts, ideas, notions). Construct validity
seeks agreement between a theoretical concept and a specific measuring device or
procedure.
For example, a test of intelligence now a day’s must include measures of multiple
intelligences, rather than just logical-mathematical and linguistic ability measures.
13
demonstrate the accuracy of a measure or procedure compared to another measure
or procedure which has already been demonstrated to be valid.
7. ESTABLISH NORMS:
Types of norms:
Age norms
Grade norms
Percentile norms
Standard scores norms
All these types of norms are not suited to all type of tests. Keeping in view the
purpose and type of test, the test constructer develops a suitable norm for the test.
Age Norm
Age norms indicate the average performance of different samples of test takers who
were at various ages at the time the test was administered.
The child of any chronological age whose performance on a valid test of intellectual
ability indicated that he or she had intellectual ability similar to that of the average
child of some other age was said to have the mental age of the norm group in which
his or her test score fell.
The reasoning here was that irrespective of chronological age, children with the same
mental age could be expected to read the same level of material, solve the same kinds
of math problems, and reason with a similar level of judgment. But some have
complained that the concept of mental age is too broad and that although a 6-year-
old might, for example perform intellectually like a 12-year-old, the 6 year old might
14
not be very similar at all to the average 12 year old socially, psychologically and
otherwise.
Grade Norms:
Grade norm was designed to indicate the average test performance of test takers in a
given school grade, grade norms are developed by administering the test to
representative samples of children over a range of consecutive grade levels.
Like age norms, grade norms have wide spread application with children of
elementary school age, the thought here is that children learn and develop at varying
rates but in ways that are in some aspects predictable.
One drawback in grade norms is that they are useful only with respect to years and
months of schooling completed. They have little or no applicability to children who
are not yet in school or who are out of school.
Percentile Norms:
Percentile system is ranking of test scores that indicate the ratio of score lower from
higher than a given score. A percentile is an expression of the percentage of people
whose score on a test or measure falls below a particular raw score. A more familiar
description of test performance, the concept of percentage correct, must be
distinguished from the concept of a percentile.
Because percentiles are easily calculated they are a popular way of organizing test
data and are very adoptable to a wide range of tests.
For example marks obtained in paper may be in 100% are applicable only in specific
area but when they are converted in GPA they become standard score.
15
7. PREPARATION OF MANUAL AND REPRODUCTION OF THE TEST:
The last step in test construction is the preparation of a manual of the test. In the
manual the test constructor reports the psychometric properties of the test, norms
and references. This gives a clear indication regarding the procedures of the test
administration, the scoring methods and time limits, if any of the test. It also
includes instructions as well as the details of arrangement of materials that is
whether items have been arranged in random order or in any other order. The test
constructer finally orders for printing of the test and the manual.
MARKING vs GRADING
16
The system of examination in which numbers are provided for their achievement in various
subjects and students are generally evaluated on 101-point scale is known as Marking system
of examination.
Grading
Grading, whether with a numerical value, letter grade or pass/fail designation, indicates the
degree of accomplishment achieved by a learner. Differentiating between norm-referenced
grading and criterion-referenced grading is important. Norm-referenced grading evaluates
student performance in comparison to other students in a group or program, determining
whether the performance is better than, worse than or equivalent to that of other students
(Gaberson, Oermann, & Shellenbarger, 2015). Criterion-referenced grading evaluates student
performance in relation to predetermined criteria and does not consider the performance of
other students (Gaberson, Oermann, & Shellenbarger, 2015).
17
learner’s grade in norm-referenced grading reflects accomplishment in relation to others in
the group. Only a select few can earn top grades, most will receive mid-level grades, and at
least some will receive failing grades. Norm-referenced grading is based on the symmetrical
statistical model of a bell or normal distribution curve.
Types of Grading
1. Direct Grading
2. Indirect Grading
18
The two types of Indirect grading are
Absolute Grading
In this type marks are initially awarded on a 101 point scale and then these marks are
converted into grades.
In this system, it is possible for all of your students to pass and even for all of them to get As.
If all of your students score a 90 or above on the test you have just given, then all of your
students will get an A on this test.
A = 90-100
B = 80-89
C = 70-79
D = 60-69
F = 0-59
Relative Grading
The other kind of grading system is called relative grading. In this system, grades are given
based on the student's score compared to the others in the class. Relative grading allows for
the teacher to interpret the results of an assessment and determine grades based on student
performance. One example of this is grading “on the curve.” In this approach, the grades of
an assessment are forced to fit a “bell curve” no matter what the distribution is. A hard grade
to the curve would look as follows.
As such, if the entire class had a score on an exam between 90-100% using relative grading
would still create a distribution that is balanced. Whether this is fair or not is another
discussion.
Some teachers will divide the class grades by quartiles with a spread from A-D. Others will
use the highest grade achieved by an individual student as the A grade and mark other
students based on the performance of the best student.
There are times when institutions would set the policy for relative grading. For example, in a
graduate school, you may see the following grading scale.
A = top 60%
B = next 30%
C = next 10%
D, F = Should never happen
the philosophy behind this is that in graduate school all the students are excellent so the
grades should be better. Earning a “C” is the same as earning an “F.” Earning a “D” or “F”
often leads to removal from the program.
19
Advantages of Grading System
The question bank makes available statistically sound questions of known technical
worth and model question papers and thus facilitates selection of proper question for
a well designed to question paper. A question bank is a planned library of test items
designed to fulfill certain predetermined purposes. Question bank should be
20
prepared with at most care so as to cover the entire prescribed text. Question bank
should be exhaustive and cover entire content with different types.
DEFINITION
• “An item bank is defined as an organized collection of test items that can be
assessed for test development” - Rudner.
• “An item bank or question bank is a collection of test items organized, classified
and catalogues the order to facilitate the construction of a variety of achievement and
other types of mental test.” - B. H. Choppin.
PURPOSES
• A pool of test items can be used for formative and summative evaluation of the
student’s performance
NEED
• Before the test, the teachers are generally not getting adequate time to prepare the
questions. Naturally in the absence of adequate time, they prepare them
haphazardly.
• If there is an item or question bank, the material from it can be used by any teacher
and by any school.
PRINCIPLES
21
• Use a variety of testing methods
• While framing question it has to be ensured that they are unambiguous, simple in
the language and brief as far as possible.
• Each of the question should evaluate some specific content area or learning
outcome.
PRINCIPLES
• Their difficulty level should be appreciated to the group of learners being tested.
• All the objective items should be grouped in one section, while the short answer
type and essay type item should be in another section.
• In the section of objective type items having the same format, e.g. Yes- No type,
True- False type, multiple choice types etc should be grouped together.
STEPS
What type of questions made up the bank entirely depends on the total frame of
reference envisaged at the planning stage the scope of question bank is determined
by taking such decisions. It should be decided whether only the written
examinations, oral examinations, practical examinations or all to be stored in a
bank..
22
The preparation of blue print is essential in developing a questions bank. If the items
are not relevant to the objectives of the program for which the question pool is being
developed, it results in hodge- podge of question. There, blue prints help to generate
a quality question pool.
Writing of questions:-
b).Ready – made question may be lifted from old- question papers, from
standardized tests and from review exercise of good textbook. It must fit into the
specification of questions must be given, question should be accompanied by answer
key, besides indicating objectives and content area. .
c).In case of new question, there may be invited from experienced teacher, examiners
and paper setters on the basis of some honorarium Specifications of questions must
be given, questions should be accompanied by answer key, besides indicating
objectives and content area
d). Get the questions prepared by practicing teacher invited to a workshop for the
purpose. This get together helps to discuss the questions face- to face and get quality
questions.
e). In such a workshop subject area and the objective may be allotted to a
participants according to their competencies.
4. Validation
. Screening of questions:- a) After the questions are written, questions sheets are
passed on to other members of the group for their comments. These comments are
passed on to author of the question, who in consultation with two or three
participants finalizes the questions individual questions are written on black board
by author followed by discussion by participants. Though, it is a time consuming
process,
It is educationally more potent. not only the quality of the question is improved, built
also provides good training to the participants for framing good questions
b) 2nd level screening is done with the help of 3 subject experts, are conversant with
the technique of text construction. Such a group may consist of subject specialist who
would pass judgment on the authenticity of the subject matter. Another person is a
teacher associated with that class who helps to judge the suitability of the question
for a particular grade level. A third person may be an evaluation expect who helps to
improve the format of the question in the light of the objectives to be tested.
23
6. Developing a system for maintaining confidentiality
PURPOSES
• The task is clear in each item and the person attempting an item will know what is
expected. The task in an item is understood in the same way by all candidates.
• The items are set within and based on the objectives and course contents outlined
in the syllabus.
• The questions are well distributed in the different parts of the syllabus course
contents/cover syllabus adequately.
• The items are fair assessment of candidates at a particular level and if they are not,
they should be tempered to the level of the candidates, actually, among others, that
are moderation is,
• The items are technically correct and accurate, offering the best way testing the
concepts or principles or knowledge it is intended to test. They should not have clues
to the correct answers.
• The items are original and not just copied from the text books or past examination
papers.
• Different types of types of questions selected from a question bank may be used for
pre testing, development, review and revision of a lesson.
• In the preparation of textual material a question poll can be utilized for preparing
review exercises in the text books.
24
• the preparation of teaching units or resource units also involves the use of
evaluation materials which may be picked up from the question bank.
• For evaluating the pupil’s progress the question bank can be used most efficiently.
Individual questions can be stored and grouped for use in topic or unit testing in
periodical test.
• Individual questions, unit tests and question papers can be profitably used by the
examining agencies by making the question bank available for paper setters.
• When question banks are established in institutions, students can use them for self
evaluation in their spare time.
• When questions on all topics of the prescribed syllabus are available pupils can
review the lessons. Even teachers can make use of such data for quick revision.
25