You are on page 1of 55

ADMINISTRATION, SCORING AND REPORTING

Introduction
Administering a test usually is the simplest phases of the testing process. There
are some common problems associated with test administration, however, that
may also affect those scores. Careful planning can help the teacher avoid or
minimize such difficulties. When giving tests it is important that everything
possible be done to obtain valid results. Cheating, poor testing conditions, and
test anxiety, as well as errors in test scoring procedures contribute to invalid test
results. Many of these factors may be controlled by practicing good test
administration procedures. Practicing these procedures will prove to be less time
consuming and less troublesome than dealing with problems resulting
from poor procedures.

After administering a test, the teacher‘s responsibility is to score it or arrange to


have it scored. The teacher then interprets the results and uses these interpretations
to make grading, selection, placement or other decisions. To accurately interpret
test scores, however, the teacher needs to analyze the performance of the test as a
whole and of the individual test items, and to use these data to draw valid
inferences about student performance. This information also helps faculty prepare
for post test discussions with students about the exam.

1
Administrating A Test
It plays a vital role in enhancing the reliability of the test scores. Test
should be administered in a congenial environment strictly as per the
instructions planned and assure uniformity of conclusions to all the people
tested.

Suggestions to administer the test-

 Long announcements before or during the test should be not be made.


 Instructions should be given in writing.
 The test administration should not respond to the individual problems of
the examinees.

The steps to be followed in the administration of group tests are:

a) Motivate the students to do their best


b) Follow the directions closely.
c) Keep time accurately.
d) Record any significant events that might influence test scores
e) Collect the test materials promptly.

The guiding principle in administering an achievement test is that all students


must be given a fair chance to demonstrated their achievement of the learning
outcomes being measured. This mean a physical and psychological environment
conducive to their best efforts and the control of factors that might interfere with
valid measurement. Students will not perform at their best if they are tense and
anxious during testing. They should also be reassured that the time limits are
adequate to allowed them to complete the test. This, of course, assumes that the
test will be used to improve learning and that the time limits are adequate. The
things to avoid while administering a test are-

 Do not talk unnecessarily before the test.


 Keep interruptions to a minimum during the test.
 Avoid giving hits to pupils who ask.

2
Administering Exams
How an exam is administered can affect student performance as such as how the
exam was written. Below is a list of general principles to consider when
designing and administering examinations.

1. Give complete instructions as to how to take the examination counts or the


amount of time to spend on each section. This helps students to allocate
their efforts wisely.
2. State specifically what aids(e.g. calculator, notebooks) students are
allowed to use in the examination room.
3. Use assignments and homework to provide preparation for taking the
exams. For examples, if the assignments ask all essay questions, it would
be inappropriate for the examination to consist of 200 multiple choice
questions.
4. Practice taking the completed test yourself. You should count on the
students to take about 4 times the amount of time it takes you to complete
the test.
5. For final examinations; structure the test to cover the scope of the entire
course. The examination should be comprehensive enough to test
adequately the student‘s learning of the course material. Use a variety of
different types of questions on the examination(e.g. multiple- choice,
essay, etc) because some topics are covered more effectively with certain
types of questions. Group questions of the same type together when
possible.
6. Tell the students what types of questions will be on the test(i.e essay,
multiple-choice etc) prior to the examination. Allow students to see
past(retired) expect. For essay exams, students understand how they will
be evaluated (if appropriate).
7. Provide students with a list of review questions or topics covered on the
exam along with an indication of the relative emphasis on each topic.
8. Give detailed study suggestions.
9. Indicate how much the examination will count toward determining the
final grade.

3
Importance Of Test Administration
Consistency

 Standardized tests are designed to be administered under consistent


procedures so that the test taking experience is as similar as possible across
examinees.
 This similar experience increases the fairness of the as well as making
examinee‘s scores more directly comparable.
 Typical guidelines related to the test administration locations state that all
the sites should be comfortable, and should have good lighting, ventilation
and handicap accessibility.
 Interruptions and distractions, such as excessive noise, should be prevented.
The time limits that have been established should be adhered to for all test
administrations.

Test security

 Test security consists of method designed to prevent cheating, as well as to


protect the test items and content from being exposed to future test- takers.
Test administration procedures related to test security may begin as early s
the registration procedure. Many exam programs restrict examinees from
registering for a test unless they meet certain eligibility criteria.
 When examinees arrives at the test site , additional provisions for test
security include verify each examinee‘s identification and restricting
materials(such as photographic or communication devices)that an examinee
is allowed to bring into the test administration. If the exam program uses
multiple parallel test forms, these may be distributed in a spiral fashion, in
order to prevent one examinee form being able to copy from another.(form
A is distributed to the first examinee, form B to the second examinee, form
A to the third examinee etc).
 The test proctors should also remain attentive throughout the test
administration to prevent cheating and other security breaches. When testing
is complete, all test related materials should be carefully collected from the
examinees before they depart.

4
Summary-

 The use of orderly, standardized test administration procedures is beneficial


to examinees. In particular, administration procedures designed to promote
consistent conditions for all examinees increase the exam program‘s fairness.
Test administration procedures related to security protect the integrity of the
test items. In both of these cases, the standardization of test administration
procedures prevents some examinees from being unfairly advantaged over
other examinees.

How many questions should I give?

It is important to allow your students enough time to complete the exam


comfortably and reasonably. Inevitably this will mean you must make some
choices about which questions you will ask.

o One minute per objective type question


o Two minute for a short answer requiring one sentence
o Five to ten minutes for a longer short answer
o Ten minutes for a problem that would take you two minutes to answer
o Fifteen minutes for a short, focused essay
o Thirty minutes for an essay of more than one to two pages

You should add ten minutes or so to allow for the distribution and collection of
the exam.

Administering tests-

There are several things you should keep in mind to make the experience run as
smoothly as possible-

 Have extra copies of the test on hand, in case you have miscounted or in
the event of some other problem.

5
 Minimize interruptions during the exam by reading the directions briefly at
the start and refraining from commenting during the exam unless you
discover a problem.
 Periodically write the time remaining on the board .
 Be alert for cheating but do not hover over the students and cause a
distraction.

There are also some steps that you can take to reduce the anxiety that students
will inevitably feel leading up to and during an exam. Consider the following-

 Have old exams on file in the department office for students to review.
 Give students practice exams prior to the real test.
 Explain in advance of the test day, the exam format and rules, and explain
how this fits with your philosophy of testing.
 Give students tips on how to study for and take the exam- this is not a test
of their test taking ability, but rather of their knowledge, so help them learn
to take tests.
 Have extra office hours and a review session before the test.
 Arrive at the exam site early, and be there yourself(rather than sending a
proxy) to communicated the importance of the event.

Recommendations for improving Test scores:

1) When a test is announced well in advance, do not wait until the day before
to begin studying spaced practice is more effective than massed practice.
2) Ask the instructor for old copies of the examination to practice with.
3) Ask other students what kinds of tests the instructor usually gives.
4) Don‘t turn study session into social occasion, isolated studying is usually
more effective.

6
5) Don‘t be too comfortable when studying lying down is a physical care for
your body to sleep.
6) Study for the type of test which was announced.
7) If you do not know the type(style) of test, study for a free recall exam.
8) Ask yourself questions about the subject material read for detail, recite the
material just prior to test.
9) Try to form material you are studying into test questions.
10)Read test directions carefully before beginning exam. Ask administrator if
unclear or some details are not included.
11)If essay test, think about question and mentally formulate answer before
you begin writing.
12)Pace yourself while taking test. Do not try to be first person finished. Allow
enough time to review answers at end of session.
13)If you can rule out one wrong answer choice, guess even if there is a penalty
for wrong answers.
14)Skip more difficult items and return to them. Later, particularly if there are a
lot of questions.
15)When time permits, review your answer. Don‘t be overly eager to hand in
your test paper before all the available time has elapsed.

Scoring The Test


7
The principles of valuation should be followed in scoring the test. It enhances
the objectivity and reliability of the test.

Reliability- The degree of accuracy, consistency with which an exam, test


measures, what it seeks to measure a given variable. The degree of
consistency among test scores. A test score is called reliable, when we have
reasons for believing it to be stable and trustworthy.

Objectivity- A test is objective, when the scorer‘s personal judgment does not
affect the scoring. It eliminates fixed opinion or judgment of the person who
scores it. The extent to which independent and competent examiners agree on
what constitutes a good answer for each of the elements of a measuring
instruments.

Selection-type items

 Prepare stencils when useful


 When using stencil with holes, make sure that students marked only one
alternative
 When response wrong, put red mark through correct answer
 Apply formula for guessing only when a test is speeded
 Weight all items the same (doing otherwise seldom makes a difference and
only confuses scoring)

Supply-type items

Use your carefully developed rubrics.

Utilizing Rubrics as Assessment Tools


What is a rubric?

A rubric is a scoring and instructional tool used to assess student performance


using a task-specific range or set of criteria. To measure student performance
against this pre-determined set of criteria for the task and levels of
performance(i.e. from poor to excellent) for each criterion. Most rubrics are
designed as a one- or two- page document formatted with a table or grid that
outlines the learning criteria for a specific lesson, assignment or project.
8
 Rubrics can be created in a variety of forms and levels of complexity, but
they all:
 Focus on measuring a stated objective(performance, behavior, or quality)
 Use a range to rate performance
 Contain specific performance characteristic arranged in levels indicating
the degree to which a standard has been met.

Two major types of rubrics:

A holistic rubric involves one global, holistic rating with a single score for an
entire product or performance based on an overall impression. These are
useful for summative assessment where an overall performance rating is
needed, for example, portfolios.

A holistic rubric requires the teacher to score the overall process or product as
a whole, without judging the component parts separately.

An analytical rubric divides a product or performance into essential traits


that are judged separately. Analytical rubrics are usually more useful for day
to day classroom use since they provide more detailed and precise feedback to
the student.

An analytical rubric, the teacher scores separate, individual parts of the


products or performance first, then sums the individual scores to obtain a total
score.

9
Assessing student learning
Rubrics provide instructors with an effective means of learning-centered
feedback and evaluation of student work. As instructional tools, rubrics enable
stdents to guage the strengths and weaknesses of their work and learning. As
assessment tools, rubrics enable faculty to provide detailed and informative
evaluations of student‘s work.

Advantages of using rubrics:


 They allow assessment to be more objective and consistent.
 They clarify the instructor‘s criteria in specific term.
 They clearly show students how their work will be evaluated and what
is expected.
 They promote awareness of the criteria to use when students assess per
performance.
 They provide benchmarks against which to measure progress.
 They reduce the amount of time teachers spend evaluating student work
by allowing them to simply circle an item in the rubrics.
10
 They increase student‘s sense of responsibility for their own work.
Steps of creating rubrics:
1. Define your assignment or project
This is the task you asking your student to perform.
2. Decide on a scale of performance.
These can be a level for each grade(A-F)or three levels(outstanding,
acceptable, not acceptable; Great job, Okay,What happened?). These are
listed at the top of the grid.
3. Identify the criteria of the task.
These are the observable and measurable characteristics of the task. They
listed in the left-hand column. They can be weighted to convey relative
importance of each.

4. Describe the performance of each critierion.


These descriptors indicate what performance looks like at each level.
They offer specific feedback. Use samples of student work to help you
determine quality work.

Suggestions for use:


 Hand out the rubric with rubric the assignment. Return the rubrics with the
performance descriptors circled.
 Have students develop their own rubrics for a project.
 Have students use the rubric for self-assessment or peer assessment

11
12
13
14
Methods Of Scoring In Standardized Test
Different tests use different methods of scoring based on different needs. The
following table summarizes the three main categories of test scores:

1. Raw Scores
2. Criterion-referenced Scores
3. Norm-referenced Scores (how most standardized tests are scored

Score How score is Uses Potential drawbacks


determined

Raw score By counting the Often used in teacher- Scores may be


number (or developed assessment difficult to interpret
calculating a instruments. without knowledge of
percentage) of how performance
correct responses or relates to either a
points earned. specific criterion or a
norm or group.
Criterion- By comparing Useful when Criteria for assessing
referenced performance to one determining whether mastery of complex
Score or more criteria orspecific instructional skills may be difficult
standards for success
objectives have been to identify.
achieved. Also useful
when determining if
basic skills that are
prerequisites for other
tasks have been
learned.
Age or By equating a Useful when  Scores are
Grade student's explaining norm- frequently
Equivalent performance to the referenced test misinterpreted,
(norm- average performance performance to people especially by
referenced) of students at a unfamiliar with parents.
particular age or standard scores.  Scores may be

15
grade level inappropriately
used as a
standard that
all students
must meet.
 Scores are
often
inapplicable
when
achievement at
the secondary
level or higher
is being
assessed.

Do not give a
typical range of
performance
for students at
that age or
grade.

Percentile By determining the Useful when Scores overestimate


Rank percentage of explaining norm- differences near the
(norm- students at the same referenced test mean and
referenced) age or grade level performance to people underestimate
who obtained lower unfamiliar with differences at the
scores. standard scores. extremes.
Standard By determining how Useful when Scores are not easily
Score far the performance describing a student's understood by people
(norm- is from the mean (for standing within the without some
referenced the age or grade norm group knowledge of
level) with respect to statistics.
standard deviation
units.

16
Standard Score

Definition

 Standard score—how far above or below average a student scored


 Distance is calculated in standard deviation (SD) units (a standard
deviation is a measure of spread or variability)
 The mean and standard deviation are for a particular norm group.
Standard Scores are by far the most complicated of the five types of
scores so they deserve a more in-depth look. When looking at the normal
distribution, a line is drawn from the highest point on the curve to the x-
axis. This point is the mean score. A standard deviation's worth is counted
out on each side of the mean and those points are marked. Another
standard deviation is counted out and two more points are marked. When
the normal distribution is divided up this way, you will always get the
same percentage of students scoring in each part. About 68% will score
within one standard deviation of the mean (34% in each direction). As
you move further from the mean, fewer and fewer students will perform at
these scores. A standard score simply tells us where a student scores in
relation to this normal distribution in standard deviation units.

Advantages

Based on the ―normal curve,‖ which means that

1. Scores are distributed symmetrically around the mean (average)


2. Each SD represents a fixed (but different) percentage of cases
3. Almost everyone is included between –3.0 and 3.0 SDs of the mean
4. The SD allows conversion of very different kinds of raw scores to a
common scale that has (a) equal units and (b) can be readily interpreted in
terms of the normal curve
5. When we can assume that scores follow a normal curve (classroom tests
usually don‘t but standardized tests do), we can translate standard scores
into percentiles—very useful.

17
Types of Standard Score

All Standard Scores

 Share a common logic


 Can be translated into each other

Z-Score

 Simplest
 The one on which all others based
 Formula: z = (X-M)/SD, where X is person‘s score, M is group‘s average,
and SD is group‘s spread (standard deviation in scores
 Z is negative for scores that are below average, so z‘s are usually converted
into some other system that has all positive numbers

T- Score

First a z-score is computed. Then a T-score with a mean of 50 and a standard


deviation of 10 is applied. T-scores are whole numbers and are never negative.

 Normally distributed standard scores


 M=50, SD=10
 Can be obtained from z scores: T = 50 + 10(z)

Normalized Standard Scores

 Starts with scores that you want to make conform to the normal curve
 Get percentile ranks for each score
 Transform percentiles into z scores using a conversion table (I handed one
out in class)
 Then transform into any other standard score you want (e.g., T-score, IQ
equivalents)
 Hope that your assumption was right, namely, that the scores really do
naturally follow a normal curve. If they don‘t, your interpretations (say, of
equal units) may be somewhat mistaken

18
Stanines

 Very simple type of normalized standard score


 Ranges from 1-9 (the ―standard nines‖)
 Each stanine from 2-8 covers ½ SD
 Stanine 5 = percentiles 40-59 (the middle 20 percent)
 A difference of 2 stanines usually signals a real difference

Strengths

1. Easily explained to students and parents

2. Normalized, so can compare different tests

3. Can add stanines to get a composite score

4. Easily recorded (only one column)

Limitations

1. Like all standard scores, cannot record growth

2. Crude, but prevents over interpretation

IQ Scores-

Tests that measure intelligence have a mean of 100 and (for the most part) a
standard deviation of 15. Most people will score between 85 and 115. Someone
who scores below a 70 is typically considered mentally retarded.

Normal-Curve Equivalents (NCE)

 Normally distributed standard scores


 M=50
 SD=21.06
 Results in scores that go from 1-99
 Like percentiles, expect that have equal units (this means that they make
fewer distinctions in the middle of the curve and more at the extremes)
19
Standard Age Scores (SAS)

 Normally distributed standard scores


 Put into an IQ metric, where
 M=100
 SD=15 (Wechsler
 IQ Test) or SD=16 (Stanford-Binet IQ Test)

Converting among Standard Scores

Easy Convertibility

 All are different ways of saying the same thing


 All represent equal units at different ranges of scores
 All can be averaged (among themselves)
 Can easily convert one into the other
 Figure 19.2 on p 494 shows how they line up with each other
 But interpretable only when scores are actually normally distributed
(standardized tests usually are)
 Downside—not as easily understood by students and parents as are
percentiles

Using Standard Scores to Examine Profiles

Uses

 You can compare a student‘s scores on different tests and subtests when
you convert all the scores to the same type of standard score
 But all the tests must use the same norm group
 Plotting profiles can show their relative strengths and weaknesses
 Should be plotted as confidence bands to illustrate fringe of error
 Interpret scores as different only when their bands do not overlap
 Sometimes plotted separately by male and female (say, on vocational
interest tests), but is controversial practice

20
 Tests sometimes come with tabular or narrative reports of profiles

Using Standard Scores to Examine Mastery of Skill Types

 Some standardized tests try to provide some criterion-referenced


information by providing scores on specific sets of skills (see Figure 19.4
on p. 498)
 Be very cautious with these—use them as clues only, because each skill
area typically has very few items

Cautions in Interpreting Standardized Test Scores

Scores should be interpreted

1. With clear knowledge about what the test measures. Don‘t rely on titles;
examine the content (breadth, etc.)
2. In light of other factors (aptitudes, educational experiences, cultural
background, health, motivation, etc.) that may have affected test
performance
3. According to the type of decision being made (high or low for what?)
4. As a band of scores rather than a specific value. Always subtract and add 1
SEM from the score to get a range to avoid over interpretation
5. In light of all your evidence. Look for corroborating or conflicting
evidence
6. Never rely on a single score to make a big decision

21
Marking Versus Grading
Giving marks and grade as a response to student‘s work is part of teachers‖
routine work . Marking refers to assigning marks or points to student‘s
performance against marking scheme set for a test or an assignment. More often
than not, marking and scoring are regarded as part of the normal practice of
―grading‖.

Brookhart (2004)defines grading as ―scoring or rating a individual assignments‖.


Grading will explain with attaching the meaning to the score that tells us if the
expectation have been exceeded , met or not met.

In relation to marking and grading of assessments, the University of Greenwich


makes the following helpful points:

1. Assessment is a matter of judgment , not simply computation.


2. Marks and grades are not absolute values, but symbols used by examiners
to communicate their judgment of a student‘s work.
3. Marks and grades provide data for decisions about student‘s fulfillment of
learning outcomes.

Marking and Grading Criteria:

Higher education institutions is normally use an institution-wide grading scale


for undergraduate programmes, whereas postgraduate programmes tend to be
graded on a pass/fail basis or pass/fail/distinction basis. Grading scales tend to
incorporate both percentage grading ; latter meaning letter such as A,B,C,etc. the
grading scale used in the UNIVERSITY of Greenwich shown:

Mark on 0- Comments
100scale
70+ Work of
exceptional
quality
60-69 Work of very
good quality

22
50-59 Work of good
quality
40-49 Work of
satisfactory
standard
30-39 Compensatable
fail
0-29 Fail

Undergraduate grading scale are likely to be similar in other higher education


institutions. It is interesting to compare this scalewith percentage equivalents for
the class honours degree.

0-30% Fail
35-39% Pass degree
40-49% Third class honours
50-59% Lower second class honours
60-69% Upper second class honours
70% or more First class honours

Marks or grades are assigned to student‘s essays to indicate the degree of


achievement they have attained and there are two systems for assigning grades.

Absolute grading gives the student marks for her essay answer, depending on
how well the essay has met assessment criteria and is usually expressed as a
percentage or letter ,e.g. 60% or B.

Relative grading tells the student how his essay answer rated in relation to other
students doing the same test, by indicating whether or not he was average , above
average, or below average. Relative grading usually uses a literal scale such as
A,B, C, D and F. Some teachers would argue that two grades are the best way of
marking, so that students are given either a pass or fail grade.

23
This gets over the problem of deciding what constitute on A or a C grade but
does reduce the information conveyed by a particular grade, since no
discrimination is made between students who pass with a very high level of
achievement and those who barely pass at all.

Common methods of Grading

a. Letter Grades: there is a great flexibility in the number of grades that


can be adopted i.e. 3-11. However 3- point scales may not
differentiate well between students of different abilities. 11-point
scales make too fine distinctions and can introduce arbitrariness.
Most common scale is 7 and 5.
Examples of 7 points grading scale.

O- outstanding

A- Very good
B- Good
C- Average
D- Below average
E- Poor
F- Very poor
Example : of 5 points grading scale
A+- Excellent
A- Good
B- Average
C- Satisfactory
D- Fail

STRENGTHS-

24
 Easy to use.
 Easy to interpret theoretically
 Provide a concise summary

Limitations

 Meaning of grades may vary widely.


 Do not describe strengths/ weaknesses of students.

2. Number /Percentage Grades – (5,3,2,1,0) or (98%, 80%,60% etc)

It is same as letter grades. Only difference is that instead of letters numbers or


percentage is used.

Strengths

 Easy to use
 Easy to interpret theoretically
 Provide a concise summary
 May be combined with letter grades
 More continuous than letter grades.

Limitations-

 Meaning of grades may vary widely.


 Do not describe strengths/weaknesses of students.
 Meaning may need to be explained or interpreted.

Two category grades-(pass-fail) it is good for courses that require-mastery of


learning.

25
Strengths

 Less reliable
 Does not contain enough information about student‘s achievement.
 Provides no indication of the level of learning

Checklists and rating scales- they are more detailed and since they are too
detailed it is cumbersome for teachers to prepare.

Strengths

 Present detailed lists of student‘s achievements.


 Can be combined with letter grades.
 Good for clinical evaluation

Limitations

 May become too detailed to easily comprehend.


 Difficult for record keeping.

Uses of grading

1. Describe unambiguously the worth, merit or value of the work


accomplished. Grades are intended to communicate the achievement of
students.
2. Grades motivate students to learn
3. Provide information to students for self evaluation for analysis of strengths
and weaknesses.
4. Grades communicate performance levels to other.
5. Grades help in selecting people for rewards.
26
6. Communicate teacher‘s judgment of the student‘s progress.

Analytical method of marking (marking scheme)

When using absolute grading to specific criteria, it is useful to use the analytic
method of marking. In this method, a marking scheme is prepared in advance and
marks are allocated to the specific points of content in the marking specification.

However, it is often difficult to decide how many marks should be given to a


particular aspect, but the relative importance of each should be reflected in the
allocation. This method has the advantage that it can be more reliable provided
the marker is conscientious and it will bring to light any errors in the writing of
the question before the test is administered

Global method of marking( structured impressionistic marking)

The global method is also termed structured impressionistic marking, and is best
used with relative grading. This method still requires a marking specification, but
in this case it serves only as a standard of comparison. The grades used are not
usually percentages but scale, such as ―excellent/good/average/below
average/unsatisfactory‖ scales can be devised. According to preference, but it is
important to select examples of answers that serve as standards for each of the
points on the scale. The teacher then reads each answer through very quickly and
put in the appropriate pile, depending whether it gives the impression of
excellent, good etc. the process is then repeated and it is much more effective if a
colleague is asked to do the second reading. This method is much faster than the
analytical one and can be quite effective for large numbers of questions.

Uses of marking

27
Marking has two distinct stakeholders, the students and the tutor. Both should
use marking as a means of raising achievement and attainment.

From tutor‘s perspective marking should:

 Check student understanding.


 Direct future lesson planning and teaching.
 Monitor progress through collection of marks.
 Helps to assess student progress and attainment.
 Set work of appropriate levels.
 Have the clear objectives about what and how you teach.
 Informs students and parents formatively and summatively.

From student‘s perspective marking should and could help them.

 Identify carelessness.
 Proof- reading- i.e by making them check their work of spelling,
punctuation etc.
 Draft work-students can become actively involved in improving their
own work.
 Identify areas of weakness and strength.
 Identify areas that lack understanding and knowledge.
 Become more motivated and value to their work.

28
Scoring Essay Questions
 Prepare an outline of the expected answer in advance.
 Use the scoring method which is most appropriate.
o Point method: each answer is compared to the ideal answer in the
scoring key and a given number of points assigned in terms of
adequacy of the answer.
o Rating method: where the rating method is used, it is desirable to
make separate ratings for each characteristic evaluated. That is
answers should be rated separately for each characteristic
evaluated. That is answers should be rated separately for
organization, comprehensiveness, relevance of ideas, and the
like.
 Decide on provision for handling factors which are irrelevant to the
learning outcomes being measured.
o Legibility of hand writing, spelling, sentence structure,
punctuation and neatness, special efforts should be made to keep
away such factors from influencing our judgment.
 Evaluate all answer to one question before going on to the next
question.
o The halo effect is less likely to form when the answers for a
given pupil are not evaluated in continuous sequence.
 Evaluate the answers without looking at pupil‘s name.
 If especially important decisions are to be based on the results, obtain
two or more independent ratings.

Methods in scoring essay tests

 It is critical that the teacher prepare, in advance a detailed ideal answer.


 Student paper should be scored anonymously and that all answers to a
given item be scored one at a time, rather than grading each total separately

Distractors in scoring essay tests

 Handwriting style
 Grammar
 Knowledge of the students
29
 Neatness

Two ways of scoring essay test

1. Holistic scoring-in this, type, a total score is assigned in each essay items
based the teacher‘s general impression or over-all assessment.
2. Analytic scoring- in this type, the essay is scored in term of each
component

Disadvantages in scoring essay test


Carryout effect-

Carryout effect in which the teacher develops an impression of the quality of the
answer from on item and carries it over to the next response. If the student
answer from one item well, the teacher may be influenced to score subsequent
responses at a similarly high level; the same situation may occur with a poor
response.

Halo effect-

There may be a tendency in evaluating essay items to be influenced by a general


impression of the student or feelings about the student, either positive or
negative, that create halo effect when judging the quality of the answers. For
instance, the teacher may hold favorable opinions about the student from class or
clinical practice and believe that this learner had made significant improvement
in the course, which in turn might influence the scoring the responses.

Scoring Guidelines

These are the descriptions of scoring criteria that the trained readers will follow
to determine the score (1–6) for your essay. Papers at each level exhibit all or
most of the characteristics described at each score point.

Score = 6

30
Essays within this score range demonstrate effective skill in responding to the
task.

The essay shows a clear understanding of the task. The essay takes a position on
the issue and may offer a critical context for discussion. The essay addresses
complexity by examining different perspectives on the issue, or by evaluating the
implications and/or complications of the issue, or by fully responding to
counterarguments to the writer's position. Development of ideas is ample,
specific, and logical. Most ideas are fully elaborated. A clear focus on the
specific issue in the prompt is maintained. The organization of the essay is clear:
the organization may be somewhat predictable or it may grow from the writer's
purpose. Ideas are logically sequenced. Most transitions reflect the writer's logic
and are usually integrated into the essay. The introduction and conclusion are
effective, clear, and well developed. The essay shows a good command of
language. Sentences are varied and word choice is varied and precise. There are
few, if any, errors to distract the reader.

Score = 5
Essays within this score range demonstrate competent skill in responding to the
task.

The essay shows a clear understanding of the task. The essay takes a position on
the issue and may offer a broad context for discussion. The essay shows
recognition of complexity by partially evaluating the implications and/or
complications of the issue, or by responding to counterarguments to the writer's
position. Development of ideas is specific and logical. Most ideas are elaborated,
with clear movement between general statements and specific reasons, examples,
and details. Focus on the specific issue in the prompt is maintained. The

31
organization of the essay is clear, although it may be predictable. Ideas are
logically sequenced, although simple and obvious transitions may be used. The
introduction and conclusion are clear and generally well developed. Language is
competent. Sentences are somewhat varied and word choice is sometimes varied
and precise. There may be a few errors, but they are rarely distracting.

Score = 4
Essays within this score range demonstrate adequate skill in responding to the
task.

The essay shows an understanding of the task. The essay takes a position on the
issue and may offer some context for discussion. The essay may show some
recognition of complexity by providing some response to counterarguments to
the writer's position. Development of ideas is adequate, with some movement
between general statements and specific reasons, examples, and details. Focus on
the specific issue in the prompt is maintained throughout most of the essay. The
organization of the essay is apparent but predictable. Some evidence of logical
sequencing of ideas is apparent, although most transitions are simple and
obvious. The introduction and conclusion are clear and somewhat developed.
Language is adequate, with some sentence variety and appropriate word choice.
There may be some distracting errors, but they do not impede understanding.

Score = 3

Essays within this score range demonstrate some developing skill in responding
to the task.

The essay shows some understanding of the task. The essay takes a position on
the issue but does not offer a context for discussion. The essay may acknowledge

32
a counterargument to the writer's position, but its development is brief or unclear.
Development of ideas is limited and may be repetitious, with little, if any,
movement between general statements and specific reasons, examples, and
details. Focus on the general topic is maintained, but focus on the specific issue
in the prompt may not be maintained. The organization of the essay is simple.
Ideas are logically grouped within parts of the essay, but there is little or no
evidence of logical sequencing of ideas. Transitions, if used, are simple and
obvious. An introduction and conclusion are clearly discernible but
underdeveloped. Language shows a basic control. Sentences show a little variety
and word choice is appropriate. Errors may be distracting and may occasionally
impede understanding.

Score = 2
Essays within this score range demonstrate inconsistent or weak skill in
responding to the task.

The essay shows a weak understanding of the task. The essay may not take a
position on the issue, or the essay may take a position but fail to convey reasons
to support that position, or the essay may take a position but fail to maintain a
stance. There is little or no recognition of a counterargument to the writer's
position. The essay is thinly developed. If examples are given, they are general
and may not be clearly relevant. The essay may include extensive repetition of
the writer's ideas or of ideas in the prompt. Focus on the general topic is
maintained, but focus on the specific issue in the prompt may not be maintained.
There is some indication of an organizational structure, and some logical
grouping of ideas within parts of the essay is apparent. Transitions, if used, are
simple and obvious, and they may be inappropriate or misleading. An
introduction and conclusion are discernible but minimal. Sentence structure and
33
word choice are usually simple. Errors may be frequently distracting and may
sometimes impede understanding.

Score = 1
Essays within this score range show little or no skill in responding to the task.

The essay shows little or no understanding of the task. If the essay takes a
position, it fails to convey reasons to support that position. The essay is
minimally developed. The essay may include excessive repetition of the writer's
ideas or of ideas in the prompt. Focus on the general topic is usually maintained,
but focus on the specific issue in the prompt may not be maintained. There is
little or no evidence of an organizational structure or of the logical grouping of
ideas. Transitions are rarely used. If present, an introduction and conclusion are
minimal. Sentence structure and word choice are simple. Errors may be
frequently distracting and may significantly impede understanding.

No Score
Blank, Off-Topic, Illegible, Not in English, or Void.

Guidelines in scoring Essay test to avoid subjectivity

 Decide what factors constitute a good answer before administering an


essay question.
 Explain these factors in the item item.
 Read all the answers to a single essay question before reading other
questions.
 Reread essay answer a second time after initial scoring

34
Scoring Objective Items
Following are the method of scoring objectives items:

 Scoring key
 Strip key
 Scoring stencil

 Scoring key: if the pupil‘s answers are recorded on the test paper
itself, a scoring key is usually obtained marking the correct answer on
a blank copy of the test.
 The scoring procedure is then simply a matter of comparing the
columns of answers on this master copy with the columns of answers
on each pupils paper.
 Strip Key: a strip key, which consists merely of strip of paper on
which columns of answers are recorded may also be used.
 Scoring Stencil:where separate answer sheets are used, a scoring
stencil is most convenient. This is a blank answer sheet with holes
punched where the correct answers should appears.
 One of the most important advantages of objective type test is ease and
accuracy of scoring. The best way to score objective tests is with a test
scanner. This technology can speed up scoring and minimized scoring
errors.
 When using a test scanner, a scoring key is prepared on a machine-
scorable answer sheet and it is read by the scanner first. After the
scanner reads the scoring key, the student responses are read and stored
on the hard disk of an attached computer.
 A separate program is used to score the student responses by
comparing each response to the correct answer on the answer key.
When this process is complete each student‘s score, along with item
analysis information is printed.

35
Item Analysis

The procedure used to judge the quality of an item is called, Item analysis.

Item analysis is a post administration examination of a test . The quality of a


test depends upon the individual items of a test. A test is usually desirable to
evaluate effectiveness of items. It provides information concerning how well
each item in the test functions. An item analysis tells about the quality of an item.
One primary goal of item analysis is to help improve the test by revising or
discarding ineffective items. Another important function is to ascertain what test
takers do and do not know. I

Item Analysis describes the statistical analyses, which allow measurement of


the effectiveness of individual test items. An understanding of the factors which
govern effectiveness (and a means of measuring them) can enable us to create
more effective test questions and also regulate and standardize existing tests. Item
analysis helps to find out how difficult the test item . Similarly it also helps to
know how well the item discriminates between high and low scorers in the test.
Item analysis further helps to detect specific technical flaws and thus provide
further information for improving test items

To ascertain whether the questions/ items do their job effectively. A detailed test
and item analysis has to be done before a meaningful and scientific inference
about the test can be made in terms of its validity, reliability, objectivity and
usability.

A systematic analysis aims at finding the performance of a group.

36
 The central tendency of marks obtained by them, e.g normal/average;
positive or negative skewness high or low value.
 The variability characterized by standard deviation(SD) indicates the
nature of spread of marks, the greater the spread, and the greater will be
value of standard deviation.
 Coefficient of reliability for the test indicating the degree of consistency
with which the test has measured the student‘s abilities. A high value of
this means that the test is reliable and it produces virtually repeatable,
scores for the students.
 Item analysis is useful in making meaningful interpretations and value
judgments about student‘s performance.
 A teacher or paper setter comes to know whether the items had the right
level of difficulty and whether there was discrimination between more able
and less able students.
Item analysis defines and maintains standard of performance, ensures
comparability of standards,
o To understand the behavior of items,
o To become better item writers, scientific, professional and
competent teachers.

Item analysis is a process of examining class-wide performance on individual test


items. There are three common types of item analysis which provide teachers
with three different types of information:

 Difficulty Index - Teachers produce a difficulty index for a test item by


calculating the proportion of students in class who got an item correct.
(The name of this index is counter-intuitive, as one actually gets a measure
of how easy the item is, not the difficulty of the item.) The larger the
37
proportion, the more students who have learned the content measured by
the item.

 Discrimination Index - The discrimination index is a basic measure of the


validity of an item. It is a measure of an item's ability to discriminate
between those who scored high on the total test and those who scored low.
Though there are several steps in its calculation, once computed, this index
can be interpreted as an indication of the extent to which overall
knowledge of the content area or mastery of the skills is related to the
response on an item. Perhaps the most crucial validity standard for a test
item is that whether a student got an item correct or not is due to their level
of knowledge or ability and not due to something else such as chance or
test bias.

 Analysis of Response Options - In addition to examining the performance


of an entire test item, teachers are often interested in examining the
performance of individual distractors (incorrect answer options) on
multiple-choice items. By calculating the proportion of students who chose
each answer option, teachers can identify which distractors are "working"
and appear attractive to students who do not know the correct answer, and
which distractors are simply taking up space and not being chosen by
many students. To eliminate blind guessing which results in a correct
answer purely by chance (which hurts the validity of a test item), teachers
want as many plausible distractors as is feasible. Analyses of response
options allow teachers to fine tune and improve items they may wish to use
again with future classes.

38
Steps involved in Item Analysis

 For each item count the number of students in each group who answered
the item correctly. For alternate response type of items, count the number
of students in each group who choose each alternative.
 Award of score to each student.
A practical, simple and rapid method is to perforate on your answer sheet
the boxes corresponding to the correct answer, placing the perforated sheet
on the student‘s answer sheet the raw score can be found almost
automatically.

A B C D

Ranking in order of merit and identifying high and low groups.


 Arrange the answer sheets from the highest score to the lowest score.
 Make two groupsi.e., highest scores in one group; lowest scores in other
group or top and bottom halves.

Calculation of difficulty index of a question

For each item, compute the percentage of students who get the item correct is
called ‗item difficulty index‘.

1. D=R/N *100
R: number of pupils who answered the item correctly.
N: total number of pupils who tried them.

39
The higher the difficulty index, the easier is the item. Difficulty
level/facility level of a test; it is an index of how easy or difficult the test is
form is a ratio of the average score of a sample of subjects on the test to the
maximum possible score on the test. It is usually expressed in percentage.
2. Difficulty level= average on the test/ Maximum possible score * 100
3. Difficulty index= H+L/N *100
H: Number of correct answers to the high group.
L: Number of correct answers to the low group.
N: Total number of students in both groups.
4. Find out the facility value of objective tests first.
5. Facility value= Number of students answering questions correctly * 100
Number of students who have taken the test.

If the facility value is 70 and above, those are easy questions; if it is below
70 the questions are difficult ones.

Estimating Discrimination Index(DI)

The discriminating power (validity index) of an item refers to the degree to which
a given item discriminates among students who differ sharply in the functions
measured by the test as a whole.

Formula-1

DI= RU-RL/1/2 N

RU= Number of correct responses from the upper group.

RL= Number of correct responses from lower group.

40
N= Total number of pupils who tried them.

High discriminate value questions are needed for selection purposes.

Formula-2

DI= No. of HAQ-LAQ/No. of HAG

No. of HAQ: number of students in high ability group answering the questions
correctly

No. of LAQ: Number of students in low ability group answering questions


correctly.

No. of HAG: Number of students in high ability group

 Positive Discrimination: If an item is answered correctly by superiors


(upper groups) and but not answered correctly by inferiors (lower group)
such item possess positive discrimination.

Negative Discrimination: An item answered correctly by inferiors (lower


group) but not answered correctly by the superiors (upper groups) such item
possess negative discrimination.

 Zero Discrimination: If an item is answered correctly by the same number of


superiors as well as inferiors examinees of the same group. The item cannot
discriminate between superior and inferior examinees. Thus, the
discrimination power of the item is zero.

41
 Item analysis is a general term that refers to the specific methods used in
education to evaluate test items, typically for the purpose of test
construction and revision.
 Regarded as one of the most important aspects of test construction and
increasingly receiving attention, it is an approach incorporated into item
response theory (IRT), which serves as an alternative to classical
measurement theory (CMT) or classical test theory (CTT). Classical
measurement theory considers a score to be the direct result of a person's
true score plus error.
 It is this error that is of interest as previous measurement theories have
been unable to specify its source. However, item response theory uses item
analysis to differentiate between types of error in order to gain a clearer
understanding of any existing deficiencies.
 Particular attention is given to individual test items, item characteristics,
probability of answering items correctly, overall ability of the test taker,
and degrees or levels of knowledge being assessed.

The Purpose Of Item Analysis

 There must be a match between what is taught and what is assessed.


However, there must also be an effort to test for more complex levels of
understanding, with care taken to avoid over-sampling items that assess
only basic levels of knowledge.
 Tests that are too difficult (and have an insufficient floor) tend to lead to
frustration and lead to deflated scores, whereas tests that are too easy (and
have an insufficient ceiling) facilitate a decline in motivation and lead to
inflated scores.

42
 Tests can be improved by maintaining and developing a pool of valid items
from which future tests can be drawn and that cover a reasonable span of
difficulty levels.
 Item analysis helps improve test items and identify unfair or biased items.
Results should be used to refine test item wording. In addition, closer
examination of items will also reveal which questions were most difficult,
perhaps indicating a concept that needs to be taught more thoroughly.
 If a particular distracter (that is, an incorrect answer choice) is the most
often chosen answer, and especially if that distracter positively correlates
with a high total score, the item must be examined more closely for
correctness. This situation also provides an opportunity to identify and
examine common misconceptions among students about a particular
concept.
 In general, once test items have been created, the value of these items can
be systematically assessed using several methods representative of item
analysis:
a) a test item's level of difficulty,
b) an item's capacity to discriminate, and c) the item characteristic curve.

Difficulty is assessed by examining the number of persons correctly


endorsing the answer. Discrimination can be examined by comparing the
number of persons getting a particular item correct with the total test score.
Finally, the item characteristic curve can be used to plot the likelihood of
answering correctly with the level of success on the test.

Using Item Analysis Results

 It helps the judge the worth or quality of a test.


43
 Aids in subsequent test revisions.
 Lead to increase skill in test construction.
 Provides diagnostic value and help in planning future learning activities.
 Provides a basis for discussing test results.
 For making decisions about the promotion of students to the next higher
grade.
 To bring about improvement in teaching methods and techniques.
 For making decisions about the promotion of students to the next higher
grade.
 To bring about improvement in teaching methods and techniques.

Item Difficulty

Perhaps ―item difficulty‖ should have been named ―item easiness;‖ it expresses
the proportion or percentage of students who answered the item correctly. Item
difficulty can range from 0.0 (none of the students answered the item correctly)
to 1.0 (all of the students answered the item correctly). Experts recommend that
the average level of difficulty for a four-option multiple choice test should be
between 60% and 80%; an average level of difficulty within this range can be
obtained, of course, when the difficulty of individual items falls outside of this
range. If an item has a low difficulty value, say, less than .25, there are several
possible causes: the item may have been miskeyed; the item may be too
challenging relative to the overall level of ability of the class; the item may be
ambiguous or not written clearly; there may be more than one correct answer.
Further insight into the cause of a low difficulty value can often be gained by
examining the percentage of students who chose each response option. For
example, when a high percentage of students chose a single option other than the

44
one that is keyed as correct, it is advisable to check whether a mistake was made
on the answer key.

Item Statistics
Item statistics are used to assess the performance of individual test items on the
assumption that the overall quality of a test derives from the quality of its items.

Item Number.
This is the question number taken from the student answer sheet. Up to 150 items
can be scored on the Standard Answer Sheet (purple).

Mean and S.D.


The mean is the "average" student response to an item. It is computed by adding
up the number of points earned by all students for the item, and dividing that total
by the number of students.
The standard deviation, or S.D., is a measure of the dispersion of student scores
on that item,
that is, it indicates how "spread out" the responses were. The item standard
deviation is most
meaningful when comparing items which have more than one correct alternative
and when scale scoring is used. For this reason it is not typically used to evaluate
classroom tests.

Item Difficulty.
For items with one correct alternative worth a single point, the item difficulty is
simply the percentage of students who answer an item correctly. In this case, it is
also equal to the item mean. The item difficulty index ranges from 0 to 100; the

45
higher the value, the easier the question. When an alternative is worth other than
a single point, or when there is more than one correct alternative per question, the
item difficulty is the average score on that item divided by the highest number of
points for any one alternative.

Item difficulty is relevant for determining whether students have learned the
concept being tested. It also plays an important role in the ability of an item to
discriminate between students who do not. The item will have low discrimination
if it is so difficult that almost everyone gets it wrong or guesses, or so easy that
almost everyone gets it right.

To maximize item discrimination, desirable difficulty levels are slightly higher


than midway
between chance and perfect scores for the item. (The chance score for five-option
questions, for example, is .20 because one-fifth of the students responding to the
question could be expected to choose the correct option by guessing.) Ideal
difficulty levels for multiple-choice items in terms of discrimination potential are:
Format Ideal Difficulty

Five-response multiple-choice 70
Four-response multiple-choice 74
Three-response multiple-choice 77
True-false (two-response multiple choice) 85
classifies item difficulty as "easy" if the index is 85% or above; "moderate" if it
is between 51 and 84%; and "hard" if it is 50% or below.

46
Item Discrimination
Item discrimination refers to the ability of an item to differentiate among students
on the basis of how well they know the material being tested. Various hand
calculation procedures have traditionally been used to compare item responses to
total test scores using high and low scoring groups of students. Computerized
analyses provide more accurate assessment of the discrimination power of items
because they take into account responses of all students rather than just high and
low scoring groups.

The item discrimination index between student responses to a particular item and
total scores on all other items on the test.
This index is the equivalent of a point-biserial coefficient in this application. It
provides an
estimate of the degree to which an individual item is measuring the same thing as
the rest of the items.
Because the discrimination index reflects the degree to which an item and the test
as a whole are measuring a unitary ability or attribute, values of the coefficient
will tend to be lower for tests measuring a wide range of content areas than for
more homogeneous tests.
Item discrimination indices must always be interpreted in the context of the type
of test which is being analyzed.
Items with low discrimination indices are often ambiguously worded and should
be examined.
Items with negative indices should be examined to determine why a negative
value was obtained.
For example, a negative value may indicate that the item was miskeyed, so that
students who
47
knew the material tended to choose an unkeyed, but correct, response option.
Tests with high internal consistency consist of items with mostly positive
relationships with total
test score. In practice, values of the discrimination index will seldom exceed .50
because of the
differing shapes of item and total score distributions. Item discrimination as
"good" if the index is above .30; "fair" if it is between .10 and .30; and "poor" if
it is below .10.

Alternate Weight.
This column shows the number of points given for each response alternative.
For most tests, there will be one correct answer which will be given one point,
but ScorePak®
allows multiple correct alternatives, each of which may be assigned a different
weight.

Means.
The mean total test score (minus that item) is shown for students who selected
each of
the possible response alternatives. This information should be looked at in
conjunction with the
discrimination index; higher total test scores should be obtained by students
choosing the correct,
or most highly weighted alternative. Incorrect alternatives with relatively high
means should be
examined to determine why "better" students chose that particular alternative.

48
Frequencies and Distribution.
The number and percentage of students who choose each
alternative are reported. The bar graph on the right shows the percentage
choosing each
response. Frequently chosen wrong alternatives may indicate common
misconceptions among
the students.

Difficulty and Discrimination Distributions


At the end of the Item Analysis report, test items are listed according their
degrees of difficulty (easy, medium, hard) and discrimination (good, fair, poor).
These distributions provide a quick overview of the test, and can be used to
identify items which are not performing well and which can perhaps be improved
or discarded.

Test Statistics
Two statistics are provided to evaluate the performance of the test as a whole.

Reliability Coefficient.
The reliability of a test refers to the extent to which the test is likely to
produce consistent scores. The particular reliability coefficient reflects three
characteristics of the test:

1. The inter correlations among the items -- the greater the relative number of
positive relationships, and the stronger those relationships are, the greater the
reliability. Item discrimination indices and the test's reliability coefficient are
related in this regard.
49
2. The length of the test -- a test with more items will have a higher reliability, all
other things
being equal.

3. The content of the test -- generally, the more diverse the subject matter tested
and the testing
techniques used, the lower the reliability.

Reliability coefficients theoretically range in value from zero (no reliability) to


1.00 (perfect
reliability). In practice, their approximate range is from .50 to .90 for about 95%
of the classroom tests scored

High reliability means that the questions of a test tended to "pull together."
Students who
answered a given question correctly were more likely to answer other questions
correctly. If a
parallel test were developed by using similar items, the relative scores of students
would show
little change.
Low reliability means that the questions tended to be unrelated to each other in
terms of who
answered them correctly. The resulting test scores reflect peculiarities of the
items or the testing situation more than students' knowledge of the subject matter.
As with many statistics, it is dangerous to interpret the magnitude of a reliability
coefficient out of context. High reliability should be demanded in situations in
50
which a single test score is used to make major decisions, such as professional
licensure examinations. Because classroom
examinations are typically combined with other scores to determine grades, the
standards for a
single test need not be as stringent. The following general guidelines can be used
to interpret
reliability coefficients for classroom exams:

Reliability Interpretation
.90 and above Excellent reliability; at the level of the best standardized tests
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test; in the range of most. There are probably a
few items which could be improved.
.60 - .70 Somewhat low. This test needs to be supplemented by other measures
(e.g., more tests) to determine grades. There are probably some items which
could be improved.
.50 - .60 Suggests need for revision of test, unless it is quite short (ten or fewer
items). The test definitely needs to be supplemented by other measures (e.g.,
more tests) for grading.
.50 or below Questionable reliability. This test should not contribute heavily to
the course grade, and it needs revision.

The measure of reliability used. This is the general form of the more commonly
reported KR-20 and can be applied to tests composed of items with different
numbers of points given for different response alternatives. When coefficient
alpha is applied to tests in which each item has only one correct answer and all

51
correct answers are worth the same number of points, the resulting coefficient is
identical to KR-20.

Standard Error of Measurement.


The standard error of measurement is directly related to the reliability of the test.
It is an index of the amount of variability in an individual student's performance
due to random measurement error. If it were possible to administer an infinite
number of parallel tests, a student's score would be expected to change from one
administration to the next due to a number of factors. For each student, the scores
would form a "normal" (bellshaped) distribution. The mean of the distribution is
assumed to be the student's "true score," and reflects what he or she "really"
knows about the subject. The standard deviation of the distribution is called the
standard error of measurement and reflects the amount of change in the student's
score which could be expected from one test administration to another.
Whereas the reliability of a test always varies between 0.00 and 1.00, the
standard error of
measurement is expressed in the same scale as the test scores. For example,
multiplying all test
scores by a constant will multiply the standard error of measurement by that same
constant, but
will leave the reliability coefficient unchanged.
A general rule of thumb to predict the amount of change which can be expected
in individual test
scores is to multiply the standard error of measurement by 1.5. Only rarely would
one expect a
student's score to increase or decrease by more than that amount between two
such similar
52
tests. The smaller the standard error of measurement, the more accurate the
measurement
provided by the test.

A CAUTION in Interpreting Item Analysis Results


Each of the various item statistics provides information which can be used to
improve individual test items and to increase the quality of the test as a whole.
Such statistics must always be interpreted in the context of the type of test given
and the individuals being tested are not synonymous with item validity.
1.An external criterion is required to accurately judge the validity of test items.
By using the internal criterion of total test score, item analyses reflect internal
consistency of items rather than validity.
2. The discrimination index is not always a measure of item quality. There is a
variety of reasons an item may have low discriminating power:
(a) extremely difficult or easy items will have low ability to discriminate but
such items are often needed to adequately sample course content and objectives;
(b) An item may show low discrimination if the test measures many different
content areas and
cognitive skills. For example, if the majority of the test measures "knowledge of
facts," then an item assessing "ability to apply principles" may have a low
correlation with total test score, yet both types of items are needed to measure
attainment of course objectives.
3.Item analysis data are tentative. Such data are influenced by the type and
number of students being tested, instructional procedures employed, and chance
errors. If repeated use of items is possible, statistics should be recorded for each
administration of each item.

53
Summary

In the light of above discussion, we have discussed about administrating a test


and various suggestions to administer the test, importance of test administration,
recommendations for improving test scores. we learnt about scoring methods ,
various standard scores and marking and grading criteriaand its types. We
discussed about scoring essay test and objective test. We had detailed glance on
item analysis ,item difficulty and its uses.

Conclusion
By above discussion, I conclude the topic that by knowing proper knowledge
about good practice of administration of test and various methods of scoring the
test helps to improve performance of the student and teacher‘s evaluation skill.

54
Bibliography

 B. Sankaranarayan(2008), LEARNING AND TEACHING NURSING, 2nd


edition, Brainfill publishers. Pg no-232-233
 K P Neeraja(2003), TEXTBOOK OF NURSING EDUCATION,1st edition,
Gopson paper ltd, Noida. Pg no-413-425
 Francis M. Quinn(2000), PRINCIPLE AND PRACTICE OF NURSE
EDUCATION, 4th edition, nelson thornes ltd. Pg no-210-214
 Marlyin H. Oermann(2009), EVALUATION AND TESTING NURSING
EDUCATION, 3rd edition, springe publisher. Pg no- 122-126

55

You might also like