Professional Documents
Culture Documents
ADMINISTRATION Reporting and Scoring 1 PDF
ADMINISTRATION Reporting and Scoring 1 PDF
Introduction
Administering a test usually is the simplest phases of the testing process. There
are some common problems associated with test administration, however, that
may also affect those scores. Careful planning can help the teacher avoid or
minimize such difficulties. When giving tests it is important that everything
possible be done to obtain valid results. Cheating, poor testing conditions, and
test anxiety, as well as errors in test scoring procedures contribute to invalid test
results. Many of these factors may be controlled by practicing good test
administration procedures. Practicing these procedures will prove to be less time
consuming and less troublesome than dealing with problems resulting
from poor procedures.
1
Administrating A Test
It plays a vital role in enhancing the reliability of the test scores. Test
should be administered in a congenial environment strictly as per the
instructions planned and assure uniformity of conclusions to all the people
tested.
2
Administering Exams
How an exam is administered can affect student performance as such as how the
exam was written. Below is a list of general principles to consider when
designing and administering examinations.
3
Importance Of Test Administration
Consistency
Test security
4
Summary-
You should add ten minutes or so to allow for the distribution and collection of
the exam.
Administering tests-
There are several things you should keep in mind to make the experience run as
smoothly as possible-
Have extra copies of the test on hand, in case you have miscounted or in
the event of some other problem.
5
Minimize interruptions during the exam by reading the directions briefly at
the start and refraining from commenting during the exam unless you
discover a problem.
Periodically write the time remaining on the board .
Be alert for cheating but do not hover over the students and cause a
distraction.
There are also some steps that you can take to reduce the anxiety that students
will inevitably feel leading up to and during an exam. Consider the following-
Have old exams on file in the department office for students to review.
Give students practice exams prior to the real test.
Explain in advance of the test day, the exam format and rules, and explain
how this fits with your philosophy of testing.
Give students tips on how to study for and take the exam- this is not a test
of their test taking ability, but rather of their knowledge, so help them learn
to take tests.
Have extra office hours and a review session before the test.
Arrive at the exam site early, and be there yourself(rather than sending a
proxy) to communicated the importance of the event.
1) When a test is announced well in advance, do not wait until the day before
to begin studying spaced practice is more effective than massed practice.
2) Ask the instructor for old copies of the examination to practice with.
3) Ask other students what kinds of tests the instructor usually gives.
4) Don‘t turn study session into social occasion, isolated studying is usually
more effective.
6
5) Don‘t be too comfortable when studying lying down is a physical care for
your body to sleep.
6) Study for the type of test which was announced.
7) If you do not know the type(style) of test, study for a free recall exam.
8) Ask yourself questions about the subject material read for detail, recite the
material just prior to test.
9) Try to form material you are studying into test questions.
10)Read test directions carefully before beginning exam. Ask administrator if
unclear or some details are not included.
11)If essay test, think about question and mentally formulate answer before
you begin writing.
12)Pace yourself while taking test. Do not try to be first person finished. Allow
enough time to review answers at end of session.
13)If you can rule out one wrong answer choice, guess even if there is a penalty
for wrong answers.
14)Skip more difficult items and return to them. Later, particularly if there are a
lot of questions.
15)When time permits, review your answer. Don‘t be overly eager to hand in
your test paper before all the available time has elapsed.
Objectivity- A test is objective, when the scorer‘s personal judgment does not
affect the scoring. It eliminates fixed opinion or judgment of the person who
scores it. The extent to which independent and competent examiners agree on
what constitutes a good answer for each of the elements of a measuring
instruments.
Selection-type items
Supply-type items
A holistic rubric involves one global, holistic rating with a single score for an
entire product or performance based on an overall impression. These are
useful for summative assessment where an overall performance rating is
needed, for example, portfolios.
A holistic rubric requires the teacher to score the overall process or product as
a whole, without judging the component parts separately.
9
Assessing student learning
Rubrics provide instructors with an effective means of learning-centered
feedback and evaluation of student work. As instructional tools, rubrics enable
stdents to guage the strengths and weaknesses of their work and learning. As
assessment tools, rubrics enable faculty to provide detailed and informative
evaluations of student‘s work.
11
12
13
14
Methods Of Scoring In Standardized Test
Different tests use different methods of scoring based on different needs. The
following table summarizes the three main categories of test scores:
1. Raw Scores
2. Criterion-referenced Scores
3. Norm-referenced Scores (how most standardized tests are scored
15
grade level inappropriately
used as a
standard that
all students
must meet.
Scores are
often
inapplicable
when
achievement at
the secondary
level or higher
is being
assessed.
Do not give a
typical range of
performance
for students at
that age or
grade.
16
Standard Score
Definition
Advantages
17
Types of Standard Score
Z-Score
Simplest
The one on which all others based
Formula: z = (X-M)/SD, where X is person‘s score, M is group‘s average,
and SD is group‘s spread (standard deviation in scores
Z is negative for scores that are below average, so z‘s are usually converted
into some other system that has all positive numbers
T- Score
Starts with scores that you want to make conform to the normal curve
Get percentile ranks for each score
Transform percentiles into z scores using a conversion table (I handed one
out in class)
Then transform into any other standard score you want (e.g., T-score, IQ
equivalents)
Hope that your assumption was right, namely, that the scores really do
naturally follow a normal curve. If they don‘t, your interpretations (say, of
equal units) may be somewhat mistaken
18
Stanines
Strengths
Limitations
IQ Scores-
Tests that measure intelligence have a mean of 100 and (for the most part) a
standard deviation of 15. Most people will score between 85 and 115. Someone
who scores below a 70 is typically considered mentally retarded.
Easy Convertibility
Uses
You can compare a student‘s scores on different tests and subtests when
you convert all the scores to the same type of standard score
But all the tests must use the same norm group
Plotting profiles can show their relative strengths and weaknesses
Should be plotted as confidence bands to illustrate fringe of error
Interpret scores as different only when their bands do not overlap
Sometimes plotted separately by male and female (say, on vocational
interest tests), but is controversial practice
20
Tests sometimes come with tabular or narrative reports of profiles
1. With clear knowledge about what the test measures. Don‘t rely on titles;
examine the content (breadth, etc.)
2. In light of other factors (aptitudes, educational experiences, cultural
background, health, motivation, etc.) that may have affected test
performance
3. According to the type of decision being made (high or low for what?)
4. As a band of scores rather than a specific value. Always subtract and add 1
SEM from the score to get a range to avoid over interpretation
5. In light of all your evidence. Look for corroborating or conflicting
evidence
6. Never rely on a single score to make a big decision
21
Marking Versus Grading
Giving marks and grade as a response to student‘s work is part of teachers‖
routine work . Marking refers to assigning marks or points to student‘s
performance against marking scheme set for a test or an assignment. More often
than not, marking and scoring are regarded as part of the normal practice of
―grading‖.
Mark on 0- Comments
100scale
70+ Work of
exceptional
quality
60-69 Work of very
good quality
22
50-59 Work of good
quality
40-49 Work of
satisfactory
standard
30-39 Compensatable
fail
0-29 Fail
0-30% Fail
35-39% Pass degree
40-49% Third class honours
50-59% Lower second class honours
60-69% Upper second class honours
70% or more First class honours
Absolute grading gives the student marks for her essay answer, depending on
how well the essay has met assessment criteria and is usually expressed as a
percentage or letter ,e.g. 60% or B.
Relative grading tells the student how his essay answer rated in relation to other
students doing the same test, by indicating whether or not he was average , above
average, or below average. Relative grading usually uses a literal scale such as
A,B, C, D and F. Some teachers would argue that two grades are the best way of
marking, so that students are given either a pass or fail grade.
23
This gets over the problem of deciding what constitute on A or a C grade but
does reduce the information conveyed by a particular grade, since no
discrimination is made between students who pass with a very high level of
achievement and those who barely pass at all.
O- outstanding
A- Very good
B- Good
C- Average
D- Below average
E- Poor
F- Very poor
Example : of 5 points grading scale
A+- Excellent
A- Good
B- Average
C- Satisfactory
D- Fail
STRENGTHS-
24
Easy to use.
Easy to interpret theoretically
Provide a concise summary
Limitations
Strengths
Easy to use
Easy to interpret theoretically
Provide a concise summary
May be combined with letter grades
More continuous than letter grades.
Limitations-
25
Strengths
Less reliable
Does not contain enough information about student‘s achievement.
Provides no indication of the level of learning
Checklists and rating scales- they are more detailed and since they are too
detailed it is cumbersome for teachers to prepare.
Strengths
Limitations
Uses of grading
When using absolute grading to specific criteria, it is useful to use the analytic
method of marking. In this method, a marking scheme is prepared in advance and
marks are allocated to the specific points of content in the marking specification.
The global method is also termed structured impressionistic marking, and is best
used with relative grading. This method still requires a marking specification, but
in this case it serves only as a standard of comparison. The grades used are not
usually percentages but scale, such as ―excellent/good/average/below
average/unsatisfactory‖ scales can be devised. According to preference, but it is
important to select examples of answers that serve as standards for each of the
points on the scale. The teacher then reads each answer through very quickly and
put in the appropriate pile, depending whether it gives the impression of
excellent, good etc. the process is then repeated and it is much more effective if a
colleague is asked to do the second reading. This method is much faster than the
analytical one and can be quite effective for large numbers of questions.
Uses of marking
27
Marking has two distinct stakeholders, the students and the tutor. Both should
use marking as a means of raising achievement and attainment.
Identify carelessness.
Proof- reading- i.e by making them check their work of spelling,
punctuation etc.
Draft work-students can become actively involved in improving their
own work.
Identify areas of weakness and strength.
Identify areas that lack understanding and knowledge.
Become more motivated and value to their work.
28
Scoring Essay Questions
Prepare an outline of the expected answer in advance.
Use the scoring method which is most appropriate.
o Point method: each answer is compared to the ideal answer in the
scoring key and a given number of points assigned in terms of
adequacy of the answer.
o Rating method: where the rating method is used, it is desirable to
make separate ratings for each characteristic evaluated. That is
answers should be rated separately for each characteristic
evaluated. That is answers should be rated separately for
organization, comprehensiveness, relevance of ideas, and the
like.
Decide on provision for handling factors which are irrelevant to the
learning outcomes being measured.
o Legibility of hand writing, spelling, sentence structure,
punctuation and neatness, special efforts should be made to keep
away such factors from influencing our judgment.
Evaluate all answer to one question before going on to the next
question.
o The halo effect is less likely to form when the answers for a
given pupil are not evaluated in continuous sequence.
Evaluate the answers without looking at pupil‘s name.
If especially important decisions are to be based on the results, obtain
two or more independent ratings.
Handwriting style
Grammar
Knowledge of the students
29
Neatness
1. Holistic scoring-in this, type, a total score is assigned in each essay items
based the teacher‘s general impression or over-all assessment.
2. Analytic scoring- in this type, the essay is scored in term of each
component
Carryout effect in which the teacher develops an impression of the quality of the
answer from on item and carries it over to the next response. If the student
answer from one item well, the teacher may be influenced to score subsequent
responses at a similarly high level; the same situation may occur with a poor
response.
Halo effect-
Scoring Guidelines
These are the descriptions of scoring criteria that the trained readers will follow
to determine the score (1–6) for your essay. Papers at each level exhibit all or
most of the characteristics described at each score point.
Score = 6
30
Essays within this score range demonstrate effective skill in responding to the
task.
The essay shows a clear understanding of the task. The essay takes a position on
the issue and may offer a critical context for discussion. The essay addresses
complexity by examining different perspectives on the issue, or by evaluating the
implications and/or complications of the issue, or by fully responding to
counterarguments to the writer's position. Development of ideas is ample,
specific, and logical. Most ideas are fully elaborated. A clear focus on the
specific issue in the prompt is maintained. The organization of the essay is clear:
the organization may be somewhat predictable or it may grow from the writer's
purpose. Ideas are logically sequenced. Most transitions reflect the writer's logic
and are usually integrated into the essay. The introduction and conclusion are
effective, clear, and well developed. The essay shows a good command of
language. Sentences are varied and word choice is varied and precise. There are
few, if any, errors to distract the reader.
Score = 5
Essays within this score range demonstrate competent skill in responding to the
task.
The essay shows a clear understanding of the task. The essay takes a position on
the issue and may offer a broad context for discussion. The essay shows
recognition of complexity by partially evaluating the implications and/or
complications of the issue, or by responding to counterarguments to the writer's
position. Development of ideas is specific and logical. Most ideas are elaborated,
with clear movement between general statements and specific reasons, examples,
and details. Focus on the specific issue in the prompt is maintained. The
31
organization of the essay is clear, although it may be predictable. Ideas are
logically sequenced, although simple and obvious transitions may be used. The
introduction and conclusion are clear and generally well developed. Language is
competent. Sentences are somewhat varied and word choice is sometimes varied
and precise. There may be a few errors, but they are rarely distracting.
Score = 4
Essays within this score range demonstrate adequate skill in responding to the
task.
The essay shows an understanding of the task. The essay takes a position on the
issue and may offer some context for discussion. The essay may show some
recognition of complexity by providing some response to counterarguments to
the writer's position. Development of ideas is adequate, with some movement
between general statements and specific reasons, examples, and details. Focus on
the specific issue in the prompt is maintained throughout most of the essay. The
organization of the essay is apparent but predictable. Some evidence of logical
sequencing of ideas is apparent, although most transitions are simple and
obvious. The introduction and conclusion are clear and somewhat developed.
Language is adequate, with some sentence variety and appropriate word choice.
There may be some distracting errors, but they do not impede understanding.
Score = 3
Essays within this score range demonstrate some developing skill in responding
to the task.
The essay shows some understanding of the task. The essay takes a position on
the issue but does not offer a context for discussion. The essay may acknowledge
32
a counterargument to the writer's position, but its development is brief or unclear.
Development of ideas is limited and may be repetitious, with little, if any,
movement between general statements and specific reasons, examples, and
details. Focus on the general topic is maintained, but focus on the specific issue
in the prompt may not be maintained. The organization of the essay is simple.
Ideas are logically grouped within parts of the essay, but there is little or no
evidence of logical sequencing of ideas. Transitions, if used, are simple and
obvious. An introduction and conclusion are clearly discernible but
underdeveloped. Language shows a basic control. Sentences show a little variety
and word choice is appropriate. Errors may be distracting and may occasionally
impede understanding.
Score = 2
Essays within this score range demonstrate inconsistent or weak skill in
responding to the task.
The essay shows a weak understanding of the task. The essay may not take a
position on the issue, or the essay may take a position but fail to convey reasons
to support that position, or the essay may take a position but fail to maintain a
stance. There is little or no recognition of a counterargument to the writer's
position. The essay is thinly developed. If examples are given, they are general
and may not be clearly relevant. The essay may include extensive repetition of
the writer's ideas or of ideas in the prompt. Focus on the general topic is
maintained, but focus on the specific issue in the prompt may not be maintained.
There is some indication of an organizational structure, and some logical
grouping of ideas within parts of the essay is apparent. Transitions, if used, are
simple and obvious, and they may be inappropriate or misleading. An
introduction and conclusion are discernible but minimal. Sentence structure and
33
word choice are usually simple. Errors may be frequently distracting and may
sometimes impede understanding.
Score = 1
Essays within this score range show little or no skill in responding to the task.
The essay shows little or no understanding of the task. If the essay takes a
position, it fails to convey reasons to support that position. The essay is
minimally developed. The essay may include excessive repetition of the writer's
ideas or of ideas in the prompt. Focus on the general topic is usually maintained,
but focus on the specific issue in the prompt may not be maintained. There is
little or no evidence of an organizational structure or of the logical grouping of
ideas. Transitions are rarely used. If present, an introduction and conclusion are
minimal. Sentence structure and word choice are simple. Errors may be
frequently distracting and may significantly impede understanding.
No Score
Blank, Off-Topic, Illegible, Not in English, or Void.
34
Scoring Objective Items
Following are the method of scoring objectives items:
Scoring key
Strip key
Scoring stencil
Scoring key: if the pupil‘s answers are recorded on the test paper
itself, a scoring key is usually obtained marking the correct answer on
a blank copy of the test.
The scoring procedure is then simply a matter of comparing the
columns of answers on this master copy with the columns of answers
on each pupils paper.
Strip Key: a strip key, which consists merely of strip of paper on
which columns of answers are recorded may also be used.
Scoring Stencil:where separate answer sheets are used, a scoring
stencil is most convenient. This is a blank answer sheet with holes
punched where the correct answers should appears.
One of the most important advantages of objective type test is ease and
accuracy of scoring. The best way to score objective tests is with a test
scanner. This technology can speed up scoring and minimized scoring
errors.
When using a test scanner, a scoring key is prepared on a machine-
scorable answer sheet and it is read by the scanner first. After the
scanner reads the scoring key, the student responses are read and stored
on the hard disk of an attached computer.
A separate program is used to score the student responses by
comparing each response to the correct answer on the answer key.
When this process is complete each student‘s score, along with item
analysis information is printed.
35
Item Analysis
The procedure used to judge the quality of an item is called, Item analysis.
To ascertain whether the questions/ items do their job effectively. A detailed test
and item analysis has to be done before a meaningful and scientific inference
about the test can be made in terms of its validity, reliability, objectivity and
usability.
36
The central tendency of marks obtained by them, e.g normal/average;
positive or negative skewness high or low value.
The variability characterized by standard deviation(SD) indicates the
nature of spread of marks, the greater the spread, and the greater will be
value of standard deviation.
Coefficient of reliability for the test indicating the degree of consistency
with which the test has measured the student‘s abilities. A high value of
this means that the test is reliable and it produces virtually repeatable,
scores for the students.
Item analysis is useful in making meaningful interpretations and value
judgments about student‘s performance.
A teacher or paper setter comes to know whether the items had the right
level of difficulty and whether there was discrimination between more able
and less able students.
Item analysis defines and maintains standard of performance, ensures
comparability of standards,
o To understand the behavior of items,
o To become better item writers, scientific, professional and
competent teachers.
38
Steps involved in Item Analysis
For each item count the number of students in each group who answered
the item correctly. For alternate response type of items, count the number
of students in each group who choose each alternative.
Award of score to each student.
A practical, simple and rapid method is to perforate on your answer sheet
the boxes corresponding to the correct answer, placing the perforated sheet
on the student‘s answer sheet the raw score can be found almost
automatically.
A B C D
For each item, compute the percentage of students who get the item correct is
called ‗item difficulty index‘.
1. D=R/N *100
R: number of pupils who answered the item correctly.
N: total number of pupils who tried them.
39
The higher the difficulty index, the easier is the item. Difficulty
level/facility level of a test; it is an index of how easy or difficult the test is
form is a ratio of the average score of a sample of subjects on the test to the
maximum possible score on the test. It is usually expressed in percentage.
2. Difficulty level= average on the test/ Maximum possible score * 100
3. Difficulty index= H+L/N *100
H: Number of correct answers to the high group.
L: Number of correct answers to the low group.
N: Total number of students in both groups.
4. Find out the facility value of objective tests first.
5. Facility value= Number of students answering questions correctly * 100
Number of students who have taken the test.
If the facility value is 70 and above, those are easy questions; if it is below
70 the questions are difficult ones.
The discriminating power (validity index) of an item refers to the degree to which
a given item discriminates among students who differ sharply in the functions
measured by the test as a whole.
Formula-1
DI= RU-RL/1/2 N
40
N= Total number of pupils who tried them.
Formula-2
No. of HAQ: number of students in high ability group answering the questions
correctly
41
Item analysis is a general term that refers to the specific methods used in
education to evaluate test items, typically for the purpose of test
construction and revision.
Regarded as one of the most important aspects of test construction and
increasingly receiving attention, it is an approach incorporated into item
response theory (IRT), which serves as an alternative to classical
measurement theory (CMT) or classical test theory (CTT). Classical
measurement theory considers a score to be the direct result of a person's
true score plus error.
It is this error that is of interest as previous measurement theories have
been unable to specify its source. However, item response theory uses item
analysis to differentiate between types of error in order to gain a clearer
understanding of any existing deficiencies.
Particular attention is given to individual test items, item characteristics,
probability of answering items correctly, overall ability of the test taker,
and degrees or levels of knowledge being assessed.
42
Tests can be improved by maintaining and developing a pool of valid items
from which future tests can be drawn and that cover a reasonable span of
difficulty levels.
Item analysis helps improve test items and identify unfair or biased items.
Results should be used to refine test item wording. In addition, closer
examination of items will also reveal which questions were most difficult,
perhaps indicating a concept that needs to be taught more thoroughly.
If a particular distracter (that is, an incorrect answer choice) is the most
often chosen answer, and especially if that distracter positively correlates
with a high total score, the item must be examined more closely for
correctness. This situation also provides an opportunity to identify and
examine common misconceptions among students about a particular
concept.
In general, once test items have been created, the value of these items can
be systematically assessed using several methods representative of item
analysis:
a) a test item's level of difficulty,
b) an item's capacity to discriminate, and c) the item characteristic curve.
Item Difficulty
Perhaps ―item difficulty‖ should have been named ―item easiness;‖ it expresses
the proportion or percentage of students who answered the item correctly. Item
difficulty can range from 0.0 (none of the students answered the item correctly)
to 1.0 (all of the students answered the item correctly). Experts recommend that
the average level of difficulty for a four-option multiple choice test should be
between 60% and 80%; an average level of difficulty within this range can be
obtained, of course, when the difficulty of individual items falls outside of this
range. If an item has a low difficulty value, say, less than .25, there are several
possible causes: the item may have been miskeyed; the item may be too
challenging relative to the overall level of ability of the class; the item may be
ambiguous or not written clearly; there may be more than one correct answer.
Further insight into the cause of a low difficulty value can often be gained by
examining the percentage of students who chose each response option. For
example, when a high percentage of students chose a single option other than the
44
one that is keyed as correct, it is advisable to check whether a mistake was made
on the answer key.
Item Statistics
Item statistics are used to assess the performance of individual test items on the
assumption that the overall quality of a test derives from the quality of its items.
Item Number.
This is the question number taken from the student answer sheet. Up to 150 items
can be scored on the Standard Answer Sheet (purple).
Item Difficulty.
For items with one correct alternative worth a single point, the item difficulty is
simply the percentage of students who answer an item correctly. In this case, it is
also equal to the item mean. The item difficulty index ranges from 0 to 100; the
45
higher the value, the easier the question. When an alternative is worth other than
a single point, or when there is more than one correct alternative per question, the
item difficulty is the average score on that item divided by the highest number of
points for any one alternative.
Item difficulty is relevant for determining whether students have learned the
concept being tested. It also plays an important role in the ability of an item to
discriminate between students who do not. The item will have low discrimination
if it is so difficult that almost everyone gets it wrong or guesses, or so easy that
almost everyone gets it right.
Five-response multiple-choice 70
Four-response multiple-choice 74
Three-response multiple-choice 77
True-false (two-response multiple choice) 85
classifies item difficulty as "easy" if the index is 85% or above; "moderate" if it
is between 51 and 84%; and "hard" if it is 50% or below.
46
Item Discrimination
Item discrimination refers to the ability of an item to differentiate among students
on the basis of how well they know the material being tested. Various hand
calculation procedures have traditionally been used to compare item responses to
total test scores using high and low scoring groups of students. Computerized
analyses provide more accurate assessment of the discrimination power of items
because they take into account responses of all students rather than just high and
low scoring groups.
The item discrimination index between student responses to a particular item and
total scores on all other items on the test.
This index is the equivalent of a point-biserial coefficient in this application. It
provides an
estimate of the degree to which an individual item is measuring the same thing as
the rest of the items.
Because the discrimination index reflects the degree to which an item and the test
as a whole are measuring a unitary ability or attribute, values of the coefficient
will tend to be lower for tests measuring a wide range of content areas than for
more homogeneous tests.
Item discrimination indices must always be interpreted in the context of the type
of test which is being analyzed.
Items with low discrimination indices are often ambiguously worded and should
be examined.
Items with negative indices should be examined to determine why a negative
value was obtained.
For example, a negative value may indicate that the item was miskeyed, so that
students who
47
knew the material tended to choose an unkeyed, but correct, response option.
Tests with high internal consistency consist of items with mostly positive
relationships with total
test score. In practice, values of the discrimination index will seldom exceed .50
because of the
differing shapes of item and total score distributions. Item discrimination as
"good" if the index is above .30; "fair" if it is between .10 and .30; and "poor" if
it is below .10.
Alternate Weight.
This column shows the number of points given for each response alternative.
For most tests, there will be one correct answer which will be given one point,
but ScorePak®
allows multiple correct alternatives, each of which may be assigned a different
weight.
Means.
The mean total test score (minus that item) is shown for students who selected
each of
the possible response alternatives. This information should be looked at in
conjunction with the
discrimination index; higher total test scores should be obtained by students
choosing the correct,
or most highly weighted alternative. Incorrect alternatives with relatively high
means should be
examined to determine why "better" students chose that particular alternative.
48
Frequencies and Distribution.
The number and percentage of students who choose each
alternative are reported. The bar graph on the right shows the percentage
choosing each
response. Frequently chosen wrong alternatives may indicate common
misconceptions among
the students.
Test Statistics
Two statistics are provided to evaluate the performance of the test as a whole.
Reliability Coefficient.
The reliability of a test refers to the extent to which the test is likely to
produce consistent scores. The particular reliability coefficient reflects three
characteristics of the test:
1. The inter correlations among the items -- the greater the relative number of
positive relationships, and the stronger those relationships are, the greater the
reliability. Item discrimination indices and the test's reliability coefficient are
related in this regard.
49
2. The length of the test -- a test with more items will have a higher reliability, all
other things
being equal.
3. The content of the test -- generally, the more diverse the subject matter tested
and the testing
techniques used, the lower the reliability.
High reliability means that the questions of a test tended to "pull together."
Students who
answered a given question correctly were more likely to answer other questions
correctly. If a
parallel test were developed by using similar items, the relative scores of students
would show
little change.
Low reliability means that the questions tended to be unrelated to each other in
terms of who
answered them correctly. The resulting test scores reflect peculiarities of the
items or the testing situation more than students' knowledge of the subject matter.
As with many statistics, it is dangerous to interpret the magnitude of a reliability
coefficient out of context. High reliability should be demanded in situations in
50
which a single test score is used to make major decisions, such as professional
licensure examinations. Because classroom
examinations are typically combined with other scores to determine grades, the
standards for a
single test need not be as stringent. The following general guidelines can be used
to interpret
reliability coefficients for classroom exams:
Reliability Interpretation
.90 and above Excellent reliability; at the level of the best standardized tests
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test; in the range of most. There are probably a
few items which could be improved.
.60 - .70 Somewhat low. This test needs to be supplemented by other measures
(e.g., more tests) to determine grades. There are probably some items which
could be improved.
.50 - .60 Suggests need for revision of test, unless it is quite short (ten or fewer
items). The test definitely needs to be supplemented by other measures (e.g.,
more tests) for grading.
.50 or below Questionable reliability. This test should not contribute heavily to
the course grade, and it needs revision.
The measure of reliability used. This is the general form of the more commonly
reported KR-20 and can be applied to tests composed of items with different
numbers of points given for different response alternatives. When coefficient
alpha is applied to tests in which each item has only one correct answer and all
51
correct answers are worth the same number of points, the resulting coefficient is
identical to KR-20.
53
Summary
Conclusion
By above discussion, I conclude the topic that by knowing proper knowledge
about good practice of administration of test and various methods of scoring the
test helps to improve performance of the student and teacher‘s evaluation skill.
54
Bibliography
55