You are on page 1of 14

Assessment in Learning 1

Table of Specifications

The table of specifications (TOS) is a tool used to ensure that a test or assessment measures the content
and thinking skills that the test intends to measure. Thus, when used appropriately, it can provide
response content and construct (i.e., response process) validity evidence. A TOS may be used for large-
scale test construction, classroom-level assessments by teachers, and psychometric scale development.
It is a foundational tool in designing tests or measures for research and educational purposes. The
primary purpose of a TOS is to ensure alignment between the items or elements of an assessment and
the content, skills, or constructs that the assessment intends to assess.

TEST CONSTRUCTION

Characteristics of Good Test

 Validity – the extent to which the test measures what it intends to measure
 Reliability – the consistency with which a test measures what it is supposed to measure
 Usability – the test can be administered with ease, clarity and uniformity
 Scorability – easy to score
 Interpretability – test results can be properly interpreted and is a major basis in making sound
educational decisions
 Economical – the test can be reused without compromising the validity and reliability

VALIDITY

There are four main types of validity:

1. Construct validity: Does the test measure the concept that it’s intended to measure?
2. Content validity: Is the test fully representative of what it aims to measure?
3. Face validity: Does the content of the test appear to be suitable to its aims?
4. Criterion validity: Do the results correspond to a different test of the same thing?

RELIABILITY

There are four main types of reliability. Each can be estimated by comparing different sets of results
produced by the same method.

Type of reliability Measures the consistency of…

Test-retest --------------------------- The same test over time.

Interrater --------------------------- The same test conducted by different people.

Parallel forms --------------------------- Different versions of a test which are designed to be equivalent.

Internal consistency --------------------- The individual items of a test.

MEASURES OF CENTRAL TENDENCY

What are the measures of central tendency?

A measure of central tendency (also referred to as measures of centre or central location) is a summary
measure that attempts to describe a whole set of data with a single value that represents the middle or
centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each of these
measures describes a different indication of the typical or central value in the distribution.

What is the mode?

The mode is the most commonly occurring value in a distribution.

Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This table shows a simple frequency distribution of the retirement age data.
Age Frequency

54 3

55 1

56 1

57 2

58 2

60 2

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

Advantage of the mode:

The mode has an advantage over the median and the mean as it can be found for both numerical and
categorical (non-numerical) data.

Limitations of the mode:

The are some limitations to using the mode. In some distributions, the mode may not reflect the centre
of the distribution very well. When the distribution of retirement age is ordered from lowest to highest
value, it is easy to see that the centre of the distribution is 57 years, but the mode is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or
multi-modal). The presence of more than one mode can limit the ability of the mode in describing the
centre or typical value of the distribution because a single value to describe the centre cannot be
identified.

In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e.
if all values are different).

In cases such as these, it may be better to consider using the median or mean, or group the data in to
appropriate intervals, and find the modal class.

What is the median?

The median is the middle value in distribution when the values are arranged in ascending or descending
order.

The median divides the distribution in half (there are 50% of observations on either side of the median
value). In a distribution with an odd number of observations, the median value is the middle value.

Looking at the retirement age distribution (which has 11 observations), the median is the middle value,
which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

When the distribution has an even number of observations, the median value is the mean of the two
middle values. In the following distribution, the two middle values are 56 and 57, therefore the median
equals 56.5 years:

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Advantage of the median:

The median is less affected by outliers and skewed data than the mean, and is usually the preferred
measure of central tendency when the distribution is not symmetrical.

Limitation of the median:

The median cannot be identified for categorical nominal data, as it cannot be logically ordered.

What is the mean?

The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.

Looking at the retirement age distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values (54+54+54+55+56+57+57+58+58+60+60 = 623)
and dividing by the number of observations (11) which equals 56.6 years.

Advantage of the mean:

The mean can be used for both continuous and discrete numeric data.

Limitations of the mean:

The mean cannot be calculated for categorical data, as the values cannot be summed.

As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.

Guidelines for Constructing Effective Test Items

1. Multiple Choice Questions

Many users regard the multiple-choice item as the most flexible and probably the most effective of the
objective item types. A standard multiple-choice test item consists of two basic parts:

1) Problem (stem) and


2) List of suggested solutions (alternatives).

The stem may be in the form of either a question or an incomplete statement, and the list of
alternatives contains one correct or best alternative (answer) and a number of incorrect or inferior
alternatives (distractors).

The purpose of the distractors is to appear as plausible solutions to the problem for those students who
have not achieved the objective being measured by the test item. Conversely, the distractors must
appear as implausible solutions for those students who have achieved the objective. Only the answer
should appear reasonable to these students.

Multiple-Choice Items are flexible in measuring all levels of cognitive skills. It permits a wide sampling of
content and objectives, provide highly reliable test scores and reduced guessing factor compared with
true-false items and can be machine-scored quickly and accurately. Again, Multiple-Choice Items are
difficult and time-consuming to construct, depend on a student’s reading skills and instructor’s writing
ability. The simplicity of writing low- level knowledge items leads instructors to neglect writing items to
test higher-level thinking. These questions may encourage guessing (but less than true-false).

Guidelines for Writing Multiple Choice Questions

1. Design each item to measure an important learning outcome; present a single, clearly formatted
problem in the stem of the item; but the alternatives at the end of the question, not in the middle and
put as much of the wording as possible in the stem.

Poor: Skinner developed programmed instruction in _____.

a. 1953

b. 1954 (correct)

c. 1955

d. 1956

Better: Skinner developed programmed instruction in _____.

a. 1930s

b. 1940s
c. 1950s (correct)

d. 1970s

2. All options should be homogenous and reasonable and punctuation should be consistent, make all
options grammatically consistent with the stem.

Poor: What is a claw hammer?

a. a woodworking tool (correct)

b. a musical instrument

c. a gardening tool

d. a shoe repair tool

Better: What is a claw hammer?

a. a woodworking tool (correct)

b. a metalworking tool

c. an auto body tool

d. a sheet metal tool

3. Reduce the length of the alternatives by moving as many words as possible to the stem. The
justification is that additional words in the alternatives have to be read four or five times.

Poor: The mean is _____.

a. a measure of the average (correct).

b. a measure of the midpoint

c. a measure of the most popular score

d. a measure of the dispersion scores.

Better: The mean is a measure of the _____.

a. average (correct)

b. midpoint

c. most popular score


d. dispersion of scores

4. Construct the stem so that it conveys a complete thought and avoid negatively worded items like
“Which of the following is not…….?” textbook wording and unnecessary words.

Poor: Objectives are _____.

a. used for planning instruction (correct)

b. written in behavioral form only

c. the last step in the instructional design process

d. used in the cognitive but not affective domain

Better: The main function of instructional objectives is _____.

a. planning instruction (correct)

b. comparing teachers

c. selecting students with exceptional abilities

d. assigning students to academic programs.

5. Do not make the correct answer stand out because of its phrasing or length. Avoid overusing always
and never in the alternatives and overusing all of the above and none of the above. When all of the
above is used, students can eliminate it simply by knowing that one answer is false. Alternatively, they
will know to select it if any two answers are true.

Poor: A narrow strip of land bordered on both sides of water is called an _____.

a. isthmus (correct)

b. peninsula

c. bayou

d. continent

(Note: Do you see why a would be the best guess given the phrasing?)

Better: A narrow strip of land bordered on both sides by water is called a (n) _____.

2. True-False Items
The true false items typically present a declarative statement that the student must mark as either true
or false. Instructors generally use true- false items to measure the recall off actual knowledge such as
names, events, dates, definitions, etc. However, this format has the potential to measure higher levels
of cognitive ability, such as comprehension of significant ideas and their application in solving problems.

They are relatively easy to write and can be answered quickly by students. Students can answer 50 true-
false items in the time it takes to answer 30 multiple-choice items. They provide the widest sampling of
content per unit of time.

Again, the problem of guessing is a major weakness. Students have a fifty-per cent chance of correctly
answering an item without any knowledge of the content. Items are often ambiguous because of the
difficulty of writing statements that are unequivocally true or false.

Rules for Writing True/False Items:

1. Be certain that the statement is completely true or false.

Poor: A good instructional objective will identify a performance standard. (True/False) (Note: The
correct answer here is technically false. However, the statement is doubtful. While a performance
standard is a feature of some “good” objectives, it is not necessary to make an objective good).

Better: A performance standard of an objective should be stated in measurable terms. (True/False).


(Note: The answer here is clearly true.)

2. Convey only one thought or idea in a true/false statement and avoid verbal clues (specific
determiners like “always”) that indicate the answer.

Poor: Bloom’s cognitive taxonomy of objectives includes six levels of objectives, the lowest being
knowledge. (True/False)

Better: Bloom’s cognitive taxonomy includes six levels of objectives. (True/False)

Knowledge is the lowest-level objective in Bloom’s cognitive taxonomy. (True/False)


3. Do not copy sentences directly from textbooks or other written materials and keep the word-length
of true statements about the same as that of false statements. Require learners to write a short
explanation of why false answers are incorrect. This is to discourage students from cramming or
memorizing

Poor: Abstract thinking is intelligence. (True/False)

Better: According to Garett, abstract thinking is intelligence. (True/False)

3. Completion Items

Items provide a wide sampling of content; they minimize guessing compared with multiple-choice and
true false. They are rarely can be written to measure more than simple recall of information; more time-
consuming to score than other objective types and difficult to write so there is only one correct answer
and no irrelevant clues.

Rules for Writing Completion Items:

1. Start with a direct question, switch to an incomplete statement, and place the blank at the end of
the statement.

Poor: What is another name for cone-bearing trees? (Coniferous trees)

Better: Cone-bearing trees are also called______. (Coniferous trees)

2. Leave only one blank. This should relate to the main point of the statement. Provide two blank if
there are two consecutive words.

Poor: The ___ is the ratio of the ___ to the___.

Better: The sine is the ration of the ___ to the ___. (opposite side, hypotenuse).

3. Make the blanks in uniform length and avoid giving irrelevant clue to the correct answer.

Poor: The first president of the United States was _____. (Two words)

(Note: The desired answer is George Washington, but students may write “from Virginia”, “a general”,
and other creative expressions.)

Better: Give the first and last name of the first president of the United States: _____.
4. Matching Items

A matching exercise typically consists of a list of questions or problems to be answered along with a list
of responses. The examinee is required to make an association between each question and response. A
large amount of material can be condensed to fit in fewer space Students have substantially fewer
chances for guessing correct associations than on multiple-choice and true/false tests Matching tests
cannot effectively test higher-order intellectual skills

Rules for Writing Matching Items

1. Teacher should Use homogeneous material in each list of a matching exercise. Mixing events and
dates with events and names of persons, for example, makes the exercise two separate sets of
questions and gives students a better chance to guess the correct response.

2. Put the problems or the stems (typically longer than the responses) in a numbered column at the
left and the response choices in a lettered column at the right. Always include more responses than
questions. If the lists are the same length, the last choice may be determined by elimination rather than
knowledge.

3. Arrange the list of responses in alphabetical or numerical order if possible in order to save reading
time. All the response choices must be likely, but make sure that there is only one correct choice for
each stem or numbered question.

Poor:

Column A Column B

1) Invented radium a) Lal Bahadur Shastri

2) Discovered America b) Bell

3) First P.M. of India c) Vasco Da Gama

4) Invented Telephone d) Marconi

e) Columbus

f) Madam Curie Here, column B consists of explorer, scientists and politicians. For higher-class students,
the list would be very heterogeneous.
Good:

Column A Column B

Eye Digestion

Tongue Hearing

Stomach Breathing

Lung Smelling

Ear Seeing

Tasting Chewing

5. Short Answer Type Items

Short-answer questions should be restrictive enough to evaluate whether the correct answer is given.
Allow a small amount of answer space to discourage the shotgun approach. These tests can test a large
amount of content within a given time period. Again, these test items are limited to testing lower-level
cognitive objectives, such as the recall of facts. Scoring may not be as straightforward as anticipated.

Poor: What is the area of a rectangle whose length is 6m and breadth 75 cm?

Better: What is the area of a rectangle whose length is 6m and breadth 75 cm? Express your answer in
sq. m.

6. Constructing Essay Type Items

“A test item which requires a response composed by the examinee, usually in the form of one or more
sentences, of a nature that no single response or pattern of responses can be listed as correct, and the
accuracy and quality of which can be judged subjectively only by one skilled or informed in the subject.”

The difference between short-answer and essay questions is more than just in the length of response
required. On essay questions there is more emphasis on the organization and integration of the
material, such as when marshaling arguments to support a point of view or method. Essay questions
can be used to measure attainment of a variety of objectives. Stecklein (1955) has listed 14 types of
abilities that can be measured by essay items:

1. Comparisons between two or more things

2. The development and defense of an opinion

3. Questions of cause and effect

4. Explanations of meanings
5. Summarizing of information in a designated area

6. Analysis

7. Knowledge of relationships

8. Illustrations of rules, principles, procedures, and applications

9. Applications of rules, laws, and principles to new situations

10. Criticisms of the adequacy, relevance, or correctness of a concept, idea, or information

11. Formulation of new questions and problems

12. Reorganization of facts

13. Discriminations between objects, concepts, or events

14. Inferential thinking.

All these involve the higher-level skills mentioned in Bloom’s Taxonomy. So essay questions provide an
effective way of assessing complex learning outcomes

Through essay questions, when a paper-and-pencil test is necessary (e.g., assessing students’ ability to
make judgments that are well thought through and that are justifiable). Essay questions require
students to demonstrate their reasoning and thinking skills, which gives teachers the opportunity to
detect problems students may have with their reasoning processes. When educators detect problems in
students’ thinking, they can help them overcome those problems

Rules for Constructing Essay Questions:

1. Ask questions that are relatively specific and focused and which will elicit relatively brief responses.

Poor: Describe the role of instructional objectives in education. Discuss Bloom’s contribution to the
evaluation of instruction.

Better: Describe and differentiate between behavioral (Mager) and cognitive (Gronlund) objectives with
regard to their (1) format and (2) relative advantages and disadvantages for specifying instructional
intentions.

2. If you are using many essay questions in a test, ensure reasonable coverage of the course objectives.
Follow the test specifications in writing prompts. Questions should cover the subject areas as well as the
complexity of behaviors cited in the test blueprint. Pitch the questions at the students’ level.

Poor: What are the major advantages and limitations of essay questions?

Better: Given their advantages and limitations, should an essay question be used to assess students’
abilities to create a solution to a problem? In answering this question, provide brief explanations of the
major advantages and limitations of essay questions. Clearly state whether you think an essay question
should be used and explain the reasoning for your judgment.
Example A assesses recall of factual knowledge, whereas Example B requires more of students. It not
only requires students to recall facts, but also to make an evaluative judgment, and to explain the
reasoning for the judgment. Example B requires more complicated thinking than Example A.

3. Formulate questions that present a clear task to perform and indicate the point value for each
question, provide ample time for answering, and use words which themselves give directions e.g.,
define, illustrate, outline, select, classify, summaries etc.

Poor: Discuss the analytical method of teaching Mathematics.

Better: Discuss the analytical method of teaching Mathematics, giving its characteristics merits, demerits
and practicability. Give illustration.

The construction of clear, unambiguous essay questions that call forth the desired responses is a much
more difficult task than is commonly presumed.

Rules for Scoring Essay Type Tests

As we noted earlier, one of the major limitations of the essay test is the subjectivity of the scoring. That
is, the feeling of the scorers is likely to enter into the judgments they make concerning the quality of the
answers. This may be a personal bias toward the writer of the essay, toward certain areas of content or
styles of writing, or toward shortcomings in such extraneous areas as legibility, spelling, and grammar.
These biases, of course, distort the results of a measure of achievement and tend to lower their
reliability.

The following rules are designed to minimize the subjectivity of the scoring and to provide as uniform a
standard of scoring from one student to another as possible. These rules will be most effective, of
course, when the questions have been carefully prepared in accordance with the rules for construction.

Evaluate answers to essay questions in terms of the learning outcomes being measured. The essay test,
like the objective test, is used to obtain evidence concerning the extent to which clearly defined learning
outcomes have been achieved. Thus, the desired student performance specified in these outcomes
should serve as a guide both for constructing the questions and for evaluating the answers.

If a question is designed to measure “The Ability to Explain Cause-Effect Relations,” for example, the
answer should be evaluated in terms of how adequately the student explains the particular cause-effect
relations presented in the question.

All other factors, such as interesting but extraneous factual information, style of writing, and errors in
spelling and grammar, should be ignored (to the extent possible) during the evaluation. In some cases,
separate scores may be given for spelling or writing ability, but these should not be allowed to
contaminate the scores that represent the degree of achievement of the intended learning outcomes.

Score restricted-response answers by the point method, using a model answer as a guide. Scoring with
the aid of a previously prepared scoring key is possible with the restricted-response item because of the
limitations placed on the answer. The procedure involves writing a model answer to each question and
determining the number of points to be assigned to it and to the parts within it. The distribution of
points within an answer, of course, takes into account all score able units indicated in the learning
outcomes being measured. For example, points may be assigned to the relevance of the examples used
and to the organization of the answer, as well as to the content of the answer, if these are legitimate
aspects of the learning outcome. As indicated earlier, it is usually desirable to make clear to the student
at the time of testing the bases on which each answer will be judged (content, organization, and so on).

Grade extended-response answers by the rating method, using defined criteria as a guide. Extended-
response items allow so much freedom in answering that the preparation of a model answer is
frequently impossible. Thus, the test maker usually grades each answer by judging its quality in terms of
a previously determined set of criteria, rather than scoring it point by point with a scoring key. The
criteria for judging the quality of an answer are determined by the nature of the question and thus by
the learning outcomes being measured.

Evaluate all of the students’ answers to one question before proceeding to the next question. Scoring or
grading essay tests question by question, rather than student by student, makes it possible to maintain a
uniform standard for judging the answers to each question. This procedure also helps offset the halo
effect in grading. When all of the answers on one paper are read together, the grader’s impression of
the paper as a whole is apt to influence the grades he assigns to the individual answers. Grading
question by question, of course, prevents the formation of this overall impression of a student’s paper.
Each answer is more appropriate to judge on its own merits when it is read and compared with other
answers to the same question, than when it is read and compared with other answers by the same
student.

Evaluate answers to essay questions without knowing the identity of the writer. This is another attempt
to control personal bias during scoring. Answer to essay questions should be evaluated in terms of what
is written, not in terms of what is known about the writers from other contacts with them. The best way
to prevent our prior knowledge from biasing our judgment is to evaluate each answer without knowing
the identity of the writer. This can be done by having the students write their names on the back of the
paper or by using code numbers in place of names.

Whenever possible, have two or more persons grade each answer. The best way to check on the
reliability of the scoring of essay answers is to obtain two or more independent judgments. Although
this may not be a feasible practice for routine classroom testing, it might be done periodically with a
fellow teacher (one who is equally competent in the area). Obtaining two or more independent ratings
becomes especially vital where the results are to be used for important and irreversible decisions, such
as in the selection of students for further training or for special awards.

Be on the alert for bluffing. Some students who do not know the answer may write a well organized
coherent essay but one containing material irrelevant to the question. Decide how to treat irrelevant or
inaccurate information contained in students’ answers. We should not give credit for irrelevant material.
It is not fair to other students who may also have preferred to write on another topic, but instead wrote
on the required question.

Write comments on the students’ answers. Teacher comments make essay tests a good learning
experience for students. They also serve to refresh your memory of your evaluation should the student
question the grade.

You might also like