You are on page 1of 59

How to create a Table of Specifications (TOS) in 5 Easy Steps

A document called terms of specifications (TOS) helps you plan out your exam. You
can also call the document, table of specifications. It will make your test creation
process more methodological and organized. Creating a solid terms of specification
will increase the likelihood of you creating a test that is valid and reliable.

So how do you create a table of specifications?


Step 1- Determine the coverage of your exam. The first rule in making exams and
therefore in making a document called table of specification is to make sure the
coverage of your exam is something that you have satisfactorily taught in class. Select
the topics that you wish to test in the exam. It is possible that you will not be able to
cover all these topics a sit might create a test that is too long and will not be realistic
for your students in the given time. So select only the most important topics.
Step 2- Determine your testing objectives for each topic area
In this step, you will need to be familiar with bloom’s taxonomy of thinking skills.
Bloom has identified the hierarchy of learning objectives, from the lower thinking
skills of knowledge and comprehension tothe higher thinking skills of evaluation and
synthesis.
Bloom’s Taxonomy has six categories: (starting from lower level to highest)
- (1) Knowledge, (2)Comprehension, (3) Application, (4) Analysis, (5) Synthesis and
(6) Evaluation
So for each content area that you wish to test, you will have to determine how you will
test each area. Will you test simply their recall of knowledge? Or will you be testing
their comprehension of the matter? Or perhaps you will be challenging them to
analyze and compare and contrast something. Again, this would depend on your
instructional objectives in the classroom. Did you teach them lower thinking skills or
did you challenge them by making them think critically?

Your objectives per topic area should use very specific verbs on how you intend to
test the students using the bloom’s taxonomy. For example, for the 2nd level which is
Comprehension, verbs to use for the objectives would be explain or retell if it is in the
context of understanding a story. For the cognitive level of analysis, verbs you can use
for that taxonomy level is analyze, or show the relationships. It is important that your
terms of specification reflect your instructional procedures during the semester. If
your coverage on a topic mostly dwelt on knowledge and comprehension of material,
then you cannot test them by going up the hierarchy of bloom’s taxonomy. Thus it is
crucial that you give a balanced set of objectives throughout the semester depending
on the nature of your students.
Step 3- Determine the duration for each content area. The next step in making
the table of specifications is to write down how long you spent teaching a particular
topic. This is important because it will determine how many points you should devote
for each topic. Logically, the longer time you spent on teaching a material, then the
more questions should be devoted for that area.
Step 4- Determine the Test Types for each objective. Now that you have created your
table of specifications for your test by aligning your objectives to bloom’s taxonomy,
it’s time to determine the test types that will accomplish your testing objectives. For
example, knowledge questions can be accomplished easily through multiple choice
questions or matching type exams. If you want to test evaluation or synthesis of a
topic, you will want to create exam type questions or perhaps you will ask the students
to create diagrams and explain their diagrams in their analysis. The important thing is
that the test type should reflect your testing objective.
Step 5- Polish your terms of specification
After your initial draft of the table of specifications, it’s time to polish it. Make sure
that you have covered in your terms of specification the important topics that you wish
to test. The number of items for your test should be sufficient for the time allotted for
the test. You should seek your academic coordinator and have them comment on your
table of specification. They will be able to give good feedback on how you can
improve or modify it.
After their approval, it’s time to put into action your blueprint by creating your exam.
It would be best to use a spreadsheet like Microsoft Excel so you could easily modify
your Terms of Specification in case you have some corrections.

Table of Specification (TOS)

What is a TOS?

Step 1
Determine the coverage of your exam.
Step 2
Determine your testing objectives for each topic area.
Step 3
Determine the duration for each content area
Step 4
Determine the Test Types for each objective.

A table of specifications is a tool used by teachers to design a test or exam. The goal
of the table is to organize the material covered by comparing number of questions
devoted to each. Essentially, a table of specification is a table chart that breaks down
the topics that will be on a test and the amount of test questions or percentage of
weight each section will have on the final test grade. This kind of table chart is usually
split into two charts, and each sub topic is numbered under the main topics that are
being covered for the test. This type of table is mainly used by teachers to help
break down their testing outline on a specific subject.
Upon planning a test, you have to consider the following:
1. Identify the test objectives
2. Decide on the stage of objective test to be presented.
3. Prepare a table of specification (TOS)
4. Construct the draft items.
5. Try out and validation. It must cover Bloom's taxonomy of objectives in order for it
to be comprehensive.

So, how do we create a TOS? The first rule in making exams and therefore in making
a document called table of specification is to make sure the coverage of your exam is
something that you have satisfactorily taught in class. Select the topics that you wish
to test in the exam. It is possible that you will not be able to cover all these topics as
it might create a test that is too long and will not be realistic for your students in the
given time. So select only the most important topics.
In this step, you will need to be familiar with bloom’s taxonomy of thinking skills.
Bloom has identified the hierarchy of learning objectives, from the lower thinking
skills of knowledge and comprehension tothe higher thinking skills of evaluation and
synthesis.So for each content area that you wish to test, you will have to determine
how you will test each area.Will you test simply their recall of knowledge? Or will
you be testing their comprehension of the matter?Or perhaps you will be challenging
them to analyze and compare and contrast something.This would depend on your
instructional objectives in the classroom.Did you teach them lower thinking skills or
did you challenge them by making them think critically?The next step in making
the table of specifications is to write down how long you spent teaching aparticular
topic. This is important because it will determine how many points you should devote
for eachtopic.Logically, the longer time you spent on teaching a material, then the
more questions should be devotedfor that area.Now that you have created your table
of specifications for your test by aligning your objectives to
bloom’s taxonomy, it’s time to determine the test types that will accomplish your
testing objectives.
Step 5Polish your terms of specification
After your initial draft of the table of specifications, it’s ti
me to polish it. Make sure that you havecovered in your terms of specification
the important topics that you wish to test. The number of items for your test should be
sufficient for the time allotted for the test. You should seek your academic coordinator
and have them comment on your table of specification. They will be able to give good
feedback on how you can improve or modify it.

THE FOUR COLUMNS OF TOS:


1. level of objective tested
2. statement of the objective
3. item numbers where such objective is being tested
4. number of items and percentage of that particular objective.
So why do we need a TOS? the TOS is an instrument that is consistent with the
student centered approach it provides a study and examination guide for students.
provides a plan of action that is consistent with the institution’s academic goals.

Essentially, a table of specification is a table chart that breaks down the topics that
will be on a test and the amount of test questions or percentage of weight each
section will have on the final test grade. This kind of table chart is usually split into
two charts, and each sub topic is numbered under the main topics that are being
covered for the test. This type of table is mainly used by
teachers to help break down their testing outline on a specific subject. Some
teachers use this particular table as their teaching guideline by breaking the table
into subjects, the teachers main points, how much time should be spent on the
point, and what assignment or project can be done to help the student learn the
subject.

To prepare a multiple choice exam or test you have to know the percentages of the
topics depending on their importance to the subject and the hours spent in their
discussion.

Let’s say you are preparing an exam for the prelim period, for your subject in Human
Physiology in medical school; here are steps you can adapt.

1. Assign the percentage per topic based on the course requirement:

Intro to human physiology – 10%


The human body- 15%
The muscular system -25%
The skeletal system -25%
The cardiovascular system -25%
TOTAL = 100%

N.B. You can adjust the percentages according to your syllabus or academic
requirements.

2. Decide on how many items the test should be. Let’s say you have decided that
the items for your Prelim exam are 150. The time allotted should at least be 2 hours
for this exam, if 1 minute per question and 3 minutes per problem is assigned.
3. Present your data in a table of specifications for clarity.

TOPIC NO. OF ITEMS PERCENTAGE


Intro to 10
Physiology
The Human 15
Body
The Muscular 25
System
The Skeletal 25
System
The 25
Cardiovascular
system
TOTAL 100%

4. Solve for the number of items of each topic by multiplying the percentage-
decimal equivalent with the total number of items.

Intro to Physiology = 0.10 (10%) X 150 = 15 items


The Human Body = 0.15 (15%) X 150 = 22.50 items
The Muscular System = 0.25 (25%) X 150 = 37.50 items
The Skeletal System = 0.25 (25%) X 150 =37.50 items
The Cardiovascular System = 0.25 (25%) X 150 = 37.50 items

For a total of 150 items. Since there are no 0.5 questions, you may decide to which
topic you would assign the 1 item excess.

Let’s say you have the final items assigned:

Intro to Physiology = 0.10 X 150 = 15 items


The Human Body = 0.15 X 150 = 23 items
The Muscular System = 0.25 X 150 = 37 items
The Skeletal System = 0.25 X 150 =37 items
The Cardiovascular System = 0.25 X 150 = 38 items
You come up with this table

TOPIC NO. OF ITEMS PERCENTAGE


Intro to 15 10
Physiology
The Human 23 15
Body
The Muscular 37 25
System
The Skeletal 37 25
System
The 38 25
Cardiovascular
system
TOTAL 150 100%

5. This is the simplest form of preparing for a table of specifications. You may
want to be more specific and prepare a more detailed table assigning easy, average
and difficult questions. The average questions should at least be 80% of your exams,
while the easy at least 20% and the difficult, at least 15 %. This is recommended but
the final decision still relies on the subject per se, and the learning ability of your
students.

TOPIC Easy Average Difficult NO. OF PERCENTAGE


qsns. qsns. qsns. ITEMS
Intro to 3 10 2 15 10
Physiology
The Human 4 16 3 23 15
Body
The Muscular 5 28 4 37 25
System
The Skeletal 5 28 4 37 25
System
The 5 28 4 38 25
Cardiovascular
system
TOTAL 22 111 17 150 100%
6. You should be able to determine which questions are easy, average and
difficult based on an item analysis that you have done in previous exams. This is an
analysis of what questions were answered easily and correctly and which ones were
difficult for the students. There are available software for item analysis that maybe
available from your school or you could prepare one yourself through the semesters
that you teach the subject.

Another Sample Table of Specifications

Clinical Chemistry 1 subject – Prelim Exams

Topic Identification Multiple Problem Number of Percentage


Choice Solving Items
Intro to 2 10 0 12 10
Clinical
Chemistry
Laboratory 2 14 20 36 30
Mathematics
Carbohydrates 3 33 0 36 30
Lipids 3 33 0 36 30
Total No. of 10 90 20 120 100%
Items

DECIDE THE NUMBER OF ITEMS FOR YOUR EXAM, BASED ON THE


HOURS AVAILABLE.

1. You assign the percentage according to the importance of the topic to your
subject, or you can also refer to the required weight of the topic by your school or
accrediting institution.
2.
Decide on the total number of items for the exam depending on the number of
hours assigned.
3.
At least 1 minute is given for easy questions and 3-5 minutes for difficult
questions. In case analyses, you may want to increase the time.
4.
Based on your total items, you now get the number of items for each topic simply
by multiplying the total score with the percentage. Below is the computation for
this Table of Specifications.
How to solve the number of items for your Table of Specifications.

Introduction to clinical chemistry = 120 X 0.10 (10%) = 12 items


Laboratory Mathematics = 120 X 0.30 = 36 items
Carbohydrates = 120 X 0.30 = 36 items
Lipids = 120 X 0.30 =36 items

Total number of items = 120

Assign now the specific type of test for the items. As the instructor, you would know
what type of test could effectively test the knowledge of your students with the
different topics. Your Table of Specifications should reflect which topics are vital to
your course.

In this example, the Introduction to Clinical Chemistry would not use problem solving
but only multiple choice and identification. You can compose 10 items for multiple
choice and 2 items for identification.

What is the importance of the TOS?


-Help teachers frame the decision making process of test construction and improve the
validity of teachers’ evaluations based on tests constructed.
-Helping teachers to identify the types of items they need to include on their tests.
-Identify the achievement domains being measured and to ensure that a fair and
representative sample of questions appear on the test.
-Provides the teacher with evidence that a test has content validity, that it covers what
should be covered.
How to make Item Analysis?
Item analysis is a process which examines student responses to individual test
items (questions) in order to assess the quality of those items and of the test as a
whole. Item analysis is especially valuable in improving items which will be used
again in later tests, but it can also be used to eliminate ambiguous or misleading
items in a single test administration. In addition, item analysis is valuable for
increasing instructors’ skills in test construction, and identifying specific areas of
course content which need greater emphasis or clarity.

Item Statistics
Item statistics are used to assess the performance of individual test items on
the assumption that the overall quality of a test derives from the quality of its
items.
Item analysis report provides the following item information:

Item Number
This is the question number taken from the student answer sheet

Mean and Standard Deviation


The mean is the “average” student response to an item. It is computed by
adding up the number of points earned by all students on the item, and
dividing that total by the number of students.

The standard deviation, or S.D., is a measure of the dispersion of student


scores on that item. That is, it indicates how “spread out” the responses were.
The item standard deviation is most meaningful when comparing items which
have more than one correct alternative and when scale scoring is used. For
this reason it is not typically used to evaluate classroom tests.

Item Difficulty
For items with one correct alternative worth a single point, the item difficulty is
simply the percentage of students who answer an item correctly. In this case,
it is also equal to the item mean. The item difficulty index ranges from 0 to
100; the higher the value, the easier the question. When an alternative is
worth other than a single point, or when there is more than one correct
alternative per question, the item difficulty is the average score on that item
divided by the highest number of points for any one alternative. Item difficulty
is relevant for determining whether students have learned the concept being
tested. It also plays an important role in the ability of an item to discriminate
between students who know the tested material and those who do not. The
item will have low discrimination if it is so difficult that almost everyone gets it
wrong or guesses, or so easy that almost everyone gets it right.
To maximize item discrimination, desirable difficulty levels are slightly higher
than midway between chance and perfect scores for the item. (The chance
score for five-option questions, for example, is 20 because one-fifth of the
students responding to the question could be expected to choose the correct
option by guessing.) Ideal difficulty levels for multiple-choice items in terms of
discrimination potential are:

Format Ideal Difficulty

Five-response multiple-choice 70

Four-response multiple-choice 74

Three-response multiple-choice 77

True-false (two-response multiple-choice) 85

(From Lord, F.M. “The Relationship of the Reliability of Multiple-Choice Test to


the Distribution of Item Difficulties,” Psychometrika, 1952, 18, 181-194.)

Item Discrimination
Item discrimination refers to the ability of an item to differentiate among
students on the basis of how well they know the material being tested. Various
hand calculation procedures have traditionally been used to compare item
responses to total test scores using high and low scoring groups of students.
Computerized analyses provide more accurate assessment of the
discrimination power of items because they take into account responses of all
students rather than just high and low scoring groups.

Alternate Weight
This column shows the number of points given for each response alternative.
For most tests, there will be one correct answer which will be given one point
Means
The mean total test score (minus that item) is shown for students who
selected each of the possible response alternatives. This information should
be looked at in conjunction with the discrimination index; higher total test
scores should be obtained by students choosing the correct, or most highly
weighted alternative. Incorrect alternatives with relatively high means should
be examined to determine why “better” students chose that particular
alternative.

Frequencies and Distribution


The number and percentage of students who choose each alternative are
reported. The bar graph on the right shows the percentage choosing each
response; each “#” represents approximately 2.5%. Frequently chosen wrong
alternatives may indicate common misconceptions among the students.

Difficulty and Discrimination Distributions


At the end of the Item Analysis report, test items are listed according their
degrees of difficulty (easy, medium, hard) and discrimination (good, fair, poor).
These distributions provide a quick overview of the test, and can be used to
identify items which are not performing well and which can perhaps be
improved or discarded.

Test Statistics
Two statistics are provided to evaluate the performance of the test as a whole.

Reliability Coefficient
The reliability of a test refers to the extent to which the test is likely to produce
consistent scores. The particular reliability coefficient computed by ScorePak®
reflects three characteristics of the test:

 Intercorrelations among the items — the greater the relative number of


positive relationships, and the stronger those relationships are, the
greater the reliability. Item discrimination indices and the test’s reliability
coefficient are related in this regard.
 Test length — a test with more items will have a higher reliability, all
other things being equal.
 Test content — generally, the more diverse the subject matter tested
and the testing techniques used, the lower the reliability.
Reliability coefficients theoretically range in value from zero (no reliability) to
1.00 (perfect reliability). In practice, their approximate range is from .50 to .90
for about 95% of the classroom tests scored by ScorePak®. High reliability
means that the questions of a test tended to “pull together.” Students who
answered a given question correctly were more likely to answer other
questions correctly. If a parallel test were developed by using similar items,
the relative scores of students would show little change. Low reliability means
that the questions tended to be unrelated to each other in terms of who
answered them correctly. The resulting test scores reflect peculiarities of the
items or the testing situation more than students’ knowledge of the subject
matter.
As with many statistics, it is dangerous to interpret the magnitude of a
reliability coefficient out of context. High reliability should be demanded in
situations in which a single test score is used to make major decisions, such
as professional licensure examinations. Because classroom examinations are
typically combined with other scores to determine grades, the standards for a
single test need not be as stringent. The following general guidelines can be
used to interpret reliability coefficients for classroom exams:

Reliability Interpretation

.90 and Excellent reliability; at the level of the best standardized tests
above

.80 – .90 Very good for a classroom test

.70 – .80 Good for a classroom test; in the range of most. There are
probably a few items which could be improved.

.60 – .70 Somewhat low. This test needs to be supplemented by other


measures (e.g., more tests) to determine grades. There are
probably some items which could be improved.

.50 – .60 Suggests need for revision of test, unless it is quite short (ten
or fewer items). The test definitely needs to be supplemented
by other measures (e.g., more tests) for grading.
.50 or Questionable reliability. This test should not contribute heavily
below to the course grade, and it needs revision.

The measure of reliability used by ScorePak® is Cronbach’s Alpha. This is the


general form of the more commonly reported KR-20 and can be applied to
tests composed of items with different numbers of points given for different
response alternatives. When coefficient alpha is applied to tests in which each
item has only one correct answer and all correct answers are worth the same
number of points, the resulting coefficient is identical to KR-20.

(Further discussion of test reliability can be found in J. C. Nunnally,


Psychometric Theory. New York: McGraw-Hill, 1967, pp. 172-235, see
especially formulas 6-26, p. 196.)

Standard Error of Measurement


The standard error of measurement is directly related to the reliability of the
test. It is an index of the amount of variability in an individual student’s
performance due to random measurement error. If it were possible to
administer an infinite number of parallel tests, a student’s score would be
expected to change from one administration to the next due to a number of
factors. For each student, the scores would form a “normal” (bell-shaped)
distribution. The mean of the distribution is assumed to be the student’s “true
score,” and reflects what he or she “really” knows about the subject. The
standard deviation of the distribution is called the standard error of
measurement and reflects the amount of change in the student’s score which
could be expected from one test administration to another.
Whereas the reliability of a test always varies between 0.00 and 1.00, the
standard error of measurement is expressed in the same scale as the test
scores. For example, multiplying all test scores by a constant will multiply the
standard error of measurement by that same constant, but will leave the
reliability coefficient unchanged.
A general rule of thumb to predict the amount of change which can be
expected in individual test scores is to multiply the standard error of
measurement by 1.5. Only rarely would one expect a student’s score to
increase or decrease by more than that amount between two such similar
tests. The smaller the standard error of measurement, the more accurate the
measurement provided by the test.
(Further discussion of the standard error of measurement can be found in J.
C. Nunnally, Psychometric Theory. New York: McGraw-Hill, 1967, pp.172-235,
see especially formulas 6-34, p. 201.)
A Caution in Interpreting Item Analysis
Results
Each of the various item statistics provided by ScorePak® provides
information which can be used to improve individual test items and to increase
the quality of the test as a whole. Such statistics must always be interpreted in
the context of the type of test given and the individuals being tested. W. A.
Mehrens and I. J. Lehmann provide the following set of cautions in using item
analysis results (Measurement and Evaluation in Education and Psychology.
New York: Holt, Rinehart and Winston, 1973, 333-334):

 Item analysis data are not synonymous with item validity. An external
criterion is required to accurately judge the validity of test items. By
using the internal criterion of total test score, item analyses reflect
internal consistency of items rather than validity.
 The discrimination index is not always a measure of item quality. There
is a variety of reasons an item may have low discriminating power:(a)
extremely difficult or easy items will have low ability to discriminate but
such items are often needed to adequately sample course content and
objectives;(b) an item may show low discrimination if the test measures
many different content areas and cognitive skills. For example, if the
majority of the test measures “knowledge of facts,” then an item
assessing “ability to apply principles” may have a low correlation with
total test score, yet both types of items are needed to measure
attainment of course objectives.
 Item analysis data are tentative. Such data are influenced by the type
and number of students being tested, instructional procedures
employed, and chance errors. If repeated use of items is possible,
statistics should be recorded for each administration of each item.

1
Raw scores are those scores which are computed by scoring answer sheets
against a ScorePak® Key Sheet. Raw score names are EXAM1 through
EXAM9, QUIZ1 through QUIZ9, MIDTRM1 through MIDTRM3, and FINAL.
ScorePak® cannot analyze scores taken from the bonus section of student
answer sheets or computed from other scores, because such scores are not
derived from individual items which can be accessed by ScorePak®.
Furthermore, separate analyses must be requested for different versions of
the same exam. Return to the text. (anchor near note 1 in text)
2
A correlation is a statistic which indexes the degree of linear relationship
between two variables. If the value of one variable is related to the value of
another, they are said to be “correlated.” In positive relationships, the value of
one variable tends to be high when the value of the other is high, and low
when the other is low. In negative relationships, the value of one variable
tends to be high when the other is low, and vice versa. The possible values of
correlation coefficients range from -1.00 to 1.00. The strength of the
relationship is shown by the absolute value of the coefficient (that is, how
large the number is whether it is positive or negative). The sign indicates the
direction of the relationship (whether positive or negative).

You might also like