You are on page 1of 45

Writing and Evaluating Test Items

1
 Define clearly what you want to measure.
 Generate an item pool.
 Avoid exceptionally long items.
 Keep the level of reading difficulty
appropriate for those who will complete the
test.
 Avoid double-barreled items that convey
more than one ideas at the same time.
 Consider making positive and negative
worded items.
2
 Form, plan, structure, arrangement, and layout of
individual test items.
 I. Selected Response Format
 Requires test takers to select a response from a
set of alternative responses.
 II. Completion Items
 Requires test takers to complete a set of stimuli to
complete a certain item.
▪ Essay items – samples need to respond to a question by
writing a composition; used to determine the depth of
knowledge of the respondent.
Dichotomous Format

• It is an item format that offers two alternatives for each


item. Usually a point is given for the selection of one of
the alternatives.
• Ex. True or false

• Advantages – Simplicity, Ease of administration, quick


scoring, absolute judgment(no neutral response).
• Disadvantage – it needs more items, 50% chance of
getting the correct answer; sample can memorize
responses

4
Polytomous/ Polychotomous Format

• Each items has more than two alternatives. A point is given


for the selection of one of the alternatives and none for any
other choices.
• Ex: Multiple choice

• Question – stems
• Correct Choice – keyed response
• Incorrect choices are called “distractors”

• Ineffective distractors may hurt the reliability of the test


because they are time consuming to read and can limit the
number of good items that can be included in a test.

• “cute” distractors, which are less likely to be chosen, may


affect the reliability of the test because test takers may guess
form the remaining options.
5
W
 Corrected Score = R –
n–1

R = Right responses
W = Wrong responses
n = number of choices for each item
*omitted responses are NOT part of the
computation
6
Likert Format

• Requires the respondent to indicate the degree of


agreement with a particular attitudinal question.
• Suggested as superior item format.
• Can be subjected to factor analysis
• Ex: I am afraid of heights
• Strongly disagree, disagree, neutral, agree, strongly
agree

• Likert scales can be 5 or 4/6 choice format (without


neutral point)

• Scoring requires that any negatively worded items be


reversed scored and the responses are then summed.

7
Category Format

• A format where in the respondents are asked to rate a construct


from 1 to 10; 1 as the lowest and 10 as the highest.
• Ex: With the scale of 1 to 10, 1 as the lowest and 10 as the highest,
rate the attractiveness of your boyfriend.

Checklist and Q-sort

• In checklist, a subject receives a long list of adjectives and indicates


whether each one is characteristic of himself or herself. Adjective
checklists can be used for describing one self or someone else.
• Q – sort requires respondents to sort a group of statements into 9
piles. Statements that are not descriptive of the individual are
placed in pile 1, and statements that fits the individual are placed in
pile 9.

8
Guttman Scale

• Items are arranged from weaker to


stronger expressions of attitude, belief,
or feeling being measured.
• All respondents who answered A agrees
with B,C, and D. All those who
answered C agrees with D but not with
A and B.
 Described by Thrustone
 Scale wherein + and – items are present
 Adds all responses in order to transform it into
interval scale.
 Uses direct estimation scaling
 DIRECT ESTIMATION SCALING
▪ Transformation of a scale to other scales is possible due to
computable value of the mean
 INDIRECT ESTIMATION SCALING
▪ Cannot be transformed to other scales because the mean is
not present.
 Also called as computer assisted testing
 Interactive computer-administered test-
taking process where in items presented to
the test taker are based in part on the test
taker’s performance on previous items.
 Item bank – relatively large and easily accessible
collection of test questions
 Item branching – ability of the computer to tailor
the content and order of presentation of test
items on the basis of response to previous item.
 1. Cumulative Model – the higher the score on
the test, the higher the test taker is on the
ability, trait, or other category.
 2. Class scoring/ Category scoring – test taker
response earn credit toward placement in a
particular class or category with other test
takers whose pattern of responses is similar
in some way. Most useful for diagnostic tests
wherein samples get a point if the response is
related to the disorder.
 3. Ipsative Scoring – compares a test taker’s
score on one scale within a test to another
scale within that same test.
 Choice of one is at the expense of the other.
 Ex.
▪ a. I usually finish a project on time.
▪ b. I prefer If I am in command in a group.
 A general term for a set of methods used to
evaluate test items, one of the most
important aspects of test construction.
 Item Difficulty
 Item Reliability
 Item Validity
 Item Discriminability
 Items for Criterion Reference Test
 *Distractor Analysis

15
 For a test that measures achievement or ability, item
analysis is defined by the number of people who get a
particular item correct.
 For example, if 84% of the test takers answered item
number 1 correctly, then we have a difficulty index of .84.
 This definition, however, indicate the easiness of the test
than difficulty.
 Thus, it is also suggested that achievement tests make use
of multiple choice questions because it has .25 chance of
getting the correct response.
 Should range from 0.30 – 0.70
16
 Suggests the best difficulty for an item based
on the number of responses.
 OID = (chance performance + 1)/ 2
 Chance performance – performance based on
guessing. Can be equated by dividing 1 from
the number of distractors.
Item difficulty Index
 The value that describes the item difficulty
for an ability test.

Item endorsement Index


 The value that describes the percentage of
individuals who said endorsed an item in a
personality test.
 Items in an ability test are arranged into
increasing difficulty.
 Used for motivating samples in taking the
test.
 Give away items – presented near the beginning
of the test to spur motivation and lessen test
anxiety.
 Indicates the internal consistency of a test.
The higher the index; the higher the internal
consistency.
 (Item Reliability) = (SD of the item) x (item-
total correlation)
 Factor analysis can also be used to determine
which items has more load for the whole test.
 Provides an indication of the degree to which
a test is measuring what it purports to
measure. Higher item-validity index; the
higher the criterion related validity for the
test.
 Item Validity = (item standard deviation) x
(correlation of item and criterion)
 How well an item performs in relation to some
criterion. In other words, it tells us the degree of
association of the performance to an item and
performance on the whole test.
 Indicates how adequately an item separates high
scorers from low scorers on the entire test.
 Limits at 0.30 discrimination index
 The higher the d the more high scorers
answering the item correctly
 Extreme Group Method
 Point Biserial Correlation
22
 This method compares people who have
done well with those who have done poorly
on a test.
 For example, you might find the students
with test scores in the top third and those in
the bottom third of the class. Then you will
find the proportions of people in each group
who got each item correct. The proportions
of the groups per item is subtracted to get
the discriminability index.
23
 Choice of 25% – 33% of the population above
and below. (upper and lower group should be
identical)
 Ex. 27% above the population as upper group;
27% of the population as lower group
 Find the proportion of students who
answered correctly per item in both groups.
 Subtract the proportion of the low group to
the high group to get the discrimination
index.
24
Item Proportion Correct Proportion Correct Discrimination
for students in the for students in the Index
top third of class bottom third of
class

1 .89 .34 .55


2 .76 .36 .40
3 .97 .45 .52
4 .98 .95 .03
5 .56 .74 -.18

25
 Used for correlating dichotomous and
continuous data.
 Correlates whether those who got an item
correct tends to have high scores as well.
 Graphic representation of item difficulty and
discrimination.
 Usually plots the scores at x-axis then p and d
on the y-axis.
 A frequency polygon is created after the test
given to two groups; one group that is
exposed to learning unit, another group that
is not exposed to learning unit.
 Antimode - the score with the lowest
frequency
 Determination of cut score (passing score) for a
criterion referenced test.
Item Fairness
 Degree of an item is biased.
 Biased Test Items – items that favor one
particular group of examinees.
 Can be tested using inferential statistics among
groups.

Qualitative Item Analysis


 Involve exploration of the issues through verbal
means such as interviews and group discussions
conducted with test takers and other relevant
parties.
Think out loud Administration
 Allows test takers (during standardization) to
speak their mind while taking the test.
 Used for shedding light to the test taker’s thought
process during the administration of the test.

Expert panels
 Guide researchers/test developers in doing
sensitivity review (especially in cultural issues)
 Sensitivity review – a study of test items typically to
examine test bias, presence of offensive language and
stereotypes.
Test
conceptualization

Test Construction

Test Tryout

Item Analysis

Test Revision
 An umbrella term that goes into the process of
creating a test.
 Test conceptualization – an early stage of test
development wherein idea for a particular
test is conceived.
 Test construction – writing test items as well
as formatting items, scoring rules, and
otherwise designing and building a test.
 Test Tryout – administration of a test to a
representative sample of test takers under
conditions that stimulate the conditions under
which the final version of the test will be
administered.
 Item analysis – entails procedures usually
statistical designed to explore how individual
test items work as compared to other items in
the test and in the context of the whole test.
 Test revision – action taken to modify a test’s
content or format for the purpose of improving
the test’s effectiveness as a tool of assessment.
 Stage wherein the following are determined:
Construct, Goal, User, Taker, Administration,
Format, Response, Benefits, Costs,
Interpretation
 Determination whether the test would be
Norms-Referenced or Criterion-Referenced
 Also called as pilot study, pilot research
 May be in the form of interview in
determining appropriate item for the test.
 It may entail literature review,
experimentation, or any efforts that
researcher may cause in order to determine
the items that might be included in the test.
 writing test items as well as formatting items,
scoring rules, and otherwise designing and
building a test.

 Scaling – process of setting rules for


assigning numbers in measurement.
 Designing a measuring device for the trait/ability
being measured.
 Manifested through its item format
(dichotomous, polytomous, likert, catergory)
 Item pool – usually 2 times the intended final
form number of items. (3 times is advised for
inexperienced test developer)
 Reservoir from which items will or will not be drawn
for the final version of the test
 Final test items should contain all domains of the test.

 Determination of scoring model


 Cumulative, categorical, ipsative

 Creation of final format of the test


 administration of a test to a representative
sample of test takers under conditions that
stimulate the conditions under which the final
version of the test will be administered.
 Issues:
 Determination of target population
 Determination of number of samples for test tryout
(# of items multiplied to 10)
 Test tryout should be executed under conditions as
identical as possible to the conditions under which the
standardized test will be administered.
 Entails procedures usually statistical designed
to explore how individual test items work as
compared to other items in the test and in
the context of the whole test.
 Determination of the following:
 Reliability
 Validity
 Item Difficulty
 Item Discrimination
 Balancing of the weakness and strengths of
the test/an item.
 After all necessary items has been revised
based on the analysis of reliability, validity,
item difficulty, and item discrimination, the
test will be ‘tried out’ again to recalibrate the
psychometric properties.
 Done after the test has been revised into
acceptable levels of reliability, validity, and
item index.

You might also like