You are on page 1of 22

Psychological Testing

Ninth Edition

Chapter

Writing and
Evaluating Test
Items

© 2019 Cengage. All rights reserved.


Scale for political ideology

• https://www.politicalcompass.org/test

© 2019 Cengage. All rights reserved.


Item Writing

• Six guidelines for writing test items (DeVellis, 2016)


1. Define clearly what you wish to measure
2. Generate pool of items
3. Avoid items that are exceptionally long
4. Be aware of the reading level of those taking the scale
and the reading level of the items
5. Avoid items that convey two or more ideas at the same
time
6. Consider using questions that mix positive and negative
wording
• It is also necessary to be aware of diversity issues

© 2019 Cengage. All rights reserved.


Item Formats—The Dichotomous Format
• This approach offers two choices for each question
– Examples include yes/no or true/false types of questions
– Appears on educational as well as personality tests
• Advantages
– Simplicity
– Often requires absolute judgment
• Disadvantages
– Can promote memorization without understanding
– Many situation are not truly dichotomous
– 50% of getting any item right even if material is not known

© 2019 Cengage. All rights reserved.


The Polytomous Format (1 of 3)

• Similar to dichotomous method, but has more than two


options
• Most common example is a multiple-choice test, where
there is one right and several wrong answers
– Incorrect answers are called distractors
• Issues to consider
– How many distractors to use?
 What does psychometric analysis say?
– What is the impact of poor distractors?

© 2019 Cengage. All rights reserved.


The Polytomous Format (2 of 3)
Table 6.1 Common Problems in Multiple-Choice Item
Writing.
Problem Description
Unfocused Stem The stem should include the information necessary to answer the
question. Test takers should not need to read the options to figure out
what question is being asked.
Negative Stem Whenever possible, the stem should exclude negative terms such as
not and except
Window Dressing Information in the stem that is irrelevant to the question or concept
being assessed is considered “window dressing” and should be
avoided.
Unequal Option The correct answer and the distractors should be about the same
Length length.

© 2019 Cengage. All rights reserved.


The Polytomous Format (3 of 3)

Problem Description
Negative Options Whenever possible, response options should exclude negatives such
as “not”
Clues to the Test writers sometimes inadvertently provide clues by using vague
Correct Answer terms such as might, may, and can. Particularly in the social sciences
where certainty is rare, vague terms may signal that the option is
correct.
Heterogeneous The correct option and all of the distractors should be in the same
Options general category.

Adapted From (DiSantis et al., 2015).

© 2019 Cengage. All rights reserved.


The Polytomous Format—Guessing

• On a test item with a limited number of responses, a


certain number can be answered correctly through
simple guessing
• There is a formula that corrects for this guessing effect,
and it does not include any answers that lacked a
response (omitted items)
• Whether or not guessing is a good idea depends on
whether incorrect answers carry greater penalties than
simply getting no credit
– What factors influence whether people guess on test
items?

© 2019 Cengage. All rights reserved.


The Likert Format

• Pronounced “Lick-ert,” not “Like-ert”


• Offers a continuum of responses that allow for
measurements of attitudes on various topics
• Is open to factor analysis, and groups of items that go
together can be identified
• Is a familiar approach that is easy to use

© 2019 Cengage. All rights reserved.


The Likert Format
TABLE 6.2 Examples of Likert Scale Items
Following is a list of statements. Please indicate how
strongly you agree or disagree by circling your answer to
the right of the statement.
Five-choice format with neutral point
Some politicians can be Strongly Somewhat Neither agree Somewhat Strongly
trusted disagree disagree nor disagree agree agree
I am confident that I will Strongly Somewhat Neither agree Somewhat Strongly
achieve my life goals disagree disagree nor disagree agree agree
I am comfortable talking to Strongly Somewhat Neither agree Somewhat Strongly
my parents about personal disagree disagree nor disagree agree agree
problems

© 2019 Cengage. All rights reserved.


The Category Format

• Similar to a Likert format, but has a greater number of


choices
• “On a scale of one to ten….”
• Can be controversial for a number of reasons
– People’s ratings can be affected by a number of factors
that can threaten the validity of their responses
– Context can change the way people respond to items
– Clearly defined endpoints can help overcome such
issues
– What is the optimal range? 1 to 10? 1 to 7? Why?
• Visual analogue scales

© 2019 Cengage. All rights reserved.


Checklists and Q-Sorts
• Adjective checklists provide a list of terms and the
individual selects that most characteristic of herself or
himself
– Can also be used to describe someone else
• Q-Sorts provide a list of adjectives that must be sorted
into nine piles of increasing similarity to the target
person.

© 2019 Cengage. All rights reserved.


Other Possibilities

• Checklists have fallen out of favor


• Forced choice and Likert formats remain the most
popular
• Table 6.3 shows what different texts on test writing
have found
– Very important advice is to avoid “all of the above” or
“none of the above” options, though many still use them
• There are several factors that are essential to writing
good questions, and they enhance the overall quality of
a designed scale

© 2019 Cengage. All rights reserved.


Item Analysis—Item Difficulty (1 of 2)

• Item analysis is a general term for a set of methods


used to evaluate test items
• Item difficulty is an essential evaluation component
– Asks what percent of people got an item correct
– What factors go into determining a reasonable difficulty
level?
 How many answers are there?

© 2019 Cengage. All rights reserved.


Item Analysis—Item Difficulty (2 of 2)
 What would we expect chance to produce?
 What difficulty discriminates those who know the answer
from those who do not?
• Difficulty of .30 to .70 is usually optimal for providing
information about differences between individuals

© 2019 Cengage. All rights reserved.


Item Discriminability

• Determines whether people who have done well on a


particular item have also done well on the entire test
• The Extreme Group method
– Compares those who did well to those who did poorly
– Calculation of a discrimination index
• The Point Biserial method
– Correlation between a dichotomous and a continuous
variable (individual item versus overall test score)
– Is less useful on tests with fewer items
– Point biserial correlations closer to 1.0 indicate better
questions

© 2019 Cengage. All rights reserved.


Pictures of Item Characteristics—Drawing
the Item Characteristic Curve
• An item characteristic curve graphs total test score on
the x-axis and the proportion of respondents who got
an item right on the y-axis
• Often uses categories on the x-axis rather than all data
points
• Can employ the point biserial or discrimination index
• Visuals can indicate strong or weak questions

© 2019 Cengage. All rights reserved.


Item Response Theory

• Item response theory looks at test quality by examining


chances of getting an item right or wrong
• Considers each item and involves the ability of the test
taker
• Advantage—looks not as much number of right or wrong
answers, but rather the level of difficulty of questions a
person can answer correctly
• Different approaches to such item analysis
• Very amenable to computer administration
• Figure 6.9 shows the different approaches to question
selection, along with the benefits of each
© 2019 Cengage. All rights reserved.
Linking Uncommon Measures

• How do we connect tests that do not use the same


items?
• One example is the SAT, which uses different test
items each time it is administered, yet produces the
same scoring system and is often utilized as if each
version is equivalent (parallel forms, split ½)
• Statistical methods to create appropriate comparisons
can be effective, but not perfect
– Not as easy as simply converting Fahrenheit to Celsius

© 2019 Cengage. All rights reserved.


Items for Criterion-Referenced
Tests (1 of 3)
• Remember that this kind of test compares a
performance to a well-defined criterion for learning
• Specific topics for learning in a given class may be that
criterion for educational testing

© 2019 Cengage. All rights reserved.


Items for Criterion-Referenced
Tests (2 of 3)
• Such items require several steps:
– Specify the objectives for the assessment and what the
learning program attempts to achieve
– Develop items toward that goal
– Administer to two groups of students: those who have
and those who have not been exposed to the learning
being assessed
– Compare the scores of the two groups to see if
discrimination occurs

© 2019 Cengage. All rights reserved.


Limitations of Item Analysis

• Test analysis can tell us about the quality of a test, but


it doesn’t help students learn
• The motives of the students may be different than the
motives of a given test or test item
• Purposes of tests are varied, and may emphasize
ranking students over identifying weaknesses or gaps
in knowledge
• If teachers feel they need to “teach to the test,” the
outcomes of a test may be misleading and indicate
more mastery than actually exists

© 2019 Cengage. All rights reserved.

You might also like