Week 5 Chpt.5 Creating Tests

Psychological Testing
Ninth Edition
Chapter
Writing and
Evaluating Test
Items
© 2019 Cengage. All rights reserved.

Scale for political ideology
• https://www.politicalcompass.org/test

Item Writing
• Six guidelines for writing test items (DeVellis, 2016)

1. Define clearly what you wish to measure
2. Generate pool of items
3. Avoid items that are exceptionally long
4. Be aware of the reading level of those taking the scale
and the reading level of the items
5. Avoid items that convey two or more ideas at the same
time
6. Consider using questions that mix positive and negative
wording
• It is also necessary to be aware of diversity issues

Item Formats—The Dichotomous Format
• This approach offers two choices for each question
– Examples include yes/no or true/false types of questions
– Appears on educational as well as personality tests
• Advantages
– Simplicity
– Often requires absolute judgment
• Disadvantages
– Can promote memorization without understanding
– Many situation are not truly dichotomous
– 50% of getting any item right even if material is not known

The Polytomous Format (1 of 3)
• Similar to dichotomous method, but has more than two

options
• Most common example is a multiple-choice test, where
there is one right and several wrong answers
– Incorrect answers are called distractors
• Issues to consider
– How many distractors to use?
 What does psychometric analysis say?
– What is the impact of poor distractors?

Table 6.1 Common Problems in Multiple-Choice Item
Writing.
Problem Description
Unfocused Stem The stem should include the information necessary to answer the
question. Test takers should not need to read the options to figure out
what question is being asked.
Negative Stem Whenever possible, the stem should exclude negative terms such as
not and except
Window Dressing Information in the stem that is irrelevant to the question or concept
being assessed is considered “window dressing” and should be
avoided.
Unequal Option The correct answer and the distractors should be about the same
Length length.

Problem Description
Negative Options Whenever possible, response options should exclude negatives such
as “not”
Clues to the Test writers sometimes inadvertently provide clues by using vague
Correct Answer terms such as might, may, and can. Particularly in the social sciences
where certainty is rare, vague terms may signal that the option is
correct.
Heterogeneous The correct option and all of the distractors should be in the same
Options general category.
Adapted From (DiSantis et al., 2015).

The Polytomous Format—Guessing
• On a test item with a limited number of responses, a

certain number can be answered correctly through
simple guessing
• There is a formula that corrects for this guessing effect,
and it does not include any answers that lacked a
response (omitted items)
• Whether or not guessing is a good idea depends on
whether incorrect answers carry greater penalties than
simply getting no credit
– What factors influence whether people guess on test
items?

The Likert Format
• Pronounced “Lick-ert,” not “Like-ert”

• Offers a continuum of responses that allow for
measurements of attitudes on various topics
• Is open to factor analysis, and groups of items that go
together can be identified
• Is a familiar approach that is easy to use

The Likert Format
TABLE 6.2 Examples of Likert Scale Items
Following is a list of statements. Please indicate how
strongly you agree or disagree by circling your answer to
the right of the statement.
Five-choice format with neutral point
Some politicians can be Strongly Somewhat Neither agree Somewhat Strongly
trusted disagree disagree nor disagree agree agree
I am confident that I will Strongly Somewhat Neither agree Somewhat Strongly
achieve my life goals disagree disagree nor disagree agree agree
I am comfortable talking to Strongly Somewhat Neither agree Somewhat Strongly
my parents about personal disagree disagree nor disagree agree agree
problems

The Category Format
• Similar to a Likert format, but has a greater number of

choices
• “On a scale of one to ten….”
• Can be controversial for a number of reasons
– People’s ratings can be affected by a number of factors
that can threaten the validity of their responses
– Context can change the way people respond to items
– Clearly defined endpoints can help overcome such
issues
– What is the optimal range? 1 to 10? 1 to 7? Why?
• Visual analogue scales

Checklists and Q-Sorts
• Adjective checklists provide a list of terms and the
individual selects that most characteristic of herself or
himself
– Can also be used to describe someone else
• Q-Sorts provide a list of adjectives that must be sorted
into nine piles of increasing similarity to the target
person.

Other Possibilities
• Checklists have fallen out of favor

• Forced choice and Likert formats remain the most
popular
• Table 6.3 shows what different texts on test writing
have found
– Very important advice is to avoid “all of the above” or
“none of the above” options, though many still use them
• There are several factors that are essential to writing
good questions, and they enhance the overall quality of
a designed scale

Item Analysis—Item Difficulty (1 of 2)
• Item analysis is a general term for a set of methods

used to evaluate test items
• Item difficulty is an essential evaluation component
– Asks what percent of people got an item correct
– What factors go into determining a reasonable difficulty
level?
 How many answers are there?

Item Analysis—Item Difficulty (2 of 2)
 What would we expect chance to produce?
 What difficulty discriminates those who know the answer
from those who do not?
• Difficulty of .30 to .70 is usually optimal for providing
information about differences between individuals

Item Discriminability
• Determines whether people who have done well on a

particular item have also done well on the entire test
• The Extreme Group method
– Compares those who did well to those who did poorly
– Calculation of a discrimination index
• The Point Biserial method
– Correlation between a dichotomous and a continuous
variable (individual item versus overall test score)
– Is less useful on tests with fewer items
– Point biserial correlations closer to 1.0 indicate better
questions

Pictures of Item Characteristics—Drawing
the Item Characteristic Curve
• An item characteristic curve graphs total test score on
the x-axis and the proportion of respondents who got
an item right on the y-axis
• Often uses categories on the x-axis rather than all data
points
• Can employ the point biserial or discrimination index
• Visuals can indicate strong or weak questions

Item Response Theory
• Item response theory looks at test quality by examining

chances of getting an item right or wrong
• Considers each item and involves the ability of the test
taker
• Advantage—looks not as much number of right or wrong
answers, but rather the level of difficulty of questions a
person can answer correctly
• Different approaches to such item analysis
• Very amenable to computer administration
• Figure 6.9 shows the different approaches to question
selection, along with the benefits of each
Linking Uncommon Measures
• How do we connect tests that do not use the same

items?
• One example is the SAT, which uses different test
items each time it is administered, yet produces the
same scoring system and is often utilized as if each
version is equivalent (parallel forms, split ½)
• Statistical methods to create appropriate comparisons
can be effective, but not perfect
– Not as easy as simply converting Fahrenheit to Celsius

Items for Criterion-Referenced
Tests (1 of 3)
• Remember that this kind of test compares a
performance to a well-defined criterion for learning
• Specific topics for learning in a given class may be that
criterion for educational testing

Items for Criterion-Referenced
Tests (2 of 3)
• Such items require several steps:
– Specify the objectives for the assessment and what the
learning program attempts to achieve
– Develop items toward that goal
– Administer to two groups of students: those who have
and those who have not been exposed to the learning
being assessed
– Compare the scores of the two groups to see if
discrimination occurs

Limitations of Item Analysis
• Test analysis can tell us about the quality of a test, but

it doesn’t help students learn
• The motives of the students may be different than the
motives of a given test or test item
• Purposes of tests are varied, and may emphasize
ranking students over identifying weaknesses or gaps
in knowledge
• If teachers feel they need to “teach to the test,” the
outcomes of a test may be misleading and indicate
more mastery than actually exists

Week 5 Chpt.5 Creating Tests

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 5 Chpt.5 Creating Tests

Uploaded by

Copyright:

Available Formats

Psychological Testing

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Six guidelines for writing test items (DeVellis, 2016)

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Similar to dichotomous method, but has more than two

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

Adapted From (DiSantis et al., 2015).

© 2019 Cengage. All rights reserved.

• On a test item with a limited number of responses, a

© 2019 Cengage. All rights reserved.

• Pronounced “Lick-ert,” not “Like-ert”

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Similar to a Likert format, but has a greater number of

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Checklists have fallen out of favor

© 2019 Cengage. All rights reserved.

• Item analysis is a general term for a set of methods

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Determines whether people who have done well on a

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Item response theory looks at test quality by examining

• How do we connect tests that do not use the same

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Test analysis can tell us about the quality of a test, but

© 2019 Cengage. All rights reserved.

You might also like