Professional Documents
Culture Documents
TEST DEVELOPMENT
TEST CONSTRUCTION NORMING & STANDARDIZING TEST
Issues such as item writing, scale Standardizing for use of particular
construction, response sets, & target populations, norms are
selection of test format. developed & considerable research is
undertaken to establish estimates of
the test reliability & validity.
TEST PUBLICATION & REVISION
Test need to revised to keep them contemporary and current.
TEST CONSTRUCTION
SELECTING 1. Multiple-choice items.
ITEM TYPES 2. Constructed responses.
Essay questions, short-answer questions/ questions that
require a concrete demonstration of skill.
Constructed response tests measure a wider & richer array
of human abilities & skills than those measured by multiple
choice tests.
Multiple choice tests are often assumed to measure narrow
facets of performance, and it is often assumed that by
allowing examinees to respond in a less structured way.
MULTIPLE CHOICE ITEMS ARE A BETTER CHOICE THAN
CONSTRUCTED RESPONSE ITEM.
It is very difficult to develop reliable methods of scoring
constructed-response items.
Method of scoring constructed response item is likely to be
expensive and slow.
MC tests are relatively easy to develop score.
ITEM First step in TC process: generate an item pool.
WRITING Issues:
- Item length
- The wording of items
- Academic status
- Sexist
- Racist/offensive language
ITEM USE A PARTICULAR THEORETICAL SYSTEM TO GUIDE IN
CONTENT WRITING ITEMS FOR THEIR TEST – translate the ideas of that theory
into test items.
Drawback:
I. Quite transparent.
II. Quite apparent.
III. Examinees often respond to such obvious items as they
would like to appear, rather than as they actually perceive
themselves.
SCALE Each item on a psychological test represents an observation of some
CONSTRUCTION particular behavior or trait.
By grouping similar items together, test developers can generate multiple
observations of the same behavior or trait.
3 scaling methods:
1. Rational Scales
Advantage: Test developers can then draw on theory to
make predictions about behavior - the scales have been
designed to measure the theory’s concepts.
Shortcoming: Usefulness of such scales is tied to the
validity of the theory on which they are based.
2. Empirical Scales
Empirically derived scales rely on criterion keying for their
validity.
Well-defined criterion groups are administered as a set of
possible test items.
Items that statistically differentiate the groups are included
on the test.
Example: a group of male adult schizophrenics diagnosed
by clinical interview vs a group of normal emotionally
well-adjusted male adults.
Both group would be given the same set of items, - those
that clearly separated the schizophrenics from the normal
group would be retained.
Factor analysis is used for the item selection process.
Result of factor analysis - It is often very difficult to
capture the essence of the scales formed by such item
grouping or their relationship to clinically useful concepts.
Different methods of factor analysis - the variability in item
groupings to construct the scales.
o The strength of this method – it yields instruments
with psychometric properties.
3. Rational-Empirical Scale
Combine aspects of both the rational and the empirical
approaches to scale construction in the test development
process.
Example: Personality Research Form
RESPONSE SETS
• To measure various psychological traits, test developer often
employ self-report inventories – ask examinees to share
information about themselves.
• Some information is too personal or embarrassing to report.
• Some examinees may feel coerced into completing an
inventory and then decide not to complete it accurately.
• Some they find the questions stated in such vague terms
that they are unsure about how to respond in order to
present an accurate reflection of their thoughts, feeling,
attitudes or behaviors.
• All of these factors add unwanted error variance to test scores and
confound the interpretation process.
• Better psychological tests have built-in methods for detecting these
sources of extraneous variance.
The sources of extraneous variance:
1. Social Desirability
• When examinees respond to test items, they do not attend
as much to the trait being measured as to the social
acceptability of the statement.
• The effects of this response tendency on measures of
personality (Edwards, 1957, 1970)
• Strategies to control or eliminate the effects of social desirability
responding on personality tests.
1. To pair items of equal desirability ratings in an ipsative
format.
2. To select neutral items that are neither strongly positive nor
negative on the social desirability dimension or
3. Statistically adjust test scores to eliminate any such effects.
2. Random responding
• Results when examinees fail to attend to the content of test items
and respond randomly.
• Unmotivated or incapable of completing the task
accurately.
• When examinees do not want to be evaluated or when they are
incapable of validly taking the task.
• Unable to read the questions or too disturbed to attend to
the task.
• Strategies to control the effects of random responding.
• Tests designed to identify random response pattern -
includes items that tend to be universally true or false for
everyone.
3. Dissimulation
• Refers to completing an inventory in such a manner so as to appear
overly healthy (faking good) or overly disturbed (faking bad).
• This response set is most operative in situations in which testing is
being done as part of a selection process for jobs, promotions,
awards or for decision making.
ITEM TYPES:
RESPONSE 1. True-false response format (e.g., Personality Research Form)
FORMAT/ 2. Multiple-choice answer format (Tennessess Self-Concept Scale – 5
ALTERNATIVES point Likert scale)
3. Projective test – require examinees to give unstructed, free
responses to various test stimuli (e.g., The Rorschach and The
Rotter Incomplete Sentence Blank)
4. Examinees are required to produce certain kind of products as part
of the testing process (e.g., the achievement & aptitude testing)
The most popular format in object testing has involved the
use of multiple choice items.
MC advantages:
1. Test item scoring is easy & rapid
2. The error in the measurement process associated
with guessing is reduced as the number of
alternatives in the item is increased.
3. No examiner judgment is involved in the scoring of
MC item.
Disadvantages:
Often difficult to write alternative statements for
items that will function as good distracters- they may
contain transparent cues that signal the correct
response.
Free-response format:
Advantages: results often convey a richness & depth of
thought/reflection that would not otherwise be detectable
from the use of structured response format.
Disadvantages: Scoring problem.