You are on page 1of 5

THE PROCESS OF TEST DEVELOPMENT

TEST DEVELOPMENT
TEST CONSTRUCTION NORMING & STANDARDIZING TEST
 Issues such as item writing, scale  Standardizing for use of particular
construction, response sets, & target populations, norms are
selection of test format. developed & considerable research is
undertaken to establish estimates of
the test reliability & validity.
TEST PUBLICATION & REVISION
 Test need to revised to keep them contemporary and current.

TEST CONSTRUCTION
SELECTING 1. Multiple-choice items.
ITEM TYPES 2. Constructed responses.
Essay questions, short-answer questions/ questions that
require a concrete demonstration of skill.
Constructed response tests measure a wider & richer array
of human abilities & skills than those measured by multiple
choice tests.
Multiple choice tests are often assumed to measure narrow
facets of performance, and it is often assumed that by
allowing examinees to respond in a less structured way.
MULTIPLE CHOICE ITEMS ARE A BETTER CHOICE THAN
CONSTRUCTED RESPONSE ITEM.
 It is very difficult to develop reliable methods of scoring
constructed-response items.
 Method of scoring constructed response item is likely to be
expensive and slow.
MC tests are relatively easy to develop score.
ITEM  First step in TC process: generate an item pool.
WRITING  Issues:
- Item length
- The wording of items
- Academic status
- Sexist
- Racist/offensive language
ITEM USE A PARTICULAR THEORETICAL SYSTEM TO GUIDE IN
CONTENT WRITING ITEMS FOR THEIR TEST – translate the ideas of that theory
into test items.
 Drawback:
I. Quite transparent.
II. Quite apparent.
III. Examinees often respond to such obvious items as they
would like to appear, rather than as they actually perceive
themselves.
SCALE Each item on a psychological test represents an observation of some
CONSTRUCTION particular behavior or trait.
By grouping similar items together, test developers can generate multiple
observations of the same behavior or trait.
3 scaling methods:
1. Rational Scales
Advantage: Test developers can then draw on theory to
make predictions about behavior - the scales have been
designed to measure the theory’s concepts.
Shortcoming: Usefulness of such scales is tied to the
validity of the theory on which they are based.

2. Empirical Scales
Empirically derived scales rely on criterion keying for their
validity.
Well-defined criterion groups are administered as a set of
possible test items.
Items that statistically differentiate the groups are included
on the test.
Example: a group of male adult schizophrenics diagnosed
by clinical interview vs a group of normal emotionally
well-adjusted male adults.
Both group would be given the same set of items, - those
that clearly separated the schizophrenics from the normal
group would be retained.
Factor analysis is used for the item selection process.
Result of factor analysis - It is often very difficult to
capture the essence of the scales formed by such item
grouping or their relationship to clinically useful concepts.
Different methods of factor analysis - the variability in item
groupings to construct the scales.
o The strength of this method – it yields instruments
with psychometric properties.

3. Rational-Empirical Scale
Combine aspects of both the rational and the empirical
approaches to scale construction in the test development
process.
Example: Personality Research Form

RESPONSE SETS
• To measure various psychological traits, test developer often
employ self-report inventories – ask examinees to share
information about themselves.
• Some information is too personal or embarrassing to report.
• Some examinees may feel coerced into completing an
inventory and then decide not to complete it accurately.
• Some they find the questions stated in such vague terms
that they are unsure about how to respond in order to
present an accurate reflection of their thoughts, feeling,
attitudes or behaviors.
• All of these factors add unwanted error variance to test scores and
confound the interpretation process.
• Better psychological tests have built-in methods for detecting these
sources of extraneous variance.
The sources of extraneous variance:
1. Social Desirability
• When examinees respond to test items, they do not attend
as much to the trait being measured as to the social
acceptability of the statement.
• The effects of this response tendency on measures of
personality (Edwards, 1957, 1970)
• Strategies to control or eliminate the effects of social desirability
responding on personality tests.
1. To pair items of equal desirability ratings in an ipsative
format.
2. To select neutral items that are neither strongly positive nor
negative on the social desirability dimension or
3. Statistically adjust test scores to eliminate any such effects.
2. Random responding
• Results when examinees fail to attend to the content of test items
and respond randomly.
• Unmotivated or incapable of completing the task
accurately.
• When examinees do not want to be evaluated or when they are
incapable of validly taking the task.
• Unable to read the questions or too disturbed to attend to
the task.
• Strategies to control the effects of random responding.
• Tests designed to identify random response pattern -
includes items that tend to be universally true or false for
everyone.
3. Dissimulation
• Refers to completing an inventory in such a manner so as to appear
overly healthy (faking good) or overly disturbed (faking bad).
• This response set is most operative in situations in which testing is
being done as part of a selection process for jobs, promotions,
awards or for decision making.
ITEM TYPES:
RESPONSE 1. True-false response format (e.g., Personality Research Form)
FORMAT/ 2. Multiple-choice answer format (Tennessess Self-Concept Scale – 5
ALTERNATIVES point Likert scale)
3. Projective test – require examinees to give unstructed, free
responses to various test stimuli (e.g., The Rorschach and The
Rotter Incomplete Sentence Blank)
4. Examinees are required to produce certain kind of products as part
of the testing process (e.g., the achievement & aptitude testing)
 The most popular format in object testing has involved the
use of multiple choice items.
 MC advantages:
1. Test item scoring is easy & rapid
2. The error in the measurement process associated
with guessing is reduced as the number of
alternatives in the item is increased.
3. No examiner judgment is involved in the scoring of
MC item.

 Disadvantages:
Often difficult to write alternative statements for
items that will function as good distracters- they may
contain transparent cues that signal the correct
response.
 Free-response format:
 Advantages: results often convey a richness & depth of
thought/reflection that would not otherwise be detectable
from the use of structured response format.
 Disadvantages: Scoring problem.

NORMING & STANDARDIZING TEST


Importance: Standardizing testing procedures- to eliminate additional sources of error from
the measurement process.
NORMING:
 Selecting & testing appropriate norm groups for psychological tests are difficult &
very expensive tasks.
 Several different norm groups are usually established for tests so that the test users
can choose the most appropriate comparison group for their particular purposes
 Ideally, good comparison groups provide a representative sample of the population in
question.
- Not too specific & limited & not too broad & non applicable for meaningful score
interpretation.
1. Defining the A primary concern in the development of norms is the comparison of the
Target normative group.
Population • Example: most test of mental ability are designed to assess an
individual’s ability relative to that of others in the general
population.
• The appropriate normative group would consist of a
random sample of people representative all educational
levels.
• If a more restricted norm group were used (e.g, people with
college degrees), it would significantly change the
interpretability of test results.
• A critical first step in the development and interpretation of norms
is to ensure that the target group on which they are based is
relevant and appropriate.
2. Selecting the • The key in the process of selecting sample is to obtain samples that
Sample are a representative cross section of the target population.
• Example: the target population for norm development for
an achievement testing program consists of beginning and
end-of-year students in 7, 8 and 9 grades.
• A variety of sampling techniques could be used to obtain a
representative sample.
• The method most likely to be used when developing large-scale
norms - a variation of cluster sampling:
1. To identify regions of the country that should be
represented. Further subdivision may be made into
rural and urban areas.
2. On the basis of the overall plan for how many
students are needed – how many school systems within
the various regions and rural-urban area would be
needed to obtain the desired number of participants.
3. To select the systems at random and convince as
many as possible that it would be to their benefit to
participate in the norming project.
4. Within each system, one then tries to sample as many
of the total number of 7, 8 and 9 students as possible.
STANDARDIZING: The goal of standardization is to eliminate as many extraneous
variables as possible from affecting test performance.
• These include such things as test-taking instructions, time limits, scoring procedures
and guidelines for score interpretation.
• There are other factors that are not so intuitively obvious – physical, health, time .
• The practical solution is to standardize the test administration process.
• Standardization involves keeping testing conditions as invariant as possible from one
administration to another.
TEST PUBLICATION & REVISION
Writing the Manual
• Preparation of a manual:
• Test developer outlines the purpose of the test,
• specifies direction for test administration and scoring, and
• Describe in detail each step in its development.
• Information about reliability and validity of the test needs to be included.
• User can evaluate the test - psychometric characteristics and its
applicability to various assessment situations.
• Detail information about the norming process – how the norm group were
selected and how the testing was done.
• Provide a thorough description of the samples employed (e.g., the number
tested, the age, the race, the sex, and the geographical location).
Revising the Test
• Test revision – depends on the timelessness of the content.
• More frequent revision are necessary when item content is so definitively dated.
• Another factor that influences the timing of test revisions is their popularity.
• New data often suggest item content modifications, changes in test administration
procedures, or alterations in test scoring.

You might also like