Test Construction

Test Development
It is an umbrella term for all that goes into the process of creating a test.
Test development has 5 stages:
1. Test conceptualization: the idea of the test
2. Test construction: writing test items, formatting items, setting scoring rules etc.
3. Test tryout: test is administered to a representative sample of testtakers under similar

conditions in which the final test will be administered.
4. Item analysis: testtakers performance on the test as whole and on each item is
analyzed using statistical tools.
5. Test revision: modifying a test’s content or format to improve test’s effectiveness.
Test Conceptualization
Test conceptualization starts with the thoughts of a test developer. The trigger for these
thoughts could be anything that compels him or her to try to develop a new test.
Preliminary questions: regardless of the trigger, a test developer has to answer a number of
questions before starting to develop a test. For example;
- What is the test designed to measure?
- What is the objective of the test? (why is it being developed, what is the goal)
- Is there a need for this test?
- Who will use this test?

- Who will take this test?
- What content will the test cover?
- How will the test be administered?
- What is the ideal format of the test?
Norm-referenced vs. criterion-referenced tests – item development issues: A norm

referenced test compares a group of individuals with each other. While a criterion referenced
test measures an individual’s ability against a set criterion. The concept of both of these
tests is different, therefore the items for these tests also need to be designed differently.
- A good item on a norm referenced test is an item for which high scorers on the test
respond correctly. Low scorers on the test tend to respond to that same item
incorrectly.
- However, that is not what makes an item good or acceptable from a criterion-oriented
perspective. Ideally, each item on a criterion-oriented test addresses the issue of
whether the testtaker—a would-be physician, engineer, piano student, or
whoever—has met certain criteria. In this context, being “first in the class” doesn’t
matter.
- Development of criterion-referenced instruments derives from a conceptualization of

the knowledge or skills to be mastered. For purposes of assessment, the required
cognitive or motor skills may be broken down into component parts.
- Development of a criterion-referenced test or assessment procedure may entail

exploratory work with at least two groups of testtakers: one group known to have
mastered the knowledge or skill being measured and another group known not to
have mastered such knowledge or skill.
Pilot work: it is the preliminary research surrounding the creation of prototype of the test. It
may include the following tools:
- Open-ended interviews with research subjects
- Physiological monitering of subjects (e.g hear rate) when exposed to certain stimuli
- Literature review
- Experimentation
Test construction
Scaling
The process of setting rules for assigning numbers in measurement. In this process, a
measuring device is designed and calibrated and numbers are assigned to different amounts
of the trait, attribute, or charecteristic being measured.
Types of scales
Scales are instruments used to measure something. In the context of psychological testing,
this something being measured can be a trait, a state, or an ability.
Scaling methods: A testtaker is presumed to have more or less of the charecteristic being
measured as a function of the test score. The higher or lower the score, the more or less of
the characteristic the testtaker presumably possesses. But how do we assign these high or
low numbers? Answer: scaling the test items.
● Rating scale: A grouping of words, statements, or symbols on which judgments of

the strength of a particular trait, attitude, or emotion are indicated by the testtaker.
- Rating scales can be used to record judgments of oneself, others,

experiences, or objects, and they can take several forms.
- Rating scales give ordinal-level data.
- Rating scales can be unidimensional (only one dimension is presumed to

underlie the ratings) or multidimensional (multiple dimensions guide the
testtaker’s response.)
● Summative scale: final score is obtained by summing (adding up) the ratings across
all the items.
● Likert scale: It is a type of summative scale that is usually used to scale attitudes.
Each item has alternative responses on a agree-disagree continuum, from which the
testtaker can choose. It can look like this:
● Method of paired comparisons: Testtakers are presented with pairs of stimuli (two
photographs, two objects, two statements), which they are asked to compare.
- They must select one of the stimuli according to some rule; for example, the
rule that they agree more with one statement than the other, or the rule that
they find one stimulus more appealing than the other. It may look like this:
● Sorting scales: Another way that ordinal information may be developed and scaled.
It involves comparative scaling and categorical scaling:
- Comparative scaling, entails judgments of a stimulus in comparison with

every other stimulus on the scale e.g giving a list of 30 items and asking the
testtaker to rank the justfiability of items from 1 to 30.
- In categorical scaling, stimuli are placed into one of two or more alternative
categories that differ quantitatively with respect to some continuum. E.g
testtaker is given 30 items and is asked to sort them into three categories; (i)
behavior never justified, (ii) sometimes justified, and (iii) always justified.
● Guttman scale: items range sequentially from weaker to stronger expressions of the
attitude, belief, or feeling being measured.
- A feature of Guttman scales is that all respondents who agree with the
stronger statements of the attitude will also agree with milder statements.
- The resulting data are then analyzed by means of scalogram analysis, an

item-analysis procedure and approach to test development that involves a
graphic mapping of a testtaker’s responses.
- The objective for the developer of a measure of attitudes is to obtain an

arrangement of items wherein endorsement of one item automatically
connotes endorsement of less extreme positions.
● Thurstone scaling: used to obtain data that are presumed to be interval in nature.
- Large number of statements reflecting positive and negative attitudes, for

example, toward suicide are collected.
- Judges evaluate and give ratings to each statement from 1 (never justified) to
9 (always justified).
- A mean and standard deviation of the ratings for each statement is

calculated. A low standard deviation is indicative of a good item because it
shows that judges agreed about the meaning of the item with respect to its
reflection of attitudes toward suicide.
- Several criteria are used to decide which items are included in the test and
whether the test is ready for administration. (see book)
Writing Items
Test developer has to answer three questions when writing items:
a. What range of content should the items cover
b. What will be the item format?
c. How many items are written in total and for each content area? (it is usually
advisable that the first draft contain approximately twice the number of items that the
final version of the test will contain.)
Item pool: a reservoir or well from which items will or will not be drawn for the final version
of the test.
Item format
The variables such as the form, plan, structure, arrangement, and layout of individual test
items are collectively called item format. It has two main types:
1. Selected-response – testtaker has to select a response from given options.

Selected-response has three types:
- Multiple-choice (a multiple choice item containing only two possible response

is called a binary-choice item e.g yes/no, agree/disagree, right/wrong etc)
- True-false (this is an example of binary-choice)
- Matching
2. Constructed-response – testtaker has to give or create the correct answer. It also

has three types:
- Complete item
- Short answer item

- Essay item (Advantages: requires more skills than true-false……
Disadvantages: covers limited area of content, subjectivity in scoring and
inter-scorer differences)
Writing items for computer administration: Computer softwares have the ability to store
itmes in an item bank and the ability to individualize testing through a technique called item
branching.
● Item bank: a relatively large and easily accessible collection of test questions
● Computerized adaptive testing (CAT): a computerized testing method in which

items presented to the testtaker are based in part on the testtaker’s performance on
previous items. CAT tends to reduce floor effect and ceiling effect.
- Floor effect: the diminished utility of an assessment tool for distinguishing

testtakers at the low end of the ability, trait, or other attribute being measured.
- Ceiling effect: the diminished utility of an assessment tool for distinguishing

testtakers at the high end of the ability, trait, or other attribute being
measured.
Scoring Items (Test Scoring)

Various test scoring models have been developed such as cummulative model, class
scoring, and ipsative scoring.
● Cummulative model: Typically, the rule in a cumulatively scored test is that the
higher the score on the test, the higher the testtaker is on the ability, trait, or other
characteristic that the test purports to measure. For each testtaker response to
targeted items made in a particular way, the testtaker earns cumulative credit with
regard to a particular construct.
● Class scoring aka category scoring: Testtaker’s responses earn credit toward
placement in a particular class or category with other testtakers whose pattern of
responses is presumably similar in some way.
● Ipsative scoring: A typical objective in ipsative scoring is comparing a testtaker’s

score on one scale within a test to another scale within that same test. Edward
Personal Preference Test (EPPS) uses ipsative scaling. It measures the relative
strength of different psychological needs of the same person.
- For example, “John’s need for achievement is higher than his need for
affiliation.” It would not be appropriate to draw inter-individual comparisons on
the basis of an ipsatively scored test. It would be inappropriate, for example,
to compare two testtakers with a statement like “John’s need for achievement
is higher than Jane’s need for achievement.”
Test Tryout
The test should be tried out on people who are similar in critical respects to the people for
whom the test was designed. The number of people on whom the test is tried out is also
important. A general rule is that there should be no fewer than 5 people, and preferably as
many as 10 for each item on the test. Moreveover, test tryout should be executed under
conditions as identical as possible.
What is a good item: A good test item is reliable and valid. A test developer identifies good
items through item analysis – different types of statistical scrutiny that the test data can
potentially undergo.
Item Analysis
Various procedures are used by test developers in an attempt to select the best items from a
pool of tryout items. The criteria for best items may differ as per the test developer’s
objectives.
● Item-Difficulty Index (p): It is obtained by calculating the proportion of the total

number of testtakers who answered the item correctly.
- The value of an item-difficulty index can theoratically range from 0 (no one got
the item right) to 1 (everyone got the item right).
- The larger the item-difficulty index, the easier the item. Because p refers to
the percent of people passing an item, the higher the p for an item, the easier
the item.
- The statistic referred to as an item-difficulty index in the context of

achievement testing may be an item-endorsement index in other contexts,
such as personality testing.
- An index of the difficulty of the average test item for a particular test can be
calculated by averaging the item-difficulty indices for all the test’s items. This
is accomplished by summing the item-difficulty indices for all test items and
dividing by the total number of items on the test.
- For maximum discrimination among the abilities of the testtakers, the optimal
average item difficulty is approximately .5, with individual items on the test
ranging in difficulty from about .3 to .8.
- The possible effect of guessing must be taken into account when considering
items of the selected-response variety. With this type of item, the optimal
average item difficulty is usually the midpoint between 1.00 and the the
probability of answering correctly by random guessing. (see book)
● Item-Reliability Index: It provides an indication of the internal consistency of a test.

The higher the index, the greater the internal consistency
- This index is equal to the product of the item-score standard deviation (s) and
the correlation (r) between the item score and the total test score.
● Item-Validity Index: Provides an indication of the degree to which a test is
measuring what it purports to measure.
- The higher the item-validity index, the greater the test’s criterion-related
validity.
- The correlation between the score on item 1 and a score on the criterion
measure (denoted by the symbol r1 C) is multiplied by item 1’s item-score
standard deviation (s1), and the product is equal to an index of an item’s
validity (s1 r1 C).
● Item-Discrimination Index (d): It indicates how adequately an item separates or

discriminates between high scorers and low scorers on an entire test.
- It measures the difference between the proportion of high scorers answering

an item correctly and the proportion of low scorers answering the item
correctly.
- It compares performance on a particular item with performance in the upper

and lower regions of a distribution of continuous test scores.
- The higher the value d of the greater the number of high scorers answering
the item correctly.
- The highest possible value of d is +1.00 (all members of upper region

answered correctly, all members of lower group answered incorrectly)
- When the value is 0, it means the item is not discriminating as the same
proportion of the members of upper and lower region pass the item.
- The lowest possible value of d is -1.00 (It is a test developer’s nightmare. It

means that all members of upper region answered incorrectly, all members of
lower group answered correctly)
NOTE: See book for better understanding of Item analysis.
Test Revision
● Cross-validation: the revalidation of a test on a sample of testtakers other than
those on whom test performance was originally found to be a valid predictor of some
criterion.
● Validity shrinkage: The items selected for the final version of the test (in part
because of their high correlations with a criterion measure) will have smaller item
validities when administered to a second sample of testtakers. This is so because of
the operation of chance. The decrease in item validities that inevitably occurs after
cross-validation of findings is referred to as validity shrinkage.
● Co-validation: A test validation process conducted on two or more tests using the
same sample of testtakers. When co-validation is used in conjunction with the
creation of norms or the revision of existing norms, this process may also be referred
to as co-norming.

Test Construction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Construction

Uploaded by

Copyright:

Available Formats

Test Development

Test development has 5 stages:

1. Test conceptualization: the idea of the test

3. Test tryout: test is administered to a representative sample of testtakers under similar

5. Test revision: modifying a test’s content or format to improve test’s effectiveness.

- What is the test designed to measure?

- Is there a need for this test?

- Who will use this test?

- What content will the test cover?

- How will the test be administered?

- What is the ideal format of the test?

Norm-referenced vs. criterion-referenced tests – item development issues: A norm

- Development of criterion-referenced instruments derives from a conceptualization of

- Development of a criterion-referenced test or assessment procedure may entail

- Open-ended interviews with research subjects

● Rating scale: A grouping of words, statements, or symbols on which judgments of

- Rating scales can be used to record judgments of oneself, others,

- Rating scales give ordinal-level data.

- Rating scales can be unidimensional (only one dimension is presumed to

- Comparative scaling, entails judgments of a stimulus in comparison with

- The resulting data are then analyzed by means of scalogram analysis, an

- The objective for the developer of a measure of attitudes is to obtain an

- Large number of statements reflecting positive and negative attitudes, for

- A mean and standard deviation of the ratings for each statement is

a. What range of content should the items cover

b. What will be the item format?

1. Selected-response – testtaker has to select a response from given options.

- Multiple-choice (a multiple choice item containing only two possible response

- True-false (this is an example of binary-choice)

2. Constructed-response – testtaker has to give or create the correct answer. It also

- Short answer item

● Computerized adaptive testing (CAT): a computerized testing method in which

- Floor effect: the diminished utility of an assessment tool for distinguishing

- Ceiling effect: the diminished utility of an assessment tool for distinguishing

Scoring Items (Test Scoring)

● Ipsative scoring: A typical objective in ipsative scoring is comparing a testtaker’s

● Item-Difficulty Index (p): It is obtained by calculating the proportion of the total

- The statistic referred to as an item-difficulty index in the context of

● Item-Reliability Index: It provides an indication of the internal consistency of a test.

● Item-Discrimination Index (d): It indicates how adequately an item separates or

- It measures the difference between the proportion of high scorers answering

- It compares performance on a particular item with performance in the upper

- The highest possible value of d is +1.00 (all members of upper region

- The lowest possible value of d is -1.00 (It is a test developer’s nightmare. It

NOTE: See book for better understanding of Item analysis.

You might also like