You are on page 1of 6

5.

Measuring and Reliability

Tests can be classified according to 3 criteria à content, administration, scoring.

Reliable measures are dependable, consistent, and relatively free from unsystematic errors of
measurement.
Measurement = the assignment of numerals to objects or events according to rules à “How
much”?. The definition says nothing about the quality of the measurement procedure.

Psychological measurement: individual differences in psychological traits.

Scales of measurement: qualitative & quantitative.


 

Scale Operation Formula

Nominal* - Equality (a = b) or (a ≠ b), but not both  

If [(a > b) and (b > c)], then (a


- Equality > c) OR
Ordinal* - Ranking  
(transitivity) If [(a = c) and (b = c)], then (a
= c)

- Equality
(d – a) = (c – a) + (d – c)
- Ranking X’= a + bX à X’= transformed
Interval  
- Equal-sized score, a & b = constants, X =
units original score.
(additivity)

- Equality
- Ranking
- Equal-sized
Ratio units    
- True
(absolute)
zero

         
Psychological are mostly nominal- or ordinal level scales*. Intelligence, aptitude (talent) and
personality scales are ordinal-level measures (not amounts, rather ranks). Yet, we can often
assume an equal interval scale.

Physical measurements are evaluated in terms of the degree to which they satisfy the
requirements of order, equality, and addition.

Most important purpose of psychological measures is decision making.


In personnel selection à accept or reject;
In placement à which alternative course or action to pursue;
In diagnosis à which remedial treatment is called for;
In hypotheses testing à the accuracy of the theoretical formulation;
In evaluation à what score to assign to an individual or procedure. 

HR specialists are confronted with the tasks of selecting and using psychological measurement
procedures, interpreting results, and communicating the results to others.
Test = any psychological measurement instrument, technique or procedure. Testing is systematic
in 3 areas: content, administration, and scoring.

Steps for selecting and creating new tests/ measures:

1. Determining a measure’s purpose


2. Defining the attribute
3. Developing a measure plan
4. Writing items
5. Conducting a pilot study and traditional item analysis
- distractor analysis (evaluate MC-items of the frequency with which incorrect choices
are selected)
- item difficulty (evaluate how difficult it is to answer each item correctly)
- item discrimination (evaluate whether the response to a particular item is related to
responses on the other items included the measure).
6. Conducting an item analysis using Item Response theory (IRT)
IRT explains how individual differences on a particular attribute affect the behavior of an
individual when he or she is responding to an item. This specific relationship between the
latent construct and the response to each item can be assessed graphically through an
item-characteristic curve (Figure 6-1, p. 117). IRT can also be used to assess bias at the
item level because it allows a researcher to determine of a given item is more difficult for
examinees from one group than for from another when they all have the same ability.
7. Selecting items
8. Determining reliability and gathering evidence for validity
9. Revising and updating items.

Methods of classifying tests:

 CONTENT
* Task:            - Verbal
                       - Non-verbal
                       - Performance

* Process:      - Cognitive (tests)


                       - Affective (inventories)
 

 ADMINISTRATION
* Efficiency:    - Individual
                       - Group

* Time:           - Speed (‘number checking’)


                       - Power (a.s.a.p. all items)
 

 (NON) STANDARDIZED TESTS


* Standardized
* Non-standardized
 

SCORING
* Objective
* Nonobjective

In addition to content, administration, standardization and scoring, several additional factors


need to be considered in selecting a test à

 Cost
Direct costs à price of software or test booklets, answer sheets, etc.
Indirect costs à time to prepare the test materials, interviewer time, etc. 
 Interpretation
Thorough awareness of the strengths and limitations of the measurement procedure,
background of the examinee, the situation and the consequences for the examinee.
 Face validity
Whether the measurement procedure looks like it is measuring the trait in question.
 

Reliability and validity information should be gathered not only for newly created measures but
also for any measure before it is put to use. Reliability is important; to make that shot count, to
present the ‘truest’ picture of one’s abilities or personal characteristics. = freedom from
unsystematic errors of measurement. Errors reduce the reliability, and therefore the
generalizability of a person’s score from a single measurement.
The correlation/reliability coefficient is a particularly appropriate measure of such agreement.
2 purposes:
1). To estimate the precision of a particular procedure as a measuring instrument;
2). To estimate the consistency of performance on the procedure by the examinees.
! 2 includes 1 à it is possible to have unreliable performance on a reliable test, but reliable
performance on an unreliable test is impossible.
Reliability coefficient may be interpreted directly as the percentage of total variance attributable
to different sources (coefficient of determination, r²).
X = T + e à X = observed (raw) score, T = true score (measurement error-free), e = error.
 

 Test-retest
Coefficient of stability. Errors: administration (light, loud noises) or personal (mood).
TEST/FORM A--------- RETEST/FORM A (TIME > 0)
 

 Parallel (Alternate) Forms


Coefficient of equivalence.
* Random assignment à creating a large pool of items with the only requirement being
that they tap the same domain;
* Incident-isomorphism à change the surface characteristics of items that do not
determine item difficulty, but, at the same time, do not change any structural item
features that determine their difficulty.
* Item-isomorphism à creating pairs of items.
FORM A --------- FORM B (TIME = 0)
 

 Internal Consistency (more questions to measure one characteristic)


* Kuder-Richardson reliability estimates. Coefficient alpha.
* Split-half reliability estimates. Select the items randomly for the two halves.
 

 Stability and Equivalence


Coefficient of stability and equivalence. Combination of the test-retest and equivalence
methods.
FORM A --------------------FORM B (TIME > 0)
3 types of errors à random response errors, specific factor errors, transient errors.
The coefficient of equivalence assesses:
- the magnitude of measurement error produced by specific-factor and random-response
error, but not transient-error processes;
- the impact of all three types of errors.
 

 Interrater Reliability
Can be estimated using 3 methods:
1. Interrater agreement à % of rater agreement and Cohen’s kappa
2. Interclass correlation à when 2 raters are rating multiple objects/individuals
3. Intraclass correlation à how much of the differences among raters is due to differences
in individuals and how much is due to the errors of measurement.
It is not a ‘real’ reliability coefficient, because it provides no information about the
measurement procedure itself.

If a procedure is to be used to compare one individual with another, reliability should be


above .90.
While the accuracy of measurement may remain unchanged, the size of reliability
estimate will vary with the range of individual differences in the group à as the variability
of the scores increases (decreases), the correlation between them also increases
(decreases).
Sample must be large and representative (figure 6-10, p. 131).

Standard error of measurement = a statistic expressed in test score (standard deviation)


units, but derived directly from the reliability coefficient. Useful, because it enables us to
talk about an individual’s true and error scores (figure 6-11, p. 131).
Useful is 3 ways, to determine whether:
1. the measures describing individuals differ significantly;
2. an individual measure is significantly different from some hypothetical true score;
3. a test discriminates differently in different groups.
A final advantage is that it forces to think of test scores not as exact points, but rather as
bands or ranges of scores.

Scale coarseness: regardless of whether the scale includes one or multiple items,
information is lost due to scale coarnesses, and two individuals with true scores of 4.4
and 3.6 will appear to have an identical score of 4.0. Scales of Likert-type and ordinal
items are coarse. In contrast to the effects of measurement error, the error caused by scale
coarseness is systematic and the same for each item. Effect à the relationship between
constructs appears weaker than it actually is. Solutions à
1. Use a continuous graphic-rating scale instead of Likert-type scales.
2. Use a statistical correction procedure after data are collected.

Generalizability theory = the reliability of a test score as the precision with which that
score/sample, represents a more generalized universe value of the score.
An examinee’s universe score is defined as the expected value of his or her observed
scores over all admissible observations.
The use of generalizability theory involves conducting two types of research studies: a
generalizability (G) study and a decision (D) study. A test has not one generalizability
coefficient, but many. The application of generalizability theory revealed that subordinate
ratings were of significantly better quality when made for developmental rather than
administrative purposes, but the same was not true for peer ratings.

A raw score (“48”) is meaningless because psychological measurement is relative rather


than absolute à need to compare the raw scores with a norm group:
- percentile ranks (ordinal)
- standard scores/z-scores (interval) à disadvantage = decimals and negative numbers, so
à transform to T-scores (figure 6-12, p. 138).

You might also like