Chapter 3 - Reliability - Hiten

Psychological Testing
Ninth Edition
Chapter 3
Reliability
© 2019 Cengage. All rights reserved.

Consider for a moment…

Reliability vs. Validity

History and Theory of Reliability—
Conceptualization of Error
• Measurement in psychology is more difficult than
physical measurements like height or length
• The difficulty comes from measuring complex traits that
often cannot be directly observed but rather ‘inferred’…
e.g. Intelligence, motivation, self concept, alexithymia,
extraversion.
– They use “rubber yardsticks”
– Part of their task is to ask how much rubber is in their
measurement tool
– Measuring error is an important part of many sciences,
not just psychology

Early Studies
• 1733: Abraham De Moivre—sampling error

• 1896: Karl Pearson—product moment correlation
• 1904: Charles Spearman—publishes article about
reliability theory
• 1904: E.L. Thorndike—publishes An Introduction to the
Theory of Mental and Social Measurements
• Other experts developed different measures of
reliability in the years that followed

Basics of Test Score Theory (1 of 3)
• RECALL our earlier discussion:
• Error reflects the difference between a true score and

an observed score on some trait or quality
– Observed score = True score + Error
X=T+E
• Classical test theory assumes that errors of
measurement are random

FIGURE 4.1 Distribution of observed scores for
repeated testing of the same person. The mean of
the distribution is the estimated true score, and the
dispersion represents the distribution of random
errors.

• The notion of a “rubber yardstick” explains how
distributions may be more or less varied (Figure 4.2)
• Standard error of measurement
FIGURE 4.2 Three distributions of observed scores. The far-
left distribution reflects the greatest error, and the far-right
distribution reflects the least.

The Domain Sampling Model
• A limited number of measurements to describe a larger,

more complex construct has problems
• Measurements are taken from a smaller number of
instances (a sample) and then a reliability estimate is
conducted
– How much error was created by not measuring all
instances?
– Compare sample variance to population variance
• Estimates are made about the scores of instances that
are not directly measured
– Different random samples can create different error

Models of Reliability—Sources of Error
• Measures must have demonstrated reliability before

they can be used to make employment or educational
decisions
• Reliability estimates are correlations but can also be
thought of as mathematical ratios
• Sources of error
– Situational factors or characteristics of the test
• Test–retest reliability
• Parallel forms reliability
• Internal consistency reliability

Item Sampling—Parallel Forms Method
• Compares scores on two different measures of the

same quality
• Also called equivalent forms method
• If the two are given on the same day, differences in
scores can only reflect random error or differences
between the two tests
– A rigorous assessment of reliability
– Generally underutilized

Time Sampling—The Test–Retest Method
• The same test is administered to the same person at

different points in time
– Only useful when assessing traits that do not change over
time (e.g., intelligence quotient)
– Would not be useful on qualities that are known to change
• Involves calculating the correlation between the two
measurements
• Carryover effects
– Only an issue when changes over time are random
– Include practice effects
• The interval between measurements must be considered

Split-Half Method
• One test is split into two equal halves

• Each half is compared to the other
– Can be split randomly, first/second halves, or odd/even
Reverse-scored

Split-Half Method
• Because each half uses fewer items, it is a

underestimate of true reliability
• For the sake of an example, assume correlation (r)
between the “odd” half and “even” half of the
psychopathy scale is .360.
• The Spearman–Brown formula is used to correct for
half-length and increases the estimate of reliability
In our example, corrected r

Is equal to 0.72/1.360 = .529

KR20 Formula
• Kuder and Richardson (1937)—methods for estimating

reliability with a single test administration
• Calculates reliability when a criterion is dichotomous
• Refer to your text for a thorough description of the KR20
formula and the details associated with its calculation
• Has been found to produce reliability estimates that are
the same as the split-half method but are more
valuable than that approach

KR20 Formula

Coefficient Alpha
• Is useful for tests where there is no single “correct”

answer to all questions (a limitation of the KR20)
• Coefficient alpha (or Cronbach’s alpha) assesses
reliability in such cases
– Similar to the KR20 in other regards
• Like other measures of reliability, assesses whether
different items on a test all measure the same ability or
trait
– If they do not, internal consistency will be low
– Factor analysis is more useful when such situations
occur, but each factor may be internally consistent

Coefficient Alpha

Coefficient Alpha

Coefficient Alpha

Reliability of a Difference Score
• Difference scores occur when one value is subtracted

from another
• In such cases, the error is assumed to be higher than
the true score or the observed score, and thus
reliability of a difference score will be lower
• Difference scores are most aptly calculated after
converting to a Z score
• Cannot be used for interpreting patterns due to lower
reliability

Reliability in Behavioral Observation
Studies
• Observing behaviors is complicated
– Samples are taking since all behaviors cannot be
monitored
• Errors occur when observations do not reflect true
occurrences
• Interrater reliability may be used, when a correlation
between the observations of two different individuals is
calculated
– Can be achieved in a variety of ways
• Kappa statistic

Connecting Sources of Error with
Reliability Assessment Method
TABLE 4.2 Sources of Measurement Error and Methods
of Reliability Assessment
Source of error Example Method How assessed
Time sampling Same test given at two Test–retest Correlation between scores obtained
points in time on the two occasions
Item sampling Different items used to Alternate forms Correlation between equivalent
assess the same attribute or parallel forms forms of the test that have different
items
Internal Consistency of items within 1. Split-half 1. Corrected correlation between
consistency the same test 2. KR20 two halves of the test
2. See Appendix 4.2
3. Alpha
3. See Appendix 4.1
Observer Different observers Kappa statistic See Fleiss (1981)
differences recording

Standard Errors of Measurement and the
Rubber Yardstick
• Standard error of measurement allows us to estimate
the degree to which a test provides inaccurate reading
– An assessment of how much “rubber” is in the yardstick
– All tests are not equally accurate but also not equally
inaccurate
• Higher standard error of measurement leads to less
certainty about the accuracy of a given measurement
of an attribute
• Creates confidence intervals—a range of likely scores
within which the correct value is believed to exist

How Reliable Is Reliable? (1 of 2)
• The acceptable level of reliability depends largely on

what it is that is being measured
• .70 to .80 is seen as “good enough” for basic research
• Researchers often look for reliability of .90 or even .95
– Some argue that these reliability levels are less
valuable, as they indicate that all items on a test
measure the same thing, and thus the test could be
shortened
– But these levels are more necessary in clinical settings,
when they influence treatment decisions

What to Do about Low Reliability
• There are several approaches that a researcher can

use to deal with an issue of low reliability
• Increase the number of items
– Is particularly effective in the domain sampling model
– Spearman–Brown formula can be used to calculate the
number of items needed
• Factor and item analysis
– Discriminability analysis
• Correction for attenuation
– Estimating what a correlation would have been if
measurements had not included error

Chapter 3 - Reliability - Hiten

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 - Reliability - Hiten

Uploaded by

Copyright:

Available Formats

Psychological Testing

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• 1733: Abraham De Moivre—sampling error

© 2019 Cengage. All rights reserved.

• RECALL our earlier discussion:

• Error reflects the difference between a true score and

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• A limited number of measurements to describe a larger,

© 2019 Cengage. All rights reserved.

• Measures must have demonstrated reliability before

© 2019 Cengage. All rights reserved.

• Compares scores on two different measures of the

© 2019 Cengage. All rights reserved.

• The same test is administered to the same person at

© 2019 Cengage. All rights reserved.

• One test is split into two equal halves

© 2019 Cengage. All rights reserved.

• Because each half uses fewer items, it is a

In our example, corrected r

© 2019 Cengage. All rights reserved.

• Kuder and Richardson (1937)—methods for estimating

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Is useful for tests where there is no single “correct”

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• Difference scores occur when one value is subtracted

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

© 2019 Cengage. All rights reserved.

• The acceptable level of reliability depends largely on

© 2019 Cengage. All rights reserved.

• There are several approaches that a researcher can

© 2019 Cengage. All rights reserved.

You might also like