You are on page 1of 26

Psychological Testing

Ninth Edition

Chapter 3
Reliability

© 2019 Cengage. All rights reserved.


Consider for a moment…

© 2019 Cengage. All rights reserved.


Reliability vs. Validity

© 2019 Cengage. All rights reserved.


History and Theory of Reliability—
Conceptualization of Error
• Measurement in psychology is more difficult than
physical measurements like height or length
• The difficulty comes from measuring complex traits that
often cannot be directly observed but rather ‘inferred’…
e.g. Intelligence, motivation, self concept, alexithymia,
extraversion.
– They use “rubber yardsticks”
– Part of their task is to ask how much rubber is in their
measurement tool
– Measuring error is an important part of many sciences,
not just psychology

© 2019 Cengage. All rights reserved.


Early Studies

• 1733: Abraham De Moivre—sampling error


• 1896: Karl Pearson—product moment correlation
• 1904: Charles Spearman—publishes article about
reliability theory
• 1904: E.L. Thorndike—publishes An Introduction to the
Theory of Mental and Social Measurements
• Other experts developed different measures of
reliability in the years that followed

© 2019 Cengage. All rights reserved.


Basics of Test Score Theory (1 of 3)

• RECALL our earlier discussion:

• Error reflects the difference between a true score and


an observed score on some trait or quality
– Observed score = True score + Error
X=T+E
• Classical test theory assumes that errors of
measurement are random

© 2019 Cengage. All rights reserved.


Basics of Test Score Theory (2 of 3)
FIGURE 4.1 Distribution of observed scores for
repeated testing of the same person. The mean of
the distribution is the estimated true score, and the
dispersion represents the distribution of random
errors.

© 2019 Cengage. All rights reserved.


Basics of Test Score Theory (3 of 3)
• The notion of a “rubber yardstick” explains how
distributions may be more or less varied (Figure 4.2)
• Standard error of measurement
FIGURE 4.2 Three distributions of observed scores. The far-
left distribution reflects the greatest error, and the far-right
distribution reflects the least.

© 2019 Cengage. All rights reserved.


The Domain Sampling Model

• A limited number of measurements to describe a larger,


more complex construct has problems
• Measurements are taken from a smaller number of
instances (a sample) and then a reliability estimate is
conducted
– How much error was created by not measuring all
instances?
– Compare sample variance to population variance
• Estimates are made about the scores of instances that
are not directly measured
– Different random samples can create different error

© 2019 Cengage. All rights reserved.


Models of Reliability—Sources of Error

• Measures must have demonstrated reliability before


they can be used to make employment or educational
decisions
• Reliability estimates are correlations but can also be
thought of as mathematical ratios
• Sources of error
– Situational factors or characteristics of the test
• Test–retest reliability
• Parallel forms reliability
• Internal consistency reliability

© 2019 Cengage. All rights reserved.


Item Sampling—Parallel Forms Method

• Compares scores on two different measures of the


same quality
• Also called equivalent forms method
• If the two are given on the same day, differences in
scores can only reflect random error or differences
between the two tests
– A rigorous assessment of reliability
– Generally underutilized

© 2019 Cengage. All rights reserved.


Time Sampling—The Test–Retest Method

• The same test is administered to the same person at


different points in time
– Only useful when assessing traits that do not change over
time (e.g., intelligence quotient)
– Would not be useful on qualities that are known to change
• Involves calculating the correlation between the two
measurements
• Carryover effects
– Only an issue when changes over time are random
– Include practice effects
• The interval between measurements must be considered

© 2019 Cengage. All rights reserved.


Split-Half Method

• One test is split into two equal halves


• Each half is compared to the other
– Can be split randomly, first/second halves, or odd/even

Reverse-scored

© 2019 Cengage. All rights reserved.


Split-Half Method

• Because each half uses fewer items, it is a


underestimate of true reliability
• For the sake of an example, assume correlation (r)
between the “odd” half and “even” half of the
psychopathy scale is .360.
• The Spearman–Brown formula is used to correct for
half-length and increases the estimate of reliability

In our example, corrected r


Is equal to 0.72/1.360 = .529

© 2019 Cengage. All rights reserved.


KR20 Formula

• Kuder and Richardson (1937)—methods for estimating


reliability with a single test administration
• Calculates reliability when a criterion is dichotomous
• Refer to your text for a thorough description of the KR20
formula and the details associated with its calculation
• Has been found to produce reliability estimates that are
the same as the split-half method but are more
valuable than that approach

© 2019 Cengage. All rights reserved.


KR20 Formula

© 2019 Cengage. All rights reserved.


Coefficient Alpha

• Is useful for tests where there is no single “correct”


answer to all questions (a limitation of the KR20)
• Coefficient alpha (or Cronbach’s alpha) assesses
reliability in such cases
– Similar to the KR20 in other regards
• Like other measures of reliability, assesses whether
different items on a test all measure the same ability or
trait
– If they do not, internal consistency will be low
– Factor analysis is more useful when such situations
occur, but each factor may be internally consistent

© 2019 Cengage. All rights reserved.


Coefficient Alpha

© 2019 Cengage. All rights reserved.


Coefficient Alpha

© 2019 Cengage. All rights reserved.


Coefficient Alpha

© 2019 Cengage. All rights reserved.


Reliability of a Difference Score

• Difference scores occur when one value is subtracted


from another
• In such cases, the error is assumed to be higher than
the true score or the observed score, and thus
reliability of a difference score will be lower
• Difference scores are most aptly calculated after
converting to a Z score
• Cannot be used for interpreting patterns due to lower
reliability

© 2019 Cengage. All rights reserved.


Reliability in Behavioral Observation
Studies
• Observing behaviors is complicated
– Samples are taking since all behaviors cannot be
monitored
• Errors occur when observations do not reflect true
occurrences
• Interrater reliability may be used, when a correlation
between the observations of two different individuals is
calculated
– Can be achieved in a variety of ways
• Kappa statistic

© 2019 Cengage. All rights reserved.


Connecting Sources of Error with
Reliability Assessment Method
TABLE 4.2 Sources of Measurement Error and Methods
of Reliability Assessment
Source of error Example Method How assessed
Time sampling Same test given at two Test–retest Correlation between scores obtained
points in time on the two occasions
Item sampling Different items used to Alternate forms Correlation between equivalent
assess the same attribute or parallel forms forms of the test that have different
items
Internal Consistency of items within 1. Split-half 1. Corrected correlation between
consistency the same test 2. KR20 two halves of the test
2. See Appendix 4.2
3. Alpha
3. See Appendix 4.1
Observer Different observers Kappa statistic See Fleiss (1981)
differences recording

© 2019 Cengage. All rights reserved.


Standard Errors of Measurement and the
Rubber Yardstick
• Standard error of measurement allows us to estimate
the degree to which a test provides inaccurate reading
– An assessment of how much “rubber” is in the yardstick
– All tests are not equally accurate but also not equally
inaccurate
• Higher standard error of measurement leads to less
certainty about the accuracy of a given measurement
of an attribute
• Creates confidence intervals—a range of likely scores
within which the correct value is believed to exist

© 2019 Cengage. All rights reserved.


How Reliable Is Reliable? (1 of 2)

• The acceptable level of reliability depends largely on


what it is that is being measured
• .70 to .80 is seen as “good enough” for basic research
• Researchers often look for reliability of .90 or even .95
– Some argue that these reliability levels are less
valuable, as they indicate that all items on a test
measure the same thing, and thus the test could be
shortened
– But these levels are more necessary in clinical settings,
when they influence treatment decisions

© 2019 Cengage. All rights reserved.


What to Do about Low Reliability

• There are several approaches that a researcher can


use to deal with an issue of low reliability
• Increase the number of items
– Is particularly effective in the domain sampling model
– Spearman–Brown formula can be used to calculate the
number of items needed
• Factor and item analysis
– Discriminability analysis
• Correction for attenuation
– Estimating what a correlation would have been if
measurements had not included error

© 2019 Cengage. All rights reserved.

You might also like