You are on page 1of 38

CHAPTER 5

RELIABILITY
“Don’t judge each day by the harvest you
reap but by the seeds that you plant”
-Robert Louis Stevenson
LEARNING OBJECTIVES

When you have completed this chapter, you should be able to:
• Describe the role of measurement error in scientific studies of
behavior
• Know that reliability is the ratio of true variability to observed
variability and explain what this tells us about a test with a
reliability of .30, .60, or .90
• Describe how test-retest reliability is assessed
• Explain the difference between test-retest reliability estimates
and split-half reliability estimates
• Discuss how the split-half method underestimates the
reliability of a short test and what can be done to correct
this problem
• Know the easiest way to find average reliability
• Define coefficient alpha and tell how it differs from other
methods of estimating reliability
• Discuss how high a reliability coefficient must be before you
would be willing to say the test is “reliable enough”
• Explain what can be done to increase the reliability of a test
• Tell how the reliability of behavioral observations is assessed
INTRODUCTION

• Discrepancies between true ability and measurement of


ability constitute errors of measurement.
• Error implies that there will always be some inaccuracy in
our measurements.
• Our task is to find the magnitude of such errors and to
develop ways to minimize them.
• This chapter discusses the conceptualization and assessment
of measurement error.
• Tests that are relatively free of measurement error are
deemed to be reliable.
HISTORY AND THEORY OF RELIABILITY
Conceptualization of Error

• Usually pursue complex traits such as intelligence


or aggressiveness
Spearman’s Early Studies

• In 1733, Abraham De Moivre introduced the basic notion


of sampling error (Stanley, 1971); and in 1896, Karl
Pearson developed the product moment correlation
• A contemporary of Pearson, Spearman actually worked
out most of the basics of contemporary reliability theory
and published his work in a 1904 article entitled “The
Proof and Measurement of Association between Two
Things”
• Spearman published his work in the American
Journal of Psychology
• The article came to the attention of measurement
pioneer Edward L. Thorndike
• Since 1904, many developments on both sides of
the Atlantic Ocean have led to further refinements
in the assessment of reliability.
• Most important among these is a 1937 article by
Kuder and Richardson, in which several new
reliability coefficients were introduced.
• Later, Cronbach and his colleagues (Cronbach,
1989, 1995) made a major advance by developing
methods for evaluating many sources of error in
behavioral research
Basics of Test Score Theory
CLASSICAL TEST SCORE THEORY

• assumes that each person has a true score that would


be obtained if there were no errors in measurement
• However, because measuring instruments are
imperfect, the score observed for each person almost
always differs from the person’s true ability or
characteristic
X=T+E
• Basic Sampling Theory tells us that the distribution of
random errors is bell-shaped.
• CTT assumes that the true score for an individual will
not change with repeated applications of the same test.
• Theoretically, the standard deviation of the distribution
of errors for each person tells us about the magnitude
of measurement error
• Standard error of measurement - meas - how much a
score varies from the true score.
THE DOMAIN SAMPLING MODEL

• This model considers the problems created by


using a limited number of items to represent a
larger and more complicated construct.
• This model conceptualizes reliability as the ratio of
the variance of the observed score on the shorter
test and the variance of the long-run true score
• The measurement considered in the domain
sampling model is the error introduced by using a
sample of items rather than the entire domain.
• Reliability can be estimated from the correlation of
the observed test score with the true score.
ITEM RESPONSE THEORY

• Using IRT, the computer is used to focus on the range of


item difficulty that helps assess an individual’s ability
level
• Difficulties with applications of IRT:
i. The method requires a bank of items that have
been systematically evaluated for level of difficulty
ii. Considerable effort must go into test development
iii. Complex computer software is required
Sources of Error Variance
• Test construction
– item sampling or content sampling, terms that refer to variation
among items within a test as well as to variation among items
between tests.
• Test administration
– Sources of error variance that occur during test administration
may influence the test taker’s attention or motivation
• Test taker variables
• Examiner-related variables
• Test scoring and interpretation
• Other sources of error
TYPES OF RELIABILITY
I. The Test-Retest Reliability

• is an estimate of reliability obtained by correlating pairs


of scores from the same people on two different
administrations of the same test
• appropriate when evaluating the reliability of a test that
purports to measure something that is relatively stable
over time
• Just administer the same test on two well-specified
occasions and then find the correlation between scores
from the two administrations using the appropriate
statistical tools.
• When the interval between testing is greater than six
months, the estimate of test-retest reliability is often
referred to as the coefficient of stability.
• Carryover effect occurs when the first testing session
influences scores from the second session.
• Random carryover effects occur when the changes
are not predictable from earlier scores or when
something affects some but not all test takers.
• Practice effects - some skills improve with practice
• Because of these problems, the time interval between
testing sessions must be selected and evaluated
carefully.
• When you find a test-retest correlation in a test
manual, you should pay careful attention to the interval
between the two testing sessions.
• One of the problems with classical test theory is that it
assumes that behavioral dispositions are constant over
time. In CTT, these variations are assumed to be errors.
II. Parallel Forms & Alternate-Forms Reliability Estimates

• The degree of the relationship between various forms


of a test can be evaluated by means of an alternate-
forms or parallel-forms coefficient of reliability, which
is often termed the coefficient of equivalence.
• Parallel forms reliability/equivalent forms reliability
compares two equivalent forms of a test that
measure the same attribute.
• Obtaining estimates of alternate-forms reliability
and parallel-forms reliability is similar in two ways
to obtaining an estimate of test-retest reliability:
i. Two test administrations with the same group
are required
ii. Test scores may be affected by factors such as
motivation, fatigue, or intervening events such
as practice, learning, or therapy
III. Internal consistency Reliability

Kuder-Richardson 20, or KR20 or KR 20


• The formula for calculating the reliability of a test
in which the items are dichotomous
• G. Frederic Kuder and M. W. Richardson
• The formula is:
Coefficient Alpha or Cronbach Alpha
• Developed by Cronbach (1951)
• Typically ranges in value from 0 to 1
• Cronbach developed a formula that estimates the
internal consistency of tests in which the items are
not scored as 0 or 1(right or wrong)
• The formula for coefficient alpha is:
IV. Split-Half Reliability

• correlating two pairs of scores obtained from equivalent


halves of a single test administered once.
• Ways to split a test
1. Randomly assign items to one or the other half of the
test
2. Assign odd-numbered items to one half of the test
and even-numbered items to the other half - odd-
even reliability
• The Spearman-Brown formula allows a test
developer or user to estimate internal consistency
reliability from a correlation of two halves of a test.
• Spearman-Brown ( rSB ) formula is:
V. Inter-scorer Reliability

• is the degree of agreement or consistency between


two or more scorers (or judges or raters) with
regard to a particular measure.
• scorer reliability, judge reliability, observer
reliability, and inter-rater reliability, inter-scorer
reliability
• Kappa Statistic
Reliability in Behavioral Observation Studies
• The kappa statistic is the best method for assessing
the level of agreement among several observers.
• The kappa statistic was introduced by J. Cohen
(1960) as a measure of agreement between two
judges who each rate a set of objects using
nominal scales.
• Fleiss (1971) extended the method to consider the
agreement between any numbers of observers
• Kappa indicates the actual agreement as a
proportion of the potential agreement following
correction for chance agreement.
• Values of kappa may vary between 1 (perfect
agreement) and −1 (less agreement than can be
expected on the basis of chance alone).
• A value greater than .75 generally indicates
“excellent” agreement, a value between .40
and .75 indicates “fair to good” (“satisfactory”)
agreement, and a value less than .40 indicates
“poor” agreement (Fleiss, 1981).
Using and Interpreting a Coefficient of Reliability
• The purpose of the Reliability Coefficient
• The Nature of the Test
– Homogeneity versus heterogeneity of test items
– Dynamic versus static characteristics
• Alternatives to the True Score Model
– Generalizability theory - Developed by Lee J. Cronbach
(1970) and his colleagues (Cronbach et al., 1972)
– based on the idea that a person’s test scores vary from
testing to testing because of variables in the testing situation.
– universe score
Reliability and Individual Scores

The Standard Error of Measurement


• Provides a measure of the precision of an observed test
score.
• The relationship between the SEM and the reliability of a
test is inverse
• Is the tool used to estimate or infer the extent to which an
observed score deviates from a true score.
• the standard error of measurement is most frequently
used in the interpretation of individual test scores
• Useful in establishing a confidence interval: a
range or band of test scores that is likely to contain
the true score.
How Reliable Is Reliable?
• .70 and .80 – basic research
• reliability greater than .95 – used to make important
decisions
What to Do About Low Reliability
• Increase the Number of Items
– domain sampling model
– Spearman-Brown Prophecy Formula:
• Factor and Item Analysis
– discriminability analysis
• Correction for Attenuation
– correlations caused by the measurement error.
References:
• Cohen-Swerdlik, Psychological Testing and Assessment: An
Introduction to Tests and Measurement 7th Edition (2009)
USA, The McGraw−Hill Companies, Inc.

• Kaplan, Robert & Saccuzzo, Dennis, Psychological Testing


Principles, Applications, and Issues 7th Edition (2009) USA,
Wadsworth Cengage Learning

You might also like