You are on page 1of 42

1

Chapter 4 – Reliability

1. Observed Scores and True Scores


2. Error
3. How We Deal with Sources of Error:
A. Domain sampling – test items
B. Time sampling – test occasions
C. Internal consistency – traits
4. Reliability in Observational Studies
5. Using Reliability Information
6. What To Do about Low Reliability
2

Chapter 4 - Reliability

• Measurement of human ability and


knowledge is challenging because:
 ability is not directly observable – we infer
ability from behavior
 all behaviors are influenced by many
variables, only a few of which matter to us
3

Observed Scores

O=T+e

O = Observed score
T = True score
e = error
4

Reliability – the basics

1. A true score on a 3. We assume that


test does not errors are random
change with (equally likely to
repeated testing increase or
2. A true score would decrease any test
be obtained if there result).
were no error of
measurement.
5

Reliability – the basics

• Because errors are • Mean of many


random, if we test one observed scores for
person many times, one person will be the
the errors will cancel person’s true score
each other out
• (Positive errors
cancel negative
errors)
6

Reliability – the basics

• Example: to measure • Ask Sarah to spell a


Sarah’s spelling subset of English
ability for English words
words. • % correct estimates
• We can’t ask her to her true English
spell every word in spelling skill
the OED, so… • But which words
should be in our
subset?
7

Estimating Sarah’s spelling ability…

• Suppose we choose • What if, by chance,


20 words randomly… we get a lot of very
easy words – cat,
tree, chair, stand…
• Or, by chance, we get
a lot of very difficult
words – desiccate,
arteriosclerosis,
numismatics
8

Estimating Sarah’s spelling ability…

• Sarah’s observed • But presumably her


score varies as the true score (her actual
difficulty of the spelling ability)
random sets of words remains constant.
varies
9

Reliability – the basics

• Other things can • E.g. on the first day


produce error in our that we test Sarah
measurement she’s tired
• But on the second
day, she’s rested…

• This would lead to


different scores on
the two days
10

Estimating Sarah’s spelling ability…

• Conclusion: • The variation in


Sarah’s scores is
O=T+e produced by
measurement error.
• How can we measure
But e1 ≠ e2 ≠ e3 … such effects – how
can we measure
reliability?
11

Reliability – the basics

• In what follows, we • Different ways of


consider various measuring reliability
sources of error in are sensitive to
measurement. different sources of
error.
12

How do we deal with sources of error?

• Error due to test items • Domain sampling


error
13

How do we deal with sources of error?

• Error due to test items • Time sampling error


• Error due to testing
occasions
14

How do we deal with sources of error?

• Error due to test items • Internal consistency


• Error due to testing error
occasions
• Error due to testing
multiple traits
15

Domain Sampling error

• A knowledge base or • We can’t test the


skill set containing entire set of items.
many items is to be  So we select a sample
tested. of items.
 E.g., the chemical  That produces domain
properties of foods. sampling error, as in
Sarah’s spelling test.
16

Domain Sampling error

• There is a “domain” of • A person’s score may


knowledge to be vary depending upon
tested what is included or
excluded from the
test.
17

Domain Sampling error

• Smaller sets of items • As a result, reliability


may not test entire of a test increases
knowledge base. with the number of
• Larger sets of items items on that test
should do a better job
of covering the whole
knowledge base.
18

Domain Sampling error

• Parallel Forms • Across all people


Reliability: tested, if correlation
• choose 2 different between scores on 2
sets of test items. parallel forms is low,
• these 2 sets give you then we probably
“parallel forms” of the have domain
test sampling error.
19

Time Sampling error

• Test-retest Reliability • Give same test


 person taking test repeatedly & check
might be having a correlations among
very good or very bad scores
day – due to fatigue,
emotional state, • High correlations
preparedness, etc. indicate stability –
less influence of bad
or good days.
20

Time Sampling error

• Test-retest approach • Not all low test-retest


is only useful for traits correlations imply a
– characteristics that weak test
don’t change over • Sometimes, the
time characteristic being
measured varies with
time (as in learning)
21

Time Sampling error

• Interval over which • Not all low test-retest


correlation is correlations imply a
measured matters weak test
• E.g., for young • Sometimes, the
children, use a very characteristic being
short period (< 1 measured varies with
month, in general) time (as in learning)
• In general, interval
should not be > 6
months
22

Time sampling error

• Test-retest approach • Carryover: first testing


advantage: easy to session influences scores
on next session
evaluate, using
• Practice: when carryover
correlation
effect involves learning
• Disadvantage:
carryover & practice
effects
23

Internal Consistency error

• Suppose a test • Would you expect


includes both items much correlation
on social psychology between scores on
and items requiring the two parts?
mental rotation of  No – because the two
abstract visual ‘skills’ are unrelated.
shapes.
24

Internal Consistency Approach

• A low correlation • A good test has high


between scores on 2 correlations between
halves of a test, scores on its two
suggests that the test halves.
is tapping two
different abilities or  But how should we
divide the test in two to
traits.
check that correlation?
25

Internal Consistency error

• Split-half method • All of these assess


• Kuder-Richardson the extent to which
formula items on a given test
• Cronbach’s alpha measure the same
ability or trait.
26

Split-half Reliability

• After testing, divide • Various ways of


test items into halves dividing test into two –
A & B that are scored randomly, first half vs.
separately. second half, odd-
• Check for correlation even…
of results for A with
results for B.
27

Split-half Reliability – a problem

• Each half-test is • So, we shouldn’t use


smaller than the the raw split-half
whole reliability to assess
• Smaller tests have reliability for the
lower reliability whole test
(domain sampling
error)
28

Split-half reliability – a problem

• We correct reliability re = estimated reliability for


estimate using the the test
Spearman-Brown rc = computed reliability
(correlation between
formula:
scores on the two halves
re = 2rc A and B)
1+ rc
29

Kuder-Richardson 20

• Kuder & Richardson • KR-20 avoids


(1937): an internal- problems associated
consistency measure with splitting by
that doesn’t require simultaneously
arbitrary splitting of considering all
test into 2 halves. possible ways of
splitting a test into 2
halves.
30

Kuder-Richardson 20

• The formula 1. a measure of all the


contains two basic variance in the
terms: whole set of test
results.
31

Kuder-Richardson 20

• The formula 2. “item variance” –


contains two basic when items measure
terms: the same trait, they
co-vary (same
people get them
right or wrong). More
co-variance = less
“item variance”
32

Internal Consistency – Cronbach’s α

• KR-20 can only be • Cronbach’s α (alpha)


used with test items generalizes KR-20 to
scored as 1 or 0 (e.g., tests with multiple
right or wrong, true or response categories.
false). • α is a more generally-
useful measure of
internal consistency
than KR-20
33
Review: How do we deal with sources of
error?
Approach Measures Issues

Test-Retest Stability of scores Carryover

Parallel Forms Equivalence & Stability Effort

Split-half Equivalence & Internal Shortened


consistency test
KR-20 & α Equivalence & Internal Difficult to
consistency calculate
34

Reliability in Observational Studies

• Some psychologists • This approach


collect data by requires time
observing behavior sampling, leading to
rather than by testing. sampling error
• Further error due to:
 observer failures
 inter-observer
differences
35

Reliability in Observational Studies

• Deal with possibility of • Deal with inter-


failure in the single- observer differences
observer situation by using:
having more than 1  Inter-rater reliability
observer.  Kappa statistic
36

Reliability in Observational Studies

• Inter-rater reliability • % agreement between 2


or more observers
 problem: in a 2-choice
case, 2 judges have a 50%
chance of agreeing even if
they guess!
 this means that %
agreement may over-
estimate inter-rater
reliability.
37

Reliability in Observational Studies

• Kappa Statistic • estimates actual inter-


(Cohen,1960) rater agreement as a
proportion of potential
inter-rater agreement
after correction for
chance.
38

Using Reliability Information

• Standard error of • estimates extent to


measurement (SEM) which test score
misrepresents a true
score.
• SEM = (S)(1 – r)
39

Standard Error of Measurement

• We use SEM to • The interval is centered


compute a confidence on the test score
interval for a • We have confidence that
the true score falls in this
particular test score.
interval
• E.g., 95% of the time the
true score will fall within
1.96 SEM either way of
the test (observed) score.
40

Standard Error of Measurement

• A simple way to think • The standard


of the SEM: deviation of the
• Suppose we gave resulting set of test
one student the same scores (for this one
test over and over student) would be the
• Suppose, too, that no standard error of
learning took place measurement.
between tests and the
student did not
memorize questions
41

What to do about low reliability

• Increase the number • To find how many you


of items need, use Spearman-
Brown formula
• Using more items
may introduce new
sources of error such
as fatigue, boredom
42

What to do about low reliability

• Discriminability • Find correlations


analysis between each item
and whole test
• Delete items with low
correlations

You might also like