Reliability 08

1
Chapter 4 – Reliability
1. Observed Scores and True Scores

2. Error
3. How We Deal with Sources of Error:
A. Domain sampling – test items
B. Time sampling – test occasions
C. Internal consistency – traits
4. Reliability in Observational Studies
5. Using Reliability Information
6. What To Do about Low Reliability
2
Chapter 4 - Reliability
• Measurement of human ability and

knowledge is challenging because:
 ability is not directly observable – we infer
ability from behavior
 all behaviors are influenced by many
variables, only a few of which matter to us
3
Observed Scores
O=T+e
O = Observed score
T = True score
e = error
4
Reliability – the basics
1. A true score on a 3. We assume that

test does not errors are random
change with (equally likely to
repeated testing increase or
2. A true score would decrease any test
be obtained if there result).
were no error of
measurement.
5
• Because errors are • Mean of many

random, if we test one observed scores for
person many times, one person will be the
the errors will cancel person’s true score
each other out
• (Positive errors
cancel negative
errors)
6
• Example: to measure • Ask Sarah to spell a

Sarah’s spelling subset of English
ability for English words
words. • % correct estimates
• We can’t ask her to her true English
spell every word in spelling skill
the OED, so… • But which words
should be in our
subset?
7
Estimating Sarah’s spelling ability…
• Suppose we choose • What if, by chance,

20 words randomly… we get a lot of very
easy words – cat,
tree, chair, stand…
• Or, by chance, we get
a lot of very difficult
words – desiccate,
arteriosclerosis,
numismatics
8
• Sarah’s observed • But presumably her

score varies as the true score (her actual
difficulty of the spelling ability)
random sets of words remains constant.
varies
9
• Other things can • E.g. on the first day

produce error in our that we test Sarah
measurement she’s tired
• But on the second
day, she’s rested…
• This would lead to

different scores on
the two days
10
• Conclusion: • The variation in

Sarah’s scores is
O=T+e produced by
measurement error.
• How can we measure
But e1 ≠ e2 ≠ e3 … such effects – how
can we measure
reliability?
11
• In what follows, we • Different ways of

consider various measuring reliability
sources of error in are sensitive to
measurement. different sources of
error.
12
How do we deal with sources of error?
• Error due to test items • Domain sampling

error
13
• Error due to test items • Time sampling error

• Error due to testing
occasions
14
• Error due to test items • Internal consistency

• Error due to testing error
occasions
• Error due to testing
multiple traits
15
Domain Sampling error
• A knowledge base or • We can’t test the

skill set containing entire set of items.
many items is to be  So we select a sample
tested. of items.
 E.g., the chemical  That produces domain
properties of foods. sampling error, as in
Sarah’s spelling test.
16
• There is a “domain” of • A person’s score may

knowledge to be vary depending upon
tested what is included or
excluded from the
test.
17
• Smaller sets of items • As a result, reliability

may not test entire of a test increases
knowledge base. with the number of
• Larger sets of items items on that test
should do a better job
of covering the whole
knowledge base.
18
• Parallel Forms • Across all people

Reliability: tested, if correlation
• choose 2 different between scores on 2
sets of test items. parallel forms is low,
• these 2 sets give you then we probably
“parallel forms” of the have domain
test sampling error.
19
Time Sampling error
• Test-retest Reliability • Give same test

 person taking test repeatedly & check
might be having a correlations among
very good or very bad scores
day – due to fatigue,
emotional state, • High correlations
preparedness, etc. indicate stability –
less influence of bad
or good days.
20
Time Sampling error
• Test-retest approach • Not all low test-retest

is only useful for traits correlations imply a
– characteristics that weak test
don’t change over • Sometimes, the
time characteristic being
measured varies with
time (as in learning)
21
Time Sampling error
• Interval over which • Not all low test-retest

correlation is correlations imply a
measured matters weak test
• E.g., for young • Sometimes, the
children, use a very characteristic being
short period (< 1 measured varies with
month, in general) time (as in learning)
• In general, interval
should not be > 6
months
22
Time sampling error
• Test-retest approach • Carryover: first testing

advantage: easy to session influences scores
on next session
evaluate, using
• Practice: when carryover
correlation
effect involves learning
• Disadvantage:
carryover & practice
effects
23
Internal Consistency error
• Suppose a test • Would you expect

includes both items much correlation
on social psychology between scores on
and items requiring the two parts?
mental rotation of  No – because the two
abstract visual ‘skills’ are unrelated.
shapes.
24
Internal Consistency Approach
• A low correlation • A good test has high

between scores on 2 correlations between
halves of a test, scores on its two
suggests that the test halves.
is tapping two
different abilities or  But how should we
divide the test in two to
traits.
check that correlation?
25
Internal Consistency error
• Split-half method • All of these assess

• Kuder-Richardson the extent to which
formula items on a given test
• Cronbach’s alpha measure the same
ability or trait.
26
Split-half Reliability
• After testing, divide • Various ways of

test items into halves dividing test into two –
A & B that are scored randomly, first half vs.
separately. second half, odd-
• Check for correlation even…
of results for A with
results for B.
27
Split-half Reliability – a problem
• Each half-test is • So, we shouldn’t use

smaller than the the raw split-half
whole reliability to assess
• Smaller tests have reliability for the
lower reliability whole test
(domain sampling
error)
28
Split-half reliability – a problem
• We correct reliability re = estimated reliability for

estimate using the the test
Spearman-Brown rc = computed reliability
(correlation between
formula:
scores on the two halves
re = 2rc A and B)
1+ rc
29
Kuder-Richardson 20
• Kuder & Richardson • KR-20 avoids

(1937): an internal- problems associated
consistency measure with splitting by
that doesn’t require simultaneously
arbitrary splitting of considering all
test into 2 halves. possible ways of
splitting a test into 2
halves.
30
Kuder-Richardson 20
• The formula 1. a measure of all the

contains two basic variance in the
terms: whole set of test
results.
31
Kuder-Richardson 20
• The formula 2. “item variance” –

contains two basic when items measure
terms: the same trait, they
co-vary (same
people get them
right or wrong). More
co-variance = less
“item variance”
32
Internal Consistency – Cronbach’s α
• KR-20 can only be • Cronbach’s α (alpha)

used with test items generalizes KR-20 to
scored as 1 or 0 (e.g., tests with multiple
right or wrong, true or response categories.
false). • α is a more generally-
useful measure of
internal consistency
than KR-20
33
Review: How do we deal with sources of
error?
Approach Measures Issues
Test-Retest Stability of scores Carryover
Parallel Forms Equivalence & Stability Effort
Split-half Equivalence & Internal Shortened

consistency test
KR-20 & α Equivalence & Internal Difficult to
consistency calculate
34
Reliability in Observational Studies
• Some psychologists • This approach

collect data by requires time
observing behavior sampling, leading to
rather than by testing. sampling error
• Further error due to:
 observer failures
 inter-observer
differences
35
• Deal with possibility of • Deal with inter-

failure in the single- observer differences
observer situation by using:
having more than 1  Inter-rater reliability
observer.  Kappa statistic
36
• Inter-rater reliability • % agreement between 2

or more observers
 problem: in a 2-choice
case, 2 judges have a 50%
chance of agreeing even if
they guess!
 this means that %
agreement may over-
estimate inter-rater
reliability.
37
• Kappa Statistic • estimates actual inter-

(Cohen,1960) rater agreement as a
proportion of potential
inter-rater agreement
after correction for
chance.
38
Using Reliability Information
• Standard error of • estimates extent to

measurement (SEM) which test score
misrepresents a true
score.
• SEM = (S)(1 – r)
39
Standard Error of Measurement
• We use SEM to • The interval is centered

compute a confidence on the test score
interval for a • We have confidence that
the true score falls in this
particular test score.
interval
• E.g., 95% of the time the
true score will fall within
1.96 SEM either way of
the test (observed) score.
40
Standard Error of Measurement
• A simple way to think • The standard

of the SEM: deviation of the
• Suppose we gave resulting set of test
one student the same scores (for this one
test over and over student) would be the
• Suppose, too, that no standard error of
learning took place measurement.
between tests and the
student did not
memorize questions
41
What to do about low reliability
• Increase the number • To find how many you

of items need, use Spearman-
Brown formula
• Using more items
may introduce new
sources of error such
as fatigue, boredom
42
What to do about low reliability
• Discriminability • Find correlations

analysis between each item
and whole test
• Delete items with low
correlations

Reliability 08

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability 08

Uploaded by

Copyright:

Available Formats

1

1. Observed Scores and True Scores

• Measurement of human ability and

Reliability – the basics

1. A true score on a 3. We assume that

Reliability – the basics

• Because errors are • Mean of many

Reliability – the basics

• Example: to measure • Ask Sarah to spell a

Estimating Sarah’s spelling ability…

• Suppose we choose • What if, by chance,

Estimating Sarah’s spelling ability…

• Sarah’s observed • But presumably her

Reliability – the basics

• Other things can • E.g. on the first day

• This would lead to

Estimating Sarah’s spelling ability…

• Conclusion: • The variation in

Reliability – the basics

• In what follows, we • Different ways of

How do we deal with sources of error?

• Error due to test items • Domain sampling

How do we deal with sources of error?

• Error due to test items • Time sampling error

How do we deal with sources of error?

• Error due to test items • Internal consistency

Domain Sampling error

• A knowledge base or • We can’t test the

Domain Sampling error

• There is a “domain” of • A person’s score may

Domain Sampling error

• Smaller sets of items • As a result, reliability

Domain Sampling error

• Parallel Forms • Across all people

Time Sampling error

• Test-retest Reliability • Give same test

Time Sampling error

• Test-retest approach • Not all low test-retest

Time Sampling error

• Interval over which • Not all low test-retest

Time sampling error

• Test-retest approach • Carryover: first testing

Internal Consistency error

• Suppose a test • Would you expect

Internal Consistency Approach

• A low correlation • A good test has high

Internal Consistency error

• Split-half method • All of these assess

• After testing, divide • Various ways of

Split-half Reliability – a problem

• Each half-test is • So, we shouldn’t use

Split-half reliability – a problem

• We correct reliability re = estimated reliability for

• Kuder & Richardson • KR-20 avoids

• The formula 1. a measure of all the

• The formula 2. “item variance” –

Internal Consistency – Cronbach’s α

• KR-20 can only be • Cronbach’s α (alpha)

Test-Retest Stability of scores Carryover

Parallel Forms Equivalence & Stability Effort

Split-half Equivalence & Internal Shortened

Reliability in Observational Studies

• Some psychologists • This approach

Reliability in Observational Studies