You are on page 1of 15

Measurement Error

Whatever measurement we might make with

regard to some psychological construct, we do so

with some amount of error
Any observed score for an individual is their true score

with error added in

There are different types of error, but here we

are concerned with a measures inability to

capture the true response for an individual
Observed Score = True score + Error of measurement

Reliability refers to a measures ability to capture an

individuals true score, i.e. to distinguish accurately one

person from another
While a reliable measure will be consistent, consistency can
actually be seen as a by-product of reliability, and in a case
where we had perfect consistency (everyone scores the
same and gets the same score repeatedly), reliability
coefficients could not be calculated
No variance/covariance to give a correlation

The error in our analyses is due to individual differences but

also the lack of the measure being perfectly reliable

Criteria of reliability
Test components (internal consistency)

Test-retest reliability
Consistency of measurement for individuals over time
The score similarly e.g. today and 6 months from now

If too close in time the correlation between scores is due to memory of item
responses rather than true score captured
Chance covariation
Any two variables will always have a non-zero correlation
Reliability is not constant across subsets of a population
General IQ scores good reliability
IQ scores for college students, less reliable

Restriction of range, fewer individual differences

Internal Consistency
We can get a sort of average correlation

among items to assess the reliability of some

As one would most likely intuitively assume,
having more measures of something is better
than few
It is the case that having more items which
correlate with one another will increase the
tests reliability

Whats good reliability?

While we have conventions, it really kind of depends
As mentioned reliability of a measure may be different

for different groups of people

What we may need to do is compare reliability to those
measures which are in place and deemed good as well
as get interval estimates to provide an assessment of
the uncertainty in our reliability estimate
Note also that reliability estimates are biased upwardly
and so are a bit optimistic
Also, many of our techniques do not take into account
the reliability of our measures, and poor reliability can
result in lower statistical power i.e. an increase in type II
Though technically increasing reliability can potentially also lower


Replication and Reliability

While reliability implies replicability, assessing reliability does not

provide a probability of replication

Note also that statistical significance is not a measure of reliability or


Replication is not perhaps conducted as much as should be in

psychology for a number of reasons

Practical concerns, lack of publishing outlets etc.

Furthermore, knowing our estimates are biased and variable

themselves, we might even think that in many cases we would not

expect consistent research findings
In psychology, many people spend a lot of time debating back and
forth about the merits of some theory, citing cases where it did or
did not replicate
However the lack of replication could be due to low power, low
reliability, problem data, incorrectly carrying out the experiment
In other words, we didnt repeat because of methodology, not because

the theory was wrong

Factors affecting the utility of

You cant step in the same river twice!

Later replications are not providing as much information,

however they can contribute greatly to the overall

assessment of an effect


There is no perfect replication (different people involved,

time it takes to conduct etc.)

Doing exact replication gives us more confidence in the
original finding (should it hold), but may not offer much in
the way of generalization

Example: doing a gender difference study at UNT over and over.

Does it work for non-college folk? People outside of Texas?

Factors affecting the utility of

By whom
It is well known that those with a vested interest in some

idea tend to find confirming evidence more than those

that dont
Replications by others are still being done by those with
an interest in that research topic and so may have a
precorrelation inherent in their attempt

Direct: correlation of attributes of persons involved

Indirect: correlation of data to be obtained

Gist, we cant have truly independent replication

attempts, but must strive to minimize bias

The more independent replication attempts are,
the more informative they will be

Validity refers to the question of whether our

measurements are actually hitting on the

construct we think they are
While we can obtain specific statistics for
reliability (even different types), validity is more of
a global assessment based on the evidence
We can have reliable measurements that are
Classic example: The scale which is consistent and able

to distinguish from one person to the next but actually off

by 5 pounds

Validity Criteria in Psychological Testing

Content validity
Criterion validity

Construct-related validity

Content validity
Items represent the kinds of material (or content areas) they are

supposed to represent

Are the questions worth a flip in the sense they cover all domains of a

given construct?

E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers


Validity Criteria in Psychological Testing

Criterion validity
the degree to which the measure correlates with various


Does some new personality measure correlate with the Big 5

Criterion is in the present

Measure of ADHD and current scholastic behavioral problems

Criterion in the future

SAT and college gpa

Validity Criteria in Psychological Testing

Construct-related validity
How much is it an actual measure of the construct of


Correlates well with other measures of the construct

Depression scale correlates well with other dep scales

Is distinguished from related but distinct constructs

Dep scale != Stress scale

Validity Criteria in Experimentation

Statistical conclusion validity
Is there a causal relationship between X and Y?
Correlation is our starting point (i.e. correlation isnt causation, but does

lead to it)
Related to this is the question of whether the study was sufficiently
sensitive to pick up on the correlation

Internal validity
Has the study been conducted so as to rule out other effects which were


Poor instruments, experimenter bias

External validity
Will the relationship be seen in other settings?

Construct validity
Same concerns as before
Ex. Is reaction time an appropriate measure of learning?

Reliability and Validity are key concerns in

psychological research
Part of the problem in psychology is the lack of
reliable measures of the things we are interested
Assuming that they are valid to begin with, we
must always press for more reliable measures if
we are to progress scientifically
This means letting go of supposed standards when they

are no longer as useful and look for ways to improve

current ones