You are on page 1of 15

Measurement Error

Whatever measurement we might make with

regard to some psychological construct, we do so


with some amount of error
Any observed score for an individual is their true score

with error added in

There are different types of error, but here we

are concerned with a measures inability to


capture the true response for an individual
Observed Score = True score + Error of measurement

Reliability
Reliability refers to a measures ability to capture an

individuals true score, i.e. to distinguish accurately one


person from another
While a reliable measure will be consistent, consistency can
actually be seen as a by-product of reliability, and in a case
where we had perfect consistency (everyone scores the
same and gets the same score repeatedly), reliability
coefficients could not be calculated
No variance/covariance to give a correlation

The error in our analyses is due to individual differences but

also the lack of the measure being perfectly reliable

Reliability
Criteria of reliability
Test-retest
Test components (internal consistency)

Test-retest reliability
Consistency of measurement for individuals over time
The score similarly e.g. today and 6 months from now

Issues
Memory
If too close in time the correlation between scores is due to memory of item
responses rather than true score captured
Chance covariation
Any two variables will always have a non-zero correlation
Reliability is not constant across subsets of a population
General IQ scores good reliability
IQ scores for college students, less reliable

Restriction of range, fewer individual differences

Internal Consistency
We can get a sort of average correlation

among items to assess the reliability of some


measure1
As one would most likely intuitively assume,
having more measures of something is better
than few
It is the case that having more items which
correlate with one another will increase the
tests reliability

Whats good reliability?


While we have conventions, it really kind of depends
As mentioned reliability of a measure may be different

for different groups of people


What we may need to do is compare reliability to those
measures which are in place and deemed good as well
as get interval estimates to provide an assessment of
the uncertainty in our reliability estimate
Note also that reliability estimates are biased upwardly
and so are a bit optimistic
Also, many of our techniques do not take into account
the reliability of our measures, and poor reliability can
result in lower statistical power i.e. an increase in type II
error
Though technically increasing reliability can potentially also lower

power1

Replication and Reliability


While reliability implies replicability, assessing reliability does not

provide a probability of replication

Note also that statistical significance is not a measure of reliability or

replicability1

Replication is not perhaps conducted as much as should be in

psychology for a number of reasons

Practical concerns, lack of publishing outlets etc.

Furthermore, knowing our estimates are biased and variable

themselves, we might even think that in many cases we would not


expect consistent research findings
In psychology, many people spend a lot of time debating back and
forth about the merits of some theory, citing cases where it did or
did not replicate
However the lack of replication could be due to low power, low
reliability, problem data, incorrectly carrying out the experiment
etc.
In other words, we didnt repeat because of methodology, not because

the theory was wrong

Factors affecting the utility of


replications
You cant step in the same river twice!
Heraclitus1

When
Later replications are not providing as much information,

however they can contribute greatly to the overall


assessment of an effect

Meta-analysis

How
There is no perfect replication (different people involved,

time it takes to conduct etc.)


Doing exact replication gives us more confidence in the
original finding (should it hold), but may not offer much in
the way of generalization

Example: doing a gender difference study at UNT over and over.


Does it work for non-college folk? People outside of Texas?

Factors affecting the utility of


replications
By whom
It is well known that those with a vested interest in some

idea tend to find confirming evidence more than those


that dont
Replications by others are still being done by those with
an interest in that research topic and so may have a
precorrelation inherent in their attempt

Direct: correlation of attributes of persons involved


Indirect: correlation of data to be obtained

Gist, we cant have truly independent replication

attempts, but must strive to minimize bias


The more independent replication attempts are,
the more informative they will be

Validity
Validity refers to the question of whether our

measurements are actually hitting on the


construct we think they are
While we can obtain specific statistics for
reliability (even different types), validity is more of
a global assessment based on the evidence
available
We can have reliable measurements that are
invalid
Classic example: The scale which is consistent and able

to distinguish from one person to the next but actually off


by 5 pounds

Validity Criteria in Psychological Testing


Content validity
Criterion validity
Concurrent
Predictive

Construct-related validity
Convergent
Discriminant

Content validity
Items represent the kinds of material (or content areas) they are

supposed to represent

Are the questions worth a flip in the sense they cover all domains of a

given construct?

E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers


etc.

Validity Criteria in Psychological Testing


Criterion validity
the degree to which the measure correlates with various

outcomes

Does some new personality measure correlate with the Big 5

Concurrent
Criterion is in the present

Measure of ADHD and current scholastic behavioral problems

Predictive
Criterion in the future

SAT and college gpa

Validity Criteria in Psychological Testing


Construct-related validity
How much is it an actual measure of the construct of

interest

Convergent
Correlates well with other measures of the construct

Depression scale correlates well with other dep scales

Discriminant
Is distinguished from related but distinct constructs

Dep scale != Stress scale

Validity Criteria in Experimentation


Statistical conclusion validity
Is there a causal relationship between X and Y?
Correlation is our starting point (i.e. correlation isnt causation, but does

lead to it)
Related to this is the question of whether the study was sufficiently
sensitive to pick up on the correlation

Internal validity
Has the study been conducted so as to rule out other effects which were

controllable?

Poor instruments, experimenter bias

External validity
Will the relationship be seen in other settings?

Construct validity
Same concerns as before
Ex. Is reaction time an appropriate measure of learning?

Summary
Reliability and Validity are key concerns in

psychological research
Part of the problem in psychology is the lack of
reliable measures of the things we are interested
in1
Assuming that they are valid to begin with, we
must always press for more reliable measures if
we are to progress scientifically
This means letting go of supposed standards when they

are no longer as useful and look for ways to improve


current ones