You are on page 1of 29

Validity and Reliability

Lecture 2 & 3
Think in terms of ‘the
purpose of tests’ and
the ‘consistency’ with
which the purpose is
fulfilled/met

Neither Valid Reliable


nor Reliable but not
Valid

Fairly Valid but Valid & Reli-


not very Reli- able
able
Validity
 Depends on the PURPOSE
 E.g. a ruler may be a valid measuring device
for length, but isn’t very valid for measuring
volume
 Measuring what ‘it’ is supposed to
 Matter of degree (how valid?)
 Specific to a particular purpose!
 Must be inferred from evidence; cannot be di-
rectly measured
 Learning outcomes
1. Content coverage (relevance?)
2. Level & type of student engagement (cognitive, affec-
tive, psychomotor) – appropriate?
Reliability

 Consistency in the type of result a test


yields
– Time & space
– participants
 Not perfectly similar result but ‘very close-
to’ being similar
 When someone says you are a ‘reliable’
person, what do they really mean?
 Are you a reliable person? 
What do you think…?
 Forced-choice assessment forms are high in reliability,
but weak in validity (true/false)
 Performance-based assessment forms are high in both
validity and reliability (true/false)
 A test item is said to be unreliable when most students
answered the item wrongly (true/false)
 When a test contains items that do not represent the
content covered during instruction, it is known as an un-
reliable test (true/false)
 Test items that do not successfully measure the intended
learning outcomes (objectives) are invalid items (true/
false)
 Assessment that does not represent student learning
well enough are definitely invalid and unreliable (true/
false)
 A valid test can sometimes be unreliable (true/false)
– If a test is valid, it is reliable! (by-product)
Indicators of quality

 Validity
 Reliability
 Utility
 Fairness

Question: how are they all inter-related?


Two methods of estimating reliabil-
ity
 Test/Retest
Test/retest is the more conservative method to estimate
reliability. Simply put, the idea behind test/retest is that
you should get the same score on test 1 as you do on
test 2. The three main components to this method are
as follows:
1.) implement your measurement instrument at two sepa-
rate times for each subject;
2). compute the correlation between the two separate
measurements; and
3) assume there is no change in the underlying condition
(or trait you are trying to measure) between test 1 and
test 2.
Two methods of estimating reliabil-
ity
 Internal Consistency
Internal consistency estimates reliability by
grouping questions in a questionnaire that mea-
sure the same concept. For example, you could
write two sets of three questions that measure
the same concept (say class participation) and
after collecting the responses, run a correlation
between those two groups of three questions to
determine if your instrument is reliably measur-
ing that concept.
 The primary difference between test/
retest and internal consistency estimates
of reliability is that test/retest involves two
administrations of the measurement in-
strument, whereas the internal consis-
tency method involves only one adminis-
tration of that instrument.
Validity

Cook and Campbell (1979) define it as the "best available


approximation to the truth or falsity of a given inference,
proposition or conclusion.“
In short, were we right?
Let's look at a simple example. Say we are studying the ef-
fect of strict attendance policies on class participation.
We see that class participation did increase after the pol-
icy was established. Each type of validity would highlight
a different aspect of the relationship between our treat-
ment (strict attendance policy) and our observed out-
come (increased class participation).
Types of Validity

1. 1. Conclusion validity
2. 2. Internal Validity
3. 3. Construct validity
4. 4. External validity
Conclusion validity

asks is there a relationship between the


program and the observed outcome? Or,
in our example, is there a connection be-
tween the attendance policy and the in-
creased participation we saw?
Internal Validity

asks if there is a relationship between the


program and the outcome we saw, is it a
causal relationship? For example, did the
attendance policy cause class participation
to increase?
Threats To Internal Validity

There are three main types of threats to internal validity –


single group
multiple group
social interaction threats.

Single Group Threats apply when you are studying a


single group receiving a program or treatment. Thus, all
of these threats can be greatly reduced by adding a con-
trol group that is comparable to your program group to
your study.
Threats To Internal Validity-
Single Group
History Threat occurs when an historical event
affects your program group such that it causes
the outcome you observe (rather than your
treatment being the cause). In our earlier exam-
ple, this would mean that the stricter attendance
policy did not cause an increase in class partici-
pation, but rather, the expulsion of several stu-
dents due to low participation from school im-
pacted your program group such that they in-
creased their participation as a result.
Threats To Internal Validity-
Single Group
 A Maturation Threat to internal validity occurs
when standard events over the course of time
cause your outcome. For example, if by chance,
the students who participated in your study on
class participation all "grew up" naturally and re-
alized that class participation increased their
learning (how likely is that?) - that could be the
cause of your increased participation, not the
stricter attendance policy.
Threats To Internal Validity-
Single Group
 A Testing Threat to internal validity is simply
when the act of taking a pre-test affects how
that group does on the post-test. For example, if
in your study of class participation, you mea-
sured class participation prior to implementing
your new attendance policy, and students be-
came forewarned that there was about to be an
emphasis on participation, they may increase it
simply as a result of involvement in the pretest
measure - and thus, your outcome could be a
result of a testing threat - not your treatment.
Threats To Internal Validity-
Single Group
 An Instrumentation Threat to internal
validity could occur if the effect of in-
creased participation could be due to the
way in which that pretest was imple-
mented.
Threats To Internal Validity-
Single Group
 A Mortality Threat to internal validity occurs
when subjects drop out of your study, and this
leads to an inflated measure of your effect. For
example, if as a result of a stricter attendance
policy, most students drop out of a class, leaving
only those more serious students in the class
(those who would participate at a high level nat-
urally) - this could mean your effect is overesti-
mated and suffering from a mortality threat.
Multiple Group Threats
involve the comparability of the two groups in your study, and whether or not
any other factor other than your treatment causes the outcome. They also
mirror the single group threats to internal validity.
 A Selection-History threat occurs when an event occurring between the
pre and post test affects the two groups differently.
 A Selection-Maturation threat occurs when there are different
rates of growth between the two groups between the pre and post
test.
 Selection-Testing threat is the result of the different effect from
taking tests between the two groups.
 A Selection-Instrumentation threat occurs when the test imple-
mentation affects the groups differently between the pre and post
test.
 A Selection-Mortality Threat occurs when there are different rates
of dropout between the groups which leads to you detecting an ef-
fect that may not actually occur.
 Finally, a Selection-Regression threat occurs when the two groups
regress towards the mean at different rates.
Construct validity
Asks if there is there a relationship between
how I operationalized my concepts in this
study to the actual causal relationship I'm
trying to study
In our example, did our treatment (atten-
dance policy) reflect the construct of at-
tendance, and did our measured outcome
- increased class participation - reflect the
construct of participation?
External validity

Refers to our ability to generalize the results


of our study to other settings.
In our example, could we generalize our re-
sults to other classrooms? Other schools?
Types of validity measures

 Face validity
 Construct validity
 Content validity
 Criterion validity
1. Predictive
2. Concurrent
Face Validity
 Does it appear to measure what it is supposed to
measure?

 Example: Let’s say you are interested in measuring,


‘Propensity towards violence and aggression’. By
simply looking at the following items, state which
ones qualify to measure the variable of interest:
– Have you been arrested?
– Have you been involved in physical fighting?
– Do you get angry easily?
– Do you sleep with your socks on?
– Is it hard to control your anger?
– Do you enjoy playing sports?
Construct Validity
 Does the test measure the ‘human’ CHAR-
ACTERISTIC(s) it is supposed to?
 Examples of constructs or ‘human’ character-
istics:
– Mathematical reasoning
– Verbal reasoning
– Musical ability
– Spatial ability
– Mechanical aptitude
– Motivation
Content Validity

 Does the DV measure the empirical, oper-


ationalized variable being studied?
 Does the exam cover adequately cover the
range of material?
Criterion Validity
 The degree to which content on a test (predic-
tor) correlates with performance on relevant
criterion measures (concrete criterion in the
"real" world?)
 If they do correlate highly, it means that the test
(predictor) is a valid one!
 E.g. if you taught skills relating to ‘public speak-
ing’ and had students do a test on it, the test
can be validated by looking at how it relates to
actual performance (public speaking) of students
inside or outside of the classroom
Two Types of Criterion Validity

 Concurrent Criterion Validity = how well per-


formance on a test estimates current performance
on some valued measure (criterion)? (e.g. test of
dictionary skills can estimate students’ current skills
in the actual use of dictionary – observation)

 Predictive Criterion Validity = how well perfor-


mance on a test predicts future performance on
some valued measure (criterion)? (e.g. reading
readiness test might be used to predict students’
achievement in reading)
Reliability
 Measure of consistency of test results
from one administration of the test to the
next

 Generalizability – consistency (interwoven


concepts) – if a test item is reliable, it can
be correlated with other items to collec-
tively measure a construct or content mas-
tery
Measuring Reliability
 Test – retest
Give the same test twice to the same group with
any time interval between tests
 Equivalent forms (similar in content, difficulty level, arrangement, type
of assessment, etc.)

Give two forms of the test to the same group in


close succession
 Split-half
Test has two equivalent halves. Give test once,
score two equivalent halves (odd items vs. even
items)
 Cronbach Alpha (SPSS)
Inter-item consistency – one test – one administra-
tion
 Inter-rater Consistency (subjective scor-
ing)

You might also like