3 VALIDITY - RELAIBILITY 18032024 101010am

PSYCHOMETRIC
CHARACTERISTICS OF TESTS
– RELIABILITY AND VALIDITY
VALIDITY
TYPES AND EXAMPLES

Validity
 Validity is the extent to which a test measures what it is
supposed to measure.
 The question of validity is raised in the context of the three points
,the form of the test, the purpose of the test and the
population for whom it is intended. Therefore, we cannot ask
the general question “Is this a valid test?”.
 The question to ask is “how valid is this test for the decision that I
need to make?” or “how valid is the interpretation I propose for

the test?”
Definition
Validity has been described as ‘The agreement between
a test score or measure and the quality it is believed to

measure'
(Kaplan and Saccuzzo, 2001).
In other words, it measures the gap between what a test
actually measures and what it is intended to measure.

Validity
 Depends on the PURPOSE

 E.g. a ruler may be a valid measuring device for length, but
isn’t very valid for measuring volume
 Measuring what ‘it’ is supposed to
 Matter of degree (how valid?)
 Specific to a particular purpose!
 Must be inferred from evidence; cannot be directly measured
Validity: (actually studying the
variables that we wish to study)
 Types of validity
1. Face validity
2. Construct validity
3. Content validity
4. Criterion validity
a) Predictive validity
b) Concurrent validity
Face Validity
Basically face validity refers to the degree to which a test
‘appears’ to measure what it purports to measure.
Face validity is that the test appears to be valid. This is
validated using common-sense rules, for example that a

mathematical test should include some numerical elements.
Face Validity
Face validity is the mere appearance that a measure
has validity.
We often say a test has face validity if the items seem to
be reasonably related to the perceived purpose of the

test.
Face Validity
For example, a scale to measure anxiety might include
items such as
“My stomach gets upset when I think about taking tests.”
And
“My heart starts pounding fast whenever I think about all of
the things I need to get done.”

Face Validity
On the basis of positive responses to these items, can we
conclude that the person is anxious?
Remember validity requires evidence in order to justify
conclusions.
In this case we can only conclude that the person answers
these two items in a particular way. If we want to conclude
that a person has a particular problem with anxiety, then we
need systematic evidence that shows how responses to these
items relate to the psychological condition of anxiety.
Face validity is really not validity at all because it does not
offer evidence to support conclusions drawn from test scores.
Face Validity
Face validity is not totally unimportant. In many settings,
it is crucial to have a test that “looks like” it is valid.
These appearances can help motivate test takers because
they can see that the test is relevant.
For example, suppose you developed a test to screen

applicants for a training program in accounting. Items that
ask about balance sheets and ledgers might make
applicants more motivated than items about fuel
consumption. However, both types of items might be
testing the same arithmetic reasoning skill.
A test that does not have face validity may be rejected
by test-takers (if they have that option) and also people

who are choosing the test to use from amongst a set of
options.
Construct Validity
Construct validity is the degree to which a test
measures an intended hypothetical construct.
Many times psychologists assess/measure abstract

attributes or constructs. The process of validating the
interpretations about that construct as indicated by the
test score is construct validation.
Underlying many tests is a construct or theory that is
being assessed. For example, there are a number of
constructs for describing intelligence (spatial ability,
verbal reasoning, etc.) which the test will individually
assess.
If the construct is not valid then the test on which it is

based will not be valid. For example, there have been
historical constructs that intelligence is based on the size
and shape of the skull.
Construct Validity
Do my dependent variables actually measure the
hypothetical construct that I want to test?

Does my IQ test really measure IQ, and nothing else?
Do my procedures actually measure learning, (without
being influenced by motivation)?

Does my personality test really measure personality
traits without including fatigue?

 In other words, we need to establish the construct validity of
measures of intelligence by showing that scores on tests that

theoretically measure intelligence do, indeed, predict
performance on other tasks that require intelligence.
 If we were measuring altruism, then our measures of people who
stop to help stranded motorists are also the people who donate to
charities, while those that do not stop also do not donate.
Comparing Face Validity with
Construct Validity:
Face validity: The consensus that a measure represents a
particular concept – the face value of the measure
Construct validity: The accuracy with which a measure

represents the particular concept, without influence of
additional factors. Construct validity implies that other
operational definitions of the same construct will yield
correlated results.
Convergent vs. Divergent validity
In order to demonstrate construct validity, we must show not
only that a test correlates highly with other variables with

which it should theoretically correlate but also that it does
not correlate significantly with variables from which it
should differ.
The former is convergent validity and the latter is divergent
validity.
Convergent vs Divergent validity
Convergent Validity
When a measure correlates well with other tests
believed to measure the same construct, convergent

evidence for validity is obtained.
This sort of evidence shows that measures of the same
construct converge, or narrow in, on the same thing.

Divergent Validity
Divergent validity or discriminant validity
 To demonstrate divergent validity, a test should have low
correlations with measures of unrelated constructs, or

evidence for what the test does not measure.
 Divergent evidence for validity indicates that the measure
does not represent a construct other than the one for which it
was devised.
Content Validity
A test has content validity if it sufficiently covers the
area that it is intended to cover. This is particularly

important in ability or attainment tests that validate
skills or knowledge in a particular domain.
Traditionally, content validity has been of greatest
concern in educational testing.

Content Validity
Content under-representation occurs when important
areas are missed.
Construct-irrelevant variation occurs when irrelevant
factors contaminate the test.

Example, if we want to test knowledge on Pakistan
Geography it is not fair to have most questions limited

to the geography of England.
Content Validity
Example:
 Do the questions on an exam accurately reflect what you have
learned in the course, or were the exam questions sampled from

only a sub-section of the material? A test to measure your
knowledge of mathematics should not be limited to addition
problems, nor should it include questions about Urdu literature.
It should cover the entire range of appropriate math problems you
are trying to measure.
Content Validity
 Like face validity, content validity is determined by experts who
can judge the representativeness of the items on the test. When
studying hypothetical constructs, both face and content validity
may be limited in value because they may fail in either of two
ways:
(a) A measure may look valid when it does not, in fact, measure the
underlying construct, and
(b) A measure that does not appear to be valid may actually be an

excellent indicator of the underlying construct.
Criterion Validity
 Criterion validity evidence tells us just how well a test
corresponds with a particular criterion.

 Such evidence is provided by high correlations between a test
and a well-defined criterion measure.

 A criterion is the standard against which the test is compared.
 For example, a test might be used to predict which engaged couples will
have successful marriages and which ones will get divorced. Marital
success is the criterion, but it cannot be known at the time the couples take
the premarital test.
Criterion Validity
 A powerful indicator of the validity of a measure is its ability to
accurately predict performance on other, independent
outcome measures (referred to as criterion measures).
 The extent to which your SAT score predicts your college GPA is
an indication of the SAT’s criterion validity.
 There are two approaches to criterion validity:

Concurrent validity and
Predictive validity
Predictive Validity
 The forecasting function of tests is actually a type or form of
criterion validity evidence known as predictive validity.

 For example, the SAT test serves as predictive validity evidence as a college
admissions tests if it accurately forecasts how well high school students will do
in their college studies.
 The SAT is the predictor variable (test), and the college GPA is the criterion.
 The purpose of the test is to predict the likelihood of succeeding on the
criterion – that is achieving a high GPA in college. A valid test for this purpose
would greatly help college admissions committees because they would have
some idea about which students would most likely succeed.
Concurrent Validity
 Concurrent related evidence for validity comes from assessments
of the simultaneous relationship between the test and the criterion

– such as between a learning disability test and school performance.
 Here the measures (test) and criterion measures are taken at the same
time because the test is designed to explain why the person is now
having difficulty in school.
 Concurrent validity applies when the test and the criterion can be
measured at the same time.

Two Types of Criterion Validity
 Concurrent Criterion Validity = how well performance on a test
estimates current performance on some valued measure (criterion)?
 Predictive Criterion Validity = how well performance on a test
predicts future performance on some valued measure (criterion)?
 Both are only possible IF the predictors are VALID

Concurrent vs. Predictive Validity
In concurrent validity, the SAT test scores and criterion measures (high school
GPA) are obtained at roughly the same time (concurrent).
 If the SAT shows high concurrent validity, it will be highly correlated with
GPA obtained at the same time the SAT is taken.
Predictive validity However, its predictive validity would be high if your SAT
score accurately predicted your college GPA, which is obtained long after taking
the SAT.
RELIABILITY
TYPES
Reliability: The consistency of a
measurement procedure
 Refers to consistency of scores obtained by the same persons
when they are reexamined with the same test on different

occasion, or with different sets of equivalent items, or under
other variable examining condition .
 It also called ERROR OF MEASURMENT

If a measurement device or procedure
consistently assigns the same score to individuals
or objects with equal values, the device is
considered reliable.
The typical procedures for
determining reliability include
The typical procedures for determining reliability include:
1.Comparing the scores from repeated testing of the same
participants with the same test [test-retest reliability],
or by comparing scores from alternate forms of the test
[alternate forms reliability]
2.Comparing scores from different parts of the test [split-

half reliability], such as comparing the scores from
even versus odd questions
The typical procedures for
determining reliability include
3.Comparing scores assigned by different

researchers who have observed the same event
[measuring scorer reliability by inter rater
reliability].
The co-relation coefficient
Since all types of reliability are concerned with the degree
of consistency or agreement between two independently

derived sets of scores, they can all be expressed in term of
a correlation coefficient.
A correlation coefficient (r) expresses the degree of

correspondence or relationship between two sets of scores.
Types of reliability
1. Test-retest reliability
2. Alternate Forms Reliability
3. Split Half Reliability
4. Inter-rater reliability
Test-retest reliability
 Measure the scores twice with the same instrument. Reliable
measures should produce very similar scores.
 The most obvious method for finding the reliability of test scores
is by repeating the identical test on a second occasion.
 The reliability coefficient (r) in this case is simply the correlation
between the scores obtained by the same persons on the two
administrations of the test.
 Examples: IQ tests typically show high test-retest reliability. The

reliability of a bathroom scale can be tested by recording your
weight 2-3 times within a day or two.
Test-retest reliability estimates are used to evaluate the
error associated with administering a test at two different
times.
This type of analysis is of value only when we measure ‘traits’

or characteristics that do not change over time.
For instance, we usually assume that an intelligence test
measures a consistent general ability. As such, if an IQ test
administered at two points in time produces different scores, then
we might conclude that the lack of correspondence is the result
of random measurement error. Usually we do not assume that a
person got more or less intelligent in the time between tests.
Tests that measure constantly changing characteristics
are not appropriate for test-retest evaluation.

 For example, the value of the Rorschach Inkblot Test seems to
tell the clinician how the client is functioning at a particular time.
 Test-retest reliability is relatively easy to evaluate. Just administer
the same test on two well-specified occasions and then find the
correlation between the scores from the two administrations.
Alternate Forms Reliability
 Alternate Forms Reliability/Parallel Forms Reliability/ Parallel
Forms Reliability
 Test-retest procedures may not be useful when participants may
be able to recall their previous responses and simply repeat them

upon retesting. In cases where administering the exact same test
will not necessarily be a good test of reliability, we may use
alternate forms.
 Practice effect or learning
 Carryover effect
As the name implies, two or more versions of the test
are constructed that are equivalent in content and

level of difficulty.
Professors use this technique to create makeup or
replacement exams because students may already know

the questions from the earlier exam.
 It compares two equivalent forms of a test that measure the same
attribute. The two forms use different items; however, the rules
used to select items of a particular difficulty level are the same.
 When two forms of the test are available, one can compare
performance on one form versus the other.

 Sometimes the two forms are administered to the same group of
people on the same day.

 The order of administration is usually counterbalanced to avoid
practice effects.
 The correlation between the scores obtained on the two forms
represents the reliability coefficient of the test.

 The method of parallel forms provides one of the most rigorous
assessments of reliability commonly in use.

 Such a reliability coefficient is a measure of both temporal
stability and consistency of response to different item samples

(or test forms).
Unfortunately, the use of parallel forms occurs in
practice less often than is desirable.

Often test developers find it burdensome to develop two
forms of the same test, and practical constraints make it

difficult to retest the same group of individuals.
Instead, many test developers prefer to base their
estimate of reliability on a single form of a test.

Split Half Reliability
A measure of consistency where a test is split in two
and the scores for each half of the test is compared

with one another.
A test given and divided into halves and are scored
separately, then the score of one half of test are compared

to the score of the remaining half to test the reliability
(Kaplan & Saccuzzo, 2001).
 The two halves of the test can be created in a variety of ways. If the test
is long, the best method is to divide the items randomly into two halves.
 Although convenient, this method can cause problems when items on
the second half of the test are more difficult than items on the first half.
If the items get progressively more difficult, then you might be better
advised to use odd-and-even system, whereby one sub-score is obtained
for the odd-numbered items in the test and another for the even-
numbered items.
To estimate the reliability of the test, you could find the
correlation between the two halves.

This type of reliability coefficient is also called
coefficient of internal consistency.

It is apparent that split-half reliability provides a
measure of consistency with regard to content sampling.

Inter-item reliability
Inter-item reliability: The degree to which different
items measuring the same variable attain consistent

results. Scores on different items designed to measure
the same construct should be highly correlated. It also
goes by the name internal consistency.
Inter-item reliability
Math tests often ask you to solve several examples of
the same type of problem. Your scores on these

questions will normally represent your ability to solve
this type of problem, and the test would have high inter
item reliability.
Split half reliability
When it is impractical or inadvisable to administer two
tests to the same participants, it is possible to assess the

reliability of some measurement procedures by
examining their internal consistency. This type of
reliability assessment is useful with tests that contain a
series of items intended to measure the same attribute.
Inter-rater reliability
Inter-rater reliability/ inter-rater agreement/ Scorer
reliability
When observers must use their own judgment to interpret
the events they are interpreting (including live or
videotaped behaviors and written answers to open-ended
interview questions), scorer reliability must be measured.
Have different observers take measurements of the
same responses; the agreement between their
measurements is called inter-rater reliability. Their
results can be compared statistically and represent the
scorer’s reliability.
Inter-rater reliability
 Tests of creativity and projective tests of personality leave a good
deal to the judgment of the scorer.

 Scorer reliability can be found by having a sample of test papers
independently scored by two examiners. The two scores thus

obtained by each examiner are then correlated in the usual way,
and the resulting correlation coefficient is a measure of scorer
reliability.

3 VALIDITY - RELAIBILITY 18032024 101010am

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 VALIDITY - RELAIBILITY 18032024 101010am

Uploaded by

Copyright:

Available Formats

PSYCHOMETRIC

TYPES AND EXAMPLES

 The question of validity is raised in the context of the three points

need to make?” or “how valid is the interpretation I propose for

a test score or measure and the quality it is believed to

In other words, it measures the gap between what a test

actually measures and what it is intended to measure.

 Depends on the PURPOSE

Basically face validity refers to the degree to which a test

‘appears’ to measure what it purports to measure.

Face validity is that the test appears to be valid. This is

validated using common-sense rules, for example that a

Face validity is the mere appearance that a measure

be reasonably related to the perceived purpose of the

“My stomach gets upset when I think about taking tests.”

the things I need to get done.”

For example, suppose you developed a test to screen

by test-takers (if they have that option) and also people

Construct validity is the degree to which a test

measures an intended hypothetical construct.

Many times psychologists assess/measure abstract

If the construct is not valid then the test on which it is

hypothetical construct that I want to test?

Do my procedures actually measure learning, (without

being influenced by motivation)?

traits without including fatigue?

measures of intelligence by showing that scores on tests that

 If we were measuring altruism, then our measures of people who

Construct validity: The accuracy with which a measure

only that a test correlates highly with other variables with

The former is convergent validity and the latter is divergent

believed to measure the same construct, convergent

construct converge, or narrow in, on the same thing.

correlations with measures of unrelated constructs, or

 Divergent evidence for validity indicates that the measure

area that it is intended to cover. This is particularly

Traditionally, content validity has been of greatest

concern in educational testing.

Content under-representation occurs when important

areas are missed.

Construct-irrelevant variation occurs when irrelevant

factors contaminate the test.

Geography it is not fair to have most questions limited

learned in the course, or were the exam questions sampled from

(b) A measure that does not appear to be valid may actually be an

corresponds with a particular criterion.

and a well-defined criterion measure.

 There are two approaches to criterion validity:

criterion validity evidence known as predictive validity.

 The purpose of the test is to predict the likelihood of succeeding on the

 Concurrent related evidence for validity comes from assessments

of the simultaneous relationship between the test and the criterion

measured at the same time.

estimates current performance on some valued measure (criterion)?

 Predictive Criterion Validity = how well performance on a test

predicts future performance on some valued measure (criterion)?

 Both are only possible IF the predictors are VALID

GPA obtained at the same time the SAT is taken.

 Refers to consistency of scores obtained by the same persons

when they are reexamined with the same test on different

 It also called ERROR OF MEASURMENT

2.Comparing scores from different parts of the test [split-

3.Comparing scores assigned by different