You are on page 1of 55

PSYCHOMETRIC

CHARACTERISTICS OF TESTS
– RELIABILITY AND VALIDITY
VALIDITY

TYPES AND EXAMPLES


Validity
 Validity is the extent to which a test measures what it is
supposed to measure.

 The question of validity is raised in the context of the three points

,the form of the test, the purpose of the test and the
population for whom it is intended. Therefore, we cannot ask
the general question “Is this a valid test?”.
 The question to ask is “how valid is this test for the decision that I

need to make?” or “how valid is the interpretation I propose for


the test?”
Definition
Validity has been described as ‘The agreement between

a test score or measure and the quality it is believed to


measure'
(Kaplan and Saccuzzo, 2001).

In other words, it measures the gap between what a test

actually measures and what it is intended to measure.


Validity

 Depends on the PURPOSE


 E.g. a ruler may be a valid measuring device for length, but
isn’t very valid for measuring volume
 Measuring what ‘it’ is supposed to
 Matter of degree (how valid?)
 Specific to a particular purpose!
 Must be inferred from evidence; cannot be directly measured
Validity: (actually studying the
variables that we wish to study)
 Types of validity

1. Face validity

2. Construct validity

3. Content validity

4. Criterion validity

a) Predictive validity
b) Concurrent validity
Face Validity

Basically face validity refers to the degree to which a test

‘appears’ to measure what it purports to measure.

Face validity is that the test appears to be valid. This is

validated using common-sense rules, for example that a


mathematical test should include some numerical elements.
Face Validity

Face validity is the mere appearance that a measure

has validity.

We often say a test has face validity if the items seem to

be reasonably related to the perceived purpose of the


test.
Face Validity
For example, a scale to measure anxiety might include

items such as

“My stomach gets upset when I think about taking tests.”

And
“My heart starts pounding fast whenever I think about all of

the things I need to get done.”


Face Validity
On the basis of positive responses to these items, can we
conclude that the person is anxious?
Remember validity requires evidence in order to justify
conclusions.
In this case we can only conclude that the person answers
these two items in a particular way. If we want to conclude
that a person has a particular problem with anxiety, then we
need systematic evidence that shows how responses to these
items relate to the psychological condition of anxiety.
Face validity is really not validity at all because it does not
offer evidence to support conclusions drawn from test scores.
Face Validity
Face validity is not totally unimportant. In many settings,
it is crucial to have a test that “looks like” it is valid.
These appearances can help motivate test takers because
they can see that the test is relevant.

For example, suppose you developed a test to screen


applicants for a training program in accounting. Items that
ask about balance sheets and ledgers might make
applicants more motivated than items about fuel
consumption. However, both types of items might be
testing the same arithmetic reasoning skill.
A test that does not have face validity may be rejected

by test-takers (if they have that option) and also people


who are choosing the test to use from amongst a set of
options.
Construct Validity

Construct validity is the degree to which a test

measures an intended hypothetical construct.

Many times psychologists assess/measure abstract


attributes or constructs. The process of validating the
interpretations about that construct as indicated by the
test score is construct validation.
Underlying many tests is a construct or theory that is
being assessed. For example, there are a number of
constructs for describing intelligence (spatial ability,
verbal reasoning, etc.) which the test will individually
assess.

If the construct is not valid then the test on which it is


based will not be valid. For example, there have been
historical constructs that intelligence is based on the size
and shape of the skull.
Construct Validity
Do my dependent variables actually measure the

hypothetical construct that I want to test?


Does my IQ test really measure IQ, and nothing else?

Do my procedures actually measure learning, (without

being influenced by motivation)?


Does my personality test really measure personality

traits without including fatigue?


 In other words, we need to establish the construct validity of

measures of intelligence by showing that scores on tests that


theoretically measure intelligence do, indeed, predict
performance on other tasks that require intelligence.

 If we were measuring altruism, then our measures of people who

stop to help stranded motorists are also the people who donate to
charities, while those that do not stop also do not donate.
Comparing Face Validity with
Construct Validity:
Face validity: The consensus that a measure represents a
particular concept – the face value of the measure

Construct validity: The accuracy with which a measure


represents the particular concept, without influence of
additional factors. Construct validity implies that other
operational definitions of the same construct will yield
correlated results.
Convergent vs. Divergent validity
In order to demonstrate construct validity, we must show not

only that a test correlates highly with other variables with


which it should theoretically correlate but also that it does
not correlate significantly with variables from which it
should differ.

The former is convergent validity and the latter is divergent

validity.
Convergent vs Divergent validity

Convergent Validity
When a measure correlates well with other tests

believed to measure the same construct, convergent


evidence for validity is obtained.
This sort of evidence shows that measures of the same

construct converge, or narrow in, on the same thing.


Divergent Validity
Divergent validity or discriminant validity
 To demonstrate divergent validity, a test should have low

correlations with measures of unrelated constructs, or


evidence for what the test does not measure.

 Divergent evidence for validity indicates that the measure

does not represent a construct other than the one for which it
was devised.
Content Validity
A test has content validity if it sufficiently covers the

area that it is intended to cover. This is particularly


important in ability or attainment tests that validate
skills or knowledge in a particular domain.

Traditionally, content validity has been of greatest

concern in educational testing.


Content Validity

Content under-representation occurs when important

areas are missed.

Construct-irrelevant variation occurs when irrelevant

factors contaminate the test.


Example, if we want to test knowledge on Pakistan

Geography it is not fair to have most questions limited


to the geography of England.
Content Validity
Example:
 Do the questions on an exam accurately reflect what you have

learned in the course, or were the exam questions sampled from


only a sub-section of the material? A test to measure your
knowledge of mathematics should not be limited to addition
problems, nor should it include questions about Urdu literature.
It should cover the entire range of appropriate math problems you
are trying to measure.
Content Validity
 Like face validity, content validity is determined by experts who
can judge the representativeness of the items on the test. When
studying hypothetical constructs, both face and content validity
may be limited in value because they may fail in either of two
ways:

(a) A measure may look valid when it does not, in fact, measure the
underlying construct, and

(b) A measure that does not appear to be valid may actually be an


excellent indicator of the underlying construct.
Criterion Validity
 Criterion validity evidence tells us just how well a test

corresponds with a particular criterion.


 Such evidence is provided by high correlations between a test

and a well-defined criterion measure.


 A criterion is the standard against which the test is compared.

 For example, a test might be used to predict which engaged couples will

have successful marriages and which ones will get divorced. Marital
success is the criterion, but it cannot be known at the time the couples take
the premarital test.
Criterion Validity
 A powerful indicator of the validity of a measure is its ability to
accurately predict performance on other, independent
outcome measures (referred to as criterion measures).

 The extent to which your SAT score predicts your college GPA is
an indication of the SAT’s criterion validity.

 There are two approaches to criterion validity:


Concurrent validity and
Predictive validity
Predictive Validity
 The forecasting function of tests is actually a type or form of

criterion validity evidence known as predictive validity.


 For example, the SAT test serves as predictive validity evidence as a college

admissions tests if it accurately forecasts how well high school students will do
in their college studies.
 The SAT is the predictor variable (test), and the college GPA is the criterion.

 The purpose of the test is to predict the likelihood of succeeding on the

criterion – that is achieving a high GPA in college. A valid test for this purpose
would greatly help college admissions committees because they would have
some idea about which students would most likely succeed.
Concurrent Validity

 Concurrent related evidence for validity comes from assessments

of the simultaneous relationship between the test and the criterion


– such as between a learning disability test and school performance.
 Here the measures (test) and criterion measures are taken at the same

time because the test is designed to explain why the person is now
having difficulty in school.
 Concurrent validity applies when the test and the criterion can be

measured at the same time.


Two Types of Criterion Validity
 Concurrent Criterion Validity = how well performance on a test

estimates current performance on some valued measure (criterion)?

 Predictive Criterion Validity = how well performance on a test

predicts future performance on some valued measure (criterion)?

 Both are only possible IF the predictors are VALID


Concurrent vs. Predictive Validity
In concurrent validity, the SAT test scores and criterion measures (high school
GPA) are obtained at roughly the same time (concurrent).
 If the SAT shows high concurrent validity, it will be highly correlated with

GPA obtained at the same time the SAT is taken.

Predictive validity However, its predictive validity would be high if your SAT
score accurately predicted your college GPA, which is obtained long after taking
the SAT.
RELIABILITY

TYPES
Reliability: The consistency of a
measurement procedure

 Refers to consistency of scores obtained by the same persons

when they are reexamined with the same test on different


occasion, or with different sets of equivalent items, or under
other variable examining condition .

 It also called ERROR OF MEASURMENT


If a measurement device or procedure
consistently assigns the same score to individuals
or objects with equal values, the device is
considered reliable.
The typical procedures for
determining reliability include
The typical procedures for determining reliability include:
1.Comparing the scores from repeated testing of the same
participants with the same test [test-retest reliability],
or by comparing scores from alternate forms of the test
[alternate forms reliability]

2.Comparing scores from different parts of the test [split-


half reliability], such as comparing the scores from
even versus odd questions
The typical procedures for
determining reliability include

3.Comparing scores assigned by different


researchers who have observed the same event
[measuring scorer reliability by inter rater
reliability].
The co-relation coefficient

Since all types of reliability are concerned with the degree

of consistency or agreement between two independently


derived sets of scores, they can all be expressed in term of
a correlation coefficient.

A correlation coefficient (r) expresses the degree of


correspondence or relationship between two sets of scores.
Types of reliability

1. Test-retest reliability

2. Alternate Forms Reliability

3. Split Half Reliability

4. Inter-rater reliability
Test-retest reliability
 Measure the scores twice with the same instrument. Reliable
measures should produce very similar scores.
 The most obvious method for finding the reliability of test scores
is by repeating the identical test on a second occasion.
 The reliability coefficient (r) in this case is simply the correlation
between the scores obtained by the same persons on the two
administrations of the test.

 Examples: IQ tests typically show high test-retest reliability. The


reliability of a bathroom scale can be tested by recording your
weight 2-3 times within a day or two.
Test-retest reliability
Test-retest reliability estimates are used to evaluate the
error associated with administering a test at two different
times.

This type of analysis is of value only when we measure ‘traits’


or characteristics that do not change over time.
For instance, we usually assume that an intelligence test
measures a consistent general ability. As such, if an IQ test
administered at two points in time produces different scores, then
we might conclude that the lack of correspondence is the result
of random measurement error. Usually we do not assume that a
person got more or less intelligent in the time between tests.
Test-retest reliability

Tests that measure constantly changing characteristics

are not appropriate for test-retest evaluation.


 For example, the value of the Rorschach Inkblot Test seems to

tell the clinician how the client is functioning at a particular time.

 Test-retest reliability is relatively easy to evaluate. Just administer

the same test on two well-specified occasions and then find the
correlation between the scores from the two administrations.
Alternate Forms Reliability
 Alternate Forms Reliability/Parallel Forms Reliability/ Parallel
Forms Reliability

 Test-retest procedures may not be useful when participants may

be able to recall their previous responses and simply repeat them


upon retesting. In cases where administering the exact same test
will not necessarily be a good test of reliability, we may use
alternate forms.
 Practice effect or learning

 Carryover effect
Alternate Forms Reliability

As the name implies, two or more versions of the test

are constructed that are equivalent in content and


level of difficulty.

Professors use this technique to create makeup or

replacement exams because students may already know


the questions from the earlier exam.
Alternate Forms Reliability
 It compares two equivalent forms of a test that measure the same

attribute. The two forms use different items; however, the rules
used to select items of a particular difficulty level are the same.
 When two forms of the test are available, one can compare

performance on one form versus the other.


 Sometimes the two forms are administered to the same group of

people on the same day.


 The order of administration is usually counterbalanced to avoid

practice effects.
Alternate Forms Reliability

 The correlation between the scores obtained on the two forms

represents the reliability coefficient of the test.


 The method of parallel forms provides one of the most rigorous

assessments of reliability commonly in use.


 Such a reliability coefficient is a measure of both temporal

stability and consistency of response to different item samples


(or test forms).
Alternate Forms Reliability
Unfortunately, the use of parallel forms occurs in

practice less often than is desirable.


Often test developers find it burdensome to develop two

forms of the same test, and practical constraints make it


difficult to retest the same group of individuals.
Instead, many test developers prefer to base their

estimate of reliability on a single form of a test.


Split Half Reliability
A measure of consistency where a test is split in two

and the scores for each half of the test is compared


with one another.
A test given and divided into halves and are scored

separately, then the score of one half of test are compared


to the score of the remaining half to test the reliability
(Kaplan & Saccuzzo, 2001).
Split Half Reliability
 The two halves of the test can be created in a variety of ways. If the test

is long, the best method is to divide the items randomly into two halves.

 Although convenient, this method can cause problems when items on

the second half of the test are more difficult than items on the first half.
If the items get progressively more difficult, then you might be better
advised to use odd-and-even system, whereby one sub-score is obtained
for the odd-numbered items in the test and another for the even-
numbered items.
Split Half Reliability

To estimate the reliability of the test, you could find the

correlation between the two halves.


This type of reliability coefficient is also called

coefficient of internal consistency.


It is apparent that split-half reliability provides a

measure of consistency with regard to content sampling.


Inter-item reliability

Inter-item reliability: The degree to which different

items measuring the same variable attain consistent


results. Scores on different items designed to measure
the same construct should be highly correlated. It also
goes by the name internal consistency.
Inter-item reliability

Math tests often ask you to solve several examples of

the same type of problem. Your scores on these


questions will normally represent your ability to solve
this type of problem, and the test would have high inter
item reliability.
Split half reliability
When it is impractical or inadvisable to administer two

tests to the same participants, it is possible to assess the


reliability of some measurement procedures by
examining their internal consistency. This type of
reliability assessment is useful with tests that contain a
series of items intended to measure the same attribute.
Inter-rater reliability
Inter-rater reliability/ inter-rater agreement/ Scorer
reliability
When observers must use their own judgment to interpret
the events they are interpreting (including live or
videotaped behaviors and written answers to open-ended
interview questions), scorer reliability must be measured.
Have different observers take measurements of the
same responses; the agreement between their
measurements is called inter-rater reliability. Their
results can be compared statistically and represent the
scorer’s reliability.
Inter-rater reliability
 Tests of creativity and projective tests of personality leave a good

deal to the judgment of the scorer.


 Scorer reliability can be found by having a sample of test papers

independently scored by two examiners. The two scores thus


obtained by each examiner are then correlated in the usual way,
and the resulting correlation coefficient is a measure of scorer
reliability.

You might also like