III. Reliability and Validity PDF

 Reliability – refers to the consistency of scores
obtained by the same person when re-examined

with the same test on different occasions, or
with different sets of equivalent items, or under
other variable examining condition.
 This mainly refers to the attribute of consistency

in measurement.
 Charles Spearman – Key individual in the theories of
reliability.
 Measurement error is common in all fields of
science.
 Tests that are relatively free of measurement
error are considered to be reliable while tests
that contain relatively large measurement error
are considered to be unreliable.
 Classical Test Score Theory – this assumes
that each person has a true score that would
be obtained if there were no errors in
measurement.
 Measuring instruments are imperfect,

therefore the observed score for each person
almost always differ from the person’s true
ability or characteristic.
 Measurement error – the difference between the
observed score and the true score results.
X = E + T
(observed score) (error) (true score)
 Standard error of measurement – the standard

deviation of the distribution of errors for each
repeated application of the same test on an
individual.
 Inversely related to reliability
 SEM = SD√(1 – r)
 Range or band of test scores that is likely to
contain the true score.
 M – 1.96 (SEM) to M + 1.96 (SEM)
 M = mean to test scores from the test taker.
 Error (E) can either be positive or negative. If E
positive, the Obtained Score (X) will be higher
than the True Score (T); if E is negative, then X
will be lower than T.
 Although it is impossible to eliminate all

measurement error, test developers do strive to
minimize psychometric nuisance through careful
attention to the sources of measurement error.
 It is important to stress that true score is never

known.
 All factors associated with the process of
measuring variable, other than the variable
being measured
 Random error – source of error in measuring a
targeted variable caused by unpredictable
fluctuations and inconsistencies of other variables in
the measurement process.
 Systematic error – source of error in measuring a
variable that is typically constant or proportionate to
what is presumed to be the true value of the variable
being measured.
 Factors that contribute to consistency:
 These consist entirely of those stable attributes of
the individual, which the examiner is trying to
measure.
 Factors that contribute to inconsistency:

 These include characteristics of the individual,
test, or situation, which have nothing to do with
the attribute being measured, but which
nonetheless affect test scores.
 Domain Sampling Model
 There is a problem in the use of limited number of
items to represent a larger and more complicated
construct.
 A sample of items is utilized instead of the infinite

pool of items of the construct.
 The greater the number of items, the higher the

reliability.
 Generalizability theory
 does not assume that a person has a “true” score on
intelligence, or that error is basically of one kind
 argues that different conditions may result in
different scores, and that error may reflect a variety of
sources
 it also allows for the evaluation of the interaction
effects from different types of error sources. Thus, it is
a more thorough procedure for identifying the error
variance component that may enter into scores.
A. Item selection
 One source of measurement error is the instrument
itself. A test developer must settle upon a finite
number of items from a potentially infinite pool of
test question.
 Which items should be included? How should they be

worded?
 Although psychometricians strive to obtain

representative test items, the particular set of
questions chosen for a test might not be equally fair
to all persons.
B. Test Administration
 General environmental conditions may exert an
untoward influence on the accuracy of measurement,
such as uncomfortable room temperature, dim
lighting, and excessive noise.
 Momentary fluctuation in anxiety, motivation,

attention, and fatigue level of the test taker may also
introduce sources of measurement error.
 The examiner may also contribute to the

measurement error in the process of test
administration.
C. Test Scoring
 Whenever psychological test uses a format other than
machine-scored multiple choice items, some degree
of judgment is required to assign points to answers.
 Most tests have well-defined criteria for answers to

each question. These guidelines help minimize the
impact of subjective judgment in scoring.
 Consistent and reliable scores have reliability

coefficient near 1.0; Conversely, tests, which reflect
large amount of measurement error, produce
inconsistent and unreliable score and their reliability
coefficients are close to 0.
 With the help of a computer, the item
difficulty is calibrated to the mental ability of
the test taker.
 If you got several easy items correct, the

computer will then move to more difficult
items.
 If you get several difficult items wrong, the

computer moves back to average items.
 A correlation coefficient (r) expresses the degree
and magnitude of a linear relationship between
two sets of scores obtained from the same
person.
 It can take on values ranging from -1.00 to +1.00.

 When two measures have a positive (+)
correlation, the high/low scores on Y are
associated with the high/low scores on X.
 When two measures have a negative (-)

correlation, the high scores on Y are associated
with low scores on X and vice versa.
 Correlations of +1.00 are extremely rare in

psychological research and usually signify a
trivial finding.
PERFECT POSITIVE CORRELATION
+1.00 Very high positive correlation
+0.75 High positive correlation
+0.50
Moderately small positive correlation
+0.25
Very small positive correlation
0.00 NO CORRELATION
Very small negative correlation
-0.25
Moderately small negative correlation
-0.50
High negative correlation
-0.75
-1.00 Very high negative correlation
PERFECT NEGATIVE CORRELATION
 It is established by comparing the scores obtained
from two successive measurements of the same
individuals and calculating a correlation between the
two sets of scores.
 It is also known as time sampling reliability since it

measures the error associated with administering a
test at two different times.
 This is used when we measure only traits or

characteristics that do not change over time. (e.g. IQ)
.
 Example: You took an IQ test today and you will take
it again after exactly a year. If your scores are almost
the same (e.g. 105 and 107), then the measure has a
good test-retest reliability.
 Error variance – corresponds to the random

fluctuations of performance from one test session to
the other.
 Clearly, this type of reliability is only applicable to

stable traits.
 Carryover effect – occurs when the first
testing session influences the results of the
second session and this can affect the test-
retest reliability of a psychological measure.
 Practice effect – a type of carryover effect

wherein the scores on the second test
administration are higher than they were on
the first.
 If the results of the first and second
administration has a low correlation, it might
mean that:
 The test has poor reliability
 A major change had occurred on the subjects
between the first and second administration.
 A combination of low reliability and major change
have occurred.
 Sometimes, a poor test-retest correlation do not

mean that the test is unreliable. It might mean
that the variable under study has changed.
 Too short = Carryover effects
 Too long = various factors might happen in
between 2 testing conditions.
 It is established when at least two different versions
of the test yield almost the same scores.
 It is also known as item sampling reliability or

alternate forms reliability since it compares two
equivalent forms of a test that measure the same
attribute to make sure that the items indeed assess
a specific characteristic.
 The correlation between the scores obtained on the

two forms represents the reliability coefficient of
the test.
 Examples:
▪ The Purdue Non-Language Test (PNLT) has Forms A and
B and both yield slightly identical scores of the test taker.
▪ The SRA Verbal Form has parallel forms A and B and both
yield almost identical scores of the test taker.
 The error of variance in this case represents

fluctuations in performance from one set of
items to another, but not fluctuations over
time.
 Tests should contain the same number of items and
the items should be expressed in the same form and
should cover the same type of content. The range
and level of difficulty of the items should also be
equal. Instructions, time limits, illustrative
examples, format and all other aspects of the test
must likewise be checked for equivalence.
 One of the most rigorous and burdensome
assessments of reliability since test
developers have to create two forms of the
same test.
 Practical constraints make it difficult to retest

the same group of individuals.
 Used when tests are administered once.
 Suggests that there is consistency among items
within the test.
 This model of reliability measures the internal

consistency of the test which is the degree to which
each test item measures the same construct. It is
simply the intercorrelations among the items.
 If all items on a test measure the same construct, then

it has a good internal consistency.
 It is obtained by splitting the items on a questionnaire
or test in half, computing a separate score for each
half, and then calculating the degree of consistency
between the two scores for a group of participants.
 The test can be divided according to the odd and even

numbers of the items (odd-even system).
 Since the test is divided into two and are correlated to

each other, the coefficient of correlation has been
compromised; thus, Spearman – Brown Formula
should be used to correct the correlation of the test.
 Spearman –Brown Formula
 A statistics which allows a test developer to
estimate what correlations between the two
halves would have been if each half had been the
length of the whole test and have equal variances.
 Cronbach’s coefficient alpha
 Also called as Cronbach alpha
 Used when two halves of the test have unequal
variances.
 Provides the lowest estimate of reliability.
 Average of all split halves.
 Items are not in right or wrong format (Likert
Scale)
 Kuder-Richardson 20 (KR20) Formula
 The statistics used for calculating the reliability of
a test in which the items are dichotomous or
scored as 0 or 1.
 Tests with right or wrong format.
 It is the degree of agreement between two observers
who simultaneously record measurements of the
behaviors.
 Examples:
▪ Two psychologists observe the aggressive behavior of
elementary school children. If their individual records of the
construct are almost the same, then the measure has a
good inter-rater reliability.
▪ Two parents evaluated the ADHD symptoms of their child.

If they both yield identical ratings, then the measure has
good inter-rater reliability.
 This uses the kappa statistic in order to assess
the level of agreement among raters in nominal
scale.
▪ Cohen’s Kappa – used to know the agreement among 2
raters
▪ Fleiss’ Kappa – used to know the agreement among 3
or more raters.
Method No. of Forms No. of Sessions Sources of Error
Variance
Test - Retest 1 2 Changes over time

Alternative – Forms 2 1 Item sampling
(Immediate)
Alternative – Forms 2 2 Item sampling

(Delayed) changes over time
Split – Half 1 1 Item sampling

(Spearman-Brown) Nature of split
Coefficent Alpha & 1 1 Item sampling

Kuder -Richardson Test Heterogeneity
Inter -Rater 1 1 Scorer Differences

 For tests that has two forms, use parallel
forms reliability.
 For tests that are designed to be

administered to an individual more than
once, use test-retest reliability.
 For tests with factorial purity, use Cronbach’s

coefficient alpha.
 For tests with items carefully ordered
according to difficulty, use split-half
reliability.
 For tests which involve some degree of

subjective scoring, use inter-rater reliability.
 For tests which involve dichotomous items or

forced choice items, use KR20.
 Dynamic Characteristics – ever changing
characteristics that change through time or
situation.
 Internal consistency is of best use.
 Static Characteristics – Characteristics that

would not vary across time.
 Test retest and parallel forms are of best use.
 Unidimensional tests are expected to have
high internal consistency than
multidemensional tests.
 Test restest, alternate forms – “the more the
better”
 Internal consistency
 Basic Research = .70 to .80
 Clinical Setting = .95 or above
 Increase the number of items.
 Use factor analysis and item analysis.
 Use the correction for attenuation formula.

 Used to determine the exact correlation
between two variables if the test is deemed
affected by error.
 It refers to the degree to which the measurement
procedure measures the variable that it claims to
measure (strength and usefulness).
 Gives evidence for inferences made about a test

score.
 Basically, it is the agreement between a test score or

measure and the characteristic it is believed to
measure.
 The process of gathering and evaluating
evidence about validity.
 Local validation studies – applied when test are
altered in some ways such as format, language, or
content.
 Face validity – is the simplest and least scientific
form of validity and it is demonstrated when the
face value or superficial appearance of a
measurement measures what it is supposed to
measure.
 Item seems to be reasonably related to the
perceived purpose of the test.
 Often used to motivate test takers because they can
see that the test is relevant.
 An IQ test containing items which measure memory,
mathematical ability, verbal reasoning and abstract
reasoning has a good face validity.
 An IQ test containing items which measure depression and

anxiety has a bad face validity.
 A self-esteem rating scale which has items like “I know I can

do what other people can do.” and “I usually feel that I would
fail on a task.” has a good face validity.
 Inkblot test that low face validity because test takers

question whether the test really measures personality.
 Content
 Criterion Related
 Concurrent
 Predictive
 Construct
 Convergent
 Discriminant/Divergent
 It is concerned with the extent to which the test is
representative of a defined body of content
consisting of topics and processes.
 Content validation is not done by statistical analysis

but by the inspection of items. A panel of experts
can review the test items and rate them in terms of
how closely they match the objective or domain
specification.
 This considers the adequacy of representation of the
conceptual domain the test is designed to cover.
 If the test items adequately represent the domain of

possible items for a variable, then the test has
adequate content validity.
 Determination of content validity is often made by

expert judgment.
 Educational Content Valid Test – syllabus is
covered in the test; usually follows the table of
specification of the test.
 Table of specification – a blueprint of the test in terms
of number of items per difficulty, topic importance, or
taxonomy.
 Employment Content Valid Test – appropriate
job related skills are included in the test. Reflects
the job specification of the test.
 Clinical Content Valid Test – symptoms of the
disorder are all covered in the test. Reflects the
diagnostic criteria for a test.
 Construct underrepresentation
 Failure to capture important components of a construct
(e.g. An English test which only contains vocabulary items
but no grammar items will have a poor content validity.)
 Construct-irrelevant variance
 Happens when scores are influenced by factors irrelevant
to the construct (e.g. test anxiety, reading speed, reading
comprehension, illness)
 Quantification of Content Validity
 Lawshe (1975) proposed a structured and systematic way of
establishing the content validity of a test
 He developed the formula content validity ratio (CVR)
 Tells how well a test corresponds with a particular
criterion.
 A judgment of how adequately a test score can be
used to infer an individual’s most probable standing on
some measure of interest.
 Criterion – standard against which a test or a test score is
evaluated.
 A criterion can be a test score, psychiatric diagnosis, training
cost, index of absenteeism, amount of time.
 1. Relevant
 2. Valid and Reliable
 3. Uncontaminated
 Criterion contamination – criterion based on
predictor measures; the criterion used is a
criterion of what is supposed to be the criterion.
 Verbal Aptitude Test
 Self Esteem Test
 Managerial Skills Test
 Extroversion Test
 Visual-spatial skills test
 Concurrent Validity
 Both the test scores and the criterion
measures are obtained at present.
 Predictive Validity
 Test scores may be obtained at one time and
the criterion measure may be obtained in the
future after an intervening event.
 Construct – An informed scientific idea developed or
hypothesized to describe or explain a behavior; something
built by mental synthesis.
 Unobservable, presupposed traits; something that the researcher
thought to have either high or low correlation with other variables.
 Established through a series of activities in
which a researcher simultaneously defines
some construct and develops
instrumentation to measure it.
 A judgment about the appropriateness of
inferences drawn form test scores regarding
individual standings on a variable called
construct.
 Required when no criterion or universe of
content is accepted as entirely adequate to
define the quality being measured.
 Assembling evidence about what a test means.
 Series of statistical analysis tat one variable is a
separate variable.
 A test has a good construct validity if there is an
existing psychological theory which can support what
the test items are measuring.
 Establishing construct validity involves both logical

analysis and empirical data.
 Example: In measuring aggression, you have to check

all past research and theories to see how the
researchers measure that variable/construct.
 Construct validity is like proving a theory
through evidences and statistical analysis.
 1. Test is homogenous, measuring a single
construct.
 2. Test score increases or decreases as a
function of age, passage of time, or
experimental manipulation.
 3. Pretest, posttest differences
 4. Test scores differ from groups.
 5. Test scores correlate with scores on other
test in accordance to what is predicted.
 How uniform a test is in measuring a single
concept.
 Subtest scores are correlated to the total test score.
 Coefficient alpha may be used as homogeneity
evidence.
 Spearman Rho can be used to correlate an item to
another item.
 Pearson or point biserial can be used to correlate an
item to the total test score. (item-total correlation)
 Evidence of change with age
 Some variable/construct are expected to change with
age.
 Evidence of pretest posttest
 Difference of scores from pretest and post test of a
defined construct after careful manipulation would
provide validity
 Evidence from distinct group
 Also called a method of contrasted group
 T-test can be used to test the difference of groups.
 Convergent evidence
 Also called as convergent validity
 The test is correlated to another measure
 Ex. Depression test and Negative Affect Scale
 Divergent/Discriminant Evidence
 Also called as divergent/discriminant validity
 A validity coefficient sharing little or no relationship
between the newly created test and an existing test.
 Social Desirability test and Marital Satisfaction test.
 This approach refers to a validation strategy
that requires the collection of data on two or
more distinct traits (e.g., anxiety, affiliation,
and dominance) by two or more different
methods (e.g., self-report questionnaires,
behavioral observations, and projective
techniques).
 Can be used to obtain evidence for both
convergent and discriminant validity.
 Exploratory Factor Analysis – estimating or
extracting factors; deciding how many
factors to retain; and rotating factors to an
interpretable orientation
 Looking for factors
 Confirmatory Factor Analysis – researchers
test the degree to which a hypothetical
model fits the actual data.
 Revalidation of the test to a criterion based
on another group different from the original
group from which the test was validated.
 Validity Shrinkage – decrease in validity after
cross validation.
 Co-validation – validation of more than one test
from the same group.
 Co-norming – norming more than one test from
the same group.
 1. Test bias – A factor inherent in a test that
systematically prevents accurate, impartial
measurement.
 Test fairness – the extent to which a test is used in an
impartial, just, and equitable way
 Adverse Impact – the use of the test systematically
rejects higher portions of minority group than non-
minority applicants
 Differential Validity – the extent to which a test has
different meaning for different people
▪ Ex. A test may be a valid predictor of college success for white
but not for African-Americans
 2. Rating Error – judgment resulting from
intentional and unintentional misuse of a
rating scale
 Rating – a numerical or verbal judgment that
places a person or an attribute along a continuum
identified by a rating scale.
▪ Leniency Error/ Generosity Error – An error in rating that
arises from the tendency on part of the rater to be
lenient in scoring marking or grading.
▪ Severity Error – overly critical scoring
▪ Central Tendency Error – The rater has reluctance in
giving ratings at either positive or negative extreme.
▪ Rater’s ratings would tend to cluster in the middle of the
continuum.
▪ Halo Effect – The tendency to give ratee a higher rating
than he/she objectively deserves because of the rater’s
failure to discriminate among conceptually distinct and
potentially independent aspects of a ratee’s behavior.
▪ Tendency to ascribe positive attributes independently of the
observed behavior.
 Reliability and validity are partially related
and partially independent.
 Reliability is a prerequisite for validity,
meaning a measurement cannot be valid
unless it is reliable.
 It is not necessary for a measurement to be
valid for it to be considered reliable.
 r₁₂max = √r₁₁r₂₂
 r₁₂max = maximum validity

 r₁₁ = reliability of the test
 r₂₂ = reliability of the criterion

III. Reliability and Validity PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

III. Reliability and Validity PDF

Uploaded by

Copyright:

Available Formats

 Reliability – refers to the consistency of scores

obtained by the same person when re-examined

 This mainly refers to the attribute of consistency

 Measuring instruments are imperfect,

 Standard error of measurement – the standard

 Although it is impossible to eliminate all

 It is important to stress that true score is never

 Factors that contribute to inconsistency:

 A sample of items is utilized instead of the infinite

 The greater the number of items, the higher the

 Which items should be included? How should they be

 Although psychometricians strive to obtain

 Momentary fluctuation in anxiety, motivation,

 The examiner may also contribute to the

 Most tests have well-defined criteria for answers to

 Consistent and reliable scores have reliability

 If you got several easy items correct, the

 If you get several difficult items wrong, the

 It can take on values ranging from -1.00 to +1.00.

 When two measures have a negative (-)

 Correlations of +1.00 are extremely rare in

 It is also known as time sampling reliability since it

 This is used when we measure only traits or

 Error variance – corresponds to the random

 Clearly, this type of reliability is only applicable to

 Practice effect – a type of carryover effect

 Sometimes, a poor test-retest correlation do not

 It is also known as item sampling reliability or

 The correlation between the scores obtained on the

 The error of variance in this case represents

 Practical constraints make it difficult to retest

 This model of reliability measures the internal

 If all items on a test measure the same construct, then

 The test can be divided according to the odd and even

 Since the test is divided into two and are correlated to

▪ Two parents evaluated the ADHD symptoms of their child.

Test - Retest 1 2 Changes over time

Alternative – Forms 2 2 Item sampling

Split – Half 1 1 Item sampling

Coefficent Alpha & 1 1 Item sampling

Inter -Rater 1 1 Scorer Differences

 For tests that are designed to be

 For tests with factorial purity, use Cronbach’s

 For tests which involve some degree of

 For tests which involve dichotomous items or

 Static Characteristics – Characteristics that

 Use factor analysis and item analysis.

 Use the correction for attenuation formula.

 Gives evidence for inferences made about a test

 Basically, it is the agreement between a test score or

 An IQ test containing items which measure depression and

 A self-esteem rating scale which has items like “I know I can

 Inkblot test that low face validity because test takers

 Content validation is not done by statistical analysis

 If the test items adequately represent the domain of

 Determination of content validity is often made by

 Establishing construct validity involves both logical

 Example: In measuring aggression, you have to check

 r₁₂max = maximum validity

You might also like