6 Validity and Reliability Studies - 1

Unit Five
Validity and Reliability Studies
Destaye Shiferaw Alemu (MPH, MSc2)

Department of Epidemiology and Biostatistics
Institute of Public Health
College of Medicine and Health Sciences ,Comprehensive Specialized Hospital
University of Gondar
October 2023
Objectives
• By the end of the session, you will be able to:
Differentiate intermethod and intramethod reliability studies
Describe the assumptions of the parallel test model
Explain the relationship of reliability to validity under the parallel test model
Explain effects of errors variances in reliability and validly study
Identify common statistical analysis methods for reliability and validity studies
Interpret reliability parameters for validity and reliability studies
Explain issues in the design of reliability and validity studies
02/28/2024 2
Contents
• Intermethod and intermeshed reliability
• The parallel test model
• Reliability and validity under the parallel test model
• Reliability and validity under unequal variances
• A model of reliability allowing for correlated errors
• Parameters for quantitative measures in reliability studies
• Parameters for qualitative measures in reliability studies
• Issues in the design of validity and reliability studies
• Sample size for reliability studies

02/28/2024 3
Introduction
• Selecting or developing accurate measurement instrument is critical step in
designing an epidemiological study.
• First, the available literature on the validity and reliability of instruments

which measure the exposure of interest should be reviewed.
• Then, if a new instrument is to be developed which differs substantially

from other methods, its reliability or, preferably, its validity should be
assessed.
• Reliability studies can be designed to provide information about the validity of

a measure if the comparison measure is carefully selected.
02/28/2024 4
• Measures of reliability are primarily important for what they reveal about the
validity of a measure
• Emphasized for both design and interpretation of reliability studies
• If a comparison measure without differential bias between cases and controls is

chosen, then a reliability study can yield estimates of differential bias in the
measure of interest.
• Useful information about the validity coefficient can be obtained from a

comparison of the measure of interest with an equally accurate or more accurate
measure when the errors between the two measures are uncorrelated.
02/28/2024 5
• Choice of analytical technique for a validity or reliability study depends on
whether
1. Scale of Exposure measured:
• Continuous variable
• Nominal categorical dichotomous variable), or
• Ordered categorical variable
2. Study type: validity or intermethod study vs. intramethod study

• Intramethod studies:
• ICC and its versions: continuous exposure variables
• Kappa (κ) and its versions: categorical variables
3. Other design issues
02/28/2024 6
• E.g.: for a reliability study in which a continuous exposure measure from proxy
respondents was compared with the same measure from subjects themselves,
• Pearson correlation coefficient might be appropriate for the analysis if

all interviews in the full study were to be from proxy respondents
• Version intraclass correlation coefficient (R2) might be used if both

proxies and index cases were to be included in the full study, and
• Another version (R4) might be used if both were included but a factor
indicating whether the interview was by index or proxy respondent was to be
adjusted for in the full study.
02/28/2024 7
Reliability
• Used to refer to the reproducibility of a measure
o How consistently a measurement can be repeated on the same subjects
• Refers to intramethod reliability (most fields of study)
• Less work exists on design and interpretation of intermethod reliability studies!
• Can be assessed in a number of ways
Intramethod reliability
• A measure of the reproducibility of an instrument, either applied:
o In the same manner to same subjects at two or more points in time: test–
retest reliability, or
o By two or more data collectors to the same subjects: inter-rater reliability
• E.g.: a comparison could be made of exposure information from two data

abstractors who extracted information from the medical records of the same
group of subjects.
02/28/2024 8
Intermethod reliability:
• A measure of the ability of two different instruments which measure the same
underlying exposure to yield similar results on the same subjects.
• Compares a measurement method to be used in an epidemiological study with a more

accurate but more burdensome method (generally).
• E.g.: a questionnaire might be compared with an exposure diary for a group of subjects.
• Intermethod reliability studies of this type are sometimes called validity studies or validation
studies
• A perfectly accurate measure of exposure (criterion measure or ‘gold standard’) is needed to

measure validity directly → intermethod reliability is preferable term
• Intermethod reliability studies also called method comparison studies and studies of relative
validity of one measure in comparison to another.
02/28/2024 9
Internal consistency reliability:
• Can be assessed when the exposure measure for each individual in the parent
epidemiological study is a sum or average of two or more individual items.
• E.g.:
• A disability scale calculated as the sum of multiple questions or
• Average serum beta carotene for each subject averaged over three blood
draws
• A measure of reliability can be calculated based on the correlation between

the individual items that were summed or averaged.
• The reliability statistic used is Cronbach's α
02/28/2024 10
• Reliability studies help to estimate the influence of different sources of
variation on scores by.
Investigating which sources of variation are most distorting the

measurement
• i.e. by evaluating reliability
Studying the amount of error in scores in absolute terms due to sources of

variation
• i.e. by evaluating measurement error
• ‘Agreement’ (categorical outcomes)
• When it is possible to better standardize these sources of variation, the

measurement can be improved – leading to:
• Smaller errors, and
• Less patients required in studies to find intervention effects
02/28/2024 11
Validity
• Refers capacity of exposure variable to measure true exposure in a population of

interest
• Measures of validity are measures of exposure measurement error in a population.
02/28/2024 12
A model of reliability and measures of reliability
• Suppose that each person in a population of interest is measured twice,
• Either with one instrument or
• Two instruments that purport to measure the same exposure
• If two instruments are used, X1 will denote the measure of interest, i.e. the
one to be used in the epidemiological study, and X2 the comparison measure.
• For a given subject i, two (continuous) exposure measurements, Xi1 and Xi2,
are obtained.
02/28/2024 13
A simple model that could apply to intermethod or intramethod reliability studies is
Xi1 = Ti + b1 + Ei1
Xi2 = Ti + b2 + Ei2 μE1 = μE2 = 0
The model can be written
Xij = Ti + bj + Eij
Where Xij is the observation on subject i of measure Xj

This model states that subject i’s first measure Xi1 is equal to the true value of Ti
for subject i plus the constant bias b1 of the first instrument in the population plus
the error Ei1 for subject i on measure one. The second measure Xi2 is equal to the
same true value Ti plus the bias b2 of the second instrument plus a second error
Ei2
02/28/2024 14
• In the population, X1, X2, T, E1, and E2 are random variables with
distributions.
• The population mean of X1 is denoted by μX1 the variance by σx12 etc.
• Because the bias of X1 in the population is expressed as a constant b1 and the

bias of X2 as b2, it follows that the population means of the subject error terms
E1 and E2 are zero.
• In a reliability study, information is available on X1 and X2 for each subject,

but not on T.
• A reliability study can yield estimates of μX1 , μX2 and the correlation ρX1X2
between the two measures, termed the reliability coefficient.
02/28/2024 15
• Bias & validity coefficient, two measures of the validity of a continuous
exposure measure are important in assessing the impact of measurement
error
• Primary question: If X1 is the measure of interest, what can the estimates of

μX1 , μX2 and ρX1X2 from a reliability study tell us about the bias b1 in X1 and its
validity coefficient ρTX1?
02/28/2024 16
Measurement of bias in a measure and differential bias
• Reliability studies often cannot provide information on the bias in X1 or X2.
• Only difference between the biases of X1 and X2 can be observed (based model)
States that the difference between the population means of the two measures is
equal to the difference between their biases.
• This difference is often not very informative!
• If a similar degree of bias is present in both measures—for example, if the same
miscalibrated scale is used to weigh each subject twice—the difference between the
means of the two measures can be close to zero even when there is considerable
bias in both measures.
• However, if X2 is an unbiased measure of T (b2 = 0), then
02/28/2024 17
• Thus, only when the comparison measure X2 is a perfect measure or when X2 can
be assumed to be unbiased (e.g. a well-calibrated scale) can a reliability study
yield information about the bias in X1.
• Differential bias in the exposure measure between cases and controls is a major
concern because it can lead to invalid results in an epidemiological study.
(Differential precision may also be a concern)
• To assess differential bias, a reliability study would need to measure X 1 and X2 in

a population of cases and a population of controls to yield estimates of the means
of X1 and X2 among those with disease (μX1D , μX2D) and among the non-diseased
groups (μX1N , μX2N)
02/28/2024 18
• The difference between the biases in X1 between cases and controls (b1D-b1N) can
be measured only if the comparison measure X2 is perfect or unbiased, or if there
is non-differential bias in X2 (b2D-b2N)
• Then, if the simple additive model given above holds for both cases and controls
02/28/2024 19
Class exercise
• To assess differential bias between breast cancer cases and controls in
a retrospective food frequency estimate of dietary fibre intake (X1), a
reliability study was conducted within an existing cohort study. X1
was compared with the prospective (pre-diagnostic) food frequency
estimate of fibre intake (X2) from the cohort study. Cases reported
19.5 grams of fibre prospectively and 20.0 grams retrospectively.
Controls reported 20.5 grams of fibre prospectively and 20.2 grams
retrospectively. What is the differential bias in X1 assuming that there
is reasonable certainty that any bias in X2 is equal for cases and
controls?
02/28/2024 20
Class exercise: solution
• To assess differential bias between Given information
breast cancer cases and controls in a • Measurement one (X1): retrospective
retrospective food frequency
estimate of dietary fiber intake (X1), • Measurement two (X2): prospective
a reliability study was conducted
within an existing cohort study. X1 • μX2D = 19.5grams
was compared with the prospective
(pre-diagnostic) food frequency
estimate of fiber intake (X2) from the
• μX1D = 20.0 grams
cohort study. Cases reported 19.5
grams of fiber prospectively and 20.0 • μX2N = 20.5 grams
grams retrospectively. Controls
0
reported 20.5 grams of fiber
prospectively and 20.2 grams
• μX1N = 20.2 grams
retrospectively. Since there is
reasonable certainty that any bias in
(b1D-b1N) - (b2D-b2N) = (μX1D- μX2D ) –(μX1N - μX2N)
X2 is equal for cases and controls, = (20 -19.5)- (20.2- 20.5)
the differential bias in X1 can be
estimated as = 0.5-(-0.3)
02/28/2024 = 0.80grams 21
Relationship of reliability to validity under the parallel test model
• The inability of many reliability study designs to yield information on bias or

differential bias is a major limitation.
• However, it should be recalled, that under non-differential measurement error

(and certain other assumptions), the attenuation equations depend only on the
validity coefficient and not on the bias.
• Thus measures of reliability may be used to estimate at least some of the effects of
measurement error in the absence of a measure of bias.
• When non-differential measurement error can be assumed, reliability can be

assessed in a single population representative of the population in which the
epidemiological study is to be conducted.
02/28/2024 22
• When certain assumptions are met, reliability studies can yield information about
the validity coefficient. re
co
s
ue
• One such set of assumptions is the model of parallel tests. tr
ith
s w
or
r r
e
e d
l a t c e
r e a n
o r r i
nc v a r s
U t r o
a n er
s t te d
n la
Co o r re
c
Un
02/28/2024 23
• Assumption of the parallel test model (PTM)
1. Error variables E1 and E 2 are not correlated with the true value T.
2. E1 and E2 have equal variance

→X1 and X2 have equal variance and X1 and X2 are equally precise.
• Usually a reasonable assumption in intramethod studies
• X1 and X2 are measurements from the same instrument
3. E1 is not correlated with E2.

• Important (and often violated) assumption
• → an individual who has a positive error E1 on the first measurement is equally likely to
have a positive or a negative error E2 on the second measurement.
• Summary of PTM:
• Two measures are parallel measures of T if they have equal AND uncorrelated errors!
•02/28/2024
Includes the assumption that b1 = b2 = 0 (generally) 24
• Under the assumptions of parallel tests it can be shown that
• Or equivalently
Reliability coefficient = square of the validity coefficient for X1 (or X2)
• Important!
• Because it shows that if the assumptions are correct, the reliability coefficient - a measure of
the correlation between two imperfect measures, can be used to estimate the correlation
between T and X1 without having a perfect measure of T.
• The correlation of X1 with X2 is less than the correlation of X1, with T

02/28/2024 25
• Because of the error in X2
• The definition of the reliability coefficient of X1 as the correlation between X1 and
X2, two parallel measures of T, is one definition of reliability.
• These expressions apply only when the reliability coefficient is restricted to the
correlation between parallel measures of T.
• However, we use the term reliability coefficient to refer to the correlation ρX1X2 between
measures of the same exposure even when the assumptions of parallel tests do not hold.
• This means that, for a given instrument X1 applied to a given population, the reliability
coefficient will vary with the choice of X2.
• In real reliability studies, the assumptions of parallel tests are often incorrect.
• Two common violations : unequal variances of E1 and E2, and correlated errors.
• Even when these assumptions are violated, the correlation between X1 and X2 can still
provide some information about the validity coefficient of X1.
02/28/2024 26
Relationship of reliability to validity under unequal variances of E1 and E2
• Model of parallel tests → σ 2

E1 =σ E2
2
(assumed)
→ X1 and X2 are equally precise
• This assumption is incorrect for certain reliability studies, particularly for many
intermethod reliability studies.
• First, consider a true validity study where X1, the exposure measure of interest, is
compared with a perfect measure of exposure, termed X2 (X2 = T).
• Then, by definition,
02/28/2024 27
• However, a perfect measure is often not available, and so the exposure measure of interest
X1 is often compared with an imperfect but more precise measure, X2.
→ ρTX2 > ρTX1
• If other assumptions of PTM hold, including the assumption of uncorrected errors,

then
States that when X2 is more precise than X1 and the errors in X1 and X2 are
uncorrelated, the reliability coefficient can be used to yield an upper and a lower
bound for the validity coefficient of X1.
• The lower bound for the validity coefficient of X1 is the interpretation as if X2 were a
perfect measure, and the upper bound is the interpretation as if X2 had equal error
variance.
02/28/2024 28
• The more accurate X is, the closer the lower bound is to ρ
• In an effort to find a comparison measure X2 with an error uncorrelated with
the error in X1, the comparison measure may be less accurate than (ρTX2< ρTX1)
• More specifically, X2 may be a less accurate measure of T, the true exposure that
X1 is attempting to measure, than X1.
• If X2 is a less precise measure of T or if it is not known whether X1 or X2 is more

accurate, it can still be assumed that
• In other words, the correlation of X1 with even a poor measure X2 with

uncorrelated errors gives a lower limit for the correlation of X1 with the true
measure.
02/28/2024 29
A model of reliability allowing for correlated errors
• Uncorrelated errors: the assumption often violated in PTM
• Errors in X1 and X2 are often positively correlated
• E.g.: Physical activity may be over-reported on self-administered questionnaire and

same subjects may be most likely to over-report their activity on in-person interview
used as a comparison measure.
 Errors are correlated since same subjects with large positive errors on first
measure tend to have positive errors on second measure
• Correlated error can be due to:

 Subject:
• e.g. Social desirability bias may affect some subjects consistently across measures, or
 Instrument:
• e.g. a physical activity questionnaire which omitted swimming will repeatedly
underestimate the total activity level for those subjects who are swimmers
02/28/2024 30
• Positive correlated errors occur when those subjects with positive error E1 are
also more likely to have a positive error E2
The true exposure overestimated by X1 is also overestimated by X2 beyond

the constant biases which affects all subjects
if X1 and X2 have a similar bias this does not lead to correlated error
because bias adds a constant error to all subjects.
02/28/2024 31
• A model for reliability that makes the correlated errors explicit is:
Eji = Bi + Fij
where
• Error terms Ei1 and Ei2 for a given subject are the sum of two parts:
• Part Bi
• Which repeats itself on each measure of subject i,
• Termed the within-subject bias
• Part Fij
• Which varies between measures
• Termed the random error
• E1 and E2 are correlated because they both include the within-subject bias.
02/28/2024 32
Measurement error in Xij, the jth measure on subject i.
02/28/2024 33
Relationship of reliability to validity under correlated errors
• When a reliability study is conducted in which X1 and X2 have correlated errors,
the reliability coefficient ρX1X2 is artificially high.
• Because part of the correlation between X1 and X2 is due to the correlated

error, rather than solely due to both X1 and X2 measuring the same true
exposure value for a subject.
• Reliability coefficient only measures (i.e. is only reduced by) the components of
error that are not correlated.
• A reliability study with correlated error does not capture all components of error
because it does not measure the part of the error that is repeated within subjects
(the within-subject bias).
02/28/2024 34
• When errors (E1 & E2) of measures in a reliability study are positively correlated
• Reliability study can only yield an upper limit for the validity coefficient.
• Validity coefficient is less than square root of reliability coefficient when:

• X1 and X2 are equally precise (or X2 is more precise than X1) and
• Assumptions of the model hold
• Thus:
• A measure can be reliable (repeatable) even if it has poor validity
• While a low reliability coefficient implies poor validity, a high reliability does not
necessarily imply a high validity coefficient.
• The high reliability may be due instead to the correlated error.
• Evaluation of potential sources of correlated errors between measures is

important to interpret a reliability studies
• There is a wide range of sources of measurement error, and most of these could
02/28/2024 35
• Sources of correlated errors between two administrations of a food frequency
questionnaire include the following.
o Some subjects may consistently tend to report their ‘best’ diet rather than
their usual diet.
o Certain high-fat foods eaten frequently by a few subjects may have been
omitted from the questionnaire.
- Those subjects would have their fat intake consistently underestimated.
o Time period assessed by instrument (diet in last year) may differ from true
time period of interest (e.g. diet over the last 5 years).
- Then those who have lowered the fat in their diet in recent years will
have their fat intake underestimated on both administrations of the
instrument compared with their true average fat intake.
• Correlated errors commonly occur in intramethod studies, but they could occur
in intermethod studies as well.
02/28/2024 36
Issues in the design of validity and reliability studies
• Several issues need to be considered in the design of validity and reliability

studies.
• Most of these issues are also important in interpreting reliability studies carried
out by others.
02/28/2024 37
Purpose and timing of the reliability study
• A validity or reliability study of an instrument should be carried out first when:
o A new instrument is to be developed for an epidemiological study, or
o An existing one is to be applied to a substantially different population
• Estimates of validity or reliability coefficient and bias of instrument can then be

used to decide whether it is necessary to develop a more accurate instrument.
• If measure is reasonably reliable → increase confidence in the outcome of the study
• Reliability studies conducted before the main epidemiological study, or early in its
course, can be used to evaluate but also to improve the instrument.
o E.g.: an inter-rater reliability study could identify interviewers, abstractors, or

laboratory personnel who need more training or should be dropped from the study.
02/28/2024 38
• Computation of reliability coefficients for each pair of raters may reveal an individual rater who
compares poorly with the others.
• Important to investigate situations in which discrepancies between repeated measurements

occurred
• Lead to improvement of the instrument or the protocol for its use.
• E.g.: if disagreements on the variable marital status of subjects usually involved divorced subjects being
erroneously classified as ‘single’, the category ‘single’ might be clarified by the label ‘never married’.
• Estimation of impact of exposure measurement error on the results of a study after the parent
epidemiological study has been completed is additional use of reliability studies
• Information from a reliability study conducted on a subset of subjects concurrently with the
epidemiological study can yield information about the validity of the exposure measure.
• This information can be used to adjust the ORo for the effects of measurement error.
02/28/2024 39
Choice of comparison measures
• Many types of comparison measures in reliability studies
• An instrument can be compared with a re-administration of the same instrument

o At different time
o By different rater, or
o With variation of some other condition of interest
• E.g.: proxy respondent versus index subject
• For intermethod studies, questionnaire data have been compared with:

o More detailed interviews
o Medical records
o Physical or biochemical measures
o Interviews by experts
o Exposure diaries, and
o Direct observation
02/28/2024 40
How is an appropriate comparison method selected?
• Measurements from instrument whose accuracy is to be determined are
compared with those provided by a perfect or near-perfect measure of exposure
(ideally).
• Known as a validity study
o Allows to estimate both dimensions of measurement error –bias and validity coefficient
• If a validity study is not possible:

o Comparing the instrument of interest with a measure of exposure with uncorrelated
errors should be considered
• A good choice is a comparison measure X2 which is more precise than the

measure of interest X1 and has an error unlikely to be correlated with X1.
02/28/2024 41
• In reviewing reliability studies by others the following are key issues to be
considered.
Was the comparison method used close to perfect, a more precise measure of
the true exposure than X1, or a less precise measure than X1?
Are there sources of correlated error between X1 and X2 ?
If two or more measures with correlated errors were used, were the errors likely
to be strongly or weakly correlated?
• The answers to these questions will guide the interpretation of the reliability
study
02/28/2024 42
Selection of subjects for reliability studies
• Subjects: random sample of population in which study will be carried out (ideally).
• Problems in generalizing reliability studies may arises due to:

o Self-selected volunteers……… instrument more valid than it would be in the
intended population
• Because of the higher level of motivation among the volunteers.
o Differences in population characteristics

• Education
• Age
• Sex, and
• Other factors
o Differences in distribution of true exposure in populations
02/28/2024 43
Timing and order of measures
• Correlated errors between measures in a reliability study may occur when

subjects recall at the second or later testing the responses they gave on earlier
tests.
• Minimized by separating measures over time (usually by a month at least)
• When two periods of testing are well separated, the two measures of
exposure may refer to different time periods.
• Thus some lack of correlation between them may be due to true change in
exposure over time.
.
02/28/2024 44
• Intermethod reliability studies:
• Instrument to be evaluated is given first
• Comparison measure is usually less prone to error and therefore may be less affected by
recall of the prior measurement.
• Knowledge that a measure is to be validated can influence subjects’ responses,

• e.g. self-report of weight may be influenced by knowledge that they will be weighed
• Invitation to subjects to participate in the second measure should be given after the first
measurement has been completed.
• Intramethod reliability studies

• Order of measures should be randomized
02/28/2024 45
Analysis of validity and reliability studies
Selecting the appropriate measures of validity or reliability
• Selection of appropriate analysis depends on design aspects of the study
• Is the exposure measure a:

• Continuous variable
• Nominal (including dichotomous) categorical variable, or
• Ordered categorical variable?
• Is the study an intermethod reliability/validity study or intramethod reliability

study?
02/28/2024 46
• Intermethod reliability study or a validity study
• Instruments usually differ
• Variances of the measures may not be equal
• Units of measure may not be the same
• E.g.: A measure of beta-carotene intake may be compared with serum beta-carotene concentration
• Limited to the comparison of only two measures at one time
• Intramethod type of study

• Instruments used are essentially the same
• Possibly with some difference in administration
• e.g., two interviewers
• Can be performed using more than two measures per subject
• One assumption of the analytical methods for intramethod studies is that the variances of the
measures X1, X2,…, Xk are equal for continuous measures, or that the measures are ‘equally
precise’ for categorical exposures.
02/28/2024 47
• The issue of correlated errors does not influence the choice of analytical method
for reliability studies, only the interpretation of the results.
• The emphasis in the analysis of reliability and validity studies is on parameter

estimation (e.g., estimation of ρX1X2).
• Confidence intervals also add useful information.
• Statistical tests are less important, for it should almost be a ‘given’ that X1 and X2
are not related by chance.
02/28/2024 48
Validity and intermethod reliability studies of continuous measures
• Intermethod reliability studies and validity studies can be analyzed using common
statistical techniques.
• For the analysis of continuous exposure variables, one would report X̄ 1, X̄ 2, and
ρX1X2 estimated by the Pearson correlation coefficient.
• Under the model of additive independent errors, the difference between the biases
of the two measures can be estimated as the difference between the sample means
of X1 and X2:
02/28/2024 49
• For reliability studies in which cases are compared with controls, if X2 has non-
differential bias then the difference in bias in X1 between cases and controls can be
estimated as:
• The Pearson product-moment correlation and its confidence interval can be used to
estimate ρX1X2 for intermethod or validity studies.
• If X1 and X2 are, at least, positively associated, then the Pearson correlation

coefficient would range from zero to 1.
02/28/2024 50
Analysis of validity and intermethod studies of categorical measures
• Several methods can be used to analyze validity or intermethod reliability
studies of categorical exposure variables.
• For a nominal categorical variable the validity or intermethod reliability can be

described by the misclassification matrix.
• For a dichotomous exposure variable, the sensitivity and specificity are

commonly used.
02/28/2024 51
• Misclassification matrix is also appropriate for ordered categorical variables.
• Depending on the distribution of the ordered categorical variable, the Spearman rank
correlation coefficient might be used.
• The effect of measurement error in an ordered categorical variable can also be described
(under certain assumptions) in terms of the validity coefficient of the underlying
continuous variable from which the categorical variable was created.
• This means that the difference in means and the correlation between the two
underlying continuous variables could be appropriate.
• Because the misclassification matrix is not a summary measure, κ is often used to

analyze intermethod reliability studies of categorical variables.
• However, these methods were developed under assumptions more appropriate to

02/28/2024 52
intramethod reliability studies.
Analysis of intramethod reliability studies: concept of interchangeable
measures
• The primary distinction between intermethod and intramethod reliability

studies is that intermethod studies use two instruments and intramethod studies
involve repeated applications of one instrument.
• The most important distinction in terms of selecting an analytical method is that

in intermethod studies only one measure, X1, is to be used in the full
epidemiological study, so one is interested in the reliability of X1.
02/28/2024 53
• In intramethod studies, the two or more measures compared are to be used
interchangeably as a single exposure measurement in the epidemiological study.
• For example, in an intramethod reliability study, X1 may refer to a measure by

one interviewer and X2 to one by a second interviewer, but in the parent
epidemiological study each subject will be questioned by one or other of the two
interviewers.
• In intramethod studies, we are interested not in the reliability of X1 or X2 but in

the reliability of the interchangeable measure ‘X’, the measure to be used in the
epidemiological study.
02/28/2024 54
• There is a key difference between intermethod and intramethod reliability.
• In intermethod reliability,
• any systematic difference between X1 and X2 reflects a consistent bias which affects all
subjects in the parent study, and thus does not affect the precision of X1.
• In intramethod reliability,
• a systematic difference between measures contributes to a lack of precision in X because it
affects some subjects but not others.
• E.g: if one interviewer weighs subjects on a correctly calibrated scale and a second rater’s
scale is miscalibrated 2 kg too heavy, this source of error will affect only the subjects
measured by the second rater.
• Thus any consistent difference between study interviewers would increase the
variance of the exposure measure (σX2 ) in the full study and decrease the
reliability compared with the use of only one interviewer.
• The Pearson product-moment correlation is not appropriate for intramethod

studies, because systematic differences (in bias or scale) between X1 and the
comparison
02/28/2024
measure X2 are not reflected in the Pearson correlation. 55
• Special analytical methods have been developed for intramethod reliability studies,
particularly in the context of inter-rater reliability studies.
• For continuous variables, the reliability ρX of X is estimated by a version of the

intraclass correlation coefficient R.
• The intraclass correlation coefficient is an estimate of ρX as defined: .
• The variance of X in the full epidemiological study is estimated under the

assumption that in the full study each subject will be randomly assigned one
measure (e.g. one interviewer).
• The intraclass correlation is diminished by the error in X due to systematic
differences between measures X1, …, Xk as well as that due to random error.
• Thus the intraclass correlation is equal to 1 only when there is exact agreement
02/28/2024 56
between measures, i.e. when Xi1 = Xi2 = … Xik for each subject.
• The intraclass correlation coefficient can be interpreted depending on whether or
not there are correlated errors between X1 and X2.
• The bias in X cannot be estimated in an intramethod study.
• The mean difference (X̄ 1 − X̄ 1) between measures can be used to reflect the
systematic difference between measures, but any consistent difference between X1
and X2 beyond chance also contributes to a lower estimate of ρX the intraclass
correlation coefficient.
02/28/2024 57
Parameters for qualitative measures
Sensitivity and specificity
• Basic measures of validity for binary categorical variables
• The study value of the exposure or outcome is compared to the “true” value,
measured by a more accurate method.
Sensitivity: ability a test to correctly identify individuals having the disease or

exposure characteristic of interest.
Specificity: ability of a test to correctly identify those individuals who do not

have the disease or exposure characteristic of interest.
02/28/2024 58
Sensitivity and specificity
Sensitivity =
Specificity =
02/28/2024 59
Class exercise
A new test for the detection of early optic disk damage in the presence of normal IOP
was devised. It is applied on 200 people expected to having normal IOP. Based on
the reference test, 20 of the 200 people have actually damaged optic disc. The new
test detects 12 cases of damaged disc of which 8 have damaged discs by the reference
test and 4 have not.
What are the sensitivity and specificity of the new test?
02/28/2024 60
Class exercise: Solution
A new test for the detection of

early optic disk damage in the Reference test result
presence of normal IOP was Damage Normal Total
devised. It is applied on 200 d disk disk
people expected to having New
normal IOP. Based on the test Damaged disk
8 4 12
reference test, 20 of the 200 result
people have actually damaged Normal disk
optic disc. The new test 12 176 188
detected 12 cases of damaged Total 180
discs of which 8 have damaged 20 200
discs by the reference test and 4

have not.
Se = Sp =
What are the sensitivity and
specificity of the new test?
02/28/2024 61
Limitations of the use of sensitivity and specificity
• Very few diagnostic tests are inherently dichotomous!
• Most diagnostic tests are based on the characterization of individuals based
on one or more underlying traits, such as blood pressure or serum glucose
level.
• Values would vary according to the cut-off level used to separate “diseased”
(or exposed) from “undiseased” (or unexposed) individuals.
• If measurement error occurs, individuals with true levels of the underlying

trait close to the test point are more likely to be misclassified.
02/28/2024 62
Percent agreement and its variations
Overall percent agreement/observed proportion of agreement (po)
o Intuitive and easy to calculate
o Simplest agreement index
o Can make agreement look artificially high
o Considerable agreement between two observers reading negative, or normal, results
Overall percent
agreement(Po)
=
02/28/2024 63
Percent positive agreement
• An alternative approach
• Disregard subjects labeled as negative by both evaluators
Percent positive agreement

=
a/(a + b+ c) x 100
02/28/2024 64
Limitations of overall and percent positive agreement
• Neither of them takes into account agreement due to chance
• The extent of agreement between two raters beyond that due to chance alone
can be not be estimated
02/28/2024 65
Kappa statistic and its variations
Cohen’s kappa coefficient
o The most salient and the most widely used in the scientific literature
o For nominal and ordinal scales
o Quantifies the level of agreement between the raters beyond chance
o It is the proportion of elements classified in the same category by raters
Observed proportion of
agreement
Proportion of agreement
expected by chance
Maximum possible agreement

beyond chance
02/28/2024 66
• Cohen corrected the observed proportion of agreement for the proportion of
agreement expected by chance and scaled the result to obtain a value:
One
o → Agreement is perfect
• all observations fall in the diagonal cells of the contingency table
Zero
o → Agreement is only to be expected by chance
Negative
o → Observed proportion of agreement is lower than the proportion of
agreement expected by chance
02/28/2024 67
• Different marginal distributions are expected with different
• Work experience
• Background or
• Using different methods
• Cohen’s kappa coefficient does not penalize the level of agreement for
differences in the marginal distribution of the raters.
02/28/2024 68
Exercise : Calculation of kappa
Technician 2
Positive Positive
Technician 1 Positive 45 5
Negative 2 30
Exercise : Calculation of kappa: Solution
Technician 2
Technician 1 Positive Negative Total
Positive 45 5 50
Negative 2 30 32
Po = 45+30/45+30+5+2= 0.91
Total 47 35 82 Pe = ((47*50)/82 + (35*32)/82)/82 = 0.52
K= 0.91 – 0.52/(1-0.52)
= 81.25%
02/28/2024 70
What is a “high” kappa?
• Some experts have attached the following qualitative terms to kappas:
• 0.0-0.2 →slight
• 0.2-0.4→fair
• 0.4-0.6→moderat
• 0.6-0.8 →substantial
• 0.8-10.0 → almost perfect
02/28/2024 71
Extensions of kappa
• All possess the same characteristic
• Account for the occurrence of agreement due to chance
Intraclass Kappa Coefficient

• Defined by assuming that the two raters have the same marginal probability
distribution.
• This is typical of a test–retest situation where there is no reason for the
marginal probabilities to change between the two measurement
occasions.
02/28/2024 72
Weighted Kappa
• Some disagreements between raters can be considered more important than
others.
• Reflect the seriousness of disagreement according to the distance between the
categories.
• E.g.: on an ordinal scale, disagreements on two extreme categories are generally
considered more important than on neighboring categories.
Interpretation and limitations of κ and weighted κ
• The value of κ for dichotomous measures or κw can be interpreted in terms of
the attenuation of the OR due to non-differential measurement error.
• When κw is derived from a study in which two dichotomous measures

compared have equal sensitivity, equal specificity, and independent error,
approximately related to the attenuation of the OR by:
• Equivalent to using √k to estimate ρTX
02/28/2024 74
There are several limitations to the interpretation of κ and κw
• The value of κw varies with the number of exposure categories
• The value of κ or κw depends on the distribution of exposure in the population

• κ cannot be used to compare the reliability of two instruments measuring the same underlying
exposure if the two reliability studies were conducted in populations which may have different
distributions of the true exposure.
• This is similar to the problem of comparing reliability coefficients across

populations which differ in the variance of exposure.
02/28/2024 75
• The dependence of κ on prevalence of exposure may be a desirable property
• Because the attenuation of the OR depends on the:
• Exposure prevalence
• Sensitivity and
• Specificity of the measurement
• Neither κ nor κw is sufficient to detect differential misclassification between cases

and controls.
• κ is a single summary measure of misclassification in the measure, while the

assessment of differential misclassification requires estimates of sensitivity and
specificity for cases and controls (or the misclassification matrices for k > 2).
02/28/2024 76
• Values range -1 to +1 --- no clear interpretation, except for 0 and 1 values
• Dependent on the prevalence of the trait under study
• Serious limitation when comparing values among studies with varying
prevalence
• Alternatives to cope with this problem
Bias-adjusted kappa (BAK)

o Allows adjustment for rater bias: differences in marginal probability
Prevalence-adjusted-bias-adjusted kappa (PABAK)

o A linear transformation of observed proportion of agreement
o PABAK = 2*po-1
02/28/2024 77
Sample size for reliability studies
• Required sample size depends on the design and aim of the reliability study.
• For an intermethod reliability study conducted to assess differential bias between
cases and controls,
• Required sample size could be based on the standard sample size formula for a two-sample
comparison of means, where the variable of interest is (Xi1 = Xi2).
• For a validity or intermethod reliability study conducted to estimate ρX1X2

• Sample size would be that needed to estimate the Pearson correlation coefficient between two
variables.
• NB: the null hypothesis to be tested would not ρX1X2=0 be for it should be assumed
that X1 and X2 are at least positively correlated.
• Rather, the study should have sufficient power to detect whether ρX1X2 is greater
than some minimum value rL.
02/28/2024 78
• Based on the transformation of ρX1X2 to a standardized normal distribution, the
required sample size n is:
Where
• r: hypothesized (expected) value of the correlation coefficient

• rL: the minimally acceptable correlation
• Zα: (100 − α) centile of the Z
• α: significance level for a one-sided test
• β: 1 – power
• E.g: for an expected r = 0.6, a lower bound rL = 0.4, 80 per cent power (Zβ = 1.96) and a two-
sided 95 % confidence level (Z α/2 = 1.96), the required sample size is 111 subjects
02/28/2024 79
Reference
Principles of Exposure Measurement in Epidemiology: Collecting, Evaluating, and
Improving Measures of Disease Risk Factors: Exposure measurement error and its
effects: In: Emily White Bruce K. Armstrong Rodolfo Saracci, chapter 4
02/28/2024 80
The
Oct 14,2023
End
Oct 14,2023
14 October 2023
th
02/28/2024 81
Oct 14,2023

6 Validity and Reliability Studies - 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Validity and Reliability Studies - 1

Uploaded by

Copyright:

Available Formats

Unit Five

Validity and Reliability Studies

Destaye Shiferaw Alemu (MPH, MSc2)

Describe the assumptions of the parallel test model

Explain effects of errors variances in reliability and validly study

Interpret reliability parameters for validity and reliability studies

Explain issues in the design of reliability and validity studies

• The parallel test model

• Reliability and validity under the parallel test model

• Reliability and validity under unequal variances

• A model of reliability allowing for correlated errors

• Parameters for quantitative measures in reliability studies

• Parameters for qualitative measures in reliability studies

• Issues in the design of validity and reliability studies

• Sample size for reliability studies

• First, the available literature on the validity and reliability of instruments

• Then, if a new instrument is to be developed which differs substantially

• Reliability studies can be designed to provide information about the validity of

• If a comparison measure without differential bias between cases and controls is

• Useful information about the validity coefficient can be obtained from a

2. Study type: validity or intermethod study vs. intramethod study

3. Other design issues

• Pearson correlation coefficient might be appropriate for the analysis if

• Version intraclass correlation coefficient (R2) might be used if both

• E.g.: a comparison could be made of exposure information from two data

• Compares a measurement method to be used in an epidemiological study with a more

• A perfectly accurate measure of exposure (criterion measure or ‘gold standard’) is needed to

• A measure of reliability can be calculated based on the correlation between

• The reliability statistic used is Cronbach's α

Investigating which sources of variation are most distorting the

Studying the amount of error in scores in absolute terms due to sources of

• When it is possible to better standardize these sources of variation, the

• Refers capacity of exposure variable to measure true exposure in a population of

• Measures of validity are measures of exposure measurement error in a population.

The model can be written

Where Xij is the observation on subject i of measure Xj

• The population mean of X1 is denoted by μX1 the variance by σx12 etc.

• Because the bias of X1 in the population is expressed as a constant b1 and the

• In a reliability study, information is available on X1 and X2 for each subject,

• Primary question: If X1 is the measure of interest, what can the estimates of

• To assess differential bias, a reliability study would need to measure X 1 and X2 in

• The inability of many reliability study designs to yield information on bias or

• However, it should be recalled, that under non-differential measurement error

• When non-differential measurement error can be assumed, reliability can be

2. E1 and E2 have equal variance

3. E1 is not correlated with E2.

Reliability coefficient = square of the validity coefficient for X1 (or X2)

• The correlation of X1 with X2 is less than the correlation of X1, with T

• Model of parallel tests → σ 2

• If other assumptions of PTM hold, including the assumption of uncorrected errors,

• If X2 is a less precise measure of T or if it is not known whether X1 or X2 is more

• In other words, the correlation of X1 with even a poor measure X2 with

• E.g.: Physical activity may be over-reported on self-administered questionnaire and

• Correlated error can be due to:

The true exposure overestimated by X1 is also overestimated by X2 beyond

• Because part of the correlation between X1 and X2 is due to the correlated

• Validity coefficient is less than square root of reliability coefficient when:

• Evaluation of potential sources of correlated errors between measures is

• Several issues need to be considered in the design of validity and reliability

• Estimates of validity or reliability coefficient and bias of instrument can then be

• If measure is reasonably reliable → increase confidence in the outcome of the study

o E.g.: an inter-rater reliability study could identify interviewers, abstractors, or