Professional Documents
Culture Documents
Explain the relationship of reliability to validity under the parallel test model
Identify common statistical analysis methods for reliability and validity studies
02/28/2024 2
Contents
• Intermethod and intermeshed reliability
02/28/2024 4
• Measures of reliability are primarily important for what they reveal about the
validity of a measure
• Emphasized for both design and interpretation of reliability studies
02/28/2024 5
• Choice of analytical technique for a validity or reliability study depends on
whether
1. Scale of Exposure measured:
• Continuous variable
• Nominal categorical dichotomous variable), or
• Ordered categorical variable
02/28/2024 6
• E.g.: for a reliability study in which a continuous exposure measure from proxy
respondents was compared with the same measure from subjects themselves,
• Another version (R4) might be used if both were included but a factor
indicating whether the interview was by index or proxy respondent was to be
adjusted for in the full study.
02/28/2024 7
Reliability
• Used to refer to the reproducibility of a measure
o How consistently a measurement can be repeated on the same subjects
• Refers to intramethod reliability (most fields of study)
• Less work exists on design and interpretation of intermethod reliability studies!
• Can be assessed in a number of ways
Intramethod reliability
• A measure of the reproducibility of an instrument, either applied:
o In the same manner to same subjects at two or more points in time: test–
retest reliability, or
o By two or more data collectors to the same subjects: inter-rater reliability
• Intermethod reliability studies of this type are sometimes called validity studies or validation
studies
• Intermethod reliability studies also called method comparison studies and studies of relative
validity of one measure in comparison to another.
02/28/2024 9
Internal consistency reliability:
• Can be assessed when the exposure measure for each individual in the parent
epidemiological study is a sum or average of two or more individual items.
• E.g.:
• A disability scale calculated as the sum of multiple questions or
• Average serum beta carotene for each subject averaged over three blood
draws
02/28/2024 10
• Reliability studies help to estimate the influence of different sources of
variation on scores by.
02/28/2024 12
A model of reliability and measures of reliability
• Suppose that each person in a population of interest is measured twice,
• Either with one instrument or
• Two instruments that purport to measure the same exposure
• If two instruments are used, X1 will denote the measure of interest, i.e. the
one to be used in the epidemiological study, and X2 the comparison measure.
• For a given subject i, two (continuous) exposure measurements, Xi1 and Xi2,
are obtained.
02/28/2024 13
A simple model that could apply to intermethod or intramethod reliability studies is
Xi1 = Ti + b1 + Ei1
Xi2 = Ti + b2 + Ei2 μE1 = μE2 = 0
Xij = Ti + bj + Eij
02/28/2024 14
• In the population, X1, X2, T, E1, and E2 are random variables with
distributions.
• A reliability study can yield estimates of μX1 , μX2 and the correlation ρX1X2
between the two measures, termed the reliability coefficient.
02/28/2024 15
• Bias & validity coefficient, two measures of the validity of a continuous
exposure measure are important in assessing the impact of measurement
error
02/28/2024 16
Measurement of bias in a measure and differential bias
• Reliability studies often cannot provide information on the bias in X1 or X2.
• Only difference between the biases of X1 and X2 can be observed (based model)
States that the difference between the population means of the two measures is
equal to the difference between their biases.
• This difference is often not very informative!
• If a similar degree of bias is present in both measures—for example, if the same
miscalibrated scale is used to weigh each subject twice—the difference between the
means of the two measures can be close to zero even when there is considerable
bias in both measures.
• However, if X2 is an unbiased measure of T (b2 = 0), then
02/28/2024 17
• Thus, only when the comparison measure X2 is a perfect measure or when X2 can
be assumed to be unbiased (e.g. a well-calibrated scale) can a reliability study
yield information about the bias in X1.
• Differential bias in the exposure measure between cases and controls is a major
concern because it can lead to invalid results in an epidemiological study.
(Differential precision may also be a concern)
02/28/2024 18
• The difference between the biases in X1 between cases and controls (b1D-b1N) can
be measured only if the comparison measure X2 is perfect or unbiased, or if there
is non-differential bias in X2 (b2D-b2N)
• Then, if the simple additive model given above holds for both cases and controls
02/28/2024 19
Class exercise
• To assess differential bias between breast cancer cases and controls in
a retrospective food frequency estimate of dietary fibre intake (X1), a
reliability study was conducted within an existing cohort study. X1
was compared with the prospective (pre-diagnostic) food frequency
estimate of fibre intake (X2) from the cohort study. Cases reported
19.5 grams of fibre prospectively and 20.0 grams retrospectively.
Controls reported 20.5 grams of fibre prospectively and 20.2 grams
retrospectively. What is the differential bias in X1 assuming that there
is reasonable certainty that any bias in X2 is equal for cases and
controls?
02/28/2024 20
Class exercise: solution
• To assess differential bias between Given information
breast cancer cases and controls in a • Measurement one (X1): retrospective
retrospective food frequency
estimate of dietary fiber intake (X1), • Measurement two (X2): prospective
a reliability study was conducted
within an existing cohort study. X1 • μX2D = 19.5grams
was compared with the prospective
(pre-diagnostic) food frequency
estimate of fiber intake (X2) from the
• μX1D = 20.0 grams
cohort study. Cases reported 19.5
grams of fiber prospectively and 20.0 • μX2N = 20.5 grams
grams retrospectively. Controls
0
reported 20.5 grams of fiber
prospectively and 20.2 grams
• μX1N = 20.2 grams
retrospectively. Since there is
reasonable certainty that any bias in
(b1D-b1N) - (b2D-b2N) = (μX1D- μX2D ) –(μX1N - μX2N)
X2 is equal for cases and controls, = (20 -19.5)- (20.2- 20.5)
the differential bias in X1 can be
estimated as = 0.5-(-0.3)
02/28/2024 = 0.80grams 21
Relationship of reliability to validity under the parallel test model
• Thus measures of reliability may be used to estimate at least some of the effects of
measurement error in the absence of a measure of bias.
02/28/2024 23
• Assumption of the parallel test model (PTM)
1. Error variables E1 and E 2 are not correlated with the true value T.
• Summary of PTM:
• Two measures are parallel measures of T if they have equal AND uncorrelated errors!
•02/28/2024
Includes the assumption that b1 = b2 = 0 (generally) 24
• Under the assumptions of parallel tests it can be shown that
• Or equivalently
• Important!
• Because it shows that if the assumptions are correct, the reliability coefficient - a measure of
the correlation between two imperfect measures, can be used to estimate the correlation
between T and X1 without having a perfect measure of T.
• These expressions apply only when the reliability coefficient is restricted to the
correlation between parallel measures of T.
• However, we use the term reliability coefficient to refer to the correlation ρX1X2 between
measures of the same exposure even when the assumptions of parallel tests do not hold.
• This means that, for a given instrument X1 applied to a given population, the reliability
coefficient will vary with the choice of X2.
• In real reliability studies, the assumptions of parallel tests are often incorrect.
• Two common violations : unequal variances of E1 and E2, and correlated errors.
• Even when these assumptions are violated, the correlation between X1 and X2 can still
provide some information about the validity coefficient of X1.
02/28/2024 26
Relationship of reliability to validity under unequal variances of E1 and E2
• This assumption is incorrect for certain reliability studies, particularly for many
intermethod reliability studies.
• First, consider a true validity study where X1, the exposure measure of interest, is
compared with a perfect measure of exposure, termed X2 (X2 = T).
• Then, by definition,
02/28/2024 27
• However, a perfect measure is often not available, and so the exposure measure of interest
X1 is often compared with an imperfect but more precise measure, X2.
→ ρTX2 > ρTX1
States that when X2 is more precise than X1 and the errors in X1 and X2 are
uncorrelated, the reliability coefficient can be used to yield an upper and a lower
bound for the validity coefficient of X1.
• The lower bound for the validity coefficient of X1 is the interpretation as if X2 were a
perfect measure, and the upper bound is the interpretation as if X2 had equal error
variance.
02/28/2024 28
• The more accurate X is, the closer the lower bound is to ρ
• In an effort to find a comparison measure X2 with an error uncorrelated with
the error in X1, the comparison measure may be less accurate than (ρTX2< ρTX1)
• More specifically, X2 may be a less accurate measure of T, the true exposure that
X1 is attempting to measure, than X1.
02/28/2024 29
A model of reliability allowing for correlated errors
• Uncorrelated errors: the assumption often violated in PTM
• Errors in X1 and X2 are often positively correlated
Errors are correlated since same subjects with large positive errors on first
measure tend to have positive errors on second measure
if X1 and X2 have a similar bias this does not lead to correlated error
because bias adds a constant error to all subjects.
02/28/2024 31
• A model for reliability that makes the correlated errors explicit is:
Eji = Bi + Fij
where
• Error terms Ei1 and Ei2 for a given subject are the sum of two parts:
• Part Bi
• Which repeats itself on each measure of subject i,
• Termed the within-subject bias
• Part Fij
• Which varies between measures
• Termed the random error
• E1 and E2 are correlated because they both include the within-subject bias.
02/28/2024 32
Measurement error in Xij, the jth measure on subject i.
02/28/2024 33
Relationship of reliability to validity under correlated errors
• When a reliability study is conducted in which X1 and X2 have correlated errors,
the reliability coefficient ρX1X2 is artificially high.
• Reliability coefficient only measures (i.e. is only reduced by) the components of
error that are not correlated.
• A reliability study with correlated error does not capture all components of error
because it does not measure the part of the error that is repeated within subjects
(the within-subject bias).
02/28/2024 34
• When errors (E1 & E2) of measures in a reliability study are positively correlated
• Reliability study can only yield an upper limit for the validity coefficient.
• Thus:
• A measure can be reliable (repeatable) even if it has poor validity
• While a low reliability coefficient implies poor validity, a high reliability does not
necessarily imply a high validity coefficient.
• The high reliability may be due instead to the correlated error.
o Certain high-fat foods eaten frequently by a few subjects may have been
omitted from the questionnaire.
- Those subjects would have their fat intake consistently underestimated.
o Time period assessed by instrument (diet in last year) may differ from true
time period of interest (e.g. diet over the last 5 years).
- Then those who have lowered the fat in their diet in recent years will
have their fat intake underestimated on both administrations of the
instrument compared with their true average fat intake.
• Correlated errors commonly occur in intramethod studies, but they could occur
in intermethod studies as well.
02/28/2024 36
Issues in the design of validity and reliability studies
• Most of these issues are also important in interpreting reliability studies carried
out by others.
02/28/2024 37
Purpose and timing of the reliability study
• A validity or reliability study of an instrument should be carried out first when:
o A new instrument is to be developed for an epidemiological study, or
o An existing one is to be applied to a substantially different population
• Reliability studies conducted before the main epidemiological study, or early in its
course, can be used to evaluate but also to improve the instrument.
• Estimation of impact of exposure measurement error on the results of a study after the parent
epidemiological study has been completed is additional use of reliability studies
• Information from a reliability study conducted on a subset of subjects concurrently with the
epidemiological study can yield information about the validity of the exposure measure.
• This information can be used to adjust the ORo for the effects of measurement error.
02/28/2024 39
Choice of comparison measures
02/28/2024 41
• In reviewing reliability studies by others the following are key issues to be
considered.
Was the comparison method used close to perfect, a more precise measure of
the true exposure than X1, or a less precise measure than X1?
If two or more measures with correlated errors were used, were the errors likely
to be strongly or weakly correlated?
• The answers to these questions will guide the interpretation of the reliability
study
02/28/2024 42
Selection of subjects for reliability studies
• Subjects: random sample of population in which study will be carried out (ideally).
02/28/2024 43
Timing and order of measures
• When two periods of testing are well separated, the two measures of
exposure may refer to different time periods.
• Thus some lack of correlation between them may be due to true change in
exposure over time.
.
02/28/2024 44
• Intermethod reliability studies:
• Instrument to be evaluated is given first
• Comparison measure is usually less prone to error and therefore may be less affected by
recall of the prior measurement.
02/28/2024 45
Analysis of validity and reliability studies
Selecting the appropriate measures of validity or reliability
02/28/2024 46
• Intermethod reliability study or a validity study
• Instruments usually differ
• Variances of the measures may not be equal
• Units of measure may not be the same
• E.g.: A measure of beta-carotene intake may be compared with serum beta-carotene concentration
• Limited to the comparison of only two measures at one time
02/28/2024 47
• The issue of correlated errors does not influence the choice of analytical method
for reliability studies, only the interpretation of the results.
• Statistical tests are less important, for it should almost be a ‘given’ that X1 and X2
are not related by chance.
02/28/2024 48
Validity and intermethod reliability studies of continuous measures
• Intermethod reliability studies and validity studies can be analyzed using common
statistical techniques.
• For the analysis of continuous exposure variables, one would report X̄ 1, X̄ 2, and
ρX1X2 estimated by the Pearson correlation coefficient.
• Under the model of additive independent errors, the difference between the biases
of the two measures can be estimated as the difference between the sample means
of X1 and X2:
02/28/2024 49
• For reliability studies in which cases are compared with controls, if X2 has non-
differential bias then the difference in bias in X1 between cases and controls can be
estimated as:
• The Pearson product-moment correlation and its confidence interval can be used to
estimate ρX1X2 for intermethod or validity studies.
02/28/2024 50
Analysis of validity and intermethod studies of categorical measures
• Several methods can be used to analyze validity or intermethod reliability
studies of categorical exposure variables.
02/28/2024 51
• Misclassification matrix is also appropriate for ordered categorical variables.
• Depending on the distribution of the ordered categorical variable, the Spearman rank
correlation coefficient might be used.
• The effect of measurement error in an ordered categorical variable can also be described
(under certain assumptions) in terms of the validity coefficient of the underlying
continuous variable from which the categorical variable was created.
• This means that the difference in means and the correlation between the two
underlying continuous variables could be appropriate.
02/28/2024 53
• In intramethod studies, the two or more measures compared are to be used
interchangeably as a single exposure measurement in the epidemiological study.
02/28/2024 54
• There is a key difference between intermethod and intramethod reliability.
• In intermethod reliability,
• any systematic difference between X1 and X2 reflects a consistent bias which affects all
subjects in the parent study, and thus does not affect the precision of X1.
• In intramethod reliability,
• a systematic difference between measures contributes to a lack of precision in X because it
affects some subjects but not others.
• E.g: if one interviewer weighs subjects on a correctly calibrated scale and a second rater’s
scale is miscalibrated 2 kg too heavy, this source of error will affect only the subjects
measured by the second rater.
• Thus any consistent difference between study interviewers would increase the
variance of the exposure measure (σX2 ) in the full study and decrease the
reliability compared with the use of only one interviewer.
• The mean difference (X̄ 1 − X̄ 1) between measures can be used to reflect the
systematic difference between measures, but any consistent difference between X1
and X2 beyond chance also contributes to a lower estimate of ρX the intraclass
correlation coefficient.
02/28/2024 57
Parameters for qualitative measures
Sensitivity and specificity
• Basic measures of validity for binary categorical variables
• The study value of the exposure or outcome is compared to the “true” value,
measured by a more accurate method.
02/28/2024 58
Sensitivity and specificity
Sensitivity =
Specificity =
02/28/2024 59
Class exercise
A new test for the detection of early optic disk damage in the presence of normal IOP
was devised. It is applied on 200 people expected to having normal IOP. Based on
the reference test, 20 of the 200 people have actually damaged optic disc. The new
test detects 12 cases of damaged disc of which 8 have damaged discs by the reference
test and 4 have not.
What are the sensitivity and specificity of the new test?
02/28/2024 60
Class exercise: Solution
• Values would vary according to the cut-off level used to separate “diseased”
(or exposed) from “undiseased” (or unexposed) individuals.
02/28/2024 62
Percent agreement and its variations
Overall percent agreement/observed proportion of agreement (po)
o Intuitive and easy to calculate
o Simplest agreement index
o Can make agreement look artificially high
o Considerable agreement between two observers reading negative, or normal, results
Overall percent
agreement(Po)
=
02/28/2024 63
Percent positive agreement
• An alternative approach
• Disregard subjects labeled as negative by both evaluators
02/28/2024 64
Limitations of overall and percent positive agreement
• The extent of agreement between two raters beyond that due to chance alone
can be not be estimated
02/28/2024 65
Kappa statistic and its variations
Cohen’s kappa coefficient
o The most salient and the most widely used in the scientific literature
o For nominal and ordinal scales
o Quantifies the level of agreement between the raters beyond chance
o It is the proportion of elements classified in the same category by raters
Observed proportion of
agreement
Proportion of agreement
expected by chance
02/28/2024 66
• Cohen corrected the observed proportion of agreement for the proportion of
agreement expected by chance and scaled the result to obtain a value:
One
o → Agreement is perfect
• all observations fall in the diagonal cells of the contingency table
Zero
o → Agreement is only to be expected by chance
Negative
o → Observed proportion of agreement is lower than the proportion of
agreement expected by chance
02/28/2024 67
• Different marginal distributions are expected with different
• Work experience
• Background or
• Using different methods
• Cohen’s kappa coefficient does not penalize the level of agreement for
differences in the marginal distribution of the raters.
02/28/2024 68
Exercise : Calculation of kappa
Technician 2
Positive Positive
Technician 1 Positive 45 5
Negative 2 30
Exercise : Calculation of kappa: Solution
Technician 2
Positive 45 5 50
Negative 2 30 32
Po = 45+30/45+30+5+2= 0.91
Total 47 35 82 Pe = ((47*50)/82 + (35*32)/82)/82 = 0.52
K= 0.91 – 0.52/(1-0.52)
= 81.25%
02/28/2024 70
What is a “high” kappa?
• Some experts have attached the following qualitative terms to kappas:
• 0.0-0.2 →slight
• 0.2-0.4→fair
• 0.4-0.6→moderat
• 0.6-0.8 →substantial
• 0.8-10.0 → almost perfect
02/28/2024 71
Extensions of kappa
• All possess the same characteristic
• Account for the occurrence of agreement due to chance
02/28/2024 72
Weighted Kappa
• Some disagreements between raters can be considered more important than
others.
• Reflect the seriousness of disagreement according to the distance between the
categories.
• E.g.: on an ordinal scale, disagreements on two extreme categories are generally
considered more important than on neighboring categories.
Interpretation and limitations of κ and weighted κ
• The value of κ for dichotomous measures or κw can be interpreted in terms of
the attenuation of the OR due to non-differential measurement error.
02/28/2024 74
There are several limitations to the interpretation of κ and κw
• The value of κw varies with the number of exposure categories
02/28/2024 75
• The dependence of κ on prevalence of exposure may be a desirable property
• Because the attenuation of the OR depends on the:
• Exposure prevalence
• Sensitivity and
• Specificity of the measurement
02/28/2024 76
• Values range -1 to +1 --- no clear interpretation, except for 0 and 1 values
• Dependent on the prevalence of the trait under study
• Serious limitation when comparing values among studies with varying
prevalence
02/28/2024 77
Sample size for reliability studies
• Required sample size depends on the design and aim of the reliability study.
• For an intermethod reliability study conducted to assess differential bias between
cases and controls,
• Required sample size could be based on the standard sample size formula for a two-sample
comparison of means, where the variable of interest is (Xi1 = Xi2).
• NB: the null hypothesis to be tested would not ρX1X2=0 be for it should be assumed
that X1 and X2 are at least positively correlated.
• Rather, the study should have sufficient power to detect whether ρX1X2 is greater
than some minimum value rL.
02/28/2024 78
• Based on the transformation of ρX1X2 to a standardized normal distribution, the
required sample size n is:
Where
• E.g: for an expected r = 0.6, a lower bound rL = 0.4, 80 per cent power (Zβ = 1.96) and a two-
sided 95 % confidence level (Z α/2 = 1.96), the required sample size is 111 subjects
02/28/2024 79
Reference
Principles of Exposure Measurement in Epidemiology: Collecting, Evaluating, and
Improving Measures of Disease Risk Factors: Exposure measurement error and its
effects: In: Emily White Bruce K. Armstrong Rodolfo Saracci, chapter 4
02/28/2024 80
The
Oct 14,2023
End
Oct 14,2023
14 October 2023
th
02/28/2024 81
Oct 14,2023