Professional Documents
Culture Documents
[Date] 1
- For example, the Stanford-Binet Intelligence Scale-a Reliability Of Measuring Instruments
standardized intelligence (IQ) test used in many
countries-is an interval measure. A score of 140 on the
Stanford-Binet is higher than a score of 120, which, in RELIABILITY
turn, is higher than 100. Moreover, the difference
between 140 and 120 is presumed to be equivalent to • Reliability is the consistency with which an
the difference between 120 and 100. instrument measures the attribute. If a scale
weighed a person at 120 pounds one minute and 150
RATIO MEASUREMENT pounds the next, we would consider it unreliable.
• Ratio measurement is the highest level. Ratio scales, • The less variation an instrument produces in repeated
unlike interval scales, have a rational, meaningful measurements, the higher its reliability.
zero and therefore provide information about the • Reliability also concerns a measure's accuracy. An
absolute magnitude of the attribute. instrument is reliable to the extent that its measures
reflect true scores-that is, to the extent that
- The Fahrenheit scale for measuring temperature measurement errors are absent from obtained
(interval measurement) has an arbitrary zero point. scores.
Zero on the thermometer does not signify the absence • A reliable instrument maximizes the true score
of heat; it would not be appropriate to say that 60°F component and minimizes the error component of an
is twice as hot as 30°F. obtained score.
- Many physical measures, however, are ratio • Three aspects of reliability are of interest to
measures with a real zero. A person's weight, for quantitative researchers: stability, internal
example, is a ratio measure. It is acceptable to say consistency, and equivalence.
that someone who weighs 200 pounds is twice as STABILITY
heavy as someone who weighs 100 pounds.
• The stability of an instrument is the extent to which
Errors Of Measurement
similar results are obtained on two separate
• Situational contaminants - Scores can be affected by occasions.
the conditions under which they are produced. For • The reliability estimate focuses on the instrument's
example, environmental factors (e.g., temperature, susceptibility to extraneous influences over time,
lighting, time of day) can be sources of measurement such as participant fatigue.
error. • Assessments of stability are made through test-retest
• Response-set biases- Relatively enduring reliability procedures. Researchers administer the
characteristics of respondents can interfere with same measure to a sample twice and then compare
accurate measurements the scores.
• Transitory personal factors - Temporary states, such FOR EXAMPLE, THE STABILITY OF A SELF- REPORT SCALE THAT
'as fatigue, hunger, or mood, can influence people's MEASURED SELF-ESTEEM
motivation or ability to cooperate, act naturally, or do
their best.
• Administration variations- Alterations in the
methods of collecting data from one person to the
next can affect obtained scores. For example, if some
physiologic measures are taken before a feeding and
others are taken after a feeding, then measurement
errors can potentially occur.
• Item sampling- Errors can be introduced as a result of
the sampling of items used to measure an attribute.
For example, a student's score on a 100-item test of
research methods will be influenced somewhat by
which 100 questions are included.
[Date] 2
• Because self-esteem is a fairly stable attribute that • Equivalence, in the context of reliability assessment,
does not change much from one day to another, we primarily concerns the degree to which two or more
would expect a reliable measure of it to yield independent observers or coders agree about th
consistent scores on two different days. scoring on an instrument.
• As a check on the instrument's stability, we • With a high level of agreement, the assumption is that
administer the scale 2 weeks apart to a sample of 10 measurement errors have been minimized. The
people degree of error can be assessed through interrater (or
• The scores on the two tests are not identical but, on interobserver) reliability procedures, which involve
the whole, differences are not large. having two or more trained observers or coders make
• Researchers compute a reliability coefficient, a simultaneous, independent observations.
numeric index that quantifies an instrument's • An index of equivalence or agreement is then
reliability, to objectively determine how small the calculated with these data to evaluate the strength of
differences are. Reliability coefficients (designated as the relationship between the ratings. When two
r) range from .00 to 1.00.* The higher the value, the independent observers score some phenomenon
more reliable (stable) is the measuring instrument. In congruently, the scores are likely to be accurate and
the example shown in Table 14.1, the reliability reliable
• Test-retest reliability is relatively easy to compute,
Interpretation Of Reliability Coefficients
but a major problem with this approach is that many
traits do change over time, independently of the • Reliability coefficients are important indicators of an
instrument's stability. instruments quality. Unreliable measures reduce
• Attitudes, mood, knowledge, and so forth can be statistical power and hence affect statistical
modified by experiences between two conclusion validity.
measurements. Thus, stability indexes are most • If data fail to support a hypothesis, one possibility is
appropriate for relatively enduring characteristics, that the instruments were unreliable-not necessarily
such as temperament. that the expected relationships do not exist.
• Even with such traits, test-retest reliability tends to Knowledge about an instrument's reliability thus is
decline as the interval between the two critical in interpreting research results, especially if
administrations increases research hypotheses are not supported.
• Various things affect an instrument's reliability. For
example, reliability is related to sample
INTERNAL CONSISTENCY heterogeneity. The more homogeneous the sample
(i.e., the more similar the scores), the lower the
• lnternal consistency reliability is the most widely used reliability coefficient will be.
reliability approach among nurse researchers. This • Reliability estimates vary according to the procedure
approach is the best means of assessing an especially used to obtain them. Estimates of reliability
important source of measurement error in computed by different procedures are not identical,
psychosocial instruments, the sampling of items. and so it is important to consider which aspect of
• Scales and tests that involve summing item scores are reliability is most important for the attribute being
almost always evaluated for their internal measured.
consistency.
• An instrument may be said to be internally consistent VALIDITY
to the extent that its items measure the same trait.
• The second important criterion for evaluating a
• Internal consistency is usually evaluated by
quantitative instrument is its validity. Validity is the
calculating coefficient alpha (or Cronbach's alpha).
degree to which an instrument measures what it is
The normal range of values for coefficient alpha is
supposed to measure.
between .00 and +1.00. The higher the reliability
coefficient, the more accurate (internally consistent)
- Suppose we wanted to assess patients' anxiety by
the measure.
measuring the circumference of their wrists. We could
EQUIVALENCE obtain highly accurate and precise measurements of
[Date] 3
wrist circumferences, but such measures would not be the standard for establishing excellence in a scale's
valid indicators of anxiety content validity
• Reliability and validity are not totally independent
qualities of an instrument. - Bu and Wu (2008) developed a scale to measure
- A measuring device that is unreliable cannot possibly nurses' attitudes toward patient advocacy. The
be valid. content validity of their 84-item scale was rated by
- An instrument cannot validly measure an attribute if seven experts (a bioethicist, patient advocacy
it is erratic and inaccurate. researchers, measurement experts). The scale's CV1
- An instrument can, however, be reliable without was calculated to be .85.
being valid. • An instrument's content validity is necessarily based
• Thus, the high reliability of an instrument provides no on judgment. No totally objective methods exist for
evidence of its validity; low reliability of a measure is ensuring the adequate content coverage of an
evidence of low validity. instrument, but it is increasingly common to use a
panel of substantive experts to evaluate the content
FACE VALIDITY validity of new instruments.
• Face validity refers to whether the instrument looks CRITERION-RELATED VALIDITY
as though it is measuring the appropriate construct,
especially to people who will be completing the • Criterion-related validity assessments, researchers
instrument. seek to establish a relationship between scores on
- Johnson and colleagues (2008) developed an an instrument and some external criterion. The
instrument to measure cognitive appraisal of health instrument, whatever abstract attribute it is
among survivors of stroke. One part of the measuring, is said to be valid if its scores correspond
development process involved assessing the face strongly with scores on the criterion.
validity of the items on the scale. Stroke survivors • After a criterion is established, validity can be
were asked a series of open-ended questions estimated easily.
regarding their health appraisal after completing the • Criterion-related validity is helpful in assisting
scale, and then the themes that emerged were decision makers by giving them some assurance that
compared with the content of scale items to assess the their decisions will be effective, fair, and, in short,
congruence of key constructs. valid.
[Date] 4
The difference between predictive and concurrent validity, condition correctly. A measure's sensitivity is its rate
then, is the difference in the timing of obtaining measurements of yielding "true positives."
on a criterion. • Specificity. is the measure's ability to identify
noncases correctly, that is, to screen out those
CONSTRUCT VALIDITY
without the condition.
• Construct Validity a key criterion for assessing the • Specificity is an instrument's rate of yielding "true
quality of a study, and construct validity has most negatives."
often been linked to measurement issues. The key
To determine an instrument's sensitivity, and specificity,
construct validity questions with regard to
researchers need a reliable and valid criterion of "caseness”
measurement are:
against which scores on the instrument can be assessed
- "What is this instrument really measuring?" and
"Does it validly measure the abstract concept of
interest?
KNOWN GROUPS
• One approach to construct validation is the known- - EXAMPLE: suppose we wanted to evaluate whether
groups technique. In this procedure, groups that are adolescents' self-reports about their smoking were
expected to differ on the target attribute are accurate, and we asked 100 teenagers aged 13 to 15
administered the instrument, and group scores are about whether they had smoked a cigarette in the
compared. previous 24 hours.
- For instance, in validating a measure of fear of the - The "gold standard" for nicotine consumption is
labor experience, the scores of primiparas and cotinine levels in a body fluid, and so let us assume
multiparas could be contrasted. Women who had that we did a urinary cotinine assay
never given birth would likely experience more anxiety
than women who had already had children; one might
question the validity of the instrument if such
differences did not emerge.
FACTOR ANALYSIS
[Date] 5
only .50. Specificity is the proportion of teenagers 8. If a diagnostic or screening tool was used, is
who accurately reported they did not smoke, or the information provided about its sensitivity and
true- negative findings divided by all real-negative specificity, and were these qualities adequate?
findings. In our example, specificity is .83. There was 9. Were the research hypotheses supported? If not,
considerably less over-reporting of smoking ("faking might data quality play a role in the failure to confirm
bad") than under-reporting ("faking good"). the hypotheses?
(Sensitivity and specificity are sometimes reported as
percentages rather than proportions, simply by
multiplying the proportions by 100.)
[Date] 6