You are on page 1of 9

research fundamentals  Measurement instruments

Research
fundamentals

Validity and reliability of measurement instruments


used in research
Carole L. Kimberlin and Almut G. Winterstein

M
easurement is the assigning
of numbers to observations Purpose. Issues related to the validity and such as those related to self-report and sec-
in order to quantify phenom- reliability of measurement instruments ondary data sources. Self-report of patients
used in research are reviewed. or subjects is required for many of the
ena. In health care, many of these
Summary. Key indicators of the quality of measurements conducted in health care,
phenomena, such as quality of life, a measuring instrument are the reliability but self-reports of behavior are particularly
patient adherence, morbidity, and and validity of the measures. The process of subject to problems with social desirability
drug efficacy, are abstract concepts developing and validating an instrument is biases. Data that were originally gathered
known as theoretical constructs. in large part focused on reducing error in for a different purpose are often used to an-
Measurement involves the opera- the measurement process. Reliability esti- swer a research question, which can affect
tionalization of these constructs in mates evaluate the stability of measures, the applicability to the study at hand.
internal consistency of measurement Conclusion. In health care and social sci-
defined variables and the develop-
instruments, and interrater reliability of ence research, many of the variables of
ment and application of instruments instrument scores. Validity is the extent to interest and outcomes that are important
or tests to quantify these variables. which the interpretations of the results of are abstract concepts known as theoretical
For example, drug efficacy may be a test are warranted, which depends on constructs. Using tests or instruments that
operationalized as the prevention the particular use the test is intended to are valid and reliable to measure such con-
or delay in onset of cardiovascular serve. The responsiveness of the measure structs is a crucial component of research
disease, and the related measurement to change is of interest in many of the quality.
applications in health care where improve-
instrument may ascertain data on the
ment in outcomes as a result of treatment Index terms: Control, quality; Data collec-
occurrence of cardiac events from is a primary goal of research. Several issues tion; Errors; Methodology; Research
patient medical records. This article may affect the accuracy of data collected, Am J Health-Syst Pharm. 2008; 65:2276-84
focuses primarily on psychometric
issues in the measurement of patient-
reported outcomes. However, similar
aspects of measurement quality measuring instrument are the reli- research often involve patient ques-
apply to clinical and economic out- ability and validity of the measures. tionnaires or interviews. Measures
comes. Steps to improve the mea- In addition, the responsiveness of the using patient self-report include
sures used in pharmacy research are measure to change is of interest in quality of life, satisfaction with care,
also outlined. many health care applications where adherence to therapeutic regimens,
improvement in outcomes as a result symptom experience, adverse drug
Evaluating the quality of of treatment is a primary goal of effects, and response to therapy (e.g.,
measures research. Data sources for measures pain control, sleep disturbance). In
Key indicators of the quality of a used in pharmacy and medical care addition, measures can be developed

Carole L. Kimberlin, Ph.D., is Professor; and Almut Winterstein, (kimberlin@cop.ufl.edu).


P h .D., is Associate Professor, Department of Pharmaceutical The authors have declared no potential conflicts of interest.
Outcomes and Policy, College of Pharmacy, University of Florida,
Gainesville. Copyright © 2008, American Society of Health-System Pharma-
Address correspondence to Dr. Kimberlin at the Department cists, Inc. All rights reserved. 1079-2082/08/1201-2276$06.00.
of Pharmaceutical Outcomes and Policy, College of Pharmacy, DOI 10.2146/ajhp070364
University of Florida, P.O. Box 100496, Gainesville, FL 32610

2276 Am J Health-Syst Pharm—Vol 65 Dec 1, 2008


research fundamentals  Measurement instruments

The Research Fundamentals section com- fects can be assessed.” Pretesting or tion that items measuring the same
prises a series of articles on important topics pilot testing an instrument allows construct should correlate. Perhaps
in pharmacy research. These include valid for the identification of such sources. the most widely used method for
research design, appropriate data collection Refinement of the instrument then estimating internal consistency
and analysis, application of research findings focuses on minimizing measurement reliability is Cronbach’s alpha. 1-4
in practice, and publication of research re- error. Cronbach’s alpha is a function of the
sults. Articles in this series have been solicited Reliability estimates are used to average intercorrelations of items
and reviewed by guest editors Lee Vermeulen, evaluate (1) the stability of measures and the number of items in the scale.
M.S., and Almut Winterstein, Ph.D. administered at different times to It is used for summated scales such as
the same individuals or using the quality-of-life instruments, activities
same standard (test–retest reliabil- of daily living scales, and the Mini
ity) or (2) the equivalence of sets of Mental State Examination. All things
items from the same test (internal being equal, the greater the number
from patient information available in consistency) or of different observ- of items in a summated scale, the
medical records, including ordered ers scoring a behavior or event us- higher Cronbach’s alpha tends to be,
tests or medical examinations, and ing the same instrument (interrater with the major gains being in ad-
administrative claims. Some of the reliability). Reliability coefficients ditional items up to approximately
measures from these data sources range from 0.00 to 1.00, with higher 10, when the increase in reliability
are considered more objective (e.g., coefficients indicating higher levels for each additional item levels off.
laboratory tests) because the reliabil- of reliability. This is one reason why the use of a
ity and validity of the measures are Stability. Stability of measure- single item to measure a construct is
known, with the error margins and ment, or test–retest reliability, is not optimal. Having multiple items
reporting of results meeting gener- determined by administering a test to measure a construct aids in the
ally rigorous standards. However, at two different points in time to determination of the reliability of
most data sources involve a greater the same individuals and determin- measurement and, in general, im-
degree of subjectivity in judgment ing the correlation or strength of proves the reliability or precision of
or other potential sources of error in association of the two sets of scores. the measurement.
measurement. In such cases, it is in- The same process may be used when Interrater reliability. Interrater
cumbent on the researcher to control calibrating a medical measurement reliability (also called interobserver
for known sources of error and to device, such as a scale. The timing of agreement) establishes the equiva-
report the reliability and validity of the second administration is critical lence of ratings obtained with an
measurements used. when tests are administered repeat- instrument when used by different
Reliability. According to classical edly. Ideally, the interval between ad- observers. If a measurement process
test theory, any score obtained by a ministrations should be long enough involves judgments or ratings by ob-
measuring instrument (the observed that values obtained from the second servers, a reliable measurement will
score) is composed of both the “true” administration will not be affected require consistency between different
score, which is unknown, and “er- by the previous measurement (e.g., raters. Interrater reliability requires
ror” in the measurement process.1 a subject’s memory of responses to completely independent ratings of
The true score is essentially the score the first administration of a knowl- the same event by more than one
that a person would have received edge tests, the clinical response to rater. No discussion or collaboration
if the measurement were perfectly an invasive test procedure) but not can occur when reliability is being
accurate. The process of developing so distant that learning or a change tested. Reliability is determined by
and validating an instrument is in in health status could alter the way the correlation of the scores from two
large part focused on reducing error subjects respond during the second or more independent raters (for rat-
in the measurement process. There administration. ings on a continuum) or the coeffi-
are different means of estimating Internal consistency. Internal con- cient of agreement of the judgments
the reliability of any measure. Ac- sistency gives an estimate of the of the raters. For categorical vari-
cording to Crocker and Algina,1 the equivalence of sets of items from ables, Cohen’s5 kappa is commonly
test developer has a responsibility to the same test (e.g., a set of questions used to determine the coefficient of
“identify the sources of measurement aimed at assessing quality of life or agreement.2 Kappa is used when two
error that would be most detrimental disease severity). The coefficient raters or observers classify events or
to useful score interpretation and of internal consistency provides an observations into categories based on
design a reliability study that permits estimate of the reliability of measure- rating criteria. Rather than a simple
such errors to occur so that their ef- ment and is based on the assump- percent agreement, kappa takes into

Am J Health-Syst Pharm—Vol 65 Dec 1, 2008 2277


research fundamentals  Measurement instruments

account the agreement that could be rather than requiring that two raters measured directly and can only be
expected by chance alone. judge all observations. In addition, inferred from observations of speci-
Often, observational instruments data to establish the consistency with fied behaviors or phenomena that
or rating scales are developed to which the primary rater applies the are thought to be indicators of the
evaluate the behaviors of subjects criteria over time are important for presence of the construct.1 Measure-
who are being directly observed. establishing the reliability of the in- ment of a construct requires that the
However, any measure that relies strument. Rater drift can occur when conceptual definition be translated
on the judgments of raters or re- an individual rater alters the way he into an operational definition. An
viewers requires evidence that any or she applies the scoring criteria operational definition of a construct
independent, trained expert would (i.e., becoming more lenient or strin- links the conceptual or theoretical
come to the same conclusion. Thus, gent) over time. Investigators who definition to more concrete indica-
interrater reliability should be estab- build in reliability checks through- tors that have numbers applied to
lished when data are abstracted from out the study as data are collected signify the “amount” of the con-
medical charts or when diagnoses or rather than waiting until the end of struct. The ability to operationally
assessments are made for research data collection can identify instances define and quantify a construct is the
purposes. Interrater reliability in re- where interrater reliability has begun core of measurement.
search such as this depends on devel- to deteriorate, perhaps due to rater To understand how a construct
oping precise operational definitions drift. might be operationally defined,
of variables being measured as well as Validity. Validity is often defined consider the example of the efficacy
having observers well trained to use as the extent to which an instrument of a new drug product. The ability
the instrument. Interrater reliability measures what it purports to mea- to improve a patient’s health may be
is optimized when criteria are explic- sure. Validity requires that an instru- measured by the decrease of certain
it and raters are trained to apply the ment is reliable, but an instrument symptoms, the delay in onset of
criteria. Raters must be trained how can be reliable without being valid. a certain disease, length of remis-
to make a decision that an event has For example, a scale that is incor- sion, or the prevention of certain
occurred or how to determine which rectly calibrated may yield exactly the clinical complications. Likewise, the
point on a scale measuring strength same, albeit inaccurate, weight val- theoretical construct of medication
or degree of a phenomenon (e.g., a ues. A multiple-choice test intended adherence may be operationally de-
3-point scale measuring seriousness to evaluate the counseling skills of fined as a one-month recording of
of a disease) should be applied. The pharmacy students may yield reliable number of missed doses as measured
more that individual judgment is scores, but it may actually evaluate by a medication-event monitoring
involved in a rating, the more crucial drug knowledge rather than the abil- system (MEMS), which includes
it is that independent observers agree ity to communicate effectively with microprocessors that record the oc-
when applying the scoring criteria. patients in making a recommenda- currence and time of each opening
Before data gathering begins, train- tion. While we speak of the validity of a prescription vial. An operational
ing should include multiple cases in of a test or instrument, validity is not definition of patient satisfaction with
which raters respond to simulated a property of the test itself. Instead, health care might be “patient self-
situations they will encounter and validity is the extent to which the reported responses to items on the
rate, interrater reliability is calcu- interpretations of the results of a test 18-item short-form version of the
lated, disagreements are clarified, and are warranted, which depend on the Patient Satisfaction Questionnaire
a criterion level of agreement is met. test’s intended use (i.e., measurement (PSQ).”6 An even more precise un-
Interrater reliability should again be of the underlying construct). derstanding of the operational defi-
verified throughout the study. Even Much of the research conducted nition would involve an examination
when established observational in- in health care involves quantifying of the specific items on the PSQ-18
struments are being used or criteria attributes that cannot be measured instrument. How critical a concise
are explicit, research that relies on directly. Instead, hypothetical or operationalization, including data
observations or judgments should abstract concepts (constructs), such sources and aggregation of informa-
check reliability, and the study proto- as severity of disease, drug efficacy, tion, is in terms of measurement
col should include procedures to de- drug safety, burden of illness, patient validity is illustrated with a simple
termine the level of observer agree- satisfaction, health literacy, quality outcome, such as onset of diabetes
ment. In most studies, a percentage of life, quality of provider–patient mellitus. A drug’s ability to delay
of observations (e.g., number of communication, and adherence to onset could be measured through
charts reviewed) is randomly selected medical regimens, are measured. simple chart review, but diagnosis of
for scoring by two independent raters Hypothetical constructs cannot be diabetes will depend on a patient’s

2278 Am J Health-Syst Pharm—Vol 65 Dec 1, 2008


research fundamentals  Measurement instruments

decision to seek health care and the Content validity. This type of a situation is a researcher developing
provider’s ability to recognize symp- validity addresses how well the a self-administered version of an in-
toms and make the proper diagnosis. items developed to operationalize a strument that had been validated for
Thus, regularly scheduled follow-up construct provide an adequate and person-to-person interviewer admin-
visits and the use of explicit screen- representative sample of all the items istration. Another example is a clini-
ing protocols will likely increase the that might measure the construct of cal researcher wanting to use a brief
accuracy of the estimate and yield a interest. Because there is no statistical screening instrument for a condition,
more valid result. test to determine whether a measure such as depression, instead of admin-
In addition, Crocker and Algina1 adequately covers a content area or istering a more extensive measure. In-
have pointed to the importance of a adequately represents a construct, vestigators in one study, for example,
theoretical foundation by noting that content validity usually depends on examined the validity of a single-item
“constructs cannot be defined only the judgment of experts in the field. question “Do you often feel sad or
in terms of operational definitions Criterion-related validity. This type depressed?” against a more extensive,
but must also have demonstrated re- of validity provides evidence about validated instrument for identifying
lationships to other constructs or ob- how well scores on the new measure depression after a stroke.8 The same
servable phenomena.” New research correlate with other measures of approach applies to sources of diag-
that gathers information on the con- the same construct or very similar nostic data. For example, researchers
structs measured by a specific instru- underlying constructs that theo- may want to determine the validity
ment, even one that has been widely retically should be related. It is crucial of using administrative claims data to
used in research, contributes to the that these criterion measures are measure a construct represented by a
evidence regarding the construct va- valid themselves. With one type of certain clinical event, such as hospi-
lidity of that test. In this sense, all of criterion-related validity—predictive talization for acute myocardial infarc-
the different studies and validation validity—the criterion measurement tion, rather than using chart reviews,
strategies that provide evidence of a is obtained at some time after the ad- which are time-consuming and costly.
test’s validity for making specific in- ministration of the test, and the ability Selecting an appropriate and
ferences about groups of respondents of the test to accurately predict the cri- meaningful criterion measure can be
are part of construct validation. terion is evaluated. For example, sur- a challenge. Often, the ultimate crite-
Validity evidence is built over time, rogate outcomes such as blood pres- rion a researcher would like to be able
with validations occurring in a vari- sure and cholesterol levels are based on to predict is too distant in time or too
ety of populations. Comprehensive their predictive validity in projecting costly to measure. The “criterion
literature reviews on measurement the risk of cardiovascular disease, even problem” exists for many of the ulti-
approaches are therefore critical in though some of these associations mate criterion measures investigators
guiding the selection of measures have been recently questioned. An- would like to predict in health care
and measurement instruments. other type of criterion-related validity research. For example, a study that
Construct validity. This type of is concurrent validity. In establishing aims to evaluate the effect of phar-
validity is a judgment based on the concurrent validity, scores on an in- maceutical care on the “health” of
accumulation of evidence from strument are correlated with scores hypertensive patients will likely not
numerous studies using a specific on another (criterion) measure of the have the necessary follow-up time to
measuring instrument. Evaluation of same construct or a highly related establish that the intervention results
construct validity requires examining construct that is measured concur- in reduced morbidity or mortality.
the relationship of the measure being rently in the same subjects. Ideally, the Instead, a surrogate outcome, such as
evaluated with variables known to be criterion measure would be consid- reduction in blood pressure, is used.
related or theoretically related to the ered to be the gold standard measure Cost of administration of the “best”
construct measured by the instru- of the construct. This strategy of criterion measures may also be a bar-
ment.1,7 For example, a measure of determining the validity of a measure rier. For example, an investigator may
quality of life would be expected to might be seen in a situation in which want to validate a new self-report
result in lower scores for chronically a new instrument has some advan- measure of medication adherence
ill patients than for healthy college tage over the gold standard measure, with concurrent measurement us-
students. Correlations that fit the such as an increased ease of use or ing a MEMS cap. However, because
expected pattern contribute evidence reduced time or expense of admin- MEMS technology is expensive, a less
of construct validity. All evidence istration. These advantages would costly measure, such as pill count or
of validity, including content- and justify the time and effort involved in refill records, may instead be used
criterion-related validity, contributes the development and validation of a to provide evidence of concurrent
to the evidence of construct validity. new instrument. An example of such validity.

Am J Health-Syst Pharm—Vol 65 Dec 1, 2008 2279


research fundamentals  Measurement instruments

Responsiveness identify existing instruments that better understanding of what aspect


Responsiveness is the ability of a measure the construct of interest. or conceptualization of quality of life
measure to detect change over time Using an existing instrument that has is addressed. Talking with physicians
in the construct of interest. For out- substantial evidence of reliability and about their progress notes will aid
come measures intended to evaluate validity in a variety of populations in deciding whether certain patient
the effects of medical or educational is more cost-effective than starting information can be expected to be
interventions, responsiveness to from scratch to develop and validate documented in a patient chart or
changes that result from the inter- an instrument. what is often omitted.
vention is required. Reliability is a In selecting an instrument, the 3. Is the evidence of reliability and valid-
crucial component of responsiveness. following questions should be ity well established? Has the measure
The “noise” that is due to measure- addressed: been evaluated using various types
ment error can mask changes that of reliability estimates (e.g., both
may, in fact, be attributable to the in- 1. Do instruments already exist that internal consistency and test–retest)
tervention. For example, using a scale measure a construct the same or very and varied strategies for establishing
manufactured to weigh trucks will similar to the one you wish to mea- validity (e.g., content and concurrent
not be helpful when evaluating a new sure? Before you begin searching for validity as well as more extensive evi-
weight-loss drug in humans because instruments, you must have a clearly dence of construct validity in varied
the estimates will be too imprecise defined construct or concept that you populations)? Has it been validated
to identify small changes. The mea- wish to measure, along with an opera- in a population similar to the one you
surement will be valid yet unreliable tional definition and some evidence will be studying?
or imprecise. A new disease-specific that the construct can be measured 4. In previous research, was there vari-
quality-of-life instrument that has as defined. For example, there is ability in scores with no floor or ceil-
not demonstrated stability over time agreement that the efficacy of a new ing effects? Did previous studies have
when there is no change in health blood-pressure-lowering medication a large amount of missing data, either
status (which may be an indication is ultimately defined by a reduction in on the measure itself or on items
of measurement error) may not be macrovascular events, but what about within the measure?
able to detect health status changes. the efficacy of a palliative agent for 5. If the measure is to be used to
Measures that have ceiling effects cancer patients? A literature search evaluate health outcomes, effects of
have a limited ability to assess posi- can help identify how other research- interventions, or changes over time,
tive changes that may result from the ers have defined the construct or a are there studies that establish the in-
intervention because there is limited closely related construct. The litera- strument’s responsiveness to change
room for subjects to improve their ture search will ideally result in a list in the construct of interest? Obvi-
scores. Responsiveness to change can of outcomes and instruments that ously, it is important that change in
legitimately differ from one popula- you can evaluate for possible use in measurement be due to change in the
tion to another, which is why the your research. construct rather than to the instabil-
measure must be appropriate to the 2. How well do the constructs in the in- ity of scores (i.e., lack of reliability
subjects being studied. For example, struments you have identified match of the measure itself). In addition, it
a measure of activities of daily liv- the construct you have conceptually would be helpful if there were data
ing that includes the ability to dress defined for your study? In evaluat- on how much change in scores would
or wash oneself may be responsive ing whether there is congruence, do be required to be considered clini-
to change among an elderly popula- not rely on the title of the measure or cally meaningful.
tion of patients undergoing physical on the operational definition of the 6. Is the instrument in the public do-
therapy or cardiac rehabilitation. construct that appears in a research main? If not, it will be necessary to
However, it would probably not be article or the description of variables obtain permission from the author
sensitive to change due to a ceiling ef- in a secondary database, such as a for its use. Even though an instrument
fect among a younger group of newly medical record or administrative is published in the scientific literature,
diagnosed hypertensive patients who claims database. Real understanding this does not automatically mean that
have not experienced significant dis- of the measure usually requires an it is in the public domain, and permis-
ability due to the disease or to the examination of the actual items or sion from the author and publisher
aging process. questions and the way data were gen- may be required. If it is a copyrighted
erated or documented. For example, instrument, you may have to pay a fee
Selecting an existing instrument reviewing the actual items used in to purchase or use the instrument.
Before developing a new test or a questionnaire to evaluate disease- Some instruments may also require
measure, an investigator should specific quality of life will provide a additional fees for scoring.

2280 Am J Health-Syst Pharm—Vol 65 Dec 1, 2008


research fundamentals  Measurement instruments

7. How expensive is it to use the instru- tation in the chart was inadequate, so the repository will become a resource
ment? A mail questionnaire costs less the measure was not able to correlate for “accurate and efficient measure-
to administer than do telephone or with any other variable of interest. ment of patient-reported symptoms
face-to-face interviews. Using elec- Subjects may misinterpret questions. and other health outcomes in clinical
tronic data is usually less costly and Responses may be highly skewed. In- practice.”12
time-consuming than conducting ternal consistency may be so low that
medical record reviews. However, item responses cannot reasonably be Measurements using self-report
electronic data may not contain in- combined into a single summated For many of the measurements
formation that is available on patient score. In other types of studies, a re- used in health care, researchers rely
charts, so a thorough understanding searcher may obtain biased results by on the self-report of patients or sub-
of the limitations of the data available incorrectly assuming that diagnostic jects. With surveys, researchers rely
as well as the requirements of mea- codes are valid without determining on responses to questions to provide
surement for your study is important. their relationship to other measures measurements of the constructs of
8. If the instrument is administered by an that should indicate the presence interest. While self-reports of behav-
interviewer or if the measure requires of the disease. Assuming medi- ior, beliefs, and attitudes are prone to
use of judges or experts, how much ex- cal records adequately capture the known biases, there are no acceptable
pertise or specific training is required information needed to construct a alternative means of measurement
to administer the instrument? measure and that chart reviewers will for many constructs (e.g., level of
9. Will the instrument be acceptable to interpret information uniformly can pain, depression, patient satisfaction
subjects? Does the test require inva- also threaten the validity of findings. with care, quality of life).
sive procedures? Is the reading level Careful attention to the development Self-reports of behavior such as di-
appropriate? Is the respondent’s bur- of instruments, regardless of how etary intake, adherence to medication
den, including complexity of ques- straightforward the measures may regimens, and exercise frequency and
tions and time needed to complete the seem, along with pilot testing to de- intensity are particularly subject to
instrument, unlikely to affect response termine their reliability and validity, problems with social desirability bi-
rates or the quality of responses? is crucial to the conduct of quality ases. Subjects may provide responses
research. that are socially acceptable or that are
Keep in mind that reliability and in line with the impression they want
validity evidence from established Item-response theory to create. In addition, self-report
instruments is applicable only if you In recent years, Rasch models questions may elicit an estimation of
use the instrument in the same form and item-response theory (IRT) or behavioral frequency rather than the
and follow the same administration latent-trait models have provided recall and count response desired by
procedures as used in the validation an alternative framework for under- the researcher. The use of estimation
study. Modifications of validated standing measurement and alterna- rather than recall is a function of how
instruments may require permission tive strategies for judging the quality information is retrieved from mem-
from developers and also require of a measuring instrument. Readers ory, how frequency-response scales
validating the modified instrument are referred to other resources for are formulated, and other specific
as if it were a new instrument. more information on Rasch and IRT aspects of the instrument.13-15 For ex-
Researchers may be tempted to models.1,9-11 ample, behaviors that occur with
conclude that available measures do The National Institutes of Health, high frequency, such as dietary intake
not meet their needs and that they along with research teams through- or taking a scheduled medication for
must develop their own instruments. out the United States, initiated the a chronic condition, are not likely to
They may view the measures they development of the Patient-Reported be specific in memory for a very long
want to develop as being so straight- Outcomes Measurement Informa- period of time. If it is desired that
forward, such as a few questions mea- tion System, which will create item specific events be recalled rather than
suring patient knowledge or a specific banks of patient-reported outcomes estimated, the time frame must be of
item from a medical chart, that they validated using modern measure- very short duration and in the imme-
do not need to conduct a pilot test ment theory. 12 This initiative is diate past. Therefore, asking patients
to determine reliability and validity. building item pools and developing how many doses of a medication they
Researchers may then go to consider- questionnaires that measure key missed in the past month or past year
able effort collecting data only to find health outcomes related to many will likely result in an estimate or
at the end of the study that subjects chronic diseases, including measures educated guess, whereas a question
do not vary much in their responses such as fatigue and pain. These items about the past 24 hours or past three
to the instrument or that documen- will be available to investigators, and days may reflect actual recall. Asking

Am J Health-Syst Pharm—Vol 65 Dec 1, 2008 2281


research fundamentals  Measurement instruments

subjects about stressors they encoun- Use of self-report or poorly de- have been misdiagnosed, or certain
tered in the past 24 hours is likely to signed measures can result in mis- medical services may not have been
lead to recall of minor, daily hassles, classification bias (error in classifying covered by the insurance company
whereas a question about stressors either exposure status or effect [e.g., and thus may not appear in the bill-
in the past year is likely to lead the disease] in patients or subjects). ing database. Understanding how the
subjects to interpret the question as Patient recall of previous drug expo- information represented in the data
being about major life events and sure, for example, has been shown to set was generated, whether and how
answer accordingly. When a list of be subject to error.17-19 it was coded, who coded it and for
alternative responses is provided, the In case–control studies, recall bias what purpose, and how consistent
response options themselves deter- is of concern when there are no ob- coding was across sites and at differ-
mine the way subjects interpret the jective markers of exposure. Individ- ent times in longitudinal data sets or
question and the way they respond. uals with the disease or outcome of among different coders is important
Often, the response choices re- interest are more likely to remember in evaluating the reliability of data.
quire subjects to provide their own relevant exposures than are healthy Utilization of diagnostic codes in
judgment about frequency using controls.20 One approach that is rec- charge data of clinical encounters has
undefined response alternatives (e.g., ommended to address this recall bias frequently been criticized because the
on an ordinal scale from “seldom” to is to have a control group affected by selection of codes is often driven by
“frequently”). Such terms can mean a disease different from that of cases reimbursement rather than clinical
very different things to different to introduce a similar bias toward accuracy. Examining prior research
subjects. One person who reports recall of exposure. that has applied these data sets can
ingesting a “moderate” amount of help determine what is known about
alcohol may be referring to two to Use of secondary data the reliability and validity of data.
three alcoholic drinks a day, while Data originally gathered for a Even when original medical charts
someone else may define moderate different purpose are often used to are used, it must be recognized that
consumption as two to three drinks answer a research question. These this information was not collected
a month. When asking questions data may have addressed a different for research purposes and that docu-
about frequency of behavior, it is research question or may have been mentation was guided by institu-
usually best to let the subject fill in gathered for clinical, billing, or legal tional policy, provider training, and
the blank on an item with a clearly purposes. Secondary data include provider preference. Moreover, while
defined reference period. An example pharmacy records, electronic or retrospective chart review is often
of such a question is “How many paper medical records, patient regis- used as the gold standard for valida-
doses of (specific medication) have tries, and insurance claims data. The tion of other measures, chart review
you missed taking completely in the first consideration when deciding is itself vulnerable to problems of
past three days?” The open format whether secondary data can be used unreliability, even though evidence
requires a specific description of the is to verify that the data set appropri- of the reliability of data abstracted
behavior of interest as well as a spe- ately measures the variables required from charts is frequently not re-
cific time frame. to answer the research questions. If ported in research articles. A review
The International Epidemiologi- the data elements are not present, of research in emergency medicine
cal Association’s European Question- consideration can be given to wheth- journals found that, of 244 articles
naire Group issued a report on prob- er appropriate proxy measures of utilizing chart review for data ab-
lems arising from the questionnaires variables of interest are available. The straction, interrater reliability was
used to collect information on expo- use of proxy measures requires care- mentioned in 5% and statistically
sure, outcomes, and confounders.16 ful conceptual analysis of how closely tested in only 0.4% of the articles.21
The report noted that the variables of interest and proxy The authors of the review also re-
measures are associated. For ex- ported that additional steps to ensure
Published results often fail to re- ample, it seems intuitive that a claims the reliability and validity of chart
produce the exact wording of key database could be used to identify all review data (e.g., use of a standard-
questions used to define exposures patients who suffered a stroke during ized abstraction form, abstractor
or outcomes, nor do they always pro- a certain time period as long as they training, abstractor monitoring,
vide adequate information on how were eligible for benefits. However, blinding of abstractors to study
the data collection instruments were strokes may have been silent and re- hypotheses) were not mentioned in
developed, or if procedures such as quired no medical intervention, the study methods. Blinding to study
pre-testing, validity checks, or pilot patients may have died before medi- hypotheses was mentioned in only
studies were used to ensure accuracy. cal care could be sought, stroke may 3% of studies, even though observer

2282 Am J Health-Syst Pharm—Vol 65 Dec 1, 2008


research fundamentals  Measurement instruments

bias is a recognized source of inac- of Combination Ezetimibe and (PSQ-18). Santa Monica, CA: RAND;
1994.
curate study results. Other research High-Dose Simvastatin vs Simvas- 7. Cambell DT, Fiske DW. Convergent and
has found that certain types of data tatin Alone on the Atherosclerotic discriminant validation by the multitrait-
elements abstracted from charts do Process in Subjects with Heterozy- multimethod matrix. Psychol Bull. 1959;
56:81-105.
not have adequate levels of interrater gous Familial Hypercholesterolemia 8. Watkins C, Daniels L, Jack C et al. Accu-
reliability.22 Researchers interested in (ENHANCE) trial found negative racy of a single question in screening for
extracting data from medical charts results for effects on intima–media depression in a cohort of patients after
stroke: comparative study. BMJ. 2001;
are referred to articles describing thickness, even though the combi- 17:1159.
procedures that can help ensure the nation of ezetimibe and simvastatin 9. Bond TG, Fox CM. Applying the Rasch
quality of data abstracted.23,24 demonstrated improved effects on model: fundamental measurement in the
human sciences. Mahwah, NJ: Lawrence
LDL cholesterol levels as well as Erlbaum; 2001:1-288.
Use of surrogate measures C-reactive protein.28 These examples 10. Hambleton RK, Swaminathan H, Rog-
The Food and Drug Administra- have alerted the research commu- ers HJ. Fundamentals of item response
theory. Newbury Park, CA: Sage; 1991:1-
tion defines a surrogate endpoint nity that surrogate outcomes remain 153.
of a clinical trial as “a laboratory nothing more than substitutes and 11. Smith EV, Smith RM. Introduction to
measurement or a physical sign used can only approximate the truth. Rasch measurement. Maple Grove, MN:
JAM; 2004.
as a substitute for a clinically mean- There are few surrogate outcomes 12. National Institutes of Health. PROMIS:
ingful endpoint that directly mea- with superior scientific acceptance of Patient-Reported Outcomes Mea-
sures how a patient feels, functions, validity than LDL cholesterol, which surement Information System. www.
nihpromis.org/default.aspx (accessed
or survives. Changes induced by should caution us about the use and 2008 Jun 2).
a therapy on a surrogate outcome interpretation of research findings. 13. Schwarz N. Self-reports: how the ques-
are expected to reflect changes in a Following this train of thought and tions shape the answers. Am Psychol.
1999; 54:93-105.
clinically meaningful endpoint.”25 recalling the previous discussion of 14. Schwarz N, Oyserman D. Asking ques-
The use of surrogate outcomes to the operationalization of theoretical tions about behavior: cognition, commu-
operationally define a construct, constructs, it could be argued that nication and questionnaire construction.
Am J Eval. 2001; 22:127-60.
such as drug efficacy, has become all measures only approximate the 15. Sudman S, Bradburn N, Schwarz N.
increasingly popular, as application truth. Invalid or unreliable measures Thinking about answers: the applica-
of these measures is typically faster can harm a study to the same extent tion of cognitive processes to survey
methodology. San Francisco: Jossey-Bass;
and less costly. Results are obtained as a poor study design or inadequate 1996:1-304.
after shorter follow-up periods, and sample size. 16. Olsen J. Epidemiology deserves better
the number of patients and length questionnaires. Int J Epidemiol. 1998; 27:
935.
of time patients have to participate Conclusion 17. Beiderbeck AB, Sturkenboom MC,
in experiments are reduced. For a In health care and social science Coebergh JW et al. Misclassification of
surrogate outcome to be valid, it research, many of the variables of exposure is high when interview data on
drug use are used as a proxy measure of
should be in the direct pathophysi- interest and outcomes that are impor- chronic drug use during follow-up. J Clin
ologic pathway of a disease, and it tant are abstract concepts known as Epidemiol. 2004; 57:973-7.
should be reasonable to expect that theoretical constructs. Using tests or 18. Ray WA, Thapa PB, Gideon P. Misclas-
sification of current benzodiazepine
the pharmacologic action of the instruments that are valid and reliable exposure by use of a single baseline
new drug is mediated through this to measure such constructs is a crucial measurement and its effects upon studies
pathway. If these two conditions component of research quality. of injuries. Pharmacoepidemiol Drug Saf.
2002; 11:663-9.
are true, the drug effect on the sur- 19. Korthuis PT, Asch S, Mancewicz M et
rogate outcome can be extrapolated References al. Measuring medication: do interviews
1. Crocker L, Algina J. Introduction to clas- agree with medical record and pharmacy
toward “true” measures of mor- sical and modern test theory. Orlando, FL: data? Med Care. 2002; 40:1270-82.
bidity or mortality. However, even Harcourt Brace Jovanovich; 1986:1-527. 20. Tripepi G, Jager KJ, Dekker FW et al. Bias
well-established surrogate outcomes 2. Nunnally JC, Bernstein IH. Psychometric in clinical research. Kidney Int. 2008;
theory. 3rd ed. New York: McGraw-Hill; 73:148-53.
have recently been questioned.26 For 1994:251. 21. Gilbert EH, Lowenstein SR, Koziol-
example, the Heart and Estrogen/ 3. Cronbach LJ. Coefficient alpha and the McLain J et al. Chart reviews in emer-
Progestin Replacement Study found internal structure of tests. Psychometrika. gency medicine research: where are the
1951; 16:297-334. methods? Ann Emerg Med. 1996; 27:
that the demonstrated improvement 4. DeVellis RF. Classical test theory. Med 305-8.
in low-density-lipoprotein (LDL) Care. 2006; 44(11, suppl 3):S50-9. 22. Yawn BP, Wollan P. Interrater reliability:
and high-density-lipoprotein cho- 5. Cohen J. A coefficient of agreement for completing the methods description in
nominal scales. Educ Psychol Meas. 1960; medical records review studies. Am J Epi-
lesterol levels did not result in an 20:37-46. demiol. 2005; 161:974-7.
expected improvement on cardiac 6. Marshall GN, Hays RD. The Patient 23. Reisch LM, Fosse JS, Beverly K et al.
events.27 Most recently, the Effect Satisfaction Questionnaire Short-Form Training, quality assurance, and assess-

Am J Health-Syst Pharm—Vol 65 Dec 1, 2008 2283


research fundamentals  Measurement instruments

ment of medical record abstraction in


a multisite study. Am J Epidemiol. 2003;
157:546-51.
24. Gearing RE, Mian IA, Barber J et al. A
methodology for conducting retrospec-
tive chart review research in child and
adolescent psychiatry. J Can Acad Child
Adolesc Psychiatry. 2006; 15:126-34.
25. Temple RJ. A regulatory authority’s
opinion about surrogate endpoints. In:
Nimmo WS, Tucker GT, eds. Clinical
measurement in drug evaluation. New
York: Wiley; 1995:1-22.
26. D’Agostino RB Jr. The slippery slope of
surrogate outcomes. Curr Control Trials
Cardiovasc Med. 2000; 1:76-8.
27. Hulley S, Grady D, Bush T et al. Ran-
domized trial of estrogen plus progestin
for secondary prevention of coronary
heart disease in postmenopausal women.
JAMA. 1998; 280:605-13.
28. Katelein JJ, Akdim F, Stroes ES et al.
Simvastatin with or without ezetimibe
in familial hypercholesterolemia. N Engl
J Med. 2008; 358:1431-43. [Erratum, N
Engl J Med. 2008; 358:1977.]

2284 Am J Health-Syst Pharm—Vol 65 Dec 1, 2008

You might also like