You are on page 1of 12

31 Assessment of Health

Dorcas Eleanor Beaton  •  Maarten
Boers  •  Peter Tugwell

KEY POINTS use in rheumatology? (2) How do the different assessment

Any single health outcome can give only a partial view of the tools relate to one another? (3) How does one characterize
impact of a disease on a patient. what one needs to measure? (4) How does one find a mea-
sure that can meet that need?
Core sets are minimal, but not exclusive, domains of
outcomes agreed on by professional groups as important to
include in studies; they are available for several rheumato- WHAT HEALTH OUTCOME ASSESSMENT
logic conditions. TOOLS ARE AVAILABLE?
Defining measurement need is key to the choice of the right In reading the rheumatology literature, and in monitor-
instrument. ing the clinical care of patients, certain highly relevant
Choosing an instrument follows a step-by-step outcomes emerge. A group of these often emerge and are
process—looking for evidence of practical aspects termed core sets of outcomes.
of using the instruments and statistical properties.
If an instrument lacks evidence of a certain property, a study DISEASE-SPECIFIC MEASURES—THE CORE SETS
can be conducted to create the evidence, rather than
abandon the instrument. Core sets are the minimal, but not exclusive, set of domains
to be measured in a study of arthritis. Historically, they fol-
low the Ds of outcome measurement in arthritis: disability,
disease activity, damage, discomfort, dissatisfaction, and
In an era of rising health care costs, increased choice, and death.6,7 They are usually recommended by groups such as
greater provider accountability,1 health outcome ­ measures OMERACT, EULAR, ILAR, or ACR or groups formed
have become essential tools for researchers, ­clinicians, and around specific diseases, such as ASAS for ankylosing spon-
funding bodies. By definition, outcomes refer to “all possible dylitis or GRAPPA for psoriatic arthritis. All core sets have
effects of a disease or intervention.”2 Outcomes cover a spec- a great interest in agreeing on a common set of relevant and
trum of the burden of arthritis on a patient from biomark- psychometrically sound outcomes that would allow them to
ers of disease to subjective appraisals of overall well-being. compare findings across studies and of clinical care.
Other chapters refer to some of the most common measures Table 31-1 presents the list of core sets for clinical trials
of health and disease encountered in rheumatology, includ- in six types of arthritis7-16; it also shows what each group rec-
ing the Disease Activity Scale (DAS, DAS28),3 the Health ommends as additional domains or as needing more research
Assessment Questionnaire (HAQ),4 and the SF-36 (short before they become core set members. The first column on the
form, 36 items).5 Using any one of these health outcome left of Table 31-1 is the core set for longitudinal observational
assessments is like looking out a window in a house, with studies in rheumatology by Wolfe and colleagues7 with some
the burden of arthritis the landscape outside. Each window minor additions. Wolfe’s core set is broader than the core sets
in the house provides a view of the outside world, but it is a presented in other columns because observational studies are
specific view defined by the size of the window and the side often looking for a broader range of outcomes that are rel-
of the house it is on. Another window may offer a slightly evant for studies outside of treatment trials. It also is designed
better angle on what you would like to see. Different health to be used across different forms of arthritis. The table also
outcome assessments can have a degree of overlap in their uses this broader list of outcomes as an axis against which the
views, in which case an informed choice needs to be made other core sets can be described. The remaining columns show
between them, whereas others can hold quite distinct views. that across different types of arthritis, the core sets have many
One assessment might be useful to compare the burden of common elements; most contain or recommend pain, physi-
arthritis against the general population, another might be cal function, patient and clinician global assessments, and
useful to assess differences in the prevalence in different markers of inflammation. Many also include disease activity
subgroups, and yet another might be useful to measure the indices, which are often an aggregation across other clini-
specific benefits of an arthritis intervention. cal findings (e.g., joint count, acute-phase reactants, global
This chapter focuses on describing the different windows ratings of severity) into a score reflecting the activity of the
clinicians have on the burden of arthritis and how they disease at that point in time. Some core sets contain domains
relate to each other. A framework is provided for ensuring reflecting the unique aspects of the disease (e.g., spinal mobil-
that a selected measure is the right one for a given need. This ity in ankylosing spondylitis)17 or the unique target of the
chapter addresses four questions: (1) What health outcome study using the outcomes (fractures are core outcomes only in
assessment tools are available generally and ­specifically for osteoporosis studies focused on fracture prevention).18

Table 31-1  Core Sets for Six Rheumatologic Conditions and for Longitudinal Observational Studies*

Clinical Trial Core Sets of Domains by Disease Group


Longitudinal Study Core Rheumatoid Systemic Lupus Ankylosing

Set of Domains7 Arthritis15,16 Osteoporosis11 Osteoarthritis13 Erythematosus14,19 Spondylitis8 Psoriatic Arthritis10,12
Health status/quality of life ✓
Quality of life R (utility) R (BL)(†) R ✓ ✓
Symptoms ✓ Pain R (back)(†) ✓ Pain R (fatigue) ✓Pain ✓ Pain
✓ Fatigue
Physical function ✓ Disability R(†) ✓ R ✓ ✓
Psychosocial R R

Disease process ✓
Aggregate index ✓ (EULAR DAS) ✓ DAI, R = severity Pending ✓ DAS
Biomarkers ✓ Biochemical O
Joint tenderness ✓ 28 or 68 joints ✓ Peripheral (44 joints) ✓
Assessment of Health Outcomes

Enthesitis Enthesitis‡
Joint swelling ✓
Joint stiffness O ✓ Spinal stiffness
Global ✓ Spinal mobility
Patient ✓ ✓ ✓ ✓
Physician ✓ R ✓
Acute-phase reactants ✓ O R ✓‡ ✓
Damage ✓
Radiography or imaging ✓ >1 yr ✓ Bone mineral density ✓ >1 yr ✓ Spine and hip§ ✓ Structural
Organ damage R (BL)/ ✓ (†) fractures ✓ Damage index ✓Skin disease
R (BL)/ ✓ (†) Change height
Toxicity effects ✓ R ✓ ✓
Death ✓

Dollar costs [R] R R(†) R ✓

Work disability [R] R
*The left-hand column is the broader set of measures recommended by Wolfe and colleagues7 for longitudinal studies. It serves here as a broader range of outcomes, and an axis for the organization of core outcomes
in the other conditions (columns).
✓, core domain; R, recommended for further research and possible inclusion in core set; O, optional outcome (osteoarthritis category only).
Osteoporosis: Core set depends on focus: BL, bone loss studies; †, studies aiming to reduce fracture rates.
Ankylosing spondylitis: Core set elements vary depending on focus of study: ‡, clinical records and symptom modifying; §, disease modifying only; others = all.
DAI, disease activity index; EULAR DAS, European League Against Rheumatism Disease Activity Scale (revised = DAS-28, 28-joint count).

Table 31-1 focuses on the core domains that should be life-years estimations. Utility states can be obtained by di-
measured; the next step is deciding on the instrument that rect or indirect methods. Direct methods, such as standard
is able to provide that well in a reproducible, accurate man- gamble and time tradeoff, involve the respondent work-
ner. In some cases, an instrument choice has been suggested ing through exercises to elicit the value for his or her own
(e.g., the HAQ for disability in rheumatoid arthritis). Other health state against things such as time or more or less fa-
times several options are provided. Strand and cowork- vorable health situations.27 Indirect methods capture the
ers19 reviewed six disease activity indices in systemic lupus state with standardized questions and apply predetermined
erythematosus and found they gave comparable results. In weights.28 Examples include the EQ-5D which is five items
some cases, the domains are shared, but the measurement (three response categories) combined to describe a health
technique varies within or by disease; in rheumatoid arthri- state. Similarly, the Health Utility Index gathers informa-
tis, the DAS28 uses 28 joints,3 and in ankylosing spondy- tion on six or seven dimensions of health (depending on the
litis, 44 joints are counted.17 Some of the more commonly version) on five-item to six-item response scales to define a
encountered instruments in arthritis are briefly reviewed. health state.28 Both these scales then use weights determined
in different populations to assign the value to these health
states—hence the “indirect” weighting. The absolute value
Health Status/Quality of Life
obtained across these different approaches varies.27
General Health Status. Generic health outcomes provide Generic measures of health and utility scores are very
information on an aspect of health across many conditions broad. They often do not perform as well as the more specific
so that theoretically comparisons can be made to compare measures described in the following sections because they are
the burden of low back pain with that of arthritis or diabe- designed to allow comparisons across populations and need to
tes. This comparison depends on the ability of that measure include items that might not be relevant in arthritis. In some
to capture the burden in a disease group well. Generic mea- areas, there are measures of quality of life that are designed
sures have advantages of allowing comparisons across dis- for that disease, such as rheumatoid arthritis or osteoporosis,
eases and covering a broader range of health issues—things which offer a measure of the broad concept of quality of life
that may have been overlooked in a core set (e.g., mental in a disease-specific manner. Such measures would not allow
health issues). Often because of their breadth, however, the comparisons across diseases. At the time of this writing, these
generic measures tend not to delve as well into the depth of were not yet recommended as core instruments.
experience in any one disease. Arthritis-related fatigue is not
detected well in many generic measures because a generic Symptoms. Pain is usually measured using a 10-cm visual
measure asks about being tired or not sleeping well. Because analog scale or a 0-to-10 numeric rating scale of the intensity
of this, a generic measure is usually weaker in its ability to of the symptoms.29 This simple measure has been well tested
detect specific changes and their sensitivity to different lev- and is easily understood by patients. Fatigue is another im-
els of disease activity and should usually be supplemented portant symptom and quite distinct from being “tired.” The
with disease-specific measures (described previously).20 ankylosing spondylitis modified core set contains fatigue,
Two commonly used generic measures are the Sickness and it is a recommended area of further research in rheuma-
Impact Profile (SIP)21 and the SF-36.22 The SIP is a 136- toid arthritis and lupus. Measures are being tested or devel-
item list of illness behaviors that provides a weighted score oped at present. Ankylosing spondylitis is currently using the
for the impact of a disease across 12 categories, such as bodily 10-cm visual analog scale of fatigue from the Bath Ankylos-
pain, work and role functioning, and dressing,21 which lead ing Spondylitis Disease Activity Index (BASDAI).30
to global scores (physical, psychosocial, and overall). The
SIP has been shown to measure illness across a wide vari- Disability Scales. Physical disability in rheumatoid arthritis
ety of health conditions.20 The SF-36 is a 36-item question- and osteoarthritis is often measured using the Health Assess-
naire of which 35 items are used to obtain eight domain ment Questionnaire–Disability Index (HAQ-DI),31 which
scores (including physical functioning, mental health, role covers 20 items looking at different aspects of daily func-
functioning, and pain) scored on a 0-to-100 scale (with tioning. Patients score each item on a 0-to-3 scale, where
100 = better health)22 and two summary scores (mental and 3 represents the greatest disability. Scores are obtained for
physical); the questionnaire is scored with normal of 50 and each domain and summed into a total score expressed on
standard deviation of 10. The SF-36 and briefer SF-12 are the same 0-to-3 scale. Scores are adjusted to a 2/3 if an aid
well supported on the website ( and is used to ­complete a task. More details on the HAQ-DI are
through manuals that supply age and disease group distri- widely available in print and on the Internet.
butions of scores.23 Direct comparisons of generic measures There are other scales asking about physical function such
have shown differences in scores and health states attribut- as the Arthritis Impact Measurement Scale (AIMS)32 and the
able to the choice of measure.24-26 Studies or clinical results AIMS233 and measures with even more specific foci, such as the
may not be comparable to each other if they are using differ- Western Ontario McMaster (WOMAC) osteoarthritis index,
ent health status scales. which is commonly used in hip and knee osteoarthritis,34 and
the AUSCAN osteoarthritis index for hand osteoarthritis.35
Utilities—Value of Health State. Utility scales offer an
overall score for the value of a health state, setting death Disease Process (Activity, Severity)
at 0 and full health at 1. The emphasis is not on describing
the state, but on assigning a value, worth, or preference to Core sets often include indices of disease process. Disease
that state.27,28 Utilities are needed for economic appraisals process can be divided into activity (inflammatory activity)
and form the health assessment for cost per quality-adjusted and severity (overall severity of disease). There are several
466 BEAToN  |  Assessment of Health Outcomes

disease activity indices, the most commonly known being outcome. Similarly, Tugwell’s “effective consumer” captures
the DAS36 and DAS283 in rheumatoid arthritis. Using a the degree to which the patient is effectively managing
subset of the core outcomes (i.e., acute-phase reactants, his or her own health care decisions, interactions with the
joint counts, global ratings), the EULAR group formulated health care team, and disease monitoring.44 This may be a
a weighted index that provides a score of 2 to 10 (DAS) reasonable target for self-help interventions.
or 0 to 9 (DAS28). From this, cutoffs were established to
define high, moderate, and low disease states. The low dis-
ease state (DAS28 <2.6) is considered an indicator of remis- EMERGING DOMAINS
sion of arthritis and is touched on again later. More recently, Work Productivity and Disability
the concept of “stable remission” has been proposed; it is
defined as an initial DAS28 of less than 3.2 and a lack of With the shift toward more aggressive management of ear-
change in DAS over time of 1.2 (2 standard errors).37 Dis- lier disease, more people with arthritis are working, and
ease activity indices track the level of inflammatory activity. outcomes need to shift to track work disability (absentee-
There are others, such as the BASDAI30 or the six available ism and at-work productivity loss).7 Work is hard to mea-
in systemic lupus erythematosus.19 When more than one is sure because it depends on the job and the organization the
available, look for direct comparisons such as Strand’s to see individual works in. Absenteeism can mean many things,
if similar information is provided.19,38 and measures should articulate how this was operational-
ized—full days off work, days on insurance payments, or full
and part days off work. More challenging still is measuring
Damage Indices
the difficulty someone is having at work (presenteeism).
A great deal of work has gone into measures of joint damage There are several measures—16 were found in a more recent
in arthritis. van der Heijde and Landewe17 provide a succinct review45—but there are few direct comparisons of these con-
summary of the Sharp, van der Heijde, and Larsen/Scott tech- ceptually diverse instruments. The most commonly used is
niques. Guidelines should be followed closely, focusing on the the work limitations questionnaire (amount of time expe-
hands and feet. Changes in radiographic progression of joint riencing difficulty).46 Two scales developed in arthritis are
damage often are measured by the smallest detectable differ- promising—Gignac’s Work Activity Limitations Scale
ence, the boundary between measurement error (day-to-day (amount of difficulty experienced)47 and Gillworth’s Work
variability) and change discernible from error.39,40 Instability Scale (risk of future work loss).48 Direct compari-
sons of these measures are under way.
Disadvantages—Toxicity/Adverse Events
Nonpaid Work
Medical and nonmedical management of many rheumatic
conditions carries a risk of toxicity and adverse events,41 Participation in valued nonpaid roles, such as parenting,
many of them unexpected. A comprehensive documenta- volunteer work, or leisure activities, can be an impor-
tion of a range of adverse events is important in outcome tant aspect of the burden of disease.49 Outcome measures
assessments separate from the treatment benefits.42 An reflecting this are needed to capture fully the concept of
OMERACT group is currently working on standardizing participation.
reporting of toxicities in rheumatologic trials.43
Patient-Specific Domains
Patient-specific scales, including the MACTAR or PET in
Arthritis is associated with increased mortality, and arthritis- arthritis,50,51 allow the patient to nominate his or her own
specific mortality should be monitored. Death is not specifically scale content, within a guided framework. Most patients
mentioned in the disease-specific core sets for clinical trials, report three to five items that are particularly salient to them.
but would be important to monitor in observational studies. A surprising number of these scales have been developed.52
Attributing death to arthritis is challenging given its depen- Each taps relevant content for that patient, and because of
dence on documentation at time of occurrence, which may or this they also are responsive to change.53 The challenge is
may not be linked to underlying arthritis in the coding. in the mathematics and how to analyze the numeric score
when the items vary across patients. Analysis that focuses
Dollar Costs on individual level quantification is likely best.

The economic burden of arthritis and the cost of care are core
Satisfaction with Health Outcomes
outcomes for psoratic arthritis and recommended for consid-
eration in observational studies and rheumatoid arthritis, Satisfaction scales are often linked with the goals of a health
osteoporosis, and lupus disease groups. Standards for what care organization and focus on the attributes of the structure
should be included in a cost analysis are under development. and process of care. Instruments developed to look at satis-
faction with a specific health end point (e.g., how satisfied
Disease Self-Management are you with the results of your surgery)54 become a health
outcome. Satisfaction with outcome is complex, and Hudak
Lorig’s work in self-management programs often have and colleagues55 point out the complex balance of experi-
focused on improving self-efficacy, which has been found to ences and ability to “live with” ongoing limitations that
reduce pain and health care use.25 Self-efficacy is a health influence a patient’s response.

Health condition
Sleep (disorder/disease)
The OMERACT patient group has identified sleep qual-
ity as an important domain for outcome assessment.
The concept and measurement of it are expected to be
Body function
addressed at the upcoming OMERACT 9 meeting in and structure
Activities Participation


Environmental Personal
With an array of potential measures, or windows to view factors factors
the burden of disease, how does one organize them to
get an accurate picture of the whole? Conceptual frame- Figure 31-1  The International Classification of Functioning conceptual
works help one to understand how domains, such as those framework showing the hypothesized relationships between domains of
impairment, activity limitations, and participation restrictions.
described previously, theoretically relate to one another
when applied in research or clinical practice. They also
usually provide an operational definition of each of
their domains, which becomes essential when choos-
ing between instruments or deciding how to model the all possible categories of impact falling under each of the
outcome of an intervention and its modifying factors.56,57 five main headings. Many groups, particularly in arthritis,
Conceptual frameworks facilitate accurate communica- have reviewed the categories and developed a list of the
tion and the generation of hypotheses of understanding categories relevant to that form of arthritis. This list sug-
of a disease process and impact. In the past, the most gests which of these categories should be reflected in core
common conceptual framework was the main (sometimes set measures in arthritis, offering a form of standard for the
causal) pathway of a biomedical model: Pathology leads content of scales.63
to organ/system changes leads to pain/symptoms leads to Frameworks define the realm of outcomes that should
functional loss leads to diminished quality of life. This is be considered, and the hypothetical relationships between
similar to a Wilson and Cleary model58 and might help them. They form the basis for understanding observa-
in understanding that function and quality of life might tions, testing hypotheses, or planning and executing an
relate more closely than sedimentation rate and quality of analysis. Shifting frameworks is challenging because dif-
life—based on their proximity and distance in the model. ferent tools may vary in how they define certain aspects
Expanding on this type of model are conceptual frame- of health or disability. Shifts can prompt fruitful rethink-
works such as Nagi59 or Verbrugge’s60 disablement process. ing of concepts, however, and how they relate to each
In these models, there is a main pathway, with slightly other.64
different definitions, but there also are boxes of influence
from outside this pathway—the patient’s personal factors HOW DOES ONE CHARACTERIZE WHAT
and environmental factors. In 2001, the World Health
Organization ratified the International Classification of
Functioning (ICF),61 which offers an even more expanded Just as investigators define a research question before
framework based on a biopsychosocial understanding of embarking on a clinical trial, so should users of health out-
disease. It is receiving wide endorsement in arthritis.56,62 comes define a measurement question before choosing an
In this model (Fig. 31-1), there are three main elements of instrument.
burden. A disease can affect an individual by impairment
(body part), activity limitations (individual’s ability to
do a task), or participation restriction (restrictions in the
execution of the individual’s roles in life situations). The Clarity about a measure’s intended purpose helps ensure the
boxes in Figure 31-1 are joined by bidirectional arrows right one is selected. Measures can be used in three ways: to
because this model suggests that the right-side boxes also describe people at one point in time, to predict a future
could influence the left with secondary problems (e.g., state, and to measure change over time.66,67 This chapter
joint contractures or wounds secondary to prolonged bed focuses on the purposes used for health outcome assessment:
rest). For the ICF, disablement is not a linear progression; describing an end point state in a trial, which is descriptive,
it is a dynamic interaction influenced by several factors.57 and evaluating change over time.
The ICF strongly emphasizes the influence of personal
factors and environmental factors on all three domains.
Several initiatives are under way to examine the fit and
usefulness of the ICF framework in various forms of arthri- An understanding of concepts and definitions is important.
tis.57,63,64 Readers are referred to Jette and Keysor65 for an It is not good enough to decide to measure physical func-
excellent comparison of the ICF and Verbrugge models in tion, for example; a better outcome would be physical func-
the context of arthritis. tion at the level of disability according to Verbrugge and
In addition to providing the conceptual framework, the Jette.60 Similarly, when measuring pain, is intensity more
ICF undertook the task of a classification system listing important than frequency? What about the degree to which
468 BEAToN  |  Assessment of Health Outcomes

pain interferes with daily activities? Questions such as these that matches the concept, population, and purpose. Many
should be addressed before any instruments or core sets are guidelines that offer more detail than can be described
reviewed. The instrument should meet the need, and not here are available, particularly for the acceptable levels
the reverse. of reliability and validity.66-72 This section describes a
decision-making process for fit between a given instru-
ment and the clinician’s need (Fig. 31-2). This process
builds on the work of Law72 and the OMERACT filter69
The target population is crucial, but often overlooked. A and highlights key understandings in each area from the
given instrument may work well in severe osteoarthritis of published guidelines.
the hip, but not be sensitive to the early symptoms of the This decision-making process emphasizes three things.
disease. Equally important is to consider if one wants to First, it begins by stating the measurement need (why,
measure for an individual patient or for describing a group of what, and in whom), which reinforces that a candidate
patients as a whole. The former demands much higher lev- measure meet one need, but not the next. Second, it
els of measurement properties (e.g., reliability coefficients emphasizes that a lot of the appraisal can be done with-
>0.90 as opposed to 0.75 to 0.80 being adequate for group out statistics. It is done by appraising the questionnaire
descriptions).66 or instrument itself and knowledge of its administration.
Third, the inability to affirm each stage suggests that
SELECTING THE OUTCOME THAT CAN there is no need to continue. At the later data-based
MEET THE MEASUREMENT NEED stages, the clinician may choose to run a small study to
create the evidence (the “do-it” loops) in patients, rather
The selection of an outcome measure depends entirely than abandoning the instrument that seems like a good
on a clear understanding of measurement need. Often a candidate. Given these three key features, the process
well-used instrument is used, rather than looking for one from left to right is reviewed next.

Need: Concept Population Intended purpose: describe evaluate change

Candidate measure:

1. Matches
target concept? Boxes marked with “do it loop” are
those where you can create the
No evidence and continue on
2. Feasable Blue boxes = pre-data evaluation
to use? Yellow boxes = data-based evaluation

a. One point in time (descriptive)

3. Does it measure what Reliability
it says it does? (truth) -internal consistency
-inter-rater reliability
Face Construct validity
Content Construct
-differentiates high/low levels
-acts as expected with other indicator

4. Does it have b. Change over time (evaluative)

purpose-specific -test-retest reliability
properties in your -inter-rater reliability
No -responsive to change that is
similar to target situation

Choose scores (states)
another 5. Is it interpretable?
tool No Change

change and state
Good fit for your needs!

Figure 31-2  Algorithm showing decision-making process for the fit of a candidate measure with your measurement need. The left-hand side of the
page is done by appraising the instrument and its instructions. The right-hand side requires numeric evidence of the relevant measurement properties.
Many instruments are weeded out as a poor fit in steps 1, 2, and 3. The “do-end” loop denotes stages at which you can pause to create evidence if it is
missing and not have to abandon the instrument.

STEP 1: IS THERE A MATCH BETWEEN relationship is found—again based on an a priori the-

THE INSTRUMENT’S CONCEPT AND THE ory—to see if the candidate measure behaves according
MEASUREMENT NEED (CONCEPT, to the theory. These add to the evidence that the instru-
POPULATION, PURPOSE)? ment is measuring what it is supposed to measure.68 If the
evidence is unavailable or is not in the intended popula-
An operational definition of the target concept, the appli- tion, you have a choice of abandoning the measure or
cable populations (patients or general population), and doing a study to create that evidence and then continu-
intended purpose should be articulated by the developer and ing to advance.
match your current need.42,68,71,73 If this is not the case, or if
it is not a good match, start with another candidate measure STEP 4: PURPOSE-SPECIFIC EVIDENCE IN YOUR
because this one would not work.68 POPULATION
After getting a general sense of the construct validity of the
instrument, the next step is to address the specific attributes
Feasibility covers the practical aspects of using this scale in needed for the instrument to function to meet your measure-
the intended setting.42,69,72,73 Does it take too much time? Are ment need (discrimination component of the OMERACT
the licensing costs too high? Does it require special equip- filter).69 More attention is now paid to the intended purpose
ment? Is it too burdensome for your patients (language, and the intended population and the properties of consis-
literacy, acceptability of questions)? Is it formatted well on tency in measurement (reliability—obtaining the same score
the page, and do the responses make sense given the tar- in different settings) and additional validity (cross-sectional
get and the question? Are the questions phrased in a clear again or responsiveness or sensitivity to change).
and simple manner? Are the necessary scoring instructions
available? A negative response to any of these questions Descriptive Purpose
could direct you to go to another, more feasible, instrument.
Feasibility often makes or breaks a decision about a candi- Descriptive outcomes often are used to classify individuals
date measure.69 as to severity of condition or to identify them by a prog-
nostic group. In health outcome assessment, descriptive
instruments are needed to classify individuals as respond-
ers or as being in a low disease activity state or remission.
An instrument needs to be precise, that is, the observed
Does the instrument measure what it says it will measure? score is very close to the true score with low error. This is
We divide this into three areas: content, face, and construct estimated by the internal consistency of a multi-item scale
validity. Content validity appraises the items and domains or questionnaire and Cronbach alpha coefficients or Kuder
of a scale. Have the authors covered what McHorney and Richardson 20 if the scale is dichotomous (yes/no). Internal
Tarlov66 call the breadth and the depth of the concept, that consistency is a feature of a scale with many items measur-
is, all the important areas, but also enough depth to capture ing the same thing—are the responses similar across items
the range of experience of the patients? Face validity is an within the instrument? It is not a feature of a scale con-
appraisal of the general direction of the scale—will it hit taining weighted sums of different attributes, such as disease
the target? Are the response options organized in a logical activity measures.37
direction for high and low levels of this attribute? Does the If more than one person will be gathering the data, inter-
scoring make sense? rater/observer reliability should be measured and quantified
The stage of construct validity is the dividing point with an intraclass correlation coefficient (ICC) for continu-
in the decision process between data-free and data-based ous measures or a weighted kappa for ordered categories.74
appraisal. Up until now, the appraisal is done by look- There are different types of ICC depending on the model
ing at the instrument and its manuals. In construct valid- used for the variance estimates; the type of ICC should be
ity, we begin to explore data to see if the numeric scores named.74 The ICC and weighted kappa measure the com-
arising from the instrument make sense. Basic construct parability of actual numeric scores and are preferred over
validity should be established regardless of the purpose. correlation coefficients that look only for trends and not
Sometimes it is de-emphasized in evaluative instruments; a direct match in number values. Cutoffs are always chal-
however, responsiveness without knowing if the instru- lenging, but in general, reliability (including test-retest)
ment is measuring the target seems misplaced. We place should be at minimum 0.7566,73 for group level analyses,
it before the purpose-specific properties to emphasize its and for describing an individual patient, it should be 0.90
need as a basis. Construct validity is generally measured to 0.95.66,73 The internal consistency reliability can be con-
by comparisons with other similar scales or related con- verted back into the scale score by calculating the precision
structs (i.e., high and low levels of pain and function) to limits—using 95% limits, the true score lists somewhere
see if the numeric scores are behaving in the way they within 1.96 × s[1 − r]1/2 where r = internal consistency and
should if this were a valid measure of the target con- s = standard deviation. This calculation tells us the range
cept. Theoretic situations are set up before analysis, the within which the true score for an individual can be found.
direction and magnitude of the expected relationship are If it is too wide (reliability too low), it is impractical to use
declared, and then the relationship is tested.68,73 Com- that instrument.
parisons also should be made between groups known to Construct validity is revisited for the descriptive instru-
differ (high versus low severity) or with scales where no ment, but with more attention to looking for evidence close
470 BEAToN  |  Assessment of Health Outcomes

to the intended application. If the goal is to measure high t ­statistic (mean change/standard error), and effect size (mean
versus low health, the sample should be divided into known change over standard deviation of baseline)74; each can be
groups with high and low health according to another adapted to quantify the relative change between treatment
accepted opinion, and then this scale is tested against it. and control groups.53,76 Deyo and Centor77 also described
The image of a window can help here in selecting compara- the correlational approach (correlate change and another
tors to use for testing. What else gives a bit (or more) of indicator of change) and the receiver operator curve approach
overlap with the target view? How much correspondence is (various change scores against external “gold standard” that
expected between scores? For good construct validity, this the person has changed) where the area under the curve is
a priori hypothesized relationship should be recreated with a summary statistic.77 The numeric summaries of responsive-
data, whether that be a strong correlation or no correlation ness, such as effect sizes or areas under the curve, should cor-
at all. An instrument or measure is never universally valid respond to the type of change expected (a priori theory). A
and requires ongoing testing to improve understanding of large effect size or area under the curve does not mean an
the scores in different situations. instrument is “responsive.” It should correspond with the
change anticipated in the study—small or large. Comparisons
of the effect sizes are helpful if different instruments are being
Evaluative Purpose
compared in the same study, as done by Buchbinder and col-
In evaluative measures, the intent of the study is to focus on leagues53 or by Verhoeven and coworkers,76 who focused on
the amount of change over time. Many clinical trials are doing responsiveness in early rheumatoid arthritis. Responsiveness
this and comparing results between treatment and control is a highly contextualized property, and the same instrument
groups. Interobserver reliability is important if more than one may not be responsive in another situation (e.g., early versus
measurer is to be involved. The hallmark of a good evaluative late disease, osteoarthritis versus rheumatoid arthritis).73
measure relates, however, to time: First, do the scores remain
the same when the target concept has not changed over time
(test-retest reliability)? Second, when the concept changes,
does the score on the instrument/measure change as well? The final step, often deemed the most elusive,78 is the inter-
Test-retest reliability requires two administrations of pretability of the scores.
the measure over a time when no change has occurred.
This may be easier said than done sometimes, but the
Benchmarking States
authors should justify their design and how they ensured
no change had occurred. Similar to interobserver reliabil- What is the meaning of a score of 2/10 on a pain score? Is it
ity, the ICC is the preferred statistic for continuous scores, a good outcome? The meaning of different scores on an out-
and weighted kappa, its equivalent, is preferred for cate- come assessment is used for classifying subjects at the begin-
gorical scores. The cutoffs are the same, and a coefficient ning of a trial and at the end point. To do this, comparisons
can be converted into a “minimal detectable change”75 are made to other known health states—­severity indices,
as 1.96 × s(2[1 − r])1/2, where s = standard deviation and ability to work, self-rating as mild.79 Gradually, enough
r = test-retest reliability (ICC).66,75 Ninety-five percent of trends might be seen across different scenarios to gain confi-
subjects who are stable have change scores less than this dence in the meaning of “good” or “mild.”80,81 In rheumatol-
value; a change greater than this is not likely to occur in a ogy, we see the emergence of low disease activity states82,83
stable patient, only in a changing one. It becomes a lower or patient acceptable symptom states84 or remission criteria
boundary of meaningful change—anything below that with the DAS2838 as thresholds below which subjects are
could be day-to-day fluctuations in scores. considered to be in an acceptable state (either tolerable
Responsiveness—the accurate detection of change when symptoms or disease activity where it does not require medi-
it has occurred—is sometimes best thought of as longitudinal cation changes). At this point, these thresholds are being
construct validity. Similar to construct validity, responsive- established, and similar to change thresholds, we may find
ness depends on an a priori theoretic relationship—one in variability in the values38 that need to be sorted out with
which the attribute is changing over time. Often the focus methodologic work and application in clinical practice.
is on the amount of change picked up, rather than the type
or amount of change that had occurred. A large change is
Changes in State
not useful if we were expecting a small one; rather it suggests
noise. The construct embedded in a study of responsiveness The second type of interpretability concerns change scores.
should be described carefully and should be a clear match with
the intended application (measurement need). If the goal is American College of Rheumatology Response Criteria.
to detect change in a clinical trial, it is important to assess the The American College of Rheumatology took the core
instrument’s ability to detect the difference in change between set measures and determined that if one observed an X%
treatment and control groups. If the goal is to detect change change in joint count and in swollen joint count and in at
in a cohort, it might be more useful to examine change in a least three other areas—erythrocyte sedimentation rate or
single group perhaps in a treatment of known efficacy (hip C-reactive protein, physician global, patient global, pain,
replacement) or in subjects who rated themselves as improved or physical disability—one had a clinical response, and the
on an external anchor (global index of change). individual would be classified as a responder. The percent
Responsiveness is summarized with statistics of sig- is usually 20%, but 50% and 70% have been considered.
nal (change) over noise (error), such as the standardized The ACR20 is widely used, catches responses across a
response mean (mean change/standard deviation of change), wide variety of domains, and discriminates well in clinical

trials76; however, it is currently being revalidated owing to reviewing the literature extensively for the measurement
the changing nature of rheumatoid arthritis and its care.85 properties.

Minimal Clinically Important Differences and Improve-

ments. Defining the threshold of change above which an SECTION 6: AREAS OF GROWTH IN HEALTH
individual has had an “important” shift in outcome is what OUTCOME ASSESSMENT
Kirwan78 has described as the “elusive crock of gold at the
Item Response Theory
end of the rainbow.” Nevertheless, important advances have
been made. There are many sources of variation in score, Users of outcomes in arthritis come across item response
including the method used, the baseline severity, and the theory (IRT) and computer adaptive testing (CAT). These
type of change to be sought.86,87 In 2000, Wells and cowork- terms relate to newer methods of ordering and calibrating
ers88 described nine different methods for deriving minimal items on a scale so that there is equal meaning in score incre-
clinically important differences from the literature. Some ments across the scale. Most of our outcomes were developed
use distributional cutoffs (½ standard deviation, or effect in “classic test theory” (internal consistency, summed scores
size of 0.2 or 0.5),89 which have been criticized as lacking without weights). There are two schools within IRT: Rasch,
any meaningful anchor. Other methods depend on some which fixes the parameters and assesses if items in a scale fit
external anchor that important improvement has occurred, or do not fit that model, and IRT itself, which fits a model
but are sometimes challenged by the dependence on that to the data, rather than the reverse. IRT and Rasch are often
anchor and the perspective of the individual who deter- presented as conflicting schools, but they are both working
mines it (patient, physician, third-party payer). Minimal toward an item calibration that allows more accuracy and
clinically important differences repeatedly have been shown precision. In the future, direct comparisons may reveal their
to vary with baseline state90,91 and with improvement versus similarities and differences in practical ways. The weights
deterioration.92 Tubach and associates93 changed the term are cumbersome to apply for the clinician, but can be eas-
to minimal clinically important improvement and looked only ily integrated into a computer-based scoring system for easy
at improvement. Minimal clinically important differences data entry and CAT. CAT chooses items based on the previ-
vary depending on the context of measurement. You need ous set of responses and uses the fewest number of items to
to plan on working with a range of values,42,86,87 to make sure reach a precise score skipping easier items if confident the
the measurement situation is similar to your own (severity, subject can do the harder ones. This streamlined scoring is
timing, type of intervention), and to build confidence with quite attractive, but there are some limitations. It depends
congruence in minimal clinically important differences from on technology that may not at this time be available in
across methods if you can achieve that. every setting. It also may be influenced by differential item
functioning, which means an item might change weight, or
Combined Approaches: Change and State order, in certain subgroups and necessitates more complex
weights. An example of differential item functioning would
An attractive, although often overlooked, option is com- be putting on a pullover sweater. It is a hard task if you have
bining change and state. In 1996, EULAR defined clini- shoulder pain, but pretty easy if you have a hand problem
cal response as a change in DAS28 score of more than 1.2 and would require a different weight.
(change) plus a final DAS28 score of less than 2.4 (final The National Institutes of Health is currently funding
state).36 Jacobson and colleagues94 did the same in defining PROMIS (Patient Reported Outcomes Measurement Infor-
response to psychotherapy; change greater than error was mation System; available at to
used (minimal detectable change mentioned earlier) plus a develop a CAT system (currently based on a two-parameter
final “normal” state. Studies from the patient’s perspective graded response model IRT) for common chronic diseases,
have often reflected the same thing.80,81,95 Treatment needs to including arthritis.96 Several measures have been pooled
induce a change, but perhaps it also needs to land patients into a large database and are being refined and rescaled at
in a healthy state to make them feel better. the time of this writing. Well-known measures, such as the
The approaches described previously focus on interpreta- HAQ, also will be used to allow for cross-calibration with
tion at the level of the individual, perhaps for use in clinical the newer items. All findings will be reported on the PRO-
practice or in a response-type analysis of a clinical trial or eco- MIS website, as will access to the scoring algorithms.
nomic appraisal (% responder). Verhoeven and ­coworkers76
showed that the same instrument may not perform equally Use of Technology in Health Outcomes Assessment
well in a responder type of analysis and a group level change.
At each stage of this appraisal, there is an element of In addition to enabling efforts such as PROMIS to develop
judgment. It is likely there will never be perfect evidence CAT systems, information technology has changed many
across all stages. The user needs to assess the potential risk aspects of health outcome assessments. Streamlined, cus-
of accepting less than ideal evidence or abandon the scale. tomized assessments can be set up on the Internet or on a
Users also may create the evidence, however, by doing it stand-alone computer with interfaces such as touch screen,
themselves. An instrument that makes it through this light pens, or “point and click.” Patients can complete the
appraisal is likely a good fit with the measurement need. questionnaires at home, at the clinic, on their PDA, or
Anything short of that could lead to error and be prone to on a tablet. Language and literacy issues can be overcome
misinterpretation. By working from left to right, scales that with talking screens. Scoring becomes instantaneous,
are not targeting the right concept or are impractical to use and reports can be printed immediately summarizing the
in the intended setting can be eliminated quickly before scored results in time for the clinical visit.97,98 ­Comparisons
472 BEAToN  |  Assessment of Health Outcomes

between touch screens and traditional paper and pen-   4. Fries JF, Spitz PW, Young DY: The dimensions of health outcomes:
cils are promising, and the acceptability by patients with The Health Assessment Questionnaire, Disability and Pain Scales.
J Rheumatol 9:19203, 1982.
arthritis is good.97-99 New technology means that health   5. Ware JE Jr, Sherbourne CD: The MOS 36-Item Short-Form Health
outcome assessment can become part of the patient- Survey (SF-36), I: Conceptual framework and item selection. Med
­clinician ­experience and facilitate the ability of the clini- Care 30:473-483, 1992.
cian to monitor the patient’s health.97   6. Fries JF: The hierarchy of outcome assessment. J Rheumatol 20:
546-547, 1993.
  7. Wolfe F, Lassere M, van der Heijde D, et al: Preliminary core set
Adaptation to an Ongoing Disease of domains and reporting requirements for longitudinal observa-
tional studies in rheumatology. J Rheumatol 26:484-489, 1999.
This chapter has focused on the measurement of health states   8. van der Heijde D, van der Linden S, Bellamy N, et al: Which
and their interpretation over time. Individuals with chronic domains should be included in a core set for endpoints in ankylos-
ing spondylitis? Introduction to the ankylosing spondylitis module of
diseases adapt to ongoing disease with behavioral strategies or OMERACT IV. J Rheumatol 26:945-947, 1999.
cognitive reframing of their situation.100 In some circles, this   9. Gladman DD, Mease PJ, Healy P, et al: Outcome measures in psori-
is adjustment95; in others, it is response shift.101 The challenge atic arthritis (PsA). J Rheumatol 34:1159-1166, 2007.
in health outcomes assessment is to tell when a state is chang- 10. Gladman DD, Mease PJ, Strand V, et al: Consensus on a core set of
domains for psoriatic arthritis. OMERACT 8 PsA Module Report.
ing only because of adaptation and not the intervention. In J Rheumatol 34:1167-1170, 2007.
many situations, we try to induce adaptation, or cognitive 11. Sambrook PN, Cummings SR, Eisman JA, et al: Guidelines of osteo-
reframing, and it can be constructive. It does create a bias porosis trials (workshop report). J Rheumatol 24:1234-1236, 1997.
in measurement,101 however, and a challenge to the health 12. Gladman DD, Strand V, Mease PJ, et-al: OMERACT 7 psoriatic
outcome assessor. Numerous groups are researching how to arthritis workshop: Synopsis. Ann Rheum Dis 64:ii-115-ii-116, 2005.
13. Bellamy N, Kirwan J, Boers M, et al: Recommendations for a core set
incorporate adaptation into health outcome assessments. of outcome measures for future phase III clinical trials in knee, hip,
and hand osteoarthritis: Consensus development at OMERACT III.
J Rheumatol 24:799-802, 1997.
SUMMARY 14. Smolen JS, Strand V, Cardiel M, et al: Randomized clinical trials and
longitudinal observational studies in systemic lupus erythematosus:
There is considerable room for improvement in health out- Consensus on a preliminary core set of outcome domains. J Rheuma-
come assessment in rheumatology, despite the work done to tol 26:504-507, 1999.
date. A battery of instruments have been developed, many of 15. Boers M, Tugwell P, Felson DT, et al: World Health Organization
which exhibit the measurement properties described in this and International League of Associations for Rheumatology core
chapter and meet the challenge of a changing arthritis tar- endpoints for symptom modifying antirheumatic drugs in rheumatoid
arthritis clinical trials. J Rheumatol 21:86-89, 1994.
get (less severe, earlier disease), and several more measures 16. Felson DT, Anderson JJ, Boers M, et al: The American College of
are being considered for membership in core sets to capture a Rheumatology preliminary core set of disease activity measures for
comprehensive view of the burden of arthritis. We are on the rheumatoid arthritis clinical trials. The Committee on Outcome
brink of deciding on the role to be played by IRT and CAT Measures in Rheumatoid Arthritis Clinical Trials. Arthritis Rheum
36:729-740, 1993.
in widespread care settings. Despite progress in assigning a 17. van der Heijde D, Landewe R: Selection of a method for scoring radio-
numeric value to a complex health state, however, we are now graphs for ankylosing spondyolitis clinical trials, by the Assessment
struggling with the back-translation—what does the numeric in Ankylosing Spondylitis working groups (ASAS) and OMERACT.
score mean in the real patient world. It is not always a simple J Rheumatol 32:2048-2049, 2005.
translation from questionnaire score to clinical meaning. 18. Guidelines of osteoporosis trials (workshop report). J Rheumatol
24:1234–1236, 1997.
Health outcome assessment is well advanced in arthritis care, 19. Strand V, Gladman DD, Isenberg D, et al: Outcome measures to be
and we should recognize the years of work and commitment of used in clinical trials in systemic lupus erythematosus. J Rheumatol
many professional and patient/consumer groups. Advances will 26:490-497, 1999.
continue in the use of technology, the breadth and depth of 20. Patrick DL, Deyo RA: Generic and disease-specific measures in
assessing health status and quality of life. Med Care 27(Suppl):
outcomes, and the quality of measurement to keep pace with S217-S232, 1989.
the needs of patients, clinicians, and researchers. 21. Bergner M, Bobbitt RA, Pollard WE, et al: The sickness impact pro-
file: Validation of a health status measure. Med Care 14:57-67, 1976.
Acknowledgments 22. Ware JE Jr: SF-36 health survey update. Spine 25:3130-3139, 2000.
23. Ware JE Jr, Snow KK, Kosinski M, et al: SF-36 Health Survey Man-
Dorcas Beaton is supported by a New Investigators Award through the ual and Interpretation Guide. Boston, The Health Institute, 1993.
Canadian Institutes of Health Research. Peter Tugwell holds a Canada 24. Beaton DE, Bombardier C, Hogg-Johnson SA: Measuring health in
Research Chair. injured workers: A cross-sectional comparison of five generic health
The authors would like to thank Ms. Taucha Inrig, Dr. Claire status instruments in workers with musculoskeletal injuries. Am J Ind
­Bombardier, Dr. Fred and Mrs. Janet Krieger, Mr. William Francis, and Med 29:618-631, 1996.
the OMERACT executive for their help with this manuscript, and 25. Beaton DE, Hogg-Johnson S, Bombardier C: Evaluating changes in
Dr. M. Ward, whose chapter in the seventh edition of Kelley’s Textbook of health status: Reliability and responsiveness of five generic health
Rheumatology was a helpful guide. status measures in workers with musculoskeletal disorders. J Clin Epi-
demiol 50:79-93, 1997.
REFERENCES 26. Visser MC, Fletcher AE, Parr G, et al: A comparison of three qual-
ity of life instruments in subjects with angina pectoris: The Sickness
  1. Relman AS: Assessment and accountability: The third revolution in Impact Profile, the Nottingham Health Profile, and the Quality of
medical care. N Engl J Med 319:1220-1222, 1988. Well Being Scale. J Clin Epidemiol 47:157-163, 1994.
  2. Last JM: A Dictionary of Epidemiology. New York, Oxford University 27. Revicki DA, Kaplan RM: Relationship between psychometric and
Press, 1988. utility-based approaches to the measurement of health-related qual-
  3. Prevoo MLL, Van’t Hof MA, et al: Modified disease activity scores ity of life. Qual Life Res 2:477-487, 1993.
that include twenty-eight-joint counts: Development and ­validation 28. Feeny D: Preference-based measures: utility and quality-adjusted life
in a prospective longitudinal study of patients with rheumatoid years. In Fayers P, Hays R (eds): Assessing Quality of Life in Clinical
arthritis. Arthritis Rheum 38:44-48, 1995. Trials, 2nd ed. New York, Oxford University Press, 2005, pp 405-429.

29. Farrar JT, Portenoy RK, Berlin JA, et al: Defining the clinically 52. O’Boyle CA, Hofer S, Ring L: Individualized quality of life. In ­Fayers
important difference in pain outcome measures. Pain 88:287-294, P, Hays R (eds): Assessing Quality of Life in Clinical Trials: Methods
2000. and Practice, 2nd ed. New York, Oxford University Press, 2005,
30. Garrett S, Jenkinson T, Kennedy LG, et al: A new approach to defin- pp 225-242.
ing disease status in ankylosing spondylitis: The BATH Ankylosing 53. Buchbinder R, Bombardier C, Yeung M, et al: Which outcome
Spondylitis Disease Activity Index. J Rheumatol 21:2286-2291, 1994. measures should be used in rheumatoid arthritis clinical trials?
31. Fries JF: The hierarchy of quality-of-life assessment, the Health Assess- Arthritis Rheum 38:1568-1580, 1995.
ment Questionnaire (HAQ), and issues mandating development of a 54. Solomon DH, Bates DW, Horsky J, et al: Development and valida-
toxicity index. Controlled Clinical Trials 12:106S-117S, 1991. tion of a patient satisfaction scale for musculoskeletal care. Arthritis
32. Meenan RF, Gertman PM, Mason JH: Measuring health status in Care Res 12:96-100, 1999.
arthritis: The Arthritis Impact Measurement Scales. Arthritis Rheum 55. Hudak PL, McKeever PD, Wright JG: Understanding the meaning of
23:146-152, 1980. satisfaction with treatment outcome. Med Care 42:718-725, 2004.
33. Meenan RF, Mason JH, Anderson JJ, et al: Aims2: The content and 56. Jette AM, Haley SM: Contemporary measurement technique for
properties of a revised and expanded arthritis impact measurement rehabilitation outcome assessment. J Rehabil 37:339-345, 2005.
scales health status questionnaire. Arthritis Rheum 35:1-10, 1992. 57. Jette AM, Keysor JJ: Disability models: Implications for arthritis
34. Bellamy N, Buchanan WW, Goldsmith CH, et al: Validation study exercise and physical activity interventions. Arthritis Care Res 49:
of WOMAC: A health status instrument for measuring clinically- 114-120, 2003.
important patient-relevant outcomes following total hip or knee 58. Wilson IB, Cleary PD: Linking clinical variables with health-related
arthroplasty in osteoarthritis. J Orthop Rheum 1:95-108, 1988. quality of life: A conceptual model of patient outcomes. JAMA
35. Bellamy N, Campbell J, Haraoui B, et al: Clinimetric properties of the 273:59-65, 1995.
AUSCAN osteoarthritis hand index: An evaluation of reliability, valid- 59. Nagi SZ: A study in the evaluation of disability and rehabilitation
ity and responsiveness. Osteoarthritis Cartilage 10:863-869, 2002. potential. Am J Public Health 54:1568-1579, 1964.
36. Van Gestel AM, Prevoo MLL, Van’t Hof MA, et al: Development 60. Verbrugge LM, Jette AM: The disablement process. Soc Sci Med
and validation of the European League Against Rheumatism response 38:1-14, 1994.
criteria for rheumatoid arthritis. Arthritis Rheum 39:34-40, 1996. 61. World Health Organization: International Classification of Function-
37. Vrijhoef HJM, Diederiks JPM, Spreeuwenberg C, et al: Applying low ing, Disability and Health. Geneva, World Health Organization, 2001.
disease activity criteria using the DAS28 to assess stability in patients 62. Arts DGT, Keizer NF, Scheffer G: Defining and improving data qual-
with rheumatoid arthritis. Ann Rheum Disease 62:419-422, 2003. ity in medical registries: A literature review, case study and generic
38. Aletaha D, Ward MM, Machold KP, et al: Remission and active framework. J Am Med Inform Assoc 9:600-611, 2002.
disease in rheumatoid arthritis: Defining criteria for disease activity 63. Stucki G, Boonen A, Tugwell P, et al: The World Health Organi-
states. Arthritis Rheum 52:2625-2636, 2005. sation International Classification of Functioning, Disability and
39. Lassere M, van der Heijde D, Johnson K, et al: Robustness and gen- Health (ICF): A conceptual model and interface for the OMERACT
eralizability of smallest detectable difference in radiological progres- process. J Rheumatol 34(3):600-606, 2007.
sion. J Rheumatol 28:911-913, 2001. 64. Jette AM: Toward a common language for function, disability and
40. Ravaud P, Giraudeau B, Auleley GR, et al: Assessing smallest detect- health. Phys Ther 86:726-734, 2006.
able change over time in continuous structural outcome measures: 65. Jette AM, Keysor JJ: Uses of evidence in disability outcomes and
Application to radiological change in knee osteoarthritis. J Clin Epi- effectiveness research. Milbank Q 80:325-345, 2002.
demiol 52:1225-1230, 1999. 66. McHorneyCA, Tarlov AR: Individual patient monitoring in clini-
41. Lassere M, Johnson K, Van Santen S, et al: Generic patient self- cal practice: Are available health status surveys adequate? Qual
report and investigator report instruments of therapeutic safety and Life Res 4:293, 1995.
tolerability. J Rheumatol 32:2033-2036, 2005. 67. Lohr KN, Aaronson NK, Alonso J, et al: Evaluating quality-of-life
42. U.S. Department of Health and Human Services Food and Drug and health status instruments: Development of scientific review cri-
Administration Center for Drug Evaluation and Research (CDER): teria. Clin Therap 18:979-992, 1996.
Guidance for industry: Patient-reported outcome measures: Use 68. McDowell I, Jenkinson C: Development standards for health mea-
in medical product development to support labeling claims: Draft sures. J Health Serv Res Policy 1:238-246, 1996.
guidance. Available at: 69. Boers M, Brooks P, Strand V, et al: The OMERACT Filter for out-
Accessed November 3, 2006. come measures in rheumatology. J Rheumatol 25:198-199, 1998.
43. Woodworth T, Furst DE, Alten R, et al: Standardizing assessment and 70. Kane RA, Kane RL: Assessing the Elderly: A Practical Guide to mea-
reporting of adverse effects in rheumatology clinical trials, II: Rheu- surement. Toronto, Lexington Books, 1981, pp13-17.
matology Common Toxicity Criteria v2.0. J Rheumatol 34:1401- 71. Bergner M: Health status measures: An overview and guide for selec-
1414, 2007. tion. Ann Rev Public Health 8:191-210, 1987.
44. Kristjansson E, Tugwell PS, Wilson AJ, et al: Development of the 72. Law M: Measurement in occupational therapy: Scientific criteria for
effective musculoskeletal consumer scale. J Rheumatol 34:1392- evaluation. Can J Occup Ther 54:133-138, 1987.
1400, 2007. 73. Scientific Advisory Committee of the Medical Outcomes Trust:
45. Escorpizo R, Bombardier C, Boonen A, et al: Worker productivity Assessing health status and quality of life instruments: Attributes
outcome measures in arthritis. J Rheumatol 34:1372-1380, 2007. and review criteria. Qual Life Res 11:193-205, 2002.
46. Lerner D, Amick BC III, Rogers WH, et al: The work limitations 74. Hays RD, Revicki D: Reliability and validity (including responsive-
questionnaire. Med Care 39:72-85, 2001. ness). In Fayers P, Hays R (eds): Assessing Quality of Life in Clinical
47. Gignac MAM, Badley EM, Lacaille D, et al: Managing arthritis and Trials: Methods and Practice, 2nd ed. New York, Oxford University
employment: Making arthritis-related work changes as a means of Press, 2005, pp 25-39.
adaptation. Arthritis Care Res 51:909-916, 2004. 75. Stratford PW, Binkley JM: Applying the results of self-report mea-
48. Gilworth G, Chamberlain AM, Harvey A, et al: Development of sures to individual patients: An example using the Roland-­Morris
a work instability scale for rheumatoid arthritis. Arthritis Care Res Questionnaire. J Orthop Sports Physical Ther 29:232-239,
49:349-354, 2003. 1999.
49. Backman C, Kennedy SM, Chalmers A, et al: Participation in paid 76. Verhoeven A, Boers M, van der Linden S: Responsiveness of the core
and unpaid work by adults with rheumatoid arthritis. J Rheumatol set, response criteria, and utilities in early rheumatoid arthritis. Ann
31:47-57, 2004. Rheum Dis 59:966-974, 2000.
50. Tugwell P, Bombardier C, Buchanan WW, et al: The MACTAR 77. DeyoRA, Centor RM: Assessing the responsiveness of functional
Patient Preference Disability Questionnaire—an individualized scales to clinical change: An analogy to diagnostic test perfor-
functional priority approach for assessing improvement in physi- mance. J Chronic Dis 39:897-906, 1986.
cal disability in clinical trials in rheumatoid arthritis. J Rheumatol 78. Kirwan J: Minimum clinically important difference: The crock of
14:446-451, 1987. gold at the end of the rainbow? J Rheumatol 28:439-444, 2001.
51. Buchbinder R, Bombardier C, Yeung M, et al: Which outcome mea- 79. Deyo RA, Carter WB: Strategies for improving and expand-
sures should be used in rheumatoid arthritis clinical trials? Clinical ing the application of health status measures in clinical set-
and quality-of-life measures’ responsiveness to treatment in a ran- tings: A researcher-developer viewpoint. Med Care 30(5 Suppl):
domized controlled trial. Arthritis Rheum 38:1568-1580, 1995. MS176-MS186, 1992.
474 BEAToN  |  Assessment of Health Outcomes

80. Tubach F, Dougados M, Falissard B, et al: Feeling good rather than 92. Angst F, Aeschlimann A, Stucki G: Smallest detectable and minimal
feeling better matters more to patients. Arthritis Care Res 55: clinically important differences of rehabilitation intervention with
526-530, 2006. their implications for required sample sizes using WOMAC and SF-
81. Beaton DE, Tarasuk V, Katz JN, et al: Are you better? A qualitative 36 quality of life measurement instruments in patients with osteo-
study of the meaning of being better. Arthritis Care Res 7:313-320, arthritis of the lower extremities. Arthritis Care Res 45:384-391,
2001. 2001.
82. Boers M, Anderson JJ, Felson D: Deriving an operational definition 93. Tubach F, Ravaud P, Baron G, et al: Evaluation of clinically relevant
of low disease activity state in rheumatoid arthritis. J Rheumatol changes in patient reported outcomes in knee and hip osteoarthritis:
30:1112-1114, 2003. The minimal clinically important improvement. Ann Rheum Dis
83. Tubach F, Wells GA, Ravaud P, et al: Minimal clinically important 64:29-33, 2005.
difference, low disease activity state and patient acceptable symptom 94. Jacobson NS, Roberts LJ, Berns SB, et al: Methods for defining and
state: Methodological issues. J Rheumatol 32:2025-2029, 2005. determining the clinical significance of treatment effects: Descrip-
84. Tubach F, Ravaud P, Baron G, et al: Evaluation of clinically relevant tion, application, alternatives. J Consult Clin Psychol 67:300-307,
states in patient reported outcomes in knee and hip osteoarthrits: 1999.
The patient acceptable symptom state. Ann Rheum Dis 64:34-37, 95. Norman G: Hi! How are you? Response shift, implicit theories and
2005. differing epistemologies. Qual Life Res 12:249, 2003.
85. Felson DT, Furst DE, Boers M: Rationale and strategies for reeval- 96. National Institutes of Health: Patient Reported Outcome Measure-
uating the ACR20. J Rheumatol 34:1184-1187, 2007. ment Information System (PROMIS) network. 2006. Available at:
86. Beaton DE, Boers M, Wells GA: Many faces of the minimal clini-
cally important difference (MCID): A literature review and direc- 97. Athale N, Sturley A, Koczen Z, et al: A web-compatible instrument
tions for future research. Curr Opin Rheumatol 14:109-114, for measuring self-reported disease activity in arthritis. J Rheumatol
2002. 31:223-228, 2004.
87. Hays RD, Woolley JM: The concept of clinically meaningful differ- 98. Fransen J, Stucki G, Twisk J, et al: Effectiveness of a measurement
ence in health-related quality of life research. PharmacoEconomics feedback system on outcome in rheumatoid arthritis: A controlled
18:419-423, 2000. clinical trial. Ann Rheum Dis 62:624-629, 2003.
88. Wells GA, Beaton DE, Shea B, et al: Minimal clinically important 99. Bischoff-Ferrari HF, Vandechend M, Bellamy N, et al: Validation
differences: Review of methods. J Rheumatol 28:406-412, 2001. and patient acceptance of a computer touch screen version of the
89. Norman GR, Sloan JA, Wyrwich KW: Interpretation of changes in WOMAC 3.1 osteoarthritis index. Ann Rheum Dis 64:80-84, 2004.
health-related quality of life: The remarkable universality of half a 100. Shaul MP: From early twinges to mastery: The process of adjustment in
standard deviation. Med Care 41:582-592, 2003. living with rheumatoid arthritis. Arthritis Care Res 8:290-297, 1995.
90. Salaffi F, Stancati A, Silvestri CA, et al: Minimal clinically impor- 101. Schwartz C, Sprangers M, Fayers P: Response shift: You know it’s
tant changes in chronic musculoskeletal pain intensity measures on a there but how do you capture it? Challenges for the next phase of
numerical rating scale. Eur J Pain 8:283-291, 2004. research. In Fayers P, Hays R (eds): Assessing Qualify of Life in Clini-
91. Stucki G, Daltroy L, Katz JN, et al: Interpretation of change scores in cal Trials: Methods and Practice, 2nd ed. New York, Oxford Univer-
ordinal clinical scales and health status measures: The whole may not sity Press, 2005, pp 275-290.
be equal to the sum of the parts. J Clin Epidemiol 49:711-717, 1996.