You are on page 1of 14

Assessing Health Status and Quality-of-Life Instruments: Attributes and Review Criteria

Author(s): Scientific Advisory Committee of the Medical Outcomes Trust


Reviewed work(s):
Source: Quality of Life Research, Vol. 11, No. 3 (May, 2002), pp. 193-205
Published by: Springer
Stable URL: http://www.jstor.org/stable/4038039 .
Accessed: 16/12/2012 11:23

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Springer is collaborating with JSTOR to digitize, preserve and extend access to Quality of Life Research.

http://www.jstor.org

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
of Life Research 11: 193-205,2002. 193
WQuality
i
? 2002 Kluwer Academic Publishers. Printed in the Netherlands.

Assessinghealthstatus and quality-of-lifeinstruments:Attributesand review


criteria

Scientific Advisory Committee of the Medical Outcomes Trust' (E-mail: klohr@rti.org)

Accepted in revised form 8 January 2002

Abstract

The field of health status and quality of life (QoL) measurement - as a formal discipline with a cohesive
theoretical framework, accepted methods, and diverse applications - has been evolving for the better part
of 30 years. To identify health status and QoL instruments and review them against rigorous criteria as a
precursor to creating an instrument library for later dissemination, the Medical Outcomes Trust in 1994
created an independently functioning Scientific Advisory Committee (SAC). In the mid-1990s, the SAC
defined a set of attributes and criteria to carry out instrument assessments; 5 years later, it updated and
revised these materials to take account of the expanding theories and technologies upon which such
instruments were being developed. This paper offers the SAC's current conceptualization of eight key
attributes of health status and QoL instruments (i.e., conceptual and measurement model; reliability;
validity; responsiveness; interpretability; respondent and administrative burden; alternate forms; and cul-
tural and language adaptations) and the criteria by which instruments would be reviewed on each of those
attributes. These are suggested guidelines for the field to consider and debate; as measurement techniques
become both more familiar and more sophisticated, we expect that experts will wish to update and refine
these criteria accordingly.

Key words: Health status, Item response theory, Measurement, Quality of life, Reliability, Responsiveness,
Validity

Introduction hesive theoretical framework, accepted methods,


and diverse applications - has been evolving for
Thefield of health assessment the better part of 30 years. It has been character-
ized by proliferation of instruments that vary
The field of healthstatusand qualityof life (QoL) widely in their methods of development, content,
measurement- as a formal disciplinewith a co- breadth of use, and quality. It has grown and
matured through the efforts of both individual
developers of instruments and teams of research-
1
Neil Aaronson, PhD, The Netherlands Cancer Institute, ers, supported at crucial junctures by private
Amsterdam; Jordi Alonso, MD, Institut Municipal d'Investi- philanthropic organizations and public sector
gacio Medica (IMIS-IMAS), Barcelona, Spain; Audrey Bur-
nam, PhD, The RAND Corporation, Santa Monica, CA;
(government) agencies particularly in North
Kathleen N. Lohr, PhD, RTI International, Research Triangle America, the United Kingdom, and various Eu-
Park, NC, and Program on Health Outcomes, University of ropean countries. A further important source of
North Carolina at Chapel Hill; Donald L. Patrick, PhD, support for the field has been the pharmaceutical
MSPH, Department of Health Services, University of Wash- industry on both sides of the Atlantic.
ington, Seattle; Edward Perrin, PhD, Department of Health
Services, University of Washington, Seattle, Ruth E.K. Stein, Adding to this mix of research and application
MD Albert Einstein College of Medicine/Children's Hospital at have been many constructive developments: cre-
Montefiore, Bronx, NY. ation of various institutes and centers that develop

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
194

new instruments, translate and culturally adapt SAC determined that, to discharge its responsi-
existing instruments, and facilitate research among bilities, it would need to establish some principles
academicians, clinicians, and health care organi- and criteria, as well as procedures, by which it
zations; emergence of a professional society de- would acquire, review, and make assessments
voted explicitly to the furtherance of this field (the about instruments that came to its attention or
international society for quality of life research were submitted to the trust.
[ISOQoL]); convening of numerous international
colloquia and conventions on methods and issues Instrument review criteria
in assessing health-related quality of life; produc-
tion of numerous compilations of rating instru- The SAC thus set about to define a set of attributes
ments and questionnaires for measuring health and criteria to carry out instrument assessments
status, functioning, and related concepts; and and, after external peer review and revisions,
publication of at least one journal whose core published and disseminated the first set of its 'in-
content relates to QoL measurement (namely, strument review criteria' in 1996 (Evaluating
Quality of Life Research). quality-of-life and health status assessment in-
Today, the field is now broadly international, struments: Development of scientific review crite-
and its leaders are at the forefront of applying ria. Clin Ther 1996; 18(5): 979-992). They were
both traditional and modern theories and methods re-published in the Monitor, a trust publication, in
from health, psychology, and related fields to the March 1997; a subsidiary set of criteria relating
creation and validation of such instruments. This solely to the evaluation of translations and cultural
heterogeneity induces extremely productive and adaptations of instruments was published in the
useful debate and methodologic advances, and for Bulletin, a sister trust publication, in July 1997.
experts in the field, this diversity is acceptable and, Within the SAC approach and criteria, the term
indeed, welcome. instrument refers to the constellation of items
contained in questionnaires and interview sched-
The Medical Outcomes Trust and its Scientific ules along with their instructions to respondents,
Advisory Committee (SAC) procedures for administration, scoring, interpre-
tation of results, and other materials found in a
This complex mix of nonprofit organizations, users' manual. We use the term attributes to in-
academic researchers, public sector agencies, and dicate categories of properties or characteristics of
commercial firms has, of course, pursued no single instruments that warrant separate, independent
set of objectives. In 1992, however, the Medical consideration in evaluation. Within the attributes,
Outcomes Trust was incorporated with the mis- we specify what we denote as criteria, which are
sion of promoting the science and application of commonly understood to be conditions or facts
outcomes assessment, with a particular emphasis used as a standard by which something can be
on expanding the availability and use of self- or judged or considered. We view the criteria as
interviewer-administered questionnaires designed prescribing the specific information instrument
to assess health and the outcomes of health care developers should be prepared to provide about
from the patients' point of view. To accomplish particular aspects of each attribute.
this mission, the trust undertook to identify such In general, we have used these criteria to review
instruments, bring them into an instrument li- instruments developed in English and cultural and
brary, and disseminate them (together with ap- language adaptations based on the English lan-
propriate users' guides and related materials) to all guage version of the given instrument, but they can
persons with an interest in and need for them. and have been used to consider instruments de-
In furtherance of this task, in 1994 the trust veloped in other languages as well. We apply them
created a Scientific Advisory Committee (SAC) - to instruments that measure domains of health
an independently operating entity charged with status and quality of life (QoL) in both groups and
reviewing instruments and assessing their suit- individuals. Although we believe that the criteria
ability for broad distribution by the trust. The apply to the 'individualized' class of measures

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
195

(such as the schedulefor the evaluation of indi- We have three goals in mind in disseminating
vidual quality of life [SEIQoL])that do not have these criteria.First, we hope to enhance the ap-
standardizeditems across respondents,we have preciationof health outcomes assessmentamong
not yet had experienceapplyingthese criteriawith as wide an audienceas possibleand to promptyet
such measures. more discussion and debate about continuous
improvementin this field. Second, we want to
Revised instrumentreview criteria providea templateby which others setting out to
assess materials or systems (e.g., performance
Theremattersstood for about 2 years,as the SAC measurementor monitoringsystemsin the quality-
appliedits originalset of instrumentreviewcriteria of-care arena) might similarlyundertaketo state
to instrumentssubmittedfrom the United States, theirevaluationcriteriaclearlyand openly.Third,
the United Kingdom,Canada,and various Euro- we aim to document the process and criteria
pean countriesas part of the largertrustactivities. used by the SAC within the context of the trust's
Increasingly,however,the SAC encounteredtwo mission.
problems. One was that developers sometimes
found the documentsdescribingthe criteriadiffi-
cult to applyto theirparticularsituation;the other Attributes and criteria
was that the criteria were less applicableto in-
strumentsdevelopedin accordancewith the prin- Eight attributeshave served as the principalfoci
ciples of modern test theory than to instruments for SAC instrumentreviewand are the core of this
creat;edin line with classicalpsychometricrules. paper.They are:
Thus,in the courseof usingthe initialcriteriaset 1 Conceptualand measurementmodel
over several years, we determinedthat they re- 2. Reliability
quiredrevisionand expansionto addressadvances 3. Validity
in the science of psychometricsand to apply to a 4. Responsiveness
broader range of instruments.Quite apart from 5. Interpretability
the fact that moreinstrumentsare beingdeveloped 6. Respondentand administrativeburden
on principles other than classic test theory, we 7. Alternativeforms
recognized that being able to apply the same 8. Cultural and language adaptations (transla-
concepts of assessmentto other types of instru- tions)
ments, such as screenersor instrumentsby which Within these attributes,we establishedspecific
consumersmightrate theirsatisfactionwith health reviewcriteriathat are basedon existingstandards
care and plans, is also desirable. and evolvingpracticesin the behavioralscienceand
To address these concerns, we undertook to health outcomes fields. These criteria,which are
revise the criteriafollowing the same process as generalguidelines,reflectprinciplesand practices
used initially. Specifically,we determinedthat we of both classicaland modern test theory. Table 1
would retain the basic structureof the criteriaset summarizesthe attributesand main criteriafor
but expand the definitionand specificcriteriato each attribute.In general,our reviewcriteriahave
reflectmoderntest theoryprinciplesand methods. been designedprimarilyfor healthstatusand QoL
We also revampedthe presentationof the criteria, profiles;we acknowledgethat for variousutilityor
primarilyto makeclearthe distinctionbetweenthe preferencemeasures,yet other attributesand cri-
description or definition of a specific attribute teriamaybe appropriate.(At the end of this paper,
(e.g., reliability or respondentburden) and the readerswill finda selectedbibliographyof seminal
specificpieces of informationthat we believe de- texts and articlesthat providethe conceptualand
velopersshouldtry to provideaboutthat attribute. empiricalbase for these attributesand criteria;we
Before publishingthese revisedcriteria,we solic- judged this approachto be simplerthan tryingto
ited outside peer reviewfrom six reviewersin the documentthe numeroussourcesthatcouldbe cited
United States,the UnitedKingdom,and Denmark for this materialwithinthe text itself.)
(see Acknowledgments)and revisedthe document We review instrumentsin the context of 11
accordingly. documentedapplications:

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
196

Table 1. Attributes and criteria for reviewing instruments*

Attribute Review criteria

1. Conceptual and measurement model


The rationale for and description of the concept and - Concept to be measured
the populations that a measure is intended to assess - Conceptual and empirical bases for item content and combinations
and the relationship between these concepts. - Target population involvement in content derivation
- Information on dimensionality and distinctiveness of scales
- Evidence of scale variability
- Intended level of measurement
- Rationale for deriving scale scores
2. Reliability Internal consistency
The degree to which an instrument is free from random error. - Methods to collect reliability data
- Reliability estimates and standard errors for all score elements
(classical test) or standard error of the mean over the range of scale
and marginal reliability of each scale (modern IRT)
Internal consistency - Data to calculate reliability coefficients or actual calculations of
The precision of a scale, based on the homogeneity reliability coefficients
(intercorrelations) of the scale's items at one point in time. - Above data for each major population of interest, if necessary

Reproducibility
- Methods employed to collect reproducibility data
Reproducibility - Well-argued rationale to support the design of the study and the
Stability of an instrument over time (test-retest) and interval between first and subsequent administration to support the
inter-rater agreement at one point in time. assumption that the population is stable
- Information on test-retest reliability and inter-rater reliability
based on intraclass correlation coefficients
- Information on the comparability of the item parameter estimates
and on measurement precision over repeated administrations
3. Validity - Rationale supporting the particular mix of evidence presented for
The degree to which the instrument measures what it the intended uses
purports to measure. - Clear description of the methods employed to collect validity data
- Composition of the sample used to examine validity (in detail)
- Above data for each major population of interest
Content-related. evidence that the domain of an instrument - Hypotheses tested and data relating to the tests
is appropriate relative to its intended use. - Clear rationale and support for the choice of criteria measures
Construct-related:evidence that supports a proposed
interpretation of scores based on theoretical implications
associated with the constructs being measured.
Criterion-related.evidence that shows the extent to which
scores of the instrument are related to a criterion measure.
4. Responsiveness - Evidence on the changes in scores of the instrument
An instrument's ability to detect change over time. - Longitudinal data that compare a group that is expected to change
with a group that is expected to remain stable
- Population(s) on which responsiveness has been tested, including
the time intervals of assessment, the interventions or measures
involved in evaluating change, and the populations assumed to be
stable
5. Interpretability - Rationale for selection of external criteria of populations for
The degree to which one can assign easily understood purposes of comparison and interpretability of data
meaning to an instrument's quantitative scores. - Information regarding the ways in which data from the instrument
should be reported and displayed
- Meaningful 'benchmarks' to facilitate interpretation of the scores
6. Burden Respondent burden
The time, effort, and other demands placed on those to whom - Information on: (a) average and range of the time needed to
the instrument is administered (respondent burden) or on complete the instrument, (b) reading and comprehension level, and
those who administer the instrument (administrative burden). (c) any special requirements or requests made of respondent

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
197

Table 1. (Continued)

Attribute Reviewcriteria

- Evidence that the instrumentplaces no undue physical or


emotionalstrainon the respondent
- Whenor underwhat circumstancesthe instrumentis not suitable
for respondents
Administrativeburden
- Informationaboutany resourcesrequiredfor administration of the
instrument
Averagetimeandrangeof timerequiredof a trainedinterviewerto
administerthe instrumentin face-to-faceinterviews,by telephone,
or with computer-assisted
fornats
- Amountof trainingandlevelof educationor professionalexpertise
and experienceneededby administrative staff
7. Alternativesmodesof administration - Evidenceon reliability,validity,responsiveness,interpretability,
Theseincludeself-report,interviewer-administered,
trained and burdenfor each mode of administration
observerrating,computer-assistedinterviewer-administered,
performance-based measures. - Informationon the comparabilityof alternativemodes
8. Culturaland languageadaptationsor translations - Methodsto achieveconceptualequivalence
Involvestwo primarysteps: - Methodsto achievelinguisticequivalence
1. Assessmentof conceptualand linguisticequivalence - Any significantdifferencesbetweenthe originaland translated
2. Evaluationof measurementproperties versions
- How inconsistencies werereconciled
* For all entriesin this column,developersare expectedto providedefinitions,descriptions,explanations,or empiricalinformation.

- assessingthe health of generalpopulationsat a setting or population.The relativeimportanceof


point in time, the eight attributesmay differ dependingon the
- assessingthe health of specificpopulationsat a intended uses and applicationsspecifiedfor the
point in time, instrument.Instrumentsmay, for instance,docu-
- monitoring the health of general populations mentthe healthstatusor attitudesof individualsat
over time, a point in time, distinguishbetweentwo or more
- monitoring the health of specific populations groups,assess change over time among groups or
over time, individuals,predictfuture status, or some combi-
- evaluatingthe impact of broad-basedor com- nations of these. Hence, the weightplacedon one
munity-levelinterventionsor policies, or another set of criteriamay differaccordingto
- evaluatingthe efficacyand effectivenessof health the purposesclaimedfor the instrument.
care interventions, In reviewinginstruments,the SAC aimed to be
- conducting economic evaluations of health in- thorough without holding instrumentsto unreal-
terventions, isticallyhigh standards.For example,we accepted
- using in quality improvementand quality as- some instrumentseventhoughtheirresponsiveness
surance programsin health care delivery sys- to change over time (attribute4) had not been
tems, evaluatedat the time of submission.In a case such
- screening for health conditions, as this, we would note that the instrumenthad
- diagnosing health conditions, been approvedfor groupcomparisonsbut that no
- monitoring the health status of individualpa- data were availableregardingthe instrument'sre-
tients. sponsiveness.In other cases, developersmay pro-
vide supportfor contentand constructvaliditybut
An instrumentthat works well for one purpose not criterionvaliditybecausetrue gold standards
or in one setting or population may not do so are often not availablefor evaluatingthe latter.In
when applied for another purpose or in another yet othercases, reliabilitymay be judgedsufficient

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
198

for comparing groups but not for evaluating in- Review criteria
dividuals. In summary, we matched criteria to Developers should:
particular uses claimed for the instrument and - State what broad concept (or concepts) the in-
accepted instruments for specific applications strument is trying to measure - for example,
when evaluation of the instrument and its docu- functional status, well-being, health-related
mentation supported these applications. quality of life, QoL, satisfaction with health
In the remainder of this paper, we present our care, or others. In addition, if the instrument is
definition of the attributes noted above and then designed to assess multiple domains within a
give our current (i.e., now revised) review criteria. broad concept (e.g., multiple scales assessing
The criteria are offered in terms of our view of several dimensions of health-related quality of
what instruments developers should 'do' (e.g., life), then provide a listing of all domains or
describe, provide, or discuss) in documenting the dimensions.
characteristics of their instruments, so the mate- - Describe the conceptual and empirical basis for
rial appears largely in bulleted form. We empha- generating the instrument content (e.g., items)
size here that our definitions and criteria are open and for combining multiple items into a single
to further discussion and evolution within the scale score and/or multiple scale scores.
field of health status assessment, and we hope that - State the methods and involvement of the target
experts around the world will be encouraged to populations for obtaining the final content of the
engage in a dialogue about these issues in years to instrument and for ascertaining the appropri-
come. ateness of the instrument's content for that
population, for example by use of focus groups
Conceptual and measurementmodel or pretesting in target population(s).
- Provide information on dimensionality and dis-
Definition tinctiveness of multiple scales, because both
A conceptual model is a rationale for and de- classical and modern test approaches assume
scription of the concepts and the populations that appropriate dimensionality (usually unidimensi-
a measure is intended to assess and the relation- onality) of scales.
ship between those concepts. A measurement - Provide evidence that the scale has adequate
model operationalizes the conceptual model and is variability in a range that is appropriate to its
reflected in an instrument's scale and subscale intended use - for example, information on
structure and the procedures followed to create central tendency and dispersion, skewness, ceil-
scale and subscale scores. The adequacy of the ing and floor effects, and pattern of missing
measurement model can be evaluated by examin- data.
ing evidence that: (1) a scale measures a single - State the intended level of measurement (e.g.,
conceptual domain or construct; (2) multiple scales ordinal, interval, or ratio scales) with available
measure distinct domains; (3) the scale adequately supportive evidence.
represents variability in the domain; and (4) the - Describe the rationale and procedures for de-
intended level of measurement of the scale (e.g., riving scale scores from raw scores and for
ordinal, interval, or ratio) and its scoring proce- transformations (such as weighting and stan-
dures are justified. dardization); for preference-weighted measures
Classical test theory approaches may employ, or utility measures, provide a rationale and
for example, principal components analysis, factor empirical basis for the weights.
analyses, and related techniques for evaluating the
empirical measurement model underlying an in- Reliability
strument and for examining dimensionality.
Methods based on modern test theory may use Definition
approaches including confirmatory factor analysis, The principal definition of test reliability is the
structural equation modeling, and methods based degree to which an instrument is free from random
on item response theory (IRT). error. Classical approaches for examining test re-

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
199

liability include (a) internal consistency reliability, applications describe specific levels of stability for
typically using Cronbach's coefficient oa,and (b) specific levels of the scale. As with internal con-
reproducibility (e.g., test-retest or inter-observer sistency reliability, minimal standards for repro-
(interviewer) reliability). The first approach re- ducibility coefficients are also typically considered
quires one administration of the instrument; the to be 0.70 for group comparisons and 0.90-0.95
latter requires at least two administrations. for individual measurements over time.
In modern test theory applications, the degree of Test-retest reproducibility is the degree to which
precision of measurement is commonly expressed an instrument yields stable scores over time among
in terms of error variance, standard error of respondents who are assumed not to have changed
measurement (SEM) (the square root of the error on the domains being assessed. The influence of
variance), or test information (reciprocal of the test administration on the second administration
error variance). Error variance (or any other may overestimate reliability. Conversely, varia-
measure of precision) takes on different values at tions in health, learning, reaction, or regression to
different points along the scale. the mean may yield test-retest data that underes-
timate reproducibility. Bias and limits-of-agree-
Internal consistency reliability. In the classical ap- ment statistics can indicate the range within which
proach, Cronbach's coefficient ocprovides an esti- 95% of retest scores can be expected to lie. Despite
mate of reliability based on all possible split-half these cautions, information on test-retest repro-
correlations for a multi-item scale. For instru- ducibility data is important for the evaluation of
ments employing dichotomous response choices, the instrument. For instruments administered by
an alternative formula, the Kuder-Richardson an interviewer, test-retest reproducibility typically
formula 20 (KR-20), is available. Commonly ac- refers to agreement among two or more observers.
cepted minimal standards for reliability coeffi-
cients are 0.70 for group comparisons and 0.90-
0.95 for individual comparisons. Reliability re- Review criteria
quirements are higher when applying instrument Internal consistency reliability and test informa-
scores for individualized use because confidence tion. Developers should:
intervals of those scores are typically computed - Describe clearly the methods employed to col-
based on the SEM. The SEM is computed as the lect reliability data. This should include (a)
standard deviation (SD) x I1-reliability. Reliabil- methods of sample accrual and sample size; (b)
ity coefficients lower than 0.9-0.95 provide too characteristics of the sample (e.g., sociodemo-
wide (e.g., more than one to two thirds of the score graphics, clinical characteristics if drawn from a
distribution) intervals to be useful for monitoring patient population, etc.); (c) the testing condi-
individual's score. tions (e.g., where and how the instrument of
In the IRT approach, measurement precision is interest was administered); and (d) descriptive
generally evaluated at one or more points on the statistics for the instrument under study (e.g.,
scale. The scale's precision should be characterized means, SDs, floor and ceiling effects).
over the measurement range likely to be encoun- - For classical applications, report reliability es-
tered in actual research. A single value, marginal timates and SEs for all elements of an instru-
reliability, can be estimated as an analog to the ment, including both the total score and
classical reliability coefficient. This value is most subscale scores, where appropriate.
useful for tests in which measurement precision is - For IRT applications, provide a plot showing
relatively stable across the scale. the SEM over the range of the scale. In addition,
the marginal reliability of each scale may be
Reproducibility. A second approach to reliability reported where this information is considered
can be obtained by judging the reproducibility or useful.
stability of an instrument over time (test-retest) - Where developers have reason to believe that
and inter-rater agreement at one point in time. In reliability estimates or SEM may differ sub-
classical applications, the stability of an instru- stantially for the various populations in which
ment is often expressed as a single value, but IRT an instrument is to be used, present these data

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
200

for each major populationof interest(e.g., dif- purportsto measure.Evidencefor the validity of
ferent chronic disease populations, different an instrumenthas commonly been classified in
languageor culturalgroups). three ways discussedjust below. (We note that
validationof a preference-based measurewill need
Reproducibility. Developers should: to employ constructsrelatingto preferencesper se,
- Describeclearlythe methodsemployedto collect not simply descriptiveconstructs, and these can
reproducibilitydata. This descriptionshould in- differfrom the criteriaset out below for nonutility
clude (a) methodsof sampleaccrualand sample measures.)
size; (b) characteristicsof the sample(e.g.,socio- 1. Content-related& Evidencethat the content do-
demographics,clinical characteristicsif drawn main of an instrumentis appropriaterelativeto its
from a patient population, etc.); (c) the testing intendeduse. Methods commonly used to obtain
conditions(e.g., whereand how the instrumentof evidenceabout content-relatedvalidityincludethe
interest was administered);and (d) descriptive use of lay and expertpanel(clinician)judgmentsof
statistics for the instrumentunder study (e.g., the clarity,comprehensiveness, and redundancyof
intraclasscorrelationcoefficient,receiveropera- items and scales of an instrument. Often, the
tor characteristic,the test-retestmean, limits of contentof newlydevelopedself-reportinstruments
agreement,etc.). is best elicitedfrom the populationbeing assessed
- Provide test-retest reproducibilityinformation or experiencingthe health condition.
as a complementto, not as a substitutefor, in- 2. Construct-related: Evidence that supports a
ternalconsistency. proposed interpretationof scores based on theo-
- Give a well-arguedrationaleto supportthe de- reticalimplicationsassociatedwith the constructs
sign of the study and the intervalbetween first being measured. Common methods to obtain
and subsequentadministrationsto support the construct-relatedvalidity data include examining
assumptionthat the population is stable. This the logical relationsthat should exist with other
can includeself-reportaboutperceivedchangein measures and/or patterns of scores for groups
health over the time intervalor other measures known to differon relevantvariables.Ideally,de-
of generaland specifichealth or functionalsta- velopers should generate and test hypotheses
tus. Information about test and retest scores about specificlogicalrelationshipsamong relevant
should includethe appropriatecentraltendency conceptsor constructs.
and dispersionmeasuresof both test and retest 3. Criterion-related& Evidence that shows the ex-
administrations. tent to which scores of the instrumentare related
- In classicalapplicationsfor instrumentsyielding to a criterion measure. Criterion measures are
interval-leveldata, includeinformationon test- measuresof the target construct that are widely
retest reliability(reproducibility)and inter-rater accepted as scaled, valid measures of that con-
reliabilitybased on intraclasscorrelationcoeffi- struct. In the area of self-reportedhealth status
cients(ICC,the bias statisticor test-retestmean, assessment, criterion-relatedvalidity is rarely
or limits of agreement);for nominal or ordinal tested because of the absence of widely accepted
scale values, K and weightedK, respectively,are criterionmeasures,althoughexceptionsoccursuch
recommended. as testing shorter versions of measures against
- In IRT applications,also provideinformationon longerversions.For testingscreeninginstruments,
the comparabilityof the item parameteresti- criterion validity is essential to compare the
mates and on measurementprecision over re- screeningmeasureagainst a criterionmeasureof
peated administrations. the diagnosisor condition in question, using sen-
sitivity, specificity,and receiveroperatingcharac-
teristic.
Validity
Review Criteria
Definition Developersshould:
The validity of an instrumentis defined as the - Explainthe rationalethat supportsthe particular
degreeto which the instrumentmeasureswhat it mix of evidencepresentedfor the intendeduses.

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
201

- Providea clear descriptionof the methodsem- estimateof a measureof the magnitudeof change
ployed to collect validity data. This should in- in health status (sometimesdenoted the 'distance'
clude (a) methodsof sampleaccrualand sample or differencebetweenbeforeand afterscores).No
size;(b) characteristicsof the sample(e.g., socio- agreement or consensus exists on the preferred
demographics,clinical characteristicsif drawn statistical measure.Effect size statistics translate
from a patientpopulation);(c) the testingcondi- the before-and-afterchangesinto a standardunit
tions (i.e., whereand how the instrumentof in- of measurement;essentiallythey involve dividing
terest was administered);and (d) descriptive the change score by one or another variancede-
statistics for the instrumentunder study (e.g., nominator.(Said anotherway, these statisticsare
means,SDs, floor and ceilingeffects). the amount of observedchange over the amount
- Describethe compositionof the sampleused to of observedvariance.)In these statistics,the nu-
examine the validity of a measurein sufficient merator is always a change score, but the de-
detailto makeclearthe populationsto whichthe nominatordiffersdependingon the statisticbeing
instrument applies and selective factors that used (e.g., standardizedresponse mean, respon-
might reasonablybe expected to influenceva- sivenessstatistic,SE of the mean).
lidity, such as gender, age, ethnicity, and lan- Moreover, differentmethods may be used to
guage. evaluate effect size. Common approachesinclude
- When reasons exist to believe that validitywill comparingscale scores before and after an inter-
differ substantiallyfor the various populations ventionthat is expectedto affectthe constructand
in whichan instrumentis to be used, presentthe comparingchangesin scale scoreswith changesin
above data for each majorpopulationof interest other, relatedmeasuresthat are assumedto move
(e.g., differentchronic disease populations,dif- in the same directionas the targetmeasure.
ferentlanguageor culturalgroups,differentage Responsiveness,as someauthorshavesuggested,
groups). Because validity testing and use of can be construedas a 'meaningful'level of change
major instrumentsare ongoing, we encourage and, accordingly,defined as the minimal change
developersto continue to presentsuch data as consideredto be importantby persons with the
they accumulatethem. health condition, their significantothers, or their
- When presentingconstructvalidity,providethe providers.We suggest, however,that this conno-
hypothesestested and data relatingto the tests. tation of responsivenessmightbetterbe considered
- When data relatedto criterionvalidityare pre- an elementof how the data froman instrumentare
sented,providea clearrationaleand supportfor interpreted.Interpretationof effectsis discussedin
the choice of criteriameasures. Interpretability.Thisincludesminimallyimportant
differencesor changes.We makethis distinctionin
Responsiveness part because, although responsivenessand inter-
pretabilityare related concepts, one focuses on
Definition performancecharacteristicsof the instrumentat
Sometimes referred to as sensitivity to change, hand and the other focuses on the respondents'
responsivenessis viewed as an importantpart of views about the domainsbeing studied.
the longitudinalconstructvalidationprocess. Re-
sponsivenessrefers to an instrument'sability to Review criteria
detect change. The criterionof responsivenessre- Developersshould:
quiresasking whetherthe measurecan detect dif- - For any claim that an instrumentis responsive,
ferencesin outcomes,even if those differencesare provideevidenceon the changesin scoresfound
small. Responsivenesscan be conceptualizedalso in field tests of the instrument.Apart from this
as the ratio of a signal (the real change over time information,changescorescan also be expressed
that has occurred)to the noise (the variabilityin as effect sizes, standardizedresponse means,
scores seen over time that is not associatedwith SEM, or other relativeor adjustedmeasuresof
true changein status). distance between before and after scores. The
Assessmentof responsivenessinvolvesstatistical methods and formulae used to calculate the re-
estimation of an effect size statistic - that is, an sponsivenessstatisticsshould be explained.

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
202

- Preferably,cite longitudinaldata that comparea comparisonand interpretability of data. As with


group that is expectedto change with a group validity(attribute3), this should include(a) ra-
that is expectedto remainstable. tionale for selectionof externalcriteriaor com-
- Clearlyidentify the population(s)on which re- parison population; (b) methods of sample
sponsivenesshas been tested, includingthe time accrualand samplesize;(b) characteristicsof the
intervals of assessment, the interventions or sample;(c) the testing conditions;and (d) de-
measuresinvolvedin evaluatingchange,and the scriptivestatisticsfor the instrumentunderstudy
populationsassumedto be stable. - Provideinformationregardingthe ways in which
data from the instrumentshould be (or have
Interpretability been) reportedand displayedin order to facili-
tate interpretation.
Definition - Cite meaningful'benchmarks'(comparativeor
Interpretabilityis definedas the degree to which normative data) to facilitate interpretationof
one can assign easily understoodmeaning to an the scores.
instrument'squantitative scores. Interpretability
of a measure is facilitated by information that Burden
translatesa quantitativescore or changein scores
to a qualitativecategoryor otherexternalmeasure Definition
that has a more familiarmeaning.Interpretability Respondentburdenis definedas the time, effort,
calls for explanationof the rationalefor the ex- and other demandsplaced on those to whom the
ternal measure,the change scores, and the ways instrumentis administered.Administrativeburden
that those scoresare to be interpretedin relationto is defined as the demands placed on those who
the externalmeasure. administerthe instrument.
Severaltypes of informationcan aid in the in-
terpretationof scores: Review criteria: respondent burden
- comparativedata on the distributionsof scores Developersshould:
derived from a variety of defined population - Give informationon the followingproperties:
groups, including,when possible, a representa- (1) averageand rangeof time neededto complete
tive sampleof the generalpopulation; the instrumenton a self-administeredbasis or,
- results from a large pool of studies that have as an interviewer-administered instrument,for
used the instrumentin question and reported all populationgroupsfor whichthe instrument
findingson it, thus bringingfamiliaritywith the is intended;
instrumentthat will aid interpretation; (2) the reading and comprehensionlevel needed
- the relationshipof scoresto clinicallyrecognized for all population groups for which the in-
conditions, need for specific treatments,or in- strumentis intended;
terventionsof known effectiveness; (3) any specialrequirementsor requeststhat might
- the relationshipof scoresor changesin scoresto be placed on respondents,such as the need to
sociallyrecognizedlife events(suchas the impact consulthealthcarerecordsor copy information
of losing a job); about medicationsused;and
- the relationshipof scoresor changesin scoresto (4) the acceptabilityof the instrument,for example
subjective ratings of the minimally important by indicatingthe level of missing data and re-
changes by persons with the condition, their fusal rates and the reasonsfor both.
significantothers, or their providers;and - For instrumentsthat are not, on the face of it,
- how well scores predictknown relevantevents harmlessand for those that appear to have ex-
(such as death or need for institutionalcare). cessive rates of missing data, provide evidence
that the instrumentplacesno unduephysicalor
Review criteria emotionalstrainon the respondent(for instance,
Developersshould: that it does not includequestionsthat a signifi-
- Clearly describe the rationale for selection of cant minorityof patientsfinds too upsettingor
externalcriteriaor populationsfor purposesof confrontational).

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
203

- Indicate when or under what circumstances their Cultural and language adaptations or translations
instrument is not suitable for respondents.
Definition
Review criteria. administrativeburden Many instruments are adapted or translated for
Developers should provide information about any applications across regional and national borders
resources required for administration of the in- and populations. In the MOT and SAC context,
strument, such as the need for special or specific cultural and language adaptations have referred to
computer hardware or software to administer, situations in which instruments have been fully
score, or analyze the instrument. adapted from original or source instruments for
For interviewer-administered instruments, de- cultures or languages different from the original.
velopers should: Language adaptation might well be differentiated
- Document the average time and range of time from translation. As a case in point: an instrument
required of a trained interviewer to administer developed in Spanish or English may be adapted
the instrument in face-to-face interviews, by for different 'versions' (e.g., country- or region-
telephone, or with computer-assisted formats/ specific dialects) of those basic languages, whereas
applications, as appropriate; an instrument developed in Swedish and translated
- Indicate the amount of training and level of into French or German would be quite a different
education or professional expertise and experi- matter. In any case, the SAC has held the view that
ence needed by administrative staff to adminis- measurement properties of each cultural and/or
ter, score, or otherwise use the instrument; language adaptation ought to be judged separately
- Indicate the availability of scoring instructions. for evidence of reliability, validity, responsiveness,
interpretability, and burden.
Alternative modes of administration The cross-cultural adaptation of an instrument
involves two primary steps: (1) assessment of
Definition conceptual and linguistic equivalence, and (2)
Alternative modes of administration used for evaluation of measurement properties. Conceptual
the development and application of instruments equivalence refers to equivalence in relevance and
can include self-report, interviewer-administered, meaning of the same concepts being measured in
trained observer rating, computer-assisted self-re- different cultures and/or languages. Linguistic
port, computer-assisted interviewer-administered, equivalence refers to equivalence of question
and performance-based measures. In addition, al- wording and meaning in the formulation of items,
ternative modes may include self-administered or response choices, and all aspects of the instrument
interviewer-administered versions of the original and its applications. In all such cases, it is useful if
source instrument that are to be completed by developers provide empirical information on how
proxy respondents such as parents, spouses, pro- items work in different cultures and languages.
viders, or other substitute respondents.
Review criteria
Review criteria Developers should:
Developers should: - Describe methods to achieve linguistic equiva-
- Make available information evidence on reli- lence. The commonly recommended steps are (a)
ability, validity, responsiveness, interpretability, at least two forward translations from the source
and burden for each mode of administration; language that yields a pooled forward transla-
- Provide information on the comparability of tion; (b) at least one, preferably more, backward
alternative modes; whenever possible, equating translations to the source language that results
studies should be conducted so that scores from in another pooled translation; (c) a review of
alternative modes can be made comparable to translated versions by lay and expert panels with
each other or to scores from an original in- revisions; and (d) field tests to provide evidence
strument. of comparability.

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
204

- Provide informationabout methods to achieve Acknowledgments


conceptualequivalencebetween or among dif-
ferentversionsof the same instrument.For this We extend our deepest appreciation to Alvin
step, assessmentof content validity of the in- Tarlov, MD, founding presidentof the Medical
strumentin each culturalor languagegroup to Outcomes Trust, for unstinting encouragement
which the instrumentis to be applied is com- and supportto the ScientificAdvisoryCommittee
monly recommended.In a cross-culturalper- (SAC) and its efforts to develop and promulgate
spective,some items in a given instrumentmay rigorous criteria for reviewing health status in-
well functiondifferentlyin one languagethan in struments.We also thank Les Lipkind,the MOT
another. Thus, IRT and confirmatoryfactor ExecutiveDirectorduringthe time this articlewas
analysis (using SEM approaches,for example) being prepared,for substantialbackgroundand
can be used to evaluate cross-culturalequiva- coordinatingsupportto the SAC. MariaOrlando,
lence through examinationof differentialitem PhD, at The RAND Corporation,Santa Monica,
functioning(DIF). California,assisted with the developmentof ma-
- Identify and explain any significantdifferences terialencompassingmoderntest theory.
betweenthe originaland translatedversions. We are especiallygratefulto the following in-
Explainhow inconsistencieswere reconciled. dividualswho providedcommentsand suggestions
on an earlier version of these criteria: Jacob
Bjorner,AMI, Copenhagen,Denmark;John Bra-
zier,PhD, Universityof Sheffield,Sheffield,United
Conclusions
Kingdom; Yen-Pin Chiang, PhD, Agency for
Healthcare Research and Quality, Rockville,
International interest and support for health status
Maryland;PenniferErickson,PhD, College Sta-
and QoL assessment in biomedical and health
tion, Pennsylvania;Rowan Harwood, MA, MSc,
services research, clinical care, and even health MD, MRCP(UK), London, England; Ronald
policymaking are expanding rapidly. These devel- Hays, PhD, Universityof Californiaat Los An-
opments are occurring in an environment featur- gelesand The RAND Corporation,SantaMonica,
ing both traditional and emergingmethods for
California;and Cathy D. Sherbourne,PhD, The
measuringoutcomesof healthcare, and they offer RAND Corporation,Santa Monica, California.
excitingopportunitiesto expandsuch applications This acknowledgmentof interestand assistanceon
and to enhance the confidence with which clini-
the part of reviewersdoes not necessarilyimply
cians, investigators, and policymakers can use such
endorsementof the SAC final criteria.
instruments. Nonetheless, the field ought not to This work (Lohr)was supportedin part by the
proceed in too free-wheeling a manner. Thus, we Centerfor Educationand Researchin Therapeu-
offer these definitions of attributes and criteria for
tics at the Universityof North Carolina(Cooper-
judging instruments in the strong hope that the ative Agreement No. U18 HS10397) and the
field will use them as a jumping-off place for de-
Universityof North CarolinaProgramon Health
bate and discussion about challenges that lie
Outcomes.
ahead. These challenges include continued refine-
ment of existing measures and development of
measures to cover clear gaps relating to patient Further reading
populations and disease groups; improving in-
1. Aday LA. Designing and Conducting Health Surveys: A
struments to make them more culturally appro-
Comprehensive Guide. 2nd edn., San Francisco, Calif.:
priate and comparable across diverse populations; Jossey-Bass, 1996.
dealing with the differences and understanding the 2. American Psychological Association. Standards for Edu-
complementarities of instru-ment developed with cational and Psychological Testing. Washington, DC: APA,
different conceptual frameworks; and enhancing 1985.
3. Bland JM, Altman DG. Statistical methods for assessing
the ways that results of such instruments can be
agreement between two methods of clinical measurement.
interpreted in ordinary terms. Lancet 1986; 1: 307-310.

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions
205

4. Bjorner JB, Thunedborg K, Kristensen TS, Modvig J, Bech 16. McHorney CA, Tarlov AR. Individual-patient monitoring
P. The Danish SF-36 health survey: Translation and pre- in clinical practice: Are available health surveys adequate?
liminary validity studies. J Clin Epidemiol 1998; 51; 991- Qual Life Res 1995; 4: 293-307.
999. 17. Nunnally JC, Bernstern IH. Psychometric Theory. 3rd edn.,
5. Bjorner JB, Kreiner S, Ware JE Jr, Damsgaard MT, Bech New York: McGraw-Hill, 1994.
P. Differential item functioning in the Danish translation of 18. Payne SL. The Art of Asking Questions. Princeton, NJ:
the SF-36. J Clin Epidemiol 1998; 51: 1189-1202. Princeton University Press, 1951.
6. Bowling A. Measuring Health: A Review of Quality of Life 19. Patrick DL, Chiang Y-P. (eds) Health outcomes method-
Measurement Scales. 2nd edn., London: Open University ology symposium. Med Care 2000; 38 (9) (Suppl. II): 113-
Press, 1997. 11208.
7. Carmines E. Reliability and Validity Assessment. Newbury 20. Reise SP, Widaman KF, Pugh RH. Confirmatory factor
Park, Calif.: Sage Publications. 1997. analysis and item response theory: Two approaches for
8. Cronbach LJ. Essentials of Psychological Testing. 4th edn., exploring measurement invariance. Psychol Bull 1993; 114:
New York: Harper and Row. 1984. 552-566.
9. Cronbach LJ. Coefficient alpha and the internal structure of 21. Staquet M, Hays R, Fayers P. Quality of Life Assessment
tests. Psychometrika 1951; 16: 297-334. in Clinical Trials: Methods and Practice. New York: Ox-
10. Devellis R. Scale Development: Theory and Applications. ford University Press, 1998.
Vol. 26. Applied Social Research Methods Series. Newbury 22. Streiner DL, Norman GR. Health Measurement Scales.
Park, Calif.: Sage Publications, 1991. 2nd edn., Oxford: Oxford University Press, 1995.
11. Hambleton RK, Swaminathan H, Rogers HJ. Fundamen- 23. Wainer H, Dorans NJ, Flaugher R, et al. Computerized
tals of Item Response Theory. Newbury Park, Calif.: Sage Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erl-
Publications, 1991. baum Associates, 1990.
12. Lohr KN. Health outcomes methodology symposium. 24. Wilkin D, Hallam L, Doggett MA. Measures of Need and
Summary and recommendations. Med. Care 2000; 38 (9) Outcome for Primary Health Care. New York: Oxford
(Suppl. II):11194-11208. University Press, 1992.
13. Lord FM, Norvick MR. Statistical Theories of Mental Test
Scores. Reading, Mass.: Addison-Wesley, 1968. Authorfor correspondence:Kathleen N. Lohr, Ph.D.,
14. McDonald RP. Test Theory: A Unified Treatment. Mah- Chief Scientist, Health, Social, and Economic Research,
wah, NJ: Lawrence Erlbaum Associates, 1999. RTI International, PO Box 12194, 3040 Cornwallis Road,
15. McDowell I, Newell C. Measuring Health. A Guide to Research Triangle Park, NC, 27709-2194 USA
Rating Scales and Questionnaires. 2nd edn., New York: Phone: + 1-919-541-6512; +1-919-541-7384
Oxford University Press, 1996. E-mail: klohr@,rti.org

This content downloaded on Sun, 16 Dec 2012 11:23:51 AM


All use subject to JSTOR Terms and Conditions

You might also like