You are on page 1of 8

Health Policy, 20 (1992) 321328 321

01992 Fhevier Scima Publishers B.V. All rights reserved. 0168-8510/92/$05.00

HPE 00455

A Second Opinion
Health related quality of life measurement -
Euro style

Roy A. Carr-Hill
School of Social and Political Sciences, University of Hull, Hull, United Kingdom

(Accepted 20 October 1991)

Summary

Several teams are attempting to produce generic health related quality of life
measures: none, probably, as ambitious as the EuroQo1° group who are ‘developing
a standardised nondiseaseapeciftc instrument .. .. with the capacity to generate
cross-national compartsons’ (EuroQol Groupo, EuroQolo - a new facility for the
measurement of health related quality of life, Health Policy, 16 (1990) 199-208).
Unfortunately the instrument is flawed both conceptually and in its construction; it
is unsurprising, therefore, that the response rates they obtain are so abysmal. Apart
from these design faults, the main problem is the quite legitimate refusal of most
normal people (respondents) to mte death on the same scale as health states.

Quality of life measurement; EuroQolO; Cross-national comparison

Whether it be the characteristics of structure, process or of outcome,


everyone wants to measure these days. This poses ‘challenges’ for health
services researchers, but outcome is the greatest challenge of them all. So when
anyone suggests an easy solution to the problem of measuring outcomes, it
sounds too good to be true.
One of the newest on the market is the EuroQolQ [l] which comes in the
guise of an easy-to-use thermometer: a simple modest instrument for telling us
what we already very roughly knew; how hot or cold we were. Similarly the

Address for correspondence: Roy k Can-Hill, School of Social and Political Sciences, University Of
Hull, Hull, HU6 7RX, United Kingdom.
322

Extract from the EuroOol questionnaire


BEST IMAGINABLE HEALTH STATE

No problems in walking about


- No problems in walking about
- No problems with self care No problems with self-care
- Unable to perform main activity Able to perform main activity
(e.g. work. study, housework) (e.g. work. study, housework)
Able to pursue family and
- Able to pursue family and
leisure activities leisure activities
- Moderate pain or discomfort Moderate pain or discomfort
- Not anxious or depressed Anxious or depressed

- Unable to walk about without a


- No problems in walking about stick, crutch or walking frame
- No problems with self-care - No problems with self-care
- Able to perform main activity - Unable to perform main actlvlty
(e.g.work, study, housework) (e.g.work.study, housework)
- Able to pursue family and - Unable to pursue family and
leisure activities leisure activities
- No pain or discomfort - Extreme pain or discomfort
- Not anxious or depressed - Anxious or depressed

- No problems in walking about


- No problems wth self-care
- Unable to perform main activity
(e.g.work.study, housework)
- Unable to pursue family and
leisure activities
- Moderate pain or discomfort
- Anxious or depressed

- Unable to walk without a stick. - Confned to bed


crutch or walking frame - Unable to feed self
- Unable to dress self - Unable to perform main activity
- Unable to perform main activity 1 (e.g.work,study.housework)
(q.g.work.study, housework) - Unable to pwsue family and
- Unable to pursue family and I leisure activities
leisure activities - Extreme pain or discomfort
- Extreme pain or discomfort - Anxious or depressed
- Anxious or depressed

WORST IMAGINABLEHEALTH STATE

EuroQolQ instrument purports to take the ‘temperature’ of your health. The


implicit suggestion, therefore, is that the respondents can assess the health
status of themselves and others as if on a thermometer [2]. Does this make any
sense?
The EuroQolQ (see Fig. 1) is intended as a generic instrument for describing
and valuing health related qualit of life. It has been promoted on an
international level by the EuroQol 8 Group. Their principal aim was ‘to test
the feasibility of jointly developing a standardised non-disease-specific
instrument ..... . with the capacity to generate cross-national comparisons’
[l], based on data which was capable of being collected via a self-completion
postal questionnaire [l]; which could generate a single index value for any
given health state’ [l].
323

Health states are described in terms of six distinct dimensions (mobility, self
care, main activity, social relationships, pain and mood) with two categories of
main activity, social relationships and mood and three categories of mobility,
self care and pain. These were selected ‘following a detailed examination of .. ..
existing health status measures’ [l] although not, apparently, any piloting,
testing or validation. The Euroqol @ Croup have claimed [3] that they are
piloting, testing, validating. They may be referring to the material published in
Nord [4] who was concerned with improving the abysmal response rates (see
below); but neither there nor elsewhere are the six ‘dimensions’ which are the
basis of the instrument put in question.
In the ‘standard’ procedure, respondents are asked to assign values to health
states defined by combinations of different ratings on each of the six
dimensions. Given that two or three categories are specified for each of the six
dimensions, there is a theoretical universe of 216 (3 x 3 x 2 x 2 x 3 x 2)
health states. Fourteen living states were selected as ‘likely to occur fairly
frequently in practice [and] representing a wide range of degrees of severity’ [l].
Respondents are then asked to rate these fourteen composite states - together
with Being Dead (although it is not suggested that this also occurs ‘fairly
frequently’) - onto a ‘thermometer-like’ scale.
The EuroQolB group claim that this is ‘just like a visual analogue scale’ [3].
But in the standard version a visual analogue scale is an uncallibrated
horizontal line rather than a finely calibrated vertical line. Some limited
experimentation has been reported also by Nord [4] in Norway changing four
of the states and excluding dead but, given, a theoretical universe of 216 states
and no empirical justification for the statement that the 14 states were ‘likely to
occur fairly frequently in practice’ this hardly merits the appellation ‘testing’.
No justification is given for basing valuations on these combined or
composite states rather than on explicit weightings to the different
components. This procedure makes it difficult to test for even face validity -
because the composite state defined by the combination has to correspond to a
‘real’health state [5]. For example, it is not obvious how to describe - let alone
diagnose - someone who reports that they have no problems in walking about
nor with self care, unable to perform main activity but able to pursue family
and leisure activities but in extreme pain or discomfort (state 112131, one of
the 14 fairly frequent states).
Clearly for the purposes of deriving a Q for QALY measurements, a one-
dimensional valuation is essential; but that rather begs the question of what is
happening when one averages and compares ratings of the composite states. It
is unfortunately impossible to conduct any secondary analysis of these data
which are being held privately, but there is plenty of evidence [6-g] that people
weight different aspects of health differently. Yet the way in which these
respondents have weighted the six different dimensions to provide a composite
valuation is apparently not of any interest - it is certainly not discussed. The
aggregation and comparison of ratings resulting from the use of different
weighting procedures by different people is not sensible - think of the fuss
324

commentators make about comparing exit polls and other public opinion
polls.
Further the use of scale values presumes that there is a defined interval
between them: and, like many other investigators, the EuroQol group simply
assume that there would be equal intervals between the scale points [l]. In
other words, the difference between a rating of 100 and 90 (a drop from best
health to not quite so good) is presumed to be the same as a decline from 50 to
40 (from half way between best and worst to not quite so good as that). But
whilst we understand the implications of a chance in 10°C on a thermometer,
at least for our heating bills, the meaning of a ten point change here is obscure
and not clarified in this text. Moreover, the presumption by the EuroQolO
group that the observed ratings of health states behave like this is rather
arrogant: after all, they were setting out to find an interval scale not to declaim
the existence of one by fiat!
It is therefore not surprising that observed responses do not always fit into a
neat pattern, so that respondents often give ‘inconsistent’ ratings where their
rating for one health state ‘should’ be higher than another but is not. In
particular, some pairs of states are adjacent in the sense that their components
are identical except for one category: for example, two of the fourteen states
differ only in the level of pain. In one of the studies carried out under this
umbrella, Kind [9] shows how the proportion of respondents, inconsistent in
this sense, varies in pairs of adjacent states from 5% when the only category
which changes is self care to 32% when the only category which changes is pain.
Yet this is to treat each respondent’s rating as precise as if it were a measure
on a real thermometer, that is, as if the same person would give the same rating
on any occasion (just as any reliable thermometer should give the same
temperature). But, whilst a person’s temperature is unambiguously higher (or
lower, or the same) today than yesterday and nearly all of us are able to make
a reasonable guess as to whether we feel more or less healthy today than
yesterday (or we can’t tell the difference) few would claim to be able to make
meta comparisons of this sort comparing relative health states on one
hypothetical pair of days with relative health states on another hypothetical
pair of days. In fact, the Swedish study [lo] draws attention to the changes in
valuation between the health state 112222 which was reproduced in the
restricted and extended core (on two successive pages of the instrument). ‘Of
258 respondents, only 55 gave the same valuation; 43 changed by 5 points or
less; 39 by 6-10 points; 30 by 1l-l 5 points; and as many as 41 by 16 points or
more’ [lo]. The mean value change was nearly 10 points which was highly
significant.
Nevertheless, the EuroQolQ group go further to claim some cross-cultural
validity for their instrument and explicitly compare the results obtained in
different countries to translations of the questionnaire. Four medium size
postal surveys have been carried out in the U.K. (Frome), in The Netherlands
(Bergen op Zoom), in Norway (entire country) and in Sweden (Lund) as per
Table 1.
325

Table 1
Sample characterlstlcs In four locations

sampling Original Replies Usable


frame mailing returned replies

U.K. (Frame) GP list 1321 522 310

The Netherlands list of households 200 112 74


(Bergen op Zoom)

Norway population register 800 245 206


(entire country)

Sweden (Lund) population list 1000 349 208

Table 2
Results of three pllot studks conducted wlth ths EuroQol Instnmmnt In Lund (Sweden,
I = 2@8/1000),J?romc(U.Km,I = X0/1321) md Berga~op Zoom w Netherlsnds, I)= 74/200)

Health state Median vahrations Mean valuations Standard deviation

Lund Frome BoZ Lund Frome BoZ Lund Frome BoZ

111111 100 99 95 93 95 93 :: :: 13
111121
111112 ;z : ff :; :: ;: 21 18 2
111122 20 17
112121 :! :: ;I! z 2; 2; : 18 ;:
112131 :z :: 43
60
112222(a) ::, :; :f 20 :; :
112222(b) 39 40 40 38 40 41 19 16 21
112232
212232 ; 25
35 ;: ;: 26
36 z ;: 17
16 :
222232 14 12 12 19 12 15
232232 10
7 10
5 2 12 8 10 19 9 15
322232 9 : 10
332232 41 2 : 8 7 :; ; :;
being dead(a) x :1 23 10 10 19
being dead(b) 10 10 18 z 21
20 ;:

Source: EuroQolO Group [ 11.

Initial response rates varied widely from 3 1 to 57% although the proportion of
unusable responses was remarkably high and stable in three of the studies at
between 34 and 41%. Their ‘final’ samples, therefore varied between 21 and
37% of the initial samples. The exclusion of ‘being dead’ in the Norwegian
study [4] raised the proportion of usable responses from among those returned
to 84%, but given that their return rate was only 31% the ‘final’ sample was
only 26% of the initial sample. These rates are extraordinarily low when
asking about health even for a postal self completion questionnaire.
They claim that ‘there seems little danger of selection bias as valuations vary
little with background variables or response times’ [l]; but in both the Swedish
and U.K. studies there were differences by level of education.
326

Table 3
Differences between median and mean valuations

Category Pair of states Range of differences


changes
median mean

Mobility 1-2 (112232,212232) IO-13 lo-l 1


2-3 (232232, 332232) 2-6 3-4
Selfcare l-2 (212232,222232) 12-15 12-14
2-3 (222232,232232) l-5 2-4
2-3 (322232, 332232) l-3 l-3
Main activity l-2 (111121,112121) 14-21 14-22

Social relationships No adjacent pairs of states

Paill l-2 (111112,111122) 2-5 2-5


l-2 (111111,111121) 9-l 5 lo-14
2-3 (112121, 112131) 5-15 7-11
2-3 (112222, 112232) 4-7 2-4
Mood l-2 (111121,111122) 12-19
l-2 (111111,111112) 20!!9 22-28

The only data provided for all four studies are the medians and mean ratings.
They argue that the patterns in each country are very similar (see Table 2)
based on the high rank correlations between three of the sets [l-3].
It should be noted, however, that the standard deviations of these ratings
are very high, given that each of the ratings relate to a O-100 scale. Moreover,
given that the valuations derived from such a method are proposed eventually
to be used in the context of prioritising different health care interventions, the
crucial statistic is the difference between pairs of valuations. Given there are
fourteen different states (including ‘being dead’) there are 91 possible pairs.
The analysis here is restricted to those pairs where only one component
dimension changes and where you might expect greatest consistency/stability
on the differences. The basic data are presented in Table 3.
There are two initial observations. First for those dimensions with three
categories, the shift from category 1 to category 2 is associated with a larger
change than the shift from category 2 to category 3 for mobility and self care,
but not for pain. Second, where there are two pairs of states with the same
category shift (for self-care, pain and mood) there are quite large differences
between the changes in valuations. Of course, there was no reason to suppose
that the combinatory algorithm would be additive, but this particular pattern
suggests no systematic model at all.
As the data have not been released, it is impossible to carry out formal tests
of these differences. But although the range of differences between the mean
valuations observed is apparently quite small (and smaller than those between
medians), they are not insignificant. Suppose, for example, that the standard
deviation of the difference between any pair of ratings is as large as 20 (10) - so
that the distribution of differences between a pair of ratings would span a
range of roughly 80 (40) points (in a 100 point scale) with the usual confidence
327

interval. Then the standard error of difference between the mean ratings on
samples of size n would be 20/n (10/n); in samples of 200 (the average sample
size), this gives a standard error of 1.4 (0.7) and a two sided confidence interval
of 5.6 (2.8).
It can be seen from the final column that in two of thirteen comparisons, the
range of differences between the means is larger than 5.6 and in five it is larger
than 2.8. Thus a range of 20-29 points in value changes is not ‘rather
convincing evidence in our favour’ [3]; it is a statistically highly significant
difference.

Conclusions

To sum up, there are no reports of reliability or validity for this instrument;
no account is taken of the different weights people attached to different
components of health; individual and inter-temporal variations are averaged
rather than being seen as of interest in their own right; there are apparently
considerable difficulties in obtaining responses; and yet this is proposed as the
basis for a European health status measuring instrument.
It should be emphasised that, whilst this particular instrument is more than
usually cheeky in claiming cross-cultural validity, and is pretending to be like a
thermometer, it is only an extreme version of all such instruments [l 11. For the
Q valuations are to be used in order to make a QALY assessment. Because the
QALY brings together the health quality of survivors and the risk of death, the
Q valuation has to be related to ‘Perfect Health’ (or ‘Absence of Health
Problems’) and to ‘Being Dead’. All of the procedures therefore are based on
asking people to compare ‘Being Dead’ with various states of (ill-) health.
These instruments may be ‘easy-to-use’, but are they really a sensible basis for
management decisions about the allocation of health care resources?

Acknowledgements

We would like to thank the ESRC for supporting this analysis of the
fundamental basis of outcome measurement, Kerry Atkinson for producing
the manuscript, Paul Dixon, David Lewis and Jenny Morris for their very
helpful comments, and, of course, to the EuroQol@ Group, without which this
could not be written.

References

1 EuroQol Group (1990) EuroQol - a new facility for the measurement of health related quality
of life, Health Policy, 16 (1990) 199-208.
2 Sintonen, H., An Approach to Measuring and Valuing Health States, Social Science and
Medicine, 15 (1981) 55-65.
328

3 Williams et al., EuroQol: ‘Not a Quick Fix’, Health Service Journal, 21 Nov. 1991, p. 29.
4 Nord, E., EuroQol’? Health related quality of life measurement. Valuations of health states by
the general public in Norway, Health Policy, 18 (1991) 25-36.
5 Froberg, D.G. and Kane, R.L., Methodology for Measuring Health State Preferences - I:
Measurement Strategies, J. Clin. Epidemiology, 42 (1989) 345-354.
6 Calnan, hi., Health and JBneas: The Lay Perspective, Tavistock, London, 1987.
7 Herzlich, C., Health and Blness, Academic Press, London, 1973.
8 Wiiiams, R.G.A., Concepts of health: an Analysis of Lay Logic, Sociology, 17 (1983) 185-205.
9 Kind, P., Measuring Valuations for Health States: A survey of patients in general practice,
Centre for Health Economics Discussion Paper No. 76, University of York, York, 1990.
10 Brook, R.G., Jendleg, S., Lindgren, B., Persson, U. and Bjork, S., EuroQol? health related
quality of life measurement. Results of the Swedish questionnaire exercise, Health Policy, 18
(1991) 3748.
11 Carr-Hill, R.A. and Morris, J., Current Practice in Obtaining the Q in QALYs: A Cautionary
Note, British Medical Journal, 1991.

You might also like