You are on page 1of 5

The validity of the 12-item General Health

Questionnaire in Australia: a comparison between


three scoring methods

Susan Donath

Objective: To investigate the specificity and sensitivity of three different scoring methods
of the 12-item General Health Questionnaire (GHQ-12) and hence to determine the best
GHQ-12 threshold score for the detection of mental illness in community settings in
Australia.
Method: Secondary data analysis of the 1997 Australian National Survey of Health and
Wellbeing (n = 10 641), using the Composite International Diagnostic Interview as the gold
standard for diagnosis of mental illness.
Results: The area under the Receiver Operating Characteristic (ROC) curve for the
C-GHQ scoring method was 0.84 (95% CI = 0.83–0.86) compared with the area for the
standard scoring method of 0.78 (95% CI = 0.76–0.80). The best threshold with C-GHQ
was 3/4, with sensitivity 82.9% (95% CI = 80.2–85.5%) and specificity 69.0% (95%
CI = 68.6–69.4%). The best threshold score with the standard scoring method was 0/1, with
sensitivity 75.4% (95% CI = 72.5–78.4%) and specificity 69.9% (95% CI = 69.5–70.3%).
These were also the best thresholds for a subsample of the population who had consulted
a health practitioner in the previous 4 weeks.
Conclusion: In the Australian setting, the C-GHQ scoring method is preferable to the stan-
dard method of scoring the GHQ-12. In Australia the GHQ-12 appears to be a less useful
instrument for detecting mental illness than in many other countries.
Key words: mental disorders, psychiatric states rating scales, ROC curve, Australia,
psychometrics.

Australian and New Zealand Journal of Psychiatry 2001; 35:231–235

The General Health Questionnaire (GHQ) was devel- Respondents to the GHQ rate themselves according to
oped to detect minor psychiatric illness in the community; the degree to which they have experienced each symptom
it is designed to ‘differentiate psychiatric patients as a class over the past few weeks. Each item has four response
from non-cases as a class’ [1, p.5]. Since its introduction categories. The standard method of scoring the GHQ is
in the 1970s, the GHQ has become one of the principle a binary method; symptomatic responses to each item
self-report questionnaires used to measure non-psychotic are scored ‘1’ and summed over the items [1]. This can
mental illness in the community and in general practice. be characterized as the (0-0-1-1) method, as adjacent
Originally a 60-item questionnaire, there are now 30-, response categories are collapsed. For the GHQ-12, this
28-, 20- and 12-item versions [1]. The shorter versions method results in a score ranging from 0 to 12.
are often preferred as they are quicker to administer. In addition to the standard scoring method, the GHQ
can also be scored using a four-point Likert scoring
Susan Donath, Senior Research Fellow
method where scores of 0–3 are assigned to each item
Turning Point Alcohol and Drug Centre, 54–62 Gertrude Street,
response and then summed across items, giving a score
Fitzroy, Victoria 3065, Australia. Email: susand@turningpoint.org.au ranging from 0 to 36 for the GHQ-12 (0-1-2-3 method)
Received 2 June 2000; revised 31 August 2000; accepted 6 September 2000. [1]. Another scoring method, devised by Goodchild and

Downloaded from anp.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016


232 VALIDITY OF THE GHQ-12

Duncan-Jones [2], attempts to overcome the assumed response rate was 78%, yielding a sample size of 10 641 persons (4705
low sensitivity to chronic disorders of the standard scoring men, 5936 women) aged 18 and over [14]. The survey was conducted
method. The positive items are scored in the conven- using face-to-face interviews.
tional binary method, but the negative items are scored
0-1-1-1, thus assuming that the ‘no more than usual’ Instruments
answer to negative questions indicates the presence of a
The survey included the GHQ-12 as a standalone questionnaire, and
chronic problem rather than good health. This is gener-
the Composite International Diagnostic Interview (CIDI), a compre-
ally referred to as the C-GHQ scoring method. In all three hensive interview which can be used to assess current and lifetime
scoring methods, higher scores indicate an increased prevalence of mental disorders in adults [13]. The CIDI enables the
likelihood of psychological distress. diagnosis of mental disorders based on either the International
For each version of the GHQ, an empirically deter- Classification of Diseases, 10th revision (ICD-10) [15], or the
mined threshold score indicates the likelihood of psy- Diagnostic and Statistical Manual of Mental Disorders, 4th revision
chiatric illness. There is a trade-off between sensitivity (DSM-IV) [16]. To facilitate comparison with Goldberg et al. [3], this
and specificity, with higher thresholds giving higher study presents results using the diagnoses according to ICD-10. The
specificity, but lower sensitivity. The optimal threshold conditions included were affective disorders (mania; hypomania; mild,
is that which gives the best combination of sensitivity moderate and severe depression; bipolar affective disorder; dys-
thymia), anxiety disorders (panic disorder, agoraphobia, social phobia,
and specificity.
generalized anxiety disorder, obsessive–compulsive disorders, post-
Goldberg et al.’s study of the GHQ-12 in 15 cities
traumatic stress disorder) and neurasthenia, but not alcohol/drug
around the world found that, for a given threshold value, dependence or harmful use. Respondents were classified as having a
there were considerable variations in sensitivity and speci- mental disorder if they were diagnosed as having one or more condi-
ficity. In some cities, the best GHQ-12 threshold was 1/2, tions during the previous 4 weeks.
in others it was 2/3, in still others, 3/4, and, in one centre,
6/7 [3]. Previous studies have found similar variation [3]. Analysis
The reasons for these differences are not clear.
As there is such variation in the optimal threshold, it is Data were analysed using SPSS (SPSS, Chicago, IL, USA) and
important to determine the most appropriate threshold for Microsoft Excel.
use in Australia. An Index Medicus/Medline and PsycLIT Using the diagnosis from the CIDI as a gold standard, the sensitiv-
search revealed only one Australian study which has ity and specificity of a range of thresholds for each of the scoring
investigated the threshold value of the GHQ-12 [4]. This methods were calculated. Receiver operating characteristic (ROC)
was a small study (n = 120) conducted in two general curves were derived for each scoring method. Receiver operating char-
acteristic analysis is a technique which enables comparison of the per-
practices in Sydney. There is evidence that the sensi-
formance of two or more screening tests or scoring methods. A ROC
tivity (but not the specificity) of the GHQ is on average
curve is obtained by plotting sensitivity against the false positive rate
considerably higher in primary care settings than in com- for all possible cut-off points of the screening instrument. The area
munity settings [1]. Although several studies have used under the curve provides a summary measure of the ability of the
the GHQ-12 in community settings in Australia [5–11], instrument or scoring method to discriminate between cases and non-
there appears to be no Australian evidence on the best cases. A ROC area equal to 0.5 is obtained when the discriminatory
threshold to use in this situation. ability of the screening instrument is no better than chance; a value of
This study uses unpublished confidentialized unit 1.0 stands for perfect discriminatory ability [1].
record file (CURF) data from the Australian Bureau of Results were obtained both for the entire sample, and also for a ‘clin-
Statistics (ABS) 1997 National Mental Health Survey ical’ subsample consisting of persons who had consulted a doctor or
[12,13] to investigate the sensitivity and specificity, and other health practitioner for any reason in the previous 4 weeks.
Using the weighting factors and method described by the ABS [13],
hence the optimal threshold values, of the GHQ-12 for
results were adjusted to ensure that they represented as far as possible
the three different scoring methods.
the adult Australian population. Confidence intervals for proportions
and percentages were estimated using the relative standard errors pro-
Method vided by ABS [13 pp.74–77]. Since the ABS does not provide esti-
mates of standard errors for means, confidence intervals for means
Data source were estimated using standard formulae [17] with the weighting factors
scaled so that they summed to the sample size.
The 1997 National Mental Health Survey was conducted on a repre-
sentative sample of residents of private dwellings in all States and
Territories. The relevant ABS publications provide a detailed descrip- Results
tion of the sampling method [12,13]. The sample excluded special
dwellings such as hospitals, institutions, nursing homes and hostels, Based on the CIDI, 7.3% (95% CI = 6.9–7.6%) of the population
and dwellings in remote and sparsely settled parts of Australia. The were diagnosed with a mental illness; 8.9% (95% CI = 8.2–9.5%) of

Downloaded from anp.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016


S. DONATH 233

women were diagnosed with a mental illness compared with 7.3% specificity were higher for males than for females with all scoring
(95% CI = 6.9–7.6%) of men. Of those people who had consulted a methods, the differences averaging around 4%. In the ‘clinical’ sub-
health practitioner in the previous 4 weeks, 11.0% (95% CI = sample, for a given threshold score, sensitivity was higher and specificity
10.3–11.7%) were diagnosed with a mental illness. In this ‘clinical’ was lower than in the total sample by 3–4%, for all scoring methods.
population, 12.5% (95% CI = 11.4–13.5%) of women and 9.2% (95% (Details of sensitivity and specificity for selected threshold scores, stan-
CI = 8.1–10.3%) of men were diagnosed with a mental illness. dard, Likert and C-GHQ scoring methods, Australia 1997, for all popu-
On average, women had higher GHQ-12 scores than men, and the lations are available from the corresponding author on request.)
average scores of those in the ‘clinical’ subsample were higher than As indicated in Table 3, for both the total sample and the ‘clinical’
the average scores for the total sample (Table 1). Using the standard subsample, the areas under the ROC curves were slightly higher for
scoring method, 66.6% of the total population (69.2% of men, 64.0% males than for females, but the differences were generally not statisti-
of women) and 58.7% of the ‘clinical’ population (61.7% of men, cally significant. Comparing the total sample and the ‘clinical’ sub-
56.3% of women) scored zero. sample, there was no difference in the areas under the ROC curves. In
The results in Table 2 and Fig. 1 indicate the trade-off between sen- all groups, the area under the ROC curve was greater for the C-GHQ
sitivity and specificity using different threshold values of the GHQ-12. scoring method than for the standard scoring method.
For a given specificity, the C-GHQ scoring method generally produces
the highest sensitivity, followed by the Likert and then the standard Discussion
scoring method.
The analyses were repeated for males and females separately and for Based on the ROC analysis, the C-GHQ scoring
the ‘clinical’ subsample. For a given threshold score, sensitivity and method provides better discrimination between those

Table 1. Mean scores for standard, Likert and C-GHQ methods of scoring the GHQ-12, Australia, 1978 and 1997

Population Scoring method Mean score (95% CI)


Men Women Total
Australia, 1978* Standard 0.95 (0.92–0.99) 1.19 (1.15–1.23) 1.07 (1.05–1.11)
Australia, 1997 Standard 0.83 (0.78–0.88) 1.02 (0.97–1.07) 0.93 (0.89–0.97)
Likert 8.74 (8.63–8.85) 9.21 (9.10–9.32) 8.98 (8.90–9.06)
C-GHQ 2.72 (2.65–2.79) 3.01 (2.95–3.07) 2.87 (2.82–2.91)
Australia, 1997
‘clinical’ population Standard 1.18 (1.08–1.28) 1.36 (1.28–1.45) 1.28 (1.22–1.34)
Likert 9.49 (9.29–9.69) 9.88 (9.71–10.04) 9.71 (9.58–9.83)
C-GHQ 3.16 (3.04–3.28) 3.43 (3.33–3.53) 3.31 (3.24–3.39)

*Source: Australian National Health Survey, unpublished unit record data, 1978. GHQ-12, 12-item General Health
Questionnaire.

Table 2. Sensitivity and specificity for selected threshold scores for standard, Likert and
C-GHQ scoring methods, Australia, 1997

Scoring method Threshold Sensitivity Specificity


% (95% CI) % (95% CI)
Standard 0/1 75.4 (72.5–78.4) 69.9 (69.5–70.3)
1/2 58.8 (55.7–61.9) 83.8 (83.0–84.5)
2/3 48.0 (44.9–51.0) 90.7 (89.9–91.4)
3/4 38.6 (35.5–41.7) 94.1 (93.2–94.9)
Likert 8/9 85.1 (82.6–87.7) 57.4 (56.9–57.9)
9/10 79.6 (76.8–82.4) 68.7 (68.3–69.1)
10/11 72.4 (69.3–75.4) 77.4 (77.1–77.6)
11/12 62.7 (59.6–65.8) 84.5 (83.8–85.2)
C-GHQ 2/3 90.2 (88.1–92.3) 54.2 (53.7–54.7)
3/4 82.9 (80.2–85.5) 69.0 (68.6–69.4)
4/5 71.5 (68.5–74.5) 79.7 (79.6–79.8)
5/6 57.7 (54.6–60.8) 88.7 (87.9–89.5)

Downloaded from anp.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016


234 VALIDITY OF THE GHQ-12

with and without a mental illness than either the Likert these can be estimated as being between 0.75 and 0.99 for
or the standard scoring methods. This is in contrast to sensitivity and 0.85 and 0.97 for specificity.
other studies which have found little or no difference In the present study, with the standard scoring method,
between scoring methods [3,18]. With this scoring the best trade-off between sensitivity and specificity is
method, the best trade-off between specificity and sensi- given by a threshold of 0/1. Australian studies have used
tivity is given by a threshold of 3/4, both in the total thresholds of 1/2 [5–9,20], 2/3 [21] or 3/4 [10,11,22], but
sample and in the ‘clinical’ subsample. the results from this study suggest that the sensitivity of
Tennant’s validity study of the GHQ [4] used a disem- thresholds higher than 0/1 is unacceptably low. Even in
bedded version of the GHQ-12 (that is, the GHQ-60 was the group with the highest sensitivity for a given thresh-
the actual questionnaire used in the study, and the GHQ-12 old score (males in the ‘clinical’ subsample) sensitivity
questions were extracted from the longer questionnaire). using a threshold 1/2 was only 66.4%.
There is evidence that disembedded versions of the GHQ Using the standard scoring method, most studies have
give different optimal thresholds from those obtained using found the optimal threshold to be 1/2 or 2/3 [3,23],
the corresponding standalone version of the GHQ [19]. In although 0/1 has been found in at least one other study
general practice patients in Sydney, Tennant found sensi- [18]. It has been suggested that the mean GHQ score for
tivity of 0.87 and specificity of 0.91 for a threshold of 1/2 the whole population of respondents provides a rough
using the standard scoring method. Confidence intervals guide to the best threshold, so that populations with low
were not reported, but based on other reported information average GHQ scores will generally have lower threshold
scores [23].
Goldberg et al. found mean scores ranging from 1.09
to 3.66, with a majority of the 15 centres in the study
reporting mean scores above 2 [3]. The Australian mean
scores of 0.93 for the total sample and 1.28 in the ‘clini-
cal’ subsample therefore appear to be low compared with
mean scores found elsewhere. However, these low scores
seem to be characteristic of the Australian population, as
the mean GHQ-12 scores in the 1978 National Health
Survey were very similar (Table 1).
In general, it appears that the higher the best threshold
on the GHQ, the greater the area under the ROC curve,
and hence, the greater the discriminatory power of the
GHQ [23]. In this study, both the areas under the ROC
curves and the sensitivity and specificity of the optimal
threshold were lower than in most of the 15 centres
Figure 1. Receiver operating characteristic (ROC) studied by Goldberg et al. [3]. Thus, the evidence from
curves for the three GHQ-12 scoring methods. this study suggests that in Australia the GHQ-12 is a less
, standard GHQ scoring; , Likert GHQ useful instrument for detecting mental illness than in
scoring; , C-GHQ scoring. many other countries.

Table 3. Areas under ROC curve for different scoring methods

Population Scoring method Area under ROC curve


Men Women Total
(95% CI) (95% CI) (95% CI)
Australia, 1997 Standard 0.80 (0.77–0.83) 0.76 (0.74–0.79) 0.78 (0.76–0.80)
Likert 0.84 (0.81–0.87) 0.79 (0.77–0.81) 0.81 (0.79–0.83)
C-GHQ 0.86 (0.84–0.88) 0.83 (0.81–0.85) 0.84 (0.83–0.86)
Australia, 1997
‘clinical’ population Standard 0.80 (0.77–0.84) 0.77 (0.74–0.79) 0.78 (0.76–0.80)
Likert 0.85 (0.82–0.88) 0.80 (0.77–0.81) 0.81 (0.79–0.83)
C-GHQ 0.86 (0.83–0.88) 0.82 (0.80–0.85) 0.84 (0.82–0.85)

ROC, receiver operating characteristic.

Downloaded from anp.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016


S. DONATH 235

References 12. Australian Bureau of Statistics. Mental health and wellbeing


profile of adults, Australia 1997. ABS Cat. No. 4326.0.
1. Goldberg D, Williams P. A user’s guide to the General Health Canberra: Australian Government Publishing Service, 1998.
Questionnaire. Windsor: NFER-Nelson, 1991. 13. Australian Bureau of Statistics. National survey of mental health
2. Goodchild ME, Duncan-Jones P. Chronicity and the General and wellbeing of adults 1997, users’ guide. ABS Cat. No.
Health Questionnaire. British Journal of Psychiatry 1985; 4327.0. Canberra: Australian Government Publishing Service,
146:55–61. 1999.
3. Goldberg DP, Gater R, Sartorius N et al. The validity of two 14. Australian Bureau of Statistics. Information paper, mental health
versions of the GHQ in the WHO study of mental illness in and wellbeing of adults 1997, confidentialised unit record file.
general health care. Psychological Medicine 1997; 27:191–197. ABS Cat. No. 4329.0 Canberra: Australian Government
4. Tennant C. The general health questionnaire: a valid index of Publishing Service, 1998.
psychological impairment in Australian populations. Medical 15. World Health Organization. The ICD-10 classification of mental
Journal of Australia 1977; 2:392–394. and behavioural disorders clinical descriptions and diagnostic
5. McDonald R, Vechi C, Bowman J, Sanson-Fisher R. Mental guidelines. Geneva: World Health Organization, 1992.
health status of a Latin American community in New South 16. American Psychiatric Association. Diagnostic and statistical
Wales. Australian and New Zealand Journal of Psychiatry 1996; manual of mental disorders. Washington: American Psychiatric
30:457–462. Association, 1984.
6. Brown WJ, Alexander J, McDonald B, Mills-Evers T. The health 17. Armitage P, Berry G. Statistical methods in medical research.
of Filipinas in the Hunter region. Australian and New Zealand Oxford: Blackwell, 1994.
Journal of Public Health 1997; 21:214–216. 18. Gureje O, Obikoya B. The GHQ-12 as a screening tool in a
7. McFarlane AC. Life events and psychiatric disorder: the role of a primary care setting. Social Psychiatry and Psychiatric
natural disaster. British Journal of Psychiatry 1987; Epidemiology 1990; 25:276–280.
151:362–367. 19. van Hemert AM, den Heijer M, Vorstenbosch M, Bolk JH.
8. Rickwood D, d’Espaignet ET. Psychological distress among Detecting psychiatric disorders in medical practice using the
older adolescents and young adults in Australia. Australian and General Health Questionnaire. Why do cut-off scores vary?
New Zealand Journal of Public Health 1996; 20:83–86. Psychological Medicine 1995; 25:165–170.
9. Morrell S, Taylor R, Quine S, Kerr C, Western J. A cohort study 20. Singh B, Lewin T, Raphael B, Johnston P, Walton J. Minor
of unemployment as a cause of psychological disturbance in psychiatric morbidity in a casualty population: identification,
Australian youth. Social Science and Medicine 1994; attempted intervention and six-month follow-up. Australian and
38:1553–1564. New Zealand Journal of Psychiatry 1987; 21:231–240.
10. Carr VJ, Lewin TJ, Kenardy JA et al. Psychosocial sequelae of 21. Harris MF, Silove D, Kehag E et al. Anxiety and depression in
the 1989 Newcastle earthquake: III. Role of vulnerability factors general practice patients: prevalence and management. Medical
in post-disaster morbidity. Psychological Medicine 1997; Journal of Australia 1996; 164:526–529.
27:179–190. 22. Schattner PL, Coman GJ. The stress of metropolitan general
11. Carr VJ, Lewin TJ, Webster RA, Kenardy JA, Hazell PL, practice. Medical Journal of Australia 1998; 169:133–137.
Carter GL. Psychosocial sequelae of the 1989 Newcastle 23. Goldberg DP, Oldehinkel T, Ormel J. Why GHQ threshold varies
earthquake: II. Exposure and morbidity profiles during the first from one place to another. Psychological Medicine 1998;
2 years post-disaster. Psychological Medicine 1997; 27:167–178. 28:915–921.

Downloaded from anp.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

You might also like