You are on page 1of 6

Global Assessment of Functioning

A Modified Scale
_______________________
Sally Caldecott-Hazard, PH.D.
Richard C. W. Hall, M.D.

The modified Global Assessment of Functioning (GAF) scale has more detailed criteria
and a more structured scoring system than the original GAF. The two scales were
compared for reliability and validity. Raters who had different training levels assigned
hospital admission and discharge GAF scores from patient charts. Intraclass correlation
coefficients for admission GAF scores were higher for raters who used the modified GAF
(0.81) compared with raters who used the original GAF (0.62). Validity studies showed
a high correlation (0.80) between the two sets of scores. The modified GAF also
correlated well with Zung Depression scores *-0.73). The modified GAF may be
particularly useful when interrater reliability needs to be maximum and/or when persons
with varying skills and employment backgrounds –– and without much GAF training ––
must rate patients. Because of the increased structures, the modified GAF may also be
more resistant to rater bias. (Psychosomatics 1995; 36: _____-______)

Global severity of illness scales are by managed care companies and


important instruments for assessing governmental agencies to determine who
change in psychiatric patients. can and cannot be admitted to hospitals.
Increasingly, such scales are being used
45
The scales are simple to administer values that range from 1, representing
and are more sensitive to differential the sickest patient, to 100, a person with
treatment effects than measures of single no symptoms. The scale is divided into
dimensions of psychopathology. 10 equal intervals, with 10 scores in each
Probably the most often used global interval, and the criteria that define each
assessment instrument is the interviewer- score in each interval are listed. The
rated Global Assessment of Functioning GAF scale has similar criteria and the
(GAF) scale, which is listed in the DSM- same interval design, except that the
III-R as an Axis V diagnostic criterion value range is from 1 to 90 (absent or
test. This scale is very similar to Global minimal symptoms), and there are 9
Assessment Scale (GAS) developed by rather than 10 equal intervals.
Endicott et al, 4 The Endicott scale has One potential problem is with
both these interviewer-rated scales is
4
Received March 3, 1993; revised April 26, that for the scores to be comparable and
1993; accepted May 21, 2993. From the Center thus meaningful across different studies,
for Psychiatry, Florida Hospital, Orlando, interrater reliability in scoring must be
Florida, and Department of Psychiatry,
quite consistent within a study and from
University of Florida, Gainesville. Address
reprint requests to Dr. Caldecott-hazard, Center one study to another. Interrater
for Psychiatry, Florida Hospital, 601 E. Rollings reliability is strongly influenced by two
St., Orlando, FL 32803. factors: 1) the consistency of the raters,
5
Copyright © 1995 The Academy of and 2) the heterogeneity of patient
Psychosomatic Medicine.
illness severity. Endicott et al. tested the

Volume 36 –– Number 2 –– March-April, 1995 1


reliability of the GAS in 5 studies and
reported intraclass correlation Methods
coefficients ranging from 0.61 to 0.91,
with associated standard error of A modified GAF scale was developed by
measurement scores ranging from 5.0 to increasing the structure of the original
8.0 units. Most of Endicott’’s ratings GAF instrument with a greater number
were done by only a few, well-trained of criteria and with additional directions
interviewers. Having consistently for assigning scores. We chose to
trained interviewers should produce a modify the GAF rather than the GAS
greater likelihood of higher interrated because the GAF, as listed in the DSM-
reliability scores and small standard III-R. reflects more current ideas on
errors. Yet even with this bias, two of illness severity rating and is the more
the studies had intraclass correlation frequently used instrument. The criteria
coefficients in the 0.60s, suggesting that and scoring changes that we made in the
the scale might be less reliable than had GAF were tested among a small group
been hoped. In contrast, one of the of staff members who rated patients
reliability studies used 15 raters of from successive drafts of the modified
different backgrounds and training scale. When staff members had different
levels. Although the intraclass rating of a given patient, their reasons
correlation coefficient was high, this was were discussed and changes were made
due primarily to a greater heterogeneity in the wording or use of the criteria or
of illness severity as compared to the scoring directions.
other studies, not to interrater Reliability studies for both the
consistency of scoring. The lack of original and modified GAF scales were
interrater consistency was demonstrated based on ratings of 16 patient intake
by a high standard error of measurement histories and discharge summaries taken
not seen in the other studies. from the patients’’ hospital charts. All of
Although we could not find these patients had diagnoses of major
published reliability studies on the GAF depression with or without comorbid
in the literature, out subjective eating disorders. They had all been
experience at Florida Hospital was that inpatients on the Affective/Eating
the GAF was used by staff members of Disorders unit, and their intake histories
different backgrounds (physicians with were obtained by one of the same two
varying degrees of familiarity with the doctors. These particular 16 patients
scale, nurses, Ph.D. researchers), and were chosen for review because they had
GAF rating from these staff differed the most detailed intake histories and
substantially for the same patient. Thus, discharge summaries available. Thus, a
we hypothesized that the original GAF maximum amount of patient information
might be less reliable than we had was available for evaluation with the
expected. To test this hypothesis and to GAF.
improve interrater reliability, we Two groups of staff from the
developed a modified GAF scale, and psychiatric units at Florida Hospital
we formally tested interrater reliability rated each of the same 16 patient
in the original and modified versions of histories and discharge summaries. All
the GAF. We conducted our study in patients were given a GAF score for the
1992-1993. severity of illness at admission and a

Volume 36 –– Number 2 –– March-April, 1995 2


second GAF score for illness severity at were used for these assessments of
discharge. One group of staff rated the validity. For the modified and original
patients using the original GAF, and the GAF comparison, admission scores were
other group of staff rated the patients obtained from the same 16 patient
using the modified GAF. None of the histories and discharge summaries as in
staff received any training in the use of the reliability tests. For the modified
either GAF, but they were allowed to GAF and Zung comparison and the
read it and to ask questions for modified GAF and self-rating of illness
clarification. This procedure was comparison, data were obtained from
followed to evaluate the consistency of outpatient telephone interviews with 142
ratings by untrained staff; therefore, we patients who and been discharged from
could evaluate the soundness and Florida Hospital 6 months to 1.5 years
reliability of each GAF under these before. These patients all had diagnoses
conditions. of major depression with or without
The staff in the group using the comoid diagnoses of eating disorder.
original GAF consisted of 12 Each patient had been evaluated using
professionals (nurses, physicians, social the modified GAF only, the Zung
workers, psychiatry technicians, and depression test, and a self-illness
clinical Ph.D.’’s) assigned to 2 inpatient severity rating. The self-rated global
treatment units (affective/eating illness scores were on a scale of 1-10,
disorders and psychiatric/medical). The where 1 was sickest and 10 was most
staff in the group using the modified health.
GAF consisted of another group of 12
professionals from other inpatient units RESULTS
(acute general psychiatry, adolescent, or
intensive treatment). Within each of the The Modified GAF
rating groups (original or modified
GAF), the means and standard errors The modified GAF retained the
were calculated for the ratings of each same 1-90 scale with the same 10-point
patient on admission and discharge. intervals as the original GAF. All
Intraclass correlation coefficients (ICC) criteria in the original GAF were
were then calculated separately for the retained and were listed on separate lines
original GAF group on admission and to facilitate quick reading (Table 1).
discharge and for the modified GAF Additional criteria were added to most of
group on admission and discharge. Both the 10-point intervals and directions for
the admission and discharge correlation scoring the patient’’s illness severity
coefficients were compared between the were added to most of the 10-point
groups. intervals, and directions for scoring the
The concurrent validity of the patient’’s illness severity were added at
modified GAF was tested by comparing the end of each 10-point interval. The
admission scored of this instrument with purpose of these additions was to
admission scores of the instrument with decrease the variability in scoring.
admission scores on the original GAF, Usually, the scoring within a 10-point
the Zung depression test, and a interval applied only to the criteria
selfp0rating of global illness severity. within that interval. For example, in the
Pearson Product Moment correlations 81-90 interval, a patient having no

Volume 36 –– Number 2 –– March-April, 1995 3


symptoms or problems received a score meets. It is felt that the slower speed in
of 88-90; a patient having minimal assigning a score from the modified
symptoms or problems received a score GAF is compensated for by the
of 84-87; and a patient having minimal increased consistency of ratings
symptoms and problems rece3ived a attributable.
score of 81-83 (table 1). However, in Interesting, all of the means for
the 21-30, 31-40, and 41-50 scoring the patient’’s admission GAF scores were
intervals, the same 10 criteria were listed also higher in the original GAF group
in each interval, and the score depended than in the modified group. Thus, the
on the number of criteria that a patient modified GAF caused patients to be
met within these 3 scoring intervals rated more sick than the original GAF.
For example, if a patient met 1 of
these criteria, the score was 48-50; if a Concurrent Validity
patient met 2 of the criteria, the score
was 44-47; and if the patient met 3 of the Because all of the mean
criteria, the score was 41-43. However, admission GAF scores for the original
if the patient met 4-6 of the criteria, the group were higher than the scores in the
scores ranged from 31-40. If the patient modified group, we wanted to test the
met 7-10 of the criteria, the scores correlation between the scores of the two
ranged from 21-30 (Table 1). Finally, in GAF’’s and test the correlation of the
the 21-30 scoring interval, a unique set o modified GAF with other psychological
f criteria and scores also existed in assessment tests. The Pearson Product
addition to the criteria and scoring Moment correlation coefficient between
already discussed. These unique criteria the 16 original and 16 modified mean
were listed in the original GAF and were admission scores was 0.80.p<0.0001,df-
deemed to be of sufficient seriousness 14, showing good correlation (Table 2).
that they should not be added to the list Because all of the patients used
of criteria in the 31-40 and 41-50 in these studies were depressed, we also
intervals but rather would warrant the compared modified GAF scores with the
lowest score available in the 21-30 scores from the Zung depression test.
category. Thus, suicidal preoccupation The Pearson Product Moment
and preparation, behavior considerable correlation coefficient was -0.73,
influenced by delusions or l’’,0.001 (negative because a higher
hallucinations, or serious impairment in number represents sickness in the Zung
communications (i.e., sometimes scores and a lower number represents
incoherent or profound stuporous sickness in the GAF) (Table 2).
depression), always elicited a score of Finally, we also correlated
21. modified GAF scores with the scores
The various changes we made in that patients gave themselves to indicate
modifying the GAF made it longer than their severity of illness. The Pearson
the original GAF (4 pages vs. 1). Thus, Product Moment correlation coefficient
it is suggested that when using this new was 0.58, P,0.01 (Table 2).
GAF, the interviewer should question
the patient about each of the criteria, DISCUSSION
then write down answers, and later count
the number of criteria that the patient

Volume 36 –– Number 2 –– March-April, 1995 4


Our findings of an intraclass correlation and has better interrater reliability on
coefficient of 0.62 for admission scores admission scores. Thus, the modified
on the original GAF agreed with GAF is less likely to reflect a bias by a
Endicott et al.’’s report of ICC’’s ranging managed care or governmental agency.
from 0.61 to 0.91. Our ICC of 0.62 was In addition to reliability tests,
significant at P,0.001, thus indicating modified GAF ratings were also
that while the reliability was somewhat correlated with Zung depression tests
low for admissions ratings, it still was and self-rating of illness severity in
perfectly usable. Likewise, the ICC for outpatients. Similar to reliability tests,
discharge ratings from the original GAF these correlations were in the same range
was 0.90, which indicates excellent as the correlations that Endicott et al.
reliability. The value of the modified found between the original GAS and the
GAF (with its admission ICC of 0.81 Mental Status Examination Record
and discharge ICC of 0l95) is for (FEF) in outpatients. The slightly higher
instances when interrater reliability correlation between the modified GAF
needs to be as high as it can be or when and the Zung depression test (-0.71),
multiple persons of varying employment compared with the original GAS and
backgrounds and without much GAF MSER (0.62), probably was because all
training will rate patients. Research is a of out patients were depressed and the
prime example for both uses of the Zung specifically assessed depression.
modified GAF. Usually during research In contrast, the MSER is a global rating
studies, where would also be enough scale like the GAS, and there was
time to read this longer GAF and assign probably greater heterogeneity among
ratings. these patients. However, both of these
Another use for the modified sets of correlations were acceptable,
GAF, compared with the original GAF thereby indicating that the interviewer
or GAS, is in evaluating the need for rated scales provide similar types of
hospital admission. Specifically, information and the original GAS and
Thompson et al., in a review of 9,055 modified GAF each show acceptable
adult intakes, found marked variations in validity. Interestingly, both the self-
the way managed care case managers, rated illness severity test that we
compared with providers, assigned GAS correlated with the modified GAF and
scores generated from the same data. the FEF, correlated by Endicott with the
Thompson and colleagues felt that original GAS, gave scored based on
higher (less sick) scores reflected a need someone other than the interviewer’’s
by managed care companies to limit the judgment, specifically the patient or the
use of all inpatient services rather than patient’’s family. Both of these sets of
their desire to selectively eliminated correlations were fairly low, 0.58 for the
unnecessary hospitalizations. The self-rated scale and modified GAF and -
ability of the managed care industry to 0.52 or -0.45 for the FEF and the
affect the GAS scores in this way is original GAS. While the Zung is also a
attributed to the relatively less-structured self-rated instrument, it’’s questions are
nature of the GAS instrument, leading to more objective than the self-rated global
lower interrater reliability. As we have illness scale or FEF, which may have
shown, the modified GAF is both more accounted for the Zung’’s higher
structured than the original GAF or GAS correlation with the GAF. Still,

Volume 36 –– Number 2 –– March-April, 1995 5


McGlashan and Pfeiffer reported that reliability needs to be maximum (i.e., in
patient self-assessment and physician or research or as a tool to determine need
interviewer assessments of patients may for hospitalization) and/or when multiple
differ significantly. One might also persons of varying skills and
expect the same discrepancy between employment backgrounds and without
family and interviewer assessment of having had much GAF training (i.e. in
patients. Thus, the interviewer vs. self managed care organizations) must rate
or family-rating procedures for patients. In addition, when used to
measuring severity of illness often evaluate the need for hospital admission,
cannot be considered as providing the modified GAF is less likely than the
similar or redundant information. original GAF or GAS to reflect a
The modified GAF is an provider or managed care bias. Thus,
instrument having a higher reliability our modified GAF may be a better and
and similar validity to the original GAF improved patient assessment tool, one
or GAS. The modified GAF may be that can more accurately reflect a
particularly useful when interrater patient’’s true need for hospitalization.

Volume 36 –– Number 2 –– March-April, 1995 6

You might also like