You are on page 1of 13

Official journal of the

bs_bs_banner
Pacific Rim College of Psychiatrists
Asia-Pacific Psychiatry ISSN 1758-5864

ORIGINAL ARTICLE Excellent reliability of the Hamilton Depression


Rating Scale (HDRS-21) in Indonesia after training
Erita Istriana1 MD, Ade Kurnia1 MD, Annelies Weijers2 MaSci, Teddy Hidayat1 MD, Lucas Pinxten3 MD, Cor de
Jong3 MD PhD & Arnt Schellekens3,4 MD PhD
1 Department of Psychiatry, Rumah Sakit Hasan Sadikin, Padjadjaran University Bandung, Bandung, Indonesia 2 Mental Health
Service GGz Oost Brabant, Veghel, the Netherlands 3 Nijmegen Institute for Scientist Practitioners in Addiction, Radboud
University Nijmegen, Nijmegen, the Netherlands 4 Department of Psychiatry, Radboud University Nijmegen Medical Centre,
Nijmegen, the Netherlands
Keywords
Abstract assessment/diagnosis, depression, ethnicity, mood disorders,
measurement/psychometrics
Introduction: The Hamilton Depression Rating Scale (HDRS) is the most widely used depression rating scale worldwide.
Reliability of HDRS has
Correspondence Arnt F.A. Schellekens MD PhD, Department of
been reported mainly from Western countries. The current study tested the reliability of HDRS ratings among psychiatric
residents in Indonesia,
Psychiatry, Radboud University Nijmegen,
before and after HDRS training. The hypotheses were that: (i) prior to the Medical
Centre, Reinier Postlaan 10, route 966,
training reliability of HDRS ratings is poor; and (ii) HDRS training can 6500HB,
Nijmegen, the Netherlands.
improve reliability of HDRS ratings to excellent levels. Furthermore, we Tel: +0031
243611111
explored cultural validity at item level. Fax: +0031 243610304
Methods: Videotaped HDRS interviews were rated by 30 psychiatric resi- Email:
arnt.schellekens@gmail.com
dents before and after 1 day of HDRS training. Based on a gold standard
Received 12 December 2012 Accepted 27 March 2013
rating, percentage correct ratings and deviation from the standard were calculated. Results: Correct ratings increased from 83% to
99% at item level and
DOI:10.1111/appy.12083
from 70% to 100% for the total rating. The average deviation from the gold standard rating improved from 0.07 to 0.02 at item
level and from 2.97 to 0.46 for the total rating. Discussion: HDRS assessment by psychiatric trainees in Indonesia without prior
training is unreliable. A short, evidence-based HDRS train- ing improves reliability to near perfect levels. The outlined training
program could serve as a template for HDRS trainings. HDRS items that may be less valid for assessment of depression severity
in Indonesia are discussed.
Introduction
The Hamilton Depression Rating Scale (HDRS), intro- duced in 1960, is the most widely used clinician-based depression rating
scale in research and clinical practice (Hamilton, 1960; Hamilton, 1967; Gibbons et al., 1993). The HDRS is increasingly used in
non-Western societies (Hanck et al., 1981; Sen and Williams, 1987; Zheng et al., 1988; Hamdi et al., 1997; Leung et al., 1999;
Chowdhury et al., 2005; Lee et al., 2005; El et al., 2006; Satthapisit et al., 2007; Ahikari et al., 2008; Pradhan et al., 2008; Ang et
al., 2009; Schaal et al., 2009; Agbir et al., 2010; Erfan et al., 2010; Lueboonthavatchai and Thavichachart, 2010). It is
Copyright
1 generally accepted that training is a prerequisite for using the
HDRS. Despite the long and widespread use of the HDRS, published work on the effect of training on reliability of HDRS
ratings, particularly from non- Western countries, is basically non-existent.
Only two studies assessed the effect of training on reliability of HDRS at item level (Muller and Dragicevic, 2003; Tabuse et
al., 2007). One paper described the effect of HDRS training in 21 psychiatric novices in Germany (Muller and Dragicevic, 2003).
This study showed increased inter-rater reliability after training in a small and heterogeneous sample of psychologists, psychiatric
residents, students and pharmacologists. Moreover, the HDRS videos used
©
2013 Wiley Publishing Asia Pty Ltd
HDRS in Indonesia E. Istriana et al.
ranged a lot in depression severity, reducing compa-
the standardized HDRS [GRID-HAM-D] interview rability
between the ratings (Muller and Dragicevic,
guide) (Tabuse et al., 2007; Williams et al., 2008) and 2003).
A study in Japan failed to show an effect of
HDRS video ratings. We used three videos of HDRS training
on reliability, probably because 70% of the
interviews with standardized patients (in Indonesian),
participants were experienced raters and almost 25%
played by a professional actor, interviewed by a received
formal training in HDRS before participating
trained local psychiatrist, with over 25 years of expe- in the
study (Tabuse et al., 2007).
rience in psychiatry (H. Z.). Two videos were used to The current
study assessed the effect of HDRS
assess inter-rater reliability, pre- and post-training. training on
reliability of HDRS ratings in inexperi-
One video was used for practicing. After rating the enced,
untrained psychiatric residents in Bandung,
videos, individual item ratings were discussed to iden-
Indonesia. Our hypotheses were that: (i) prior to the
tify causes of disagreement and provide feedback training
reliability of HDRS ratings is poor; and
regarding the gold standard rating. Finally, partici- (ii)
training in HDRS can improve reliability of
pants practiced the HDRS interview in role-plays in HDRS
ratings to excellent levels. Furthermore, we
small groups (one interviewer, one patient, one explored
cultural reliability at item level by calculat-
observer). ing homogeneity and group bias of all HDRS item
In order to ensure comparability between the ratings.
two assessment videos, scripts of patients with com- parable severity were selected from the Diagnostic and Statistical Manual
(DSM) casebook (case 1: Methods
“worthless wife”, HDRS = 21; case 2: “cry me a river”, HDRS = 20) (Spitzer et al., 2003). The practice Participants
video portrayed a patient with very severe depres-
Thirty psychiatric residents of the Department of Psy- chiatry of Hasan Sadikin Hospital at Padjadjaran Uni- versity Bandung,
Indonesia, took part in 1 day of HDRS training. The mean age of the participants was
sion (case: “the farmer”), in order to ensure that all items of the HDRS were covered during training. Gold standard ratings were
based upon consensus by the training team.
34 years (SD: 4.3) and 43% were male. Twenty-seven percent were in the first year of training, 40% in the
Analysis second, 20% in the third and 13% in the final year
of training. None of the participants received prior train- ing on HDRS, neither did they have experience with standardized HDRS
ratings.
Using the HDRS guidelines, reliability was calculated as the percentage of correct item ratings (i.e. +1 or –1 from the gold
standard rating) and the percentage of correct total ratings (i.e. +4 or –4 from the gold stan- dard rating) (Hamilton, 1960;
Hamilton, 1967; Muller
Instrument
and Dragicevic, 2003). The effect of training on the percentage of correct items was calculated using mul- To assess depression
severity we used a previously
tivariate ANOVA for the total score and using the c2-test
validated Indonesian version of the 21-item (HDRS-
for the item level ratings. 21; Kusumanto et al., 1980). To
ensure that the Dutch
In order to investigate group bias, we analyzed trainers
and Indonesian participants used similar ver-
the mean deviation from the gold standard rating at sions of
HDRS this interview was back-translated into
item level and for the total rating, by subtracting the English.
ratings of the participants from the gold standard ratings. This results in negative numbers for an over-
Procedure
estimation of severity and positive numbers for an underestimation of severity. In addition, we analyzed For the current evidence-
and competency-based
group homogeneity of the ratings by calculating the study, a
whole day of training was developed by two
standard deviation (SD) of the difference from the trained and
experienced Dutch HDRS raters (one psy-
gold standard, with a higher SD indicating less homo-
chologist [A. W.] and one psychiatrist [A. S]) and two
geneous group ratings. The effect of training on group
experienced Indonesian senior psychiatrists, trained in
bias and homogeneity was analyzed using multivari- HDRS
(T. H., H. Z.). The training consisted of lectures
ate
ANOVA on the background of HDRS, sources of
unreliability (Kobak et al., 2009) and scoring guidelines (based on
2
Copyright . Analyses were performed using SPSS ver. 16, with Cronbach’s a set at 0.05 throughout.
©
2013 Wiley Publishing Asia Pty Ltd
E. Istriana et al. HDRS in Indonesia

Results
depressed mood (SD = 0.65), work and activities (SD = 0.63), and somatic symptoms (SD = 0.64). The results are depicted in
Table 1. Prior to training, the average correct ratings were 83% at item level and 70% for the total score. After the training, the
per-
Discussion centage correct ratings improved to 99% at item
level and 100% for the total score (F [1, 57] = 216.7,
In line with our first hypothesis, the reliability of P < 0.001;
c2 = 10.6, P = 0.001, respectively). The item
HDRS ratings without prior training was rather poor, ratings
that improved significantly were: depressed
with 83% correct ratings at item level and 70% correct mood
(c2 = 10.6, P = 0.001), insomnia middle
total scorings. This is in line with previous results (c2 = 10.6,
P = 0.001), work and activities (c2 = 10.6,
showing 89% correct ratings at item level and 33% P =
0.001), psychomotor agitation (c2 = 10.6, P =
correct total ratings (Muller and Dragicevic, 2003). 0.001),
psychic anxiety (c2 = 10.6, P = 0.001) and
This highlights that the HDRS cannot be reliably somatic
anxiety (c2 = 10.6, P = 0.001).
administered by untrained raters. On the other hand, The average
deviation from the gold standard
acceptable inter-rater reliabilities have been observed rating
(i.e. group bias) improved from 0.07 to 0.02 at
among untrained but experienced psychiatrists in the item
level (F [1, 57] = 5.3, P = 0.026), and from 2.97
Netherlands (Hooijer et al., 1991). In the Netherlands, to 0.46
for the total rating (F [1, 57] = 8.3, P = 0.006);
psychiatrists are commonly exposed to HDRS assess- the
spreading (SD) of the difference from the gold
ments during their training; additional HDRS training
standard (i.e. group homogeneity) improved from 1.1
may therefore be less relevant. to 0.5 (F [1, 57] = 464.6, P <
0.001). See Table 1 for
After the HDRS training, the reliability of the group bias
and homogeneity. Items that showed per-
HDRS ratings improved significantly, to nearly perfect sistent
group bias after the training were: depressed
ratings. This confirms our hypothesis that a training mood
(mean deviation = 0.8), hypochondria (mean
program is sufficient for preparing non-Western psy-
deviation = -0.6) and insight (mean deviation = -0.7).
chiatric residents for reliable assessment of depression The
most heterogeneous items after the training were:
severity, using the HDRS.
Table 1. Effect of training on reliability of HDRS ratings in Bandung, Indonesia
HDRS item
Copyright
3 Percentage of correct ratings Deviation from
gold standard, mean (SD)
Before training After training Before training After training
1. Depressed mood 30% 89%** 1.5 (0.82) 0.8 (0.65)** 2. Guilt 97% 100% 0.5 (0.63) 0.3 (0.52) 3. Suicide 100% 100% -0.8
(0.57) 0.2 (0.57)** 4. Insomnia early 100% 100% 0.3 (0.48) 0.0 (0.19)** 5. Insomnia middle 80% 100%* -0.8 (0.77) 0.0
(0.00)** 6. Insomnia late 100% 100% 0.2 (0.41) 0.0 (0.00)* 7. Work and activities 13% 100%** 2.5 (0.92) 0.4 (0.63)** 8.
Psychomotor retardation 100% 100% 1.1 (0.57) 0.1 (0.52)** 9. Psychomotor agitation 17% 100%** -1.8 (0.57) -0.4 (0.62)** 10.
Anxiety psychic 93% 100* -0.2 (0.57) -0.2 (0.61) 11. Anxiety somatic 67% 96%** -1.6 (1.1) -0.2 (0.61)** 12. Loss of appetite
97% 100% 0.6 (0.56) 0.0 (0.00)** 13. Somatic symptoms 100% 93% 0.7 (0.65) 0.4 (0.64) 14. Sexual interest 93% 96% 0.2
(0.57) 0.1 (0.45) 15. Hypochondria 93% 100% -0.4 (0.61) -0.6 (0.50) 16. Loss of weight 97% 100% 0.2 (0.50) 0.0 (0.00)** 17.
Insight 93% 100% -0.8 (0.57) -0.7 (0.46) 18. Diurnal variation 93% 100% -0.1 (0.37) 0.0 (0.00) 19. Depersonalization and
de-realization 97% 100% -0.01 (0.37) 0.0 (0.00) 20. Paranoia 93% 100% -0.2 (0.51) 0.0 (0.00) 21. Obsessive–compulsive
symptoms 100% 100% 0.0 (0.00) 0.0 (0.00) Average per item 83% 99%** 0.1 (1.1) 0.0 (0.50)* Total score 70% 100%** 3.0
(3.8) 0.5 (2.7)**
*P < 0.05; **P < 0.01. HDRS, Hamilton Depression Rating Scale; NA, not applicable; SD, standard deviation.
©
2013 Wiley Publishing Asia Pty Ltd
HDRS in Indonesia E. Istriana et al.
After the training, some items showed a persis-
A major strength of the current study is the pre- tent
group bias (i.e. depressed mood, hypochondria
and post-training design in a homogeneous group of and
insight). In addition, the items depressed mood,
participants (i.e. psychiatric residents). To what extent work
and activities, and somatic symptoms showed
the current findings also apply to other professionals
persistently high heterogeneity after the training.
(e.g. psychologists) remains to be studied. We used a Even
though most participants rated these items
previously translated and validated Indonesian within the
range of the gold standard (Table 1), this
version of the HDRS-21. However, exact details of the group
bias and group heterogeneity for certain items
validation process were lacking. Because we did not is still
noteworthy.
encounter major obstacles in the back-translation of One previous
study also found poor inter-rater
the Indonesian version into English, problems with reliability
for the work and interest item, suggesting
the validity of the HDRS version used in our study are poor
reliability of this item in general (Muller and
probably limited. Prior studies showed that the Dragicevic,
2003). It has been suggested that the
17-item version may be more reliable (Hooijer et al.,
description of this item is not clear enough, and that it
1991; Dozois, 2003; Muller and Dragicevic, 2003; should be
revised. Indeed, the item seems to query
Bech, 2009). Recently, the GRID-HAM-D has been reduced
interest in work and activities on the one
proposed as a gold standard for HDRS assessment, hand, and
actual reduced level of activity on the other
because of its outstanding psychometric properties (Muller
and Dragicevic, 2003). This might have been
(Tabuse et al., 2007; Williams et al., 2008). Future confusing
when rating this item.
studies may preferably use GRID-HAM-D. Persistent group bias
in our sample on the other
In the current study, a professional actor played items
may indicate reduced cultural validity of these
standardized patients. Such an approach has previ- items in
Indonesia. Presentation of depressive symp-
ously shown face validity (Rosen et al., 2004). Future toms
indeed differs across Western and non-Western
studies may address whether short HDRS training can
countries (Hanck et al., 1981; Sen and Williams,
also improve the interview technique of the partici- 1987;
Ohishi and Kamijima, 2002; Chowdhury et al.,
pants, by assessment of the same (standardized) 2005;
Lueboonthavatchai and Thavichachart, 2010).
patient by all participants. In Indonesia, and particularly
Javanese culture, the
Taken together, the results of the current study
sociocultural norm is to control emotions (Kurihara
show that: (i) using the HDRS by psychiatric residents et al.,
2000; Breugelmans and Poortinga, 2006;
in Indonesia without prior training is unreliable; and Nichter
et al., 2009; Subandi, 2011). The assessment
(ii) HDRS training improves reliability to near perfect of
depressed mood and insight in depression by a
levels. Future studies are needed to address cultural direct
question, as in the HDRS, may not fit such a
effects on the assessment of depression (severity), as culture
(Ohishi and Kamijima, 2002; Lueboon-
some HDRS items may not be applicable across thavatchai
and Thavichachart, 2010). This cultural
cultures. difference between Indonesia and Western countries
might have affected the ratings by the group, result- ing in an overestimation of severity of lack of
Acknowledgments insight and an underestimation of severity
of mood symptoms.
The third item showing persistent group bias was hypochondria. Group members rated the hypochon- dria item as more
severe compared to the gold stan- dard rating. A more hypochondriac presentation of depression has been observed in cultures
with high levels of bodily expression (Hanck et al., 1981). Any
We thank Dr Hasan Zeni, who kindly contributed in making the scoring videos. We also thank the actor that participated in
making the videos. Furthermore, we thank all participants, the psychiatric residents of Rumah Sakit Hasan Sadikin, Bandung, for
their willingness to participate in the training and provide consent for analy- ses and publication of the training data.
signs of hypochondriac thoughts in the rated patients may therefore be interpreted as a more severe sign of depression, leading to
the observed group bias for the
References
hypochondria item in our study. These findings
Agbir T.M., Audu M.D., Adebowale T.O., Goar S.G. warrant
further studies into differences in presenta-
(2010) Depression among medical outpatients with tion of
depression in non-Western countries and the
diabetes: a cross-sectional study at Jos University need to
adjust Western assessment tools such as the
Teaching Hospital, Jos, Nigeria. Ann Afr Med. 9, HDRS
accordingly.
5–10.
4
Copyright
©
2013 Wiley Publishing Asia Pty Ltd
E. Istriana et al. HDRS in Indonesia
Ahikari S.R., Pradhan S.N., Sharma S.C., Shrestha B.R.,
Kobak K.A., Brown B., Sharp I., et al. (2009) Sources Shrestha
S., Tabedar S. (2008) Diagnostic variability
of unreliability in depression ratings. J Clin and therapeutic
efficacy of ECT in Nepalese sample.
Psychopharmacol. 29, 82–85. Kathmandu Univ Med J
(KUMJ). 6, 41–48.
Kurihara T., Kato M., Tsukahara T., Takano Y., Reverger Ang
Q.Q., Wing Y.K., He Y., et al. (2009) Association
R. (2000) The low prevalence of high levels of between
painful physical symptoms and clinical
expressed emotion in Bali. Psychiatry Res. 94, outcomes in
East Asian patients with major
229–238. depressive disorder: a 3-month prospective
observational study. Int J Clin Pract. 63, 1041–1049. Bech P. (2009) Fifty years with the Hamilton scales for
anxiety and depression. A tribute to Max Hamilton. Psychother Psychosom. 78, 202–211. Breugelmans S.M., Poortinga Y.H.
(2006) Emotion
without a word: shame and guilt among Raramuri Indians and rural Javanese. J Pers Soc Psychol. 91,
Copyright
5 Kusumanto S, Iskandar Y., Musaduik T. (1980) Validasi
Hamilton Depresi Rating Scale, Dalam Bahasa Indonesia. Majalah Jiwa KSPBI, Jakarta.
Lee M.S., Ham B.J., Kee B.S., et al. (2005) Comparison
of efficacy and safety of milnacipran and fluoxetine in Korean patients with major depression. Curr Med Res Opin. 21,
1369–1375.
1111–1122.
Leung C.M., Wing Y.K., Kwong P.K., Lo A., Shum K.
Chowdhury A.N., Sanyal D., Chakraborty A.K., De R.,
Banerjee S., Weiss M.G. (2005) Community Psychiatry Clinics at Sundarban: a clinical and cultural experience. Indian J Public
Health. 49,
(1999) Validation of the Chinese-Cantonese version of the hospital anxiety and depression scale and comparison with the
Hamilton Rating Scale of Depression. Acta Psychiatr Scand. 100, 456–461.
227–230.
Lueboonthavatchai P., Thavichachart N. (2010)
Dozois D.J. (2003) The psychometric characteristics of the Hamilton Depression Inventory. J Pers Assess. 80, 31–40.
Universality of interpersonal psychotherapy (IPT) problem areas in Thai depressed patients. BMC Psychiatry. 10, 1–7.
El H.Y., Yaalaoui S., Chihabeddine K., Boukind E.,
Muller M.J., Dragicevic A. (2003) Standardized rater
Moussaoui D. (2006) Depression in mothers of
training for the Hamilton Depression Rating Scale burned
children. Arch Womens Ment Health. 9,
(HAMD-17) in psychiatric novices. J Affect Disord.
117–119.
77, 65–69.
Erfan S., Hashim A.H., Shaheen M., Sabry N. (2010)
Nichter M., Padmawati S., Danardono M., Ng N., Effect of
comorbid depression on substance use
Prabandari Y., Nichter M. (2009) Reading culture disorders.
Subst Abus. 31, 162–169.
from tobacco advertisements in Indonesia. Tob
Gibbons R.D., Clark D.C., Kupfer D.J. (1993) Exactly
Control. 18, 98–107.
what does the Hamilton Depression Rating Scale
Ohishi M., Kamijima K. (2002) A comparison of measure? J
Psychiatr Res. 27, 259–273.
characteristics of depressed patients and efficacy of Hamdi
E., Amin Y., bou-Saleh M.T. (1997) Performance
sertraline and amitriptyline between Japan and the of the
Hamilton Depression Rating Scale in depressed
West. J Affect Disord. 70, 165–173. patients in the United
Arab Emirates. Acta Psychiatr
Pradhan S.N., Adhikary S.R., Sharma S.C. (2008) A Scand. 96,
416–423.
prospective study of comorbidity of alcohol and Hamilton
M. (1960) A rating scale for depression.
depression. Kathmandu Univ Med J (KUMJ). 6, J Neurol
Neurosurg Psychiatry. 23, 56–62.
340–345. Hamilton M. (1967) Development of a rating
scale for
Rosen J., Mulsant B.H., Bruce M.L., Mittal V., Fox D. primary
depressive illness. Br J Soc Clin Psychol. 6,
(2004) Actors’ portrayals of depression to test 278–296.
interrater reliability in clinical trials. Am J Psychiatry.
Hanck C., yuso Gutierrez J.L., Ramos Brieva J.A.
161, 1909–1911. (1981) Clinical forms of depression in
African and Spanish cultural communities. A new comparative study. Acta Psychiatr Belg. 81, 437–443.
Satthapisit S., Posayaanuwat N., Sasaluksananont C.,
Kaewpornsawan T., Singhakun S. (2007) The comparison of Montgomery and Asberg Depression Hooijer C., Zitman F.G., Griez
E., van Tilburg W.,
Rating Scale (MADRS thai) to diagnostic and Willemse A.,
Dinkgreve M.A. (1991) The Hamilton
statistical manual of mental disorders (DSM) and to
Depression Rating Scale (HDRS); changes in scores as
Hamilton Rating Scale for Depression (HRSD): a function of
training and version used. J Affect
validity and reliability. J Med Assoc Thai. 90, Disord. 22,
21–29.
524–531.
©
2013 Wiley Publishing Asia Pty Ltd
HDRS in Indonesia E. Istriana et al.
Schaal S., Elbert T., Neuner F. (2009) Narrative
Subandi M.A. (2011) Family expressed emotion in a exposure
therapy versus interpersonal
Javanese cultural context. Cult Med Psychiatry. 35,
psychotherapy. A pilot randomized controlled trial
331–346. with Rwandan genocide orphans. Psychother
Tabuse H., Kalali A., Azuma H., et al. (2007) The new
Psychosom. 78, 298–306.
GRID Hamilton Rating Scale for Depression Sen B.,
Williams P. (1987) The extent and nature of
demonstrates excellent inter-rater reliability for depressive
phenomena in primary health care. A
inexperienced and experienced raters before and study in
Calcutta, India. Br J Psychiatry. 151,
after training. Psychiatry Res. 153, 61–67. 486–493.
Williams J.B., Kobak K.A., Bech P., et al. (2008) The Spitzer
R.L., Gibbon M., Skodol A.E., Williams J.B.W.,
GRID-HAMD: standardization of the Hamilton First M.B.
(2003) DSM-III Case Book: A Learning
Depression Rating Scale. Int Clin Psychopharmacol.
Companion to the Diagnostic and Statistical
23, 120–129. Manual of Mental Disorders (Fourth Revised
Zheng Y.P., Zhao J.P., Phillips M., et al. (1988) Validity
Edition). American Psychiatric Publishing, Inc,
and reliability of the Chinese Hamilton Depression Arlington.
Rating Scale. Br J Psychiatry. 152, 660–664.
6
Copyright
©
2013 Wiley Publishing Asia Pty Ltd

You might also like