Primary Therapist Response Modes: Comparison of Six Rating Systems

Robert Elliott Clara E. Hill
University o f Toledo University o f Maryland
William B. Stiles Myrna L. Friedlander
Miami University State University of New York at Albany
Alvin R. Mahrer Frank R. Margison
University o f Ottawa, Canada Central Manchester Health Authority, England

Six therapist response-mode rating systems were compared in order to delineate a set of primary
modes that would best summarize the domain of therapist actions. Ratings of seven diverse therapy
sessions showed that, in spite of differences in measurement assumptions and rater characteristics,
interrater reliabilitiesgenerally were similar. When categories in different rating systems were col-
lapsed to the same level of specificity, moderate to strong convergence was found for the six modes
rated in all systems: question, information, advisement, reflection, interpretation, and self-disclo-
sure. These modes discriminated among the seven contrasting therapeutic approaches. Each thera-
pist was characterized by a unique pattern of response modes that differed significantly from the
others. Researchers interested in assessing therapist in-session behaviors should consider incorporat-
ing measures that include these six modes.

The psychotherapy or counseling process can be divided into making it difficuR to evaluate specific categories within and
four aspects: content (what is said), action (what is done), style across systems; (c) different measurement assumptions and rat-
(how it is said or done), and quality (how well it is carried out) ing procedures (e.g., differences in scoring units); and (d) theo-
(Elliott, 1984; Russell & Stiles, 1979). Categories or dimensions retical biases, resulting in overemphasis on certain verbal be-
of verbal action are referred to by linguists and philosophers as haviors and restricted applicability to therapist behaviors of a
speech acts (Searle, 1969) and by process researchers as re- particular orientation. Thus, a need exists for comparing re-
sponse modes (Goodman & Dooley, 1976). sponse-mode systems on a single sample o f therapy sessions.
Measures of therapist response modes have gradually accu- In this study, six developers o f rating systems collaborated
mulated, resulting in some 20 or 30 systems to date (EUiott et to rate a common set of actual therapy sessions. The sessions
al., 1982; Goodman & Dooley, 1976; Hill, 1982; Russell & represented a range of theoretical orientations, a range o f client
Stiles, 1979). Some of these systems have been incorporated types, and a mixture o f initial and later sessions, providing a
into skills training packages for beginning counselors (e.g., Ivey variety of therapist verbal behaviors.
& Gluckstern, 1974), whereas others have been used to discrim- Our goals were (a) to compare interrater reliabilities, (b) to
inate between treatments or to predict therapy outcome (e.g., seek a common set of primary modes, and (c) to assess the dis-
Sloane, Staples, Cristol, Yorkston, & Whipple, 1975). criminant validity of the primary modes by contrasting re-
The multiplicity of rating systems makes it difficult to com- sponse use by different therapists.
pare studies or to tell whether differences between studies are
due to rating systems or to the interviews rated. Furthermore, Method
to our knowledge, no published studies have compared different
response-mode rating systems. Sample
Barriers to the comparison o f response-mode systems in-
clude (a) the use o f different labels for similar categories; (b) Seven therapy sessions were rated: (a) John Paul Brady: Behavioral
Treatment of Stuttering(Brady, 1983), an initial session demonstrating
the failure to report reliabilities for specific response modes,
deconditioning of stuttering with a young woman; (b) Albert Ellis: John
Jones(Ellis, 1983), the 15th session of rational-emotive therapy with a
young male homosexual; (c) Clara Hill, the 5th session of a 12-session,
An earlier version of this article was presented at the annual meeting
of the Society for Psychotherapy Research, Lake Louise, Alberta, Can-
ada, June 1984.
of the Society for Psychotherapy Research, Lake Louise, Alberta, Can- lege student (Hill, Carter, & O'Farrell, 1983); (d) Robert Hobson (Hob-
ada, June 1984. The authors thank the six teams of raters, as well as son, 1982), an initial session of conversational therapy (a relationship--
Henya Rachmiel and Mee-Ok Clio, for their assistance in data entry dynamic treatment conducted by its originator, a British psychiatrist)
and analysis. This article was written while the first and third authors with a young woman with interpersonal problems; (e) Ira Progoff: Gregg
were Visiting Researchers at the Social and Applied Psychology Unit at (Progoff, 1983), Jungian dream analysis with a male client (the session
the University of Sheffield, Sheffield, United Kingdom. appears to be one of the last from a long-term therapy); (f) CarIRogers:
Correspondence concerning this article should be addressed to Rob- Miss Munn (Rogers, 1983), the 17th session of a client-centered treat-
ert Elliott, Department of Psychology, University of Toledo, Toledo, ment with a young woman; 0g) Faith Tanney, an intake session with
Ohio 43606. a male client with procrastination problems that was conducted by a

counseling center therapist with a gestalt-dynamic orientation (Hill, sponse-mode categories. Correlations were then calculated be-
1978). tween each pair of raters for each category or dimension in each
system. The means of these correlations, provided in Table 1,
Response Mode Systems can be interpreted as the reliability of the average rater for each
category or dimension within each rating system. The median
The six response mode systems used in the study are summarized values (across categories) for five oftbe seven systems were quite
below and in Table 1.
similar (Hill = .61, Friedlander -- .59, Elliott = .56, Stiles-
1. Hill's Counselor Verbal Response Mode Category System (Hill,
Intent = .55, and Mahrer = .52, or .57 if the 12 categories with
1978) consists of 14 mutually exclusive categories. Response modes are
judged for each response unit, which is defined as a grammatical sen- zero or near zero base-rates are excluded). These small differ-
tence that has been unitized separately (brief phrases such as "mm- ences probably reflect differences in the training and sophistica-
hmm'" and "yes" are also treated as separate units). Unitized transcripts tion of the raters. The median reliability for Stiles-Form ratings
were rated independently by three trained undergraduates. Final ratings was .73, presumably because response form is easier to rate
were based on agreement by two of the three judges; three-way disagree- than speaker intent. (The other response-mode systems are pri-
ments were resolved by discussion. marily intent based.) The typical values for the Margison sys-
2. Friedlander's (1982) refinement of Hill's (1978) rating system in- tem were higher yet (Mdn = .88), probably because these cate-
eludes nine mutually exclusive categories, combining several of Hill's gories have been rigidly defined and require highly experienced
categories (e.g., "open question" and "closed question" are combined raters. When median reliability values were grouped by mode,
as "information seeking"). The scoring unit is generally the same as
Hill's, except that each unit must minimally contain a verb phrase (i.e., question categories were rated most reliably (.71), followed by
"uh-huh" is not rated and compound predicates are scored separately). advisement (.66), information (.64), self-disclosure (.61), reas-
Ratings were done from unitized transcripts. Three raters were used. surance (.58), interpretation (.56), reflection (.53), confronta-
Procedures for final ratings and resolution of disagreements were identi- tion (.48), and other (.37).
cal to those of Hill (1978). Alpha reliability coefficients were also calculated in order to
3. Stiles' Verbal Response Mode System (Stiles, 1978, 1979) consists obtain a measure of actual reliability. These were substantially
of eight mutually exclusive categories. Verbal form (literal meaning) and higher than the average rater figures because they were averages
speaker intention (meaning intended in the situation) are rated sepa- o f 2 - 1 2 ratings. Only 19 of the 97 different categories or dimen-
rately, and the unit is defined as the independent clause or nonrestrictive sions had alpha reliabilities less than .70. All of these involved
dependent clause. Three trained undergraduate students unitized and either zero or very low base-rate categories (13 instances, most
rated transcripts. A two-out-of-three convention was used for resolving
in the Mahrer system); other or unclassifiable categories (3 in-
disagreements; three-way disagreements were defined as unclassifiable.
4. Elliott's Response Mode Rating System (EUiott, 1985) consists of stances); or had alphas between .60 a n d . 70, indicating that the
10 nonmutually exclusive dimensions that are rated using 0-3 confi- data were still usable (3 instances).
dence ratings. The unit is flexible, but in this study Hill's (1978) verbal
sentence units were used. Ratings were made from unitized transcripts Comparison of Rating Systems on
and tapes. Final ratings were achieved by rescaling confidence ratings
to 0-1 scales, then averaging ratings across the four raters (three under-
Primary-Response Modes
graduates and coauthor Friedlander). The final ratings were assembled into a master file, first using
5. The Conversational Therapy Rating System (Goldberg, Hobson, Hill's response units and then subdividing to accommodate the
Maguire, Margison, O'Dowd, Osborn, & Moss, 1984; referred to here smaller units used by Stiles and (occasionally) by Friedlander.
as the Margison system) was developed to rate the therapist behaviors
If Stiles' raters divided a Hill unit into smaller parts, then the
described in Hobson's Conversational Model of Therapy (Hobson,
1985). The system includes 11 mutually exclusive function categories, other systems' ratings were duplicated over the same parts o f the
rigidly defined by formal cues, with particular emphasis on types of response. The exception occurred when Stiles or Hill created a
questions and advisement. The final ratings represent a combination unit for something not rated in other systems (e.g., initial "yes,'
of ratings by two judges (a psychology research assistant and coauthor "well," "you know"); in this case, a missing value was entered
Margison). for the system in which the unit was not rateable. Finally, be-
6. Mahrer's Taxonomy of Procedures and Operations in Psychother- cause there was disagreement among systems as to whether ac-
apy (Mahrer, 1983) contains 35 mutually exclusive categories. The unit knowledgments or minimal encouragers (e.g., "uh-huh") and
is defined as the therapist's speaking turn. Disagreements (less than 50% silence should be considered units, these were dropped from the
agreement) are resolved by rerating responses or by labeling responses analyses.
as unclassifiable. Between eight and 12 raters (graduate students and
In the initial comparisons among systems, the original cate-
coauthor Mahrer) were used on each session. Ratings were made from
tapes and transcripts. gories were used. It became clear, however, that differences in
"fine-grainedness" of categories were distorting the results. For
example, Friedlander's reflection-restatement category is su-
Results perordinate to Hill's reflection, restatement, and nonverbal ref-
erent categories. To compare categories across systems, similar
Reliability Analyses
categories were identified by careful checking of category defi-
To compare systems, the phi statistic, a simplification of the nitions and examples and by initial correlational analyses. We
product-moment correlation (for pairs of dichotomous vari- found six primary response modes (question, information, ad-
ables) was used. The response-mode systems using nominal visement, interpretation, reflection, and self-disclosure) that
scale measurement (all except EUiott's) were treated as sets of were measured by all the systems and two modes (reassurance
dichotomous dimensions, each corresponding to separate re- and confrontation) that were common to four of the systems.

Table 2
Proportions and Item-Total Correlationsfor Primary Response Modes
Rating system Question Information Advisement Reflection Interpretation Self-disclosure
Hill .14 .24 .02 .12 .27 .0l
Friedlander .11 .37 .05 .17 .15 .01
Stiles-form .13 .32 .04 .16 .07 .16
Stiles-intent .15 .15 .06 .18 .22 .07
Eiliott .17 .16 .10 .20 .31 .05
Mar#son .13 .27 .07 .19 .19 .29
Mahrer .18 .18 .07 .08 .38 .02
Item-total correlations
Hill .82 .58 .50 .58 .62 .32
Friedlander .76 .51 .73 .61 .49 .34
Stiles-form .73 .32 .58 .49 .17 .59
Stiles-intent .84 .62 .73 .70 .45 .57
Elliott .84 .68 .67 .66 .67 .48
Mar#son .73 .37 .46 .35 .26 .52
Mahrer .49 .51 .33 .47 .52 .39
Note. Categories are treated as separate dimensions within each system. The item-total correlation is the correlation between a category in a system
and the proportion for the category for all the remaining systems.

In all, 138 tests of convergence (correlations between same Stiles-Form converged well only for self-disclosure and diverged
primary modes across different systems) were performed on 50 somewhat for information and interpretation. The Mahrer and
measures of primary-response modes.I The very large sample Margison systems appeared to be most divergent from the other
size (N --- 1,947) and the nonindependence of observations systems, generally showing lower item-total correlations.
(units within sessions) rendered the usual statistical significance
criteria meaningless. However, 117 of these tests (85%) had cor- Discrimination of Therapists by Response-Mode Use
relations of at least .20 and 98 (71%) of the correlations ex-
ceeded .30. An estimate of discriminant validity was given by As a further test of the discriminant validity of the primary
comparing these with the 930 correlations of different modes in response modes, we compared the proportions of response
different systems; only 10 exceeded .20 (1.1%), and none ex- modes used by the seven therapists. We developed a composite
ceeded .30. response-mode index, calculated by finding the proportion of
Mean intercorrelations showed that convergence across sys- systems scoring the given mode as present for a given unit.
tems for comparable modes was highest for question and reas- These composite indices were then averaged across the 1,947
surance (.61 and .52, respectively), and was moderate for infor- units (see Table 3). Finally, dummy variable correlations were
mation (.40), reflection (.38), advisement (.35), self-disclosure used for each mode to compare each therapist with all other
(.33), confrontation (.31), and interpretation (.28). These con- therapists. Effect sizes (rs > . 10, .20, .30) are also given in Table
vergence figures, interestingly, average only about .20 lower than 3 because conventional significance levels were not useful (for
the reliability estimates for categories within systems. r = .10,p < .001).
To provide an index of each response-mode system's conver- Each therapist was characterized by a unique profile of re-
gent validity, each was compared with an index consisting of all sponse-mode use: Brady (a behavior therapist) used more infor-
other measures of the same mode. These corrected item-total mation and advisement but was low on reflection, interpreta-
correlations are presented in Table 2 for the six primary modes. tion, and confrontation. In comparison, Progoffalso used more
(The comparable values for reassurance are Hill, .68; Fried- information but less advisement and confrontation. Tanney (a
lander, .62; Elliott, .73; and Mahrer, .52. For confrontation, gestalt]dynamic therapist) frequently used information, self-
comparable values are Hill, .47; Friedlander, .51; Elliott, .45; disclosure, and question and avoided use of reflection and inter-
and Mahrer, .29.) Hill's system showed particularly good con- pretation. Consistent with his treatment model, Rogers used
vergence relative to other systems for its measures of question, more reflection than did any other therapist for that mode or,
interpretation, and confrontation, whereas its measure of self- in fact, for any one mode. Hill's relatively high use of interpreta-
disclosure appeared to differ from that of the other systems. tion and reflection modes is consistent with her bridging of the
Friedlander's version of the Hill system showed good conver- relationship and dynamic therapy traditions. Ellis was unique
gence on measures of advisement and confrontation but not on in his high use of both reassurance and confrontation and gave
measures of self-disclosure. Elliott's system converged rela- fewer information and self-disclosure responses. Hobson was
tively strongly in its measures of question, information, reflec-
tion, and interpretation. Stiles-Intent converged well for ques-
tion, information, advisement, reflection, and self-disclosure. ' These data are available from the first author.

Table 3
Mean Composite Ratings by Therapist for Response Modes
Response mode Total Brady Ellis Hill Hobson Progoff Rogers Tanney

Question .15 .20 .17 .06" .14 .11 .10 .21"

Information .24 .38" .14" .17" .13" .31" .02" .33"
Advisement .06 .20 c .03 .03 .09 .01" .00 .06
Reflection .16 .05" .14 .22" .18 .13 .71 c .I0"
Interpretation .23 .03 b .28 .38 b .28 .26 .14 .13 b
Self-disclosure .09 .07 .01" .10 .11 .07 .04 .13"
Reassurance .05 .03 .09" .02 .03 .07 .02 .04
Confrontation .05 .00" .16" .06 .02 .01" .00 .05
Other .03 .03 .04 .01 .03 .03 .00 .03
N 1,947 216 269 357 239 364 65 437
Note. Values in the table are mean composite response ratings (proportion of systems scoring a unit as a given mode, averaged across all units).
Dummy variable correlations were used to compare each therapist with all others for each mode (correlating composite response mode ratings with
a set of variables coded 1 for each therapist vs. 0 for all other therapists). Effect sizes are given because conventional significance levels were not
useful (for r = .10,p < .001).
°r>.10. br>.20. Or>.30.

not characterized by relatively greater use of any response consider measuring reassurance, confrontation, or acknowledg-
mode, although he did use less information. ment. Researchers should select or adjust response-mode sys-
tems according to their needs in relation to the features and
Discussion strengths of particular systems.
At present, the therapist response modes are probably the
The present results suggest that a set of fundamental re- most widely studied therapist process variables. Response
sponse-mode categories or dimensions underlies a variety of modes are conceptually clear and can be thoroughly specified;
systems with different origins and purposes. We have reported they also discriminate between different approaches to therapy.
evidence for the convergent and discriminant validity of six pri- On the other hand, therapist response modes predict outcome
mary response modes (question, advisement, information, re- or therapist effectiveness only weakly or moderately (Elliott et
flection, interpretation, and self-disclosure) rated in all the sys- al., 1982). The response modes measure only one aspect of ther-
tems studied. apist responses: action. A more complete description of thera-
Convergence was obtained despite a variety of methodologi- pist responses requires additional ratings of content, style, qual-
cal differences between rating systems, including levels of rater ity, or effectiveness (Elliott, 1984; Russell & Stiles, 1979). Fi-
training experience (a few weeks to several years), rater sophisti- nally, therapist response modes also take their meaning and
cation (advanced undergraduates to seasoned therapy research- impact from context, in particular, from background character-
ers), rating units (clauses to complete responses), thoroughness istics of the client, from the current status of the relationship,
of definitions and examples (brief descriptions to complete ex- from the current important helping tasks, and from the client's
positions), and measurement structure (nominal scale vs. set of immediately prior statement (EUiott, 1984).
rating dimensions).
The factors that did seem to affect interrater reliability or
convergent validity were (a) the category itself, with question References
being the easiest category to rate reliably; (b) the form versus
intention distinction introduced by Goodman and Dooley Brady, J. P. (1983). Behavioral treatment of stuttering (AAP Tape Li-
brary Catalogue, Tape No. 38). Salt Lake City, UT: American Acad-
(1976) and incorporated by Stiles (1978, 1979); and (c) whether
emy of Psychotherapists.
or not the system was designed to measure a particular ap- Elliott, R. (1984). A discovery-oriented approach to significant events
proach to therapy. in psychotherapy: Interpersonal process recall and comprehensive
At the same time, the results make it clear that the response- process analysis. In L Rice & L. Greenberg (Eds.), Patternsof change
mode measures do not converge entirely. The intersystem valid- (pp. 249-286). New York: Guilford Press.
ity coefficients are still about .20 lower than the typical inter- EUiott, R. (1985). Helpful and nonhelpful events in brief counseling
rater correlations within systems (i.e., .30-.40 vs..55-.60). Al- interviews: An empirical taxonomy. Journal of Counseling Psychol-
though the systems measure the same primary modes, they are ogy, 32, 307-322.
still defined somewhat differently. Elliott, R., Stiles, W. B., Shiffman, S., Barker, C. B., Burstein, B., &
We conclude that there generally is not a best response-mode Goodman, G. (1982). The empirical analysis of helping communica-
tion: Conceptual framework and recent research. In T. A. Wills (Ed.),
rating system. Reliability values were comparable and conver-
Basic processes in helping relationships (pp. 333-356). New York:
gent validity varied by category, with no one system clearly Academic Press.
proving best on all modes. However, to promote comparability Ellis, A. (1983 ). John Jones (AAP Tape Libary Catalogue, Tape No. 11).
across studies, researchers measuring therapist speech acts Salt Lake City, UT: American Academy of Psychotherapists.
should examine the six primary modes and may also want to Friedlander, M. L. (1982). Counseling discourse as a speech event: Revi-

sion and extension of the Hill Counselor Verbal Response Category Mahrer, A. R. (1983). Taxonomy of procedures and operations in psy-
System. Journal of Counseling Psychology, 29, 425--429. chotherapy Unpublished manuscript, University of Ottawa, Canada.
Goldberg, D. P., Hobson, R. E, Maguire, G. P., Marglson, E R., Progoff, I. (1983). Gregg. (AAP Tape Libary Catalogue, Tape No. 12).
O'Dowd, T., Osborn, M., & Moss, S. (1984). The clarification and Salt Lake City, UT: American Academy of Psychotherapists.
assessment of a method of psychotherapy. British Journal of Psychia- Rogers, C. R. (1983). Miss Munn (AAP Tape Libary Catalogue, Tape
t ~ 144, 567-580. No. 5). Salt Lake City, UT: American Academy of Psychotherapists.
Goodman, G., & Dooley, D. (1976). A framework for help-intended Russell, R. L., & Stiles, W. B. (1979). Categories for classifying language
communication. Psychotherapy: Theo~ Research and Practice, 13, in psychotherapy. Psychological Bulletin, 86, 404-419.
106-117. Searle, J. R. (1969). Speech acts:An essay in the philosophy of language.
Hill, C. E. (1978). The development of a system for classifying counselor Cambridge, England: Cambridge University Press.
responses. Journal of Counseling Psychology, 25, 461-468. Sloane, R. B., Staples, E R., Cristol, A. H., Yorkston, N. J., & Whipple,
Hill, C. E. (1982). Counseling process research: Philosophical and K. (1975). Psychotherapy versus behavior therapy. Cambridge, MA:
methodological dilemmas. Counseling Psychologist, 10, 7-19. Harvard University Press.
Hill, C. E., Carter, J. A., & O'Farrell, M. K. (1983). A case study of the Stiles, W. B. (1978). Verbal response modes and dimensions of interper-
process and outcome of time-limited counseling. Journal of Counsel- sonal roles: A method of discourse analysis. Journal of Personality
ing Psychology,, 30, 3-18. and Social Psychology, 36, 693-703.
Hobson, R. (1982). Louise R. Unpublished tape, Central Manchester Stiles, W. B. (1979). Verbal response modes and psycbotherapeutic tech-
Health Authority, Manchester, England. nique. Psychiat~ 42, 49-62.
Hobson, R. F. (1985). Forms offeeling. London: Tavistock.
Ivey, A. E., & Gluckstern, N. B. (1974). Basic attending skills: Partici- Received March 7, 1986
pant manual. North Amherst, MA: Microtraining Associates. Revision received July 22, 1986 m