Professional Documents
Culture Documents
Health service psychology (HSP) graduate programs are shifting from knowledge- to competency-based
assessments of trainees’ psychotherapy skills. This study used Generalizability Theory to test the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
A 10-item rating form was developed from a collection of forms used by graduate programs (n = 102)
in counseling and clinical psychology, and a review of the common factors research literature. This form
was then used by 11 licensed psychologists to rate eight graduate trainees while viewing 129, approximately
5-min video clips from their psychotherapy sessions with clients (n = 22) at a graduate program’s training
clinic. Generalizability analyses were used to forecast how the number of raters and clients, and length of
observation time impact the dependability of ratings in various rating designs. Raters were the primary
source of error variance in ratings, with rater main effects (leniency bias) and dyadic effects (rater-target
interactions) contributing 24% and 7% of variance, respectively. Variance due to segments (video clips) was
also substantial, suggesting that therapist performance varies within the same counseling session.
Generalizability coefficients (G) were highest for crossed rating designs and reached maximum levels
(G > .50) after four raters watched each therapist working with three clients and observed 15 min per dyad.
These findings suggest that expert raters show consensus in ratings even without rater training and only
limited direct observation. Future research should investigate the validity of competence ratings as
predictors of outcome.
Keywords: competence assessment, clinical skills, direct observation, observer ratings, generalizability theory
The past 3 decades have witnessed a shift in professional “culture of competence” argue that completing graduate course-
psychology from knowledge- to competency-based assessments work and clinical practicum experiences do not adequately ensure
of trainee psychotherapy skills. Proponents of this growing that trainees will be effective psychotherapists; students must
also demonstrate that they can effectively apply their knowledge
to real-world clinical practice (Fouad et al., 2009; Hatcher
et al., 2013).
This article was published Online First July 29, 2021. These emerging calls for increased accountability and gatekeep-
Molly Kring https://orcid.org/0000-0002-1502-3390 ing in graduate training programs are reflected in new accreditation
Jessica K. Cozart https://orcid.org/0000-0003-2537-5362 and licensing standards. In early 2020, the Examination for Profes-
Morgan T. Sinnard https://orcid.org/0000-0001-6820-3320 sional Practice in Psychology (EPPP) began rolling out a new skills-
Emily H. Hamm https://orcid.org/0000-0002-2954-6694 based component that includes questions about applied situations
Nickolas D. Frost https://orcid.org/0000-0003-0221-6422 that psychologists may face in clinical practice. In 2017, the
William T. Hoyt https://orcid.org/0000-0003-4324-5676 American Psychological Association (APA) issued new Standards
Molly Kring is now a postdoctoral fellow at Westside Psychotherapy in of Accreditation that require graduate programs to base their eva-
Madison, Wisconsin. luations of student performance at least in part on direct observation
We have no known conflicts of interest to disclose.
of clinical skills, whether live or via video recordings, and to
A subset of these findings was presented as a poster at the 2019 annual
convention of the American Psychological Association.
measure and document students’ competence in areas outlined by
Correspondence concerning this article should be addressed to Molly the APA (American Psychological Association [APA], 2019).
Kring, Department of Counseling Psychology, University of Wisconsin– Doctoral training programs have responded to this accountability
Madison, 335 Education Building, 1000 Bascom Mall, Madison, WI movement by implementing systematic assessments of clinical
53706, United States. Email: molly.kring@gmail.com competence (Grus et al., 2016). Many programs have developed
222
EVALUATING PSYCHOTHERAPIST COMPETENCE 223
their own assessment tools and procedures tailored to their unique (Price et al., 2017). While this study’s findings are a promising start to
training goals (Grus et al., 2016), and little evidence is available on psychometrically testing the APA rating form, it did not examine the
the reliability and validity of the prototype Competency Bench- extent to which evaluations were based on direct observation, nor did
marks rating form (Fouad et al., 2009) or its variants. Our purpose in it examine interrater reliability.
this study was to conduct an initial generalizability study of observer Separate research has highlighted the significant role of rater bias
ratings of trainee clinical competence to inform further research on in competency ratings. Supervisors are often prone to viewing their
and implementation of these rating procedures. own supervisees in a favorable light (halo bias), and to use the upper
end of the rating scale (leniency bias; Gonsalvez & Freestone,
2007). A meta-analysis of 79 published generalizability studies
The Rise of Observation-Based Competency Assessments
in psychological journals found that 37% of the score variance in
To aid programs in assessing trainee competencies, an APA work rating systems was attributable to raters’ differing interpretations of
group (Fouad et al., 2009) released a Competency Benchmarks rating scales and targets (Hoyt & Kerns, 1999). Rater bias has been
rating form that provides an initial model of a standardized measure found to be particularly strong in ratings of attributes that required
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
with behavioral anchors. The APA encouraged graduate programs rater inference (e.g., empathy, working alliance) as compared to
This document is copyrighted by the American Psychological Association or one of its allied publishers.
to adapt the tool to fit their training goals, noting that some ratings that directly reflect observed behaviors (e.g., frequency
programs may seek to highlight certain competency areas over counts; Hoyt, 2002; Hoyt & Kerns, 1999). Taken together, these
others (e.g., assessment over teaching) and add behavioral anchors findings suggest that rater bias is likely a significant issue in clinical
specific to their discipline. The most recent iteration of the rating competency assessments that warrants further investigation.
form includes four subcategories in the Relational and Application Performance-based ratings of trainee competence are now near-
domains that specifically relate to direct assessment of psychother- universally used in graduate training in psychology, both to evaluate
apy skills: interpersonal relationships, affective skills, expressive progress in the program for individual students and, in aggregate, to
skills, and helping skills (American Psychological Association document program effectiveness. It is important that such high-
[APA], 2012). stakes assessments be reliable and valid, but as yet there is little
The rollout of the new APA accreditation requirements and guidance to training programs about optimal design for these rating
assessment tools has solidified the place of observationally informed procedures. For example, it is unclear how many raters, how much
competency assessments in professional psychology graduate train- direct observation time of a trainee conducting psychotherapy, and
ing programs. A 2016 survey of training directors at APA-accredited how many clients per trainee are needed to provide valid evidence of
programs in health service psychology (HSP) fields indicated that trainee clinical competence. This study is a preliminary effort to
48% of respondents (N = 150) were incorporating the Competency address this need, focusing on dependability (reliability or replica-
Benchmarks rating form into their training model and using it to bility) of trainee clinical competence ratings.
evaluate student progress, and 42% of the programs reported
developing their own competencies rating form (Grus et al.,
Generalizability Theory
2016). Moreover, 60 of the 150 programs reported using live
recordings/performance reviews to assess competencies. Generalizability theory (G theory) offers a framework for inves-
tigating the dependability (reliability) of psychotherapy competence
Implementation of Competency Assessments ratings and the relative influence of multiple possible sources of
variance that contribute to these ratings (Cronbach et al., 1972;
Outpaces Research Support
Hoyt & Melby, 1999). Score dependability is quantified by a
Research support for the reliability and validity of competency generalizability coefficient (or G coefficient) that estimates the
assessments has lagged behind their implementation. To date, no proportion of score variance that is attributable to ratee universe
study has evaluated the reliability or validity of the APA Compe- scores (analogous to the true score in classical test theory). While
tency Benchmarks rating form, although some have examined its classical test theory partitions variance in scores into a true score
psychometric properties. This is likely due in part to the heteroge- component and an error component, which encompasses all other
neity of forms and assessment methods in use by programs, as well variance sources, G theory further deconstructs the variance
as the significant resources needed to conduct reliability and validity components and their interactions to determine the most serious
studies (e.g., multiple raters per trainee, access to psychotherapy sources of error (Shavelson & Webb, 1991). G theory thereby
sessions and client outcome data). allows researchers to investigate multiple sources of error ( facets)
In one of the most rigorous psychometric studies of the Bench- simultaneously, and provides detailed guidance for optimizing rating
marks to date, Price et al. (2017) used an item response theory procedures to improve score dependability (Hoyt & Melby, 1999).
approach to analyze 270 competency evaluations of 94 preinternship In this study, the facets examined were raters, items, clients,
doctoral trainees drawn from clinical, clinical health, and counseling sessions, and segments-within-sessions. Given the rater bias research
psychology programs at a large, public university. Fifty-seven practi- outlined above, it was assumed that competency ratings would vary
cum supervisors rated their trainees at the end of each semester, so that significantly based on the rater evaluating the trainee. Clients were also
each supervisor contributed a mean of 4.98 evaluations and each identified as a potential source of variance in ratings given that client
trainee received a mean of 3.03 evaluations (Price et al., 2017). For characteristics have been shown to be an important predictor of
their evaluations, supervisors used a 52-item practicum evaluation psychotherapy outcomes (Lambert, 2013). Finally, we examined
form derived from the Benchmarks Competencies document. The whether ratings varied based on the session viewed, or by the
researchers found a unidimensional, one-factor model of competency, segment viewed within a given session (recognizing that direct
leading them to question whether the form could be greatly condensed observation in many settings involves viewing part of a session
224 KRING, COZART, SINNARD, OBY, HAMM, FROST, AND HOYT
rather than a full session). Finally, including items as a facet allowed solicitation by the first author and agreed to participate in the study.
us to estimate variance due to specific-factor error, a source of error The first author invited experienced supervisors who had an affilia-
that can be estimated when scores are computed as the mean or sum of tion with her PhD program to serve as raters for the study. Raters
multiple items (Schmidt et al., 2003). were unacquainted with the trainee therapists, and compensated
Generalizability analysis proceeds in two steps. First, data from $400 for their time. The research team relied on local personal and
the generalizability study are analyzed to provide variance estimates professional networks to recruit raters because of the significant time
representing the contribution of each facet, and interactions among commitment required and modest financial compensation. Efforts
facets, to the variance in observed scores. Second, these variance were made to recruit raters with a range of supervision and clinical
estimates are used to estimate G coefficients from different rating experience to mirror the differing experience levels of raters in HSP
designs (e.g., different numbers of raters per trainee; different competence rating processes. Potential raters were informed that
numbers of clients per trainee) to provide guidance to future users their participation in this research study was voluntary, their ratings
of the measurement procedure about how to optimize generalizabil- would be de-identified, and they would be compensated $400 upon
ity of ratings. completion of the study. Of the 40 raters recruited, 11 agreed to
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
participate (28%).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Table 1
Demographic Characteristics of Participants
Gender
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
psychotherapy sessions. Research team members then sorted all of Laska et al. (2014) and Norcross and Lambert’s (2018) meta-
the extracted items into discrete categories (e.g., interpersonal skills, analyses on therapist effects, and The Heart and Soul of Change
empathy, working alliance, etc.) and determined the 10 most (2010) were used as representative, systematic reviews of the
common and representative items in the forms. Because many common factors literature for this project. Ten common factors
programs drew their items and rating scales directly from the were identified from this literature review and converted into a
APA Benchmarks Competencies Rating Form, we opted to use rating form with the same scale anchors as the first form.
the same rating scale for our composite form. Raters were specifi-
cally asked, “How characteristic of the trainee’s behavior are the
Item Selection
following competency item descriptions?,” and rated each item on a
5-point Likert scale, with scale anchors of Not at all/slightly, Both 10-item rating forms underwent expert review before being
Somewhat, Moderately, Mostly, and Very. condensed into one, 10-item composite measure. The first form was
The research team also created a second rating form based on sent to all 102 of the training directors or department chairs who
the empirically supported characteristics of effective therapists. responded to the first recruitment email with their program’s
226 KRING, COZART, SINNARD, OBY, HAMM, FROST, AND HOYT
form(s), and the second to 11 academic scholars familiar with the selection process that emphasized the importance of selecting a
common factors literature. Twenty-six (25%) reviewers provided random clip. Once clips were selected, they were reviewed by the
feedback on the graduate program rating form, and 10 (91%) on the first author to ensure similar selection procedures regarding length
common factors rating form. Both sets of expert reviewers were and coherence. Two clips were replaced during this review process,
asked to rate their confidence that supervisors could reliably rate one due to sound issues and a second due to difficulty following the
each item based on video recordings of psychotherapy sessions, as conversation.
well as their general feedback on the items. Common factors
scholars were specifically asked to rank the strength of each item’s Rating Procedures
empirical support as a predictor of client outcomes, and training
directors were alternatively asked to rate the importance of each item The video clips were then uploaded to a secure and Health
as a predictor of therapist competence. Insurance Portability and Accountability Act-compliant file storage
The common factors scholars rated the working alliance and website maintained by the university. Over the course of 6–8 weeks,
empathy items as having the strongest empirical support, and each rater watched the same 129, 5-min video clips of trainees
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
repairing alliance ruptures and managing countertransference as conducting psychotherapy, in the same order. To avoid overwhelm-
This document is copyrighted by the American Psychological Association or one of its allied publishers.
having the least. On average, they were most confident that collab- ing raters and to encourage adherence to a timeline, video clips were
oration and therapist empathy could be rated by supervisors via released to raters in three batches of 40–45 clips each, and raters
video observation, and least confident about rating congruence/ were encouraged to take breaks as needed to prevent fatigue. After
genuineness and managing countertransference. Training directors watching each video clip, raters completed the 10-item rating form
rated working alliance and empathy items as being the most as a Qualtrics survey. The study was approved by the university’s
important indicators of therapist competence, and displaying basic Institutional Review Board.
helping skills and the ability to deal with conflict and negotiate
differences as the least important of the 10 items listed. Training Results
directors on average were most confident that supervisors could
reliably rate basic helping skills and empathy based on video Preliminary Analyses
observation, and were least confident regarding rating how effec- Before proceeding to the generalizability analyses, we attended to
tively trainees implement evidence-based interventions and their issues of missing data and conducted exploratory analyses to better
ability to deal with conflict and negotiate differences. Based on their understand the characteristics of the rating instrument. The 10-item
feedback, the researchers created one final 10-item rating form that rating instrument was based on competency assessment forms used
consisted of five items from graduate program forms, and five from in 102 doctoral training programs and factors empirically linked
the common factors research literature. to therapist outcomes. Our initial item analysis provided insight
into the rating form’s factor structure and guided our treatment
of items as a facet (source of error) in the generalizability analyses.
Procedure R software (version 3.4.1) was used for all data analyses.
Preparation of Video Clips (Segments)
Data Cleaning
Within each therapist–client dyad, three videotaped psychother-
apy sessions were selected: One from the first third of treatment, one One of the 11 raters’ scores was dropped from the final analyses
from the second third, and one from the last third of the available because of a significant number of missing ratings, multiple rating
sessions. Intake and termination sessions, as well as the first and last forms submitted for several clips, and unusually quick completion of
5 min of each session, were excluded as these are less indicative of a the study. All forms for one video clip were also omitted because of
therapist’s skill level. Within each of the three sessions, two 5-min technical difficulties with the video. The final analyses included data
(range = 4–6 min) video clips were created. Clips were identified from 10 raters, 8 therapists, 22 clients, and 129 video clips. A total of
by first using a random number generator to identify two time points 1,291 rating forms were completed. Nine of the 10 raters completed
at least 10 min apart during the session. For each randomly selected forms for every clip; one rater missed one clip, and two raters
time point a research team member watched the video from 5 min submitted the same form twice. In the latter case, the second form
before to 5 min after the time identified. Within that 10-min block, a was used in both instances because the first forms were missing
4–6 min clip was identified that had a reasonably defined “begin- responses, suggesting that they were submitted prematurely. Within
ning” and “end.” To qualify for inclusion, both the therapist and the rating forms themselves, raters completed nearly every item for
client had to be speaking in the clip for at least 30 s each, and the every rating form submitted. The most-skipped item was Goal Con-
therapist had to be delivering an intervention beyond supportive sensus, which was not rated on 19 rating forms (1.5% of all forms).
listening.
These broad selection criteria ensured that therapist skills could Item Analysis
be adequately judged from the clip and that a wide range of clips was
eligible. If these criteria were not met in the randomly selected Table 2 provides the overall mean scores and standard deviations
10-min segment, a new number was randomly generated to identify for each of the 10 rating form items, each rated on a scale from 1 to 5.
a different clip. This selection process ensured that raters could Means ranged from 3.43 (attentiveness to emotion) to 4.04 (positive
understand the context of a given clip and adequately rate the regard) and standard deviations averaged around one point. Thus,
trainees’ competencies. To ensure consistency in selecting clips, there was appreciable variability in ratings both within and between
research team members attended an 1-hr training on the clip items. Interitem correlations, calculated both between and within
EVALUATING PSYCHOTHERAPIST COMPETENCE 227
Table 3
Within- and Between-Therapist Pearson Correlations Among Rating Form Items
Item 1 2 3 4 5 6 7 8 9 10
1. Alliance — .77** .76** .73** .74** .74** .72** .70** .63** .71*
2. Empathy .91** — .73** .70** .80** .78** .76** .73** .70** .75**
3. Collaboration .95** .79* — .82** .67** .74** .73** .72** .68** .75**
4. Goal consensus .92** .72* .97** — .65** .73** .68** .67** .69** .75**
5. Positive regard .91** .97** .78* .69* — .73** .75** .70** .64** .70**
6. Accurate reflection .94** .93** .89** .88** .85** — .78** .76** .73** .79**
7. Interpersonal skills .98** .93** .90** .84** .93** .94** — .81** .72** .81**
8. Communication .98** .93** .92** .90** .89** .98** .98** — .69** .77**
9. Emotion .88** .97** .80* .76* .90** .96** .89** .93** — .76**
10. Intervention .95** .87** .97** .93** .83** .95** .94** .97** .90** —
Note. Correlations above the diagonal (roman values) are computed within therapist; ns ranged from 120 to 181; those below the diagonal (italicized values)
are computed between therapist; n = 8.
* p < .05. ** p < .01.
228 KRING, COZART, SINNARD, OBY, HAMM, FROST, AND HOYT
consistently perceives Therapist A as highly competent (across sessions (with a given client) in therapist performance. Contrary to
clients and observation units) while a different rater consistently our expectations, C:T variance was also quite modest (2.0%),
perceives Therapist A as lower in competence. Because all raters reflecting relatively low variability in competence ratings for a
observe the same segments for each therapist, a large RT variance given therapist across different clients. As explained earlier, this
component indicates that the raters employ different implicit or C:T variance component confounds the client main effect (C) and
explicit criteria for translating observed therapist behaviors into the client-by-therapist interaction (CT) because of the nested G
competence judgments. study design.
In naturalistic observation of psychotherapy, clients are nested Because Raters are crossed with Segments in the G study, Table 4
within therapists. That is, normally each client sees one therapist, includes estimates of the Rater interaction with each of the other
and each therapist sees multiple clients. In the present study, we facets. In addition to the RT interaction (6.6%), indicating differen-
included sessions (nested within therapist–client dyads) and seg- tial perceptions of therapists, there was some evidence that raters
ments (nested within sessions) as additional sources of error. Thus, showed differential perceptions of therapists with specific clients,
the G study design is multiply nested, with the Raters facet crossed (R(C:T); 2.2%); also of specific sessions for each therapist–client
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
with Segments, which are nested within Sessions nested within dyad, (R(S:C:T); 2.5%); and especially of specific segments within
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Clients nested within Therapists. Nesting is indicated by a “:” in the each session, (R(Seg:S:C:T); 25.0%). The residual variance com-
design notation, R(Seg:S:C:T). ponent (25.5%) reflects variability due to systematic sources of error
When some facets are nested in a G study, their main effects not examined in the G study and to random error.
and interactions will be confounded and cannot be estimated
separately. For example, we can conceptually distinguish between Forecasting G Coefficients for Different Measurement
a client main effect (C) and a client–therapist interaction (CT). The C Designs
component indexes variance due to client (i.e., some clients would
elicit more competent performance from their therapist than others, The variance estimates in Table 4 help us address the question
regardless of which therapist they see). The CT component “What sources contribute to variance in test scores?” (Cronbach &
indexes variance due to the interaction of client and therapist Meehl, 1955), which is important for evaluating reliability as well as
(e.g., Therapist A might perform more competently in treating validity of competence scores (Hoyt et al., 2006). In this section, we
Client X, and less competently with Client Y). Because clients derive forecasts of generalizability (G) coefficients under different
are nested within therapists in the G study, we cannot estimate C rating designs to help guide future users of observer-rated compe-
and CT variance components separately. Instead, we estimate a tency scores.
single component (C:T) which indexes variance attributable to both The G coefficient, similar to Cronbach’s α, is an intraclass
of these sources. correlation coefficient (ICC). It reflects the proportion of variance
in competence scores that is attributable to therapist universe scores.
Variance Estimates This is the proportion of variance that is expected to replicate over
multiple assessments involving different sets of raters, clients,
Variance partitioning for this G study is presented in Table 4. sessions, and segments. Three design considerations for observer-
This table provides information about the proportion of variance due rated competence measures are (a) number of raters, (b) nesting of
to Therapists (4.0%) and shows that both Rater main effects (24.2%) raters within targets, and (c) amount of observation time. Because
and Rater–Therapist interactions (6.6%) are substantial sources of rater-based components (especially R, RT, and RSeg:S:C:T)
error in observer-rated competence. Another large source of error is account for a substantial proportion of observed score variance
Seg:S:C:T (7.5%), which indicates substantial within-session vari- (Table 4), the number of raters who evaluate each trainee will be an
ability in therapist competence. By contrast, the S:C:T component important determinant of the generalizability of competence scores
was quite small (0.6%), reflecting relatively little variation between derived from observer ratings. Just as adding items to a rating scale
improves its internal consistency reliability (Cronbach’s α), aggre-
gating scores over multiple raters will improve the interrater reli-
Table 4
ability of competence assessments.
Estimated Variance Components for R(Seg:S:C:T) Model
Although raters were crossed with therapists in the G study (RT),
Source Variance 95% CI (%) this is not necessarily the case for ratings based on observations
during doctoral training. For example, when ratings are provided by
T 0.040 [0.035, 0.045] 4.0 supervisors during practicum or internship training, the rating
R 0.242 [0.233, 0.251] 24.2
C:T 0.020 [0.015, 0.025] 2.0
design is typically nested (R:T) rather than crossed (RT). It is
S:C:T 0.004 [0.000, 0.011] 0.5 important for training programs to understand how this design
Seg:S:C:T 0.075 [0.067, 0.083] 7.5 feature affects dependability of competency ratings.
RT 0.066 [0.059, 0.073] 6.6 A third feature of the rating context concerns how much data
R(C:T) 0.021 [0.014, 0.030] 2.2 raters have on which to base their evaluations of competence.
R(S:C:T) 0.026 [0.013, 0.037] 2.5
R(Seg:S:C:T) 0.249 [0.235, 0.265] 25.0 Depending on the rating context, evaluators may make competence
Residual 0.254 [0.246, 0.263] 25.5 assessments based on relatively limited data (e.g., observing part of
a single session of work with a single client) or a much more
Note. T = Therapists; R = Raters; C = Clients (nested within Therapists);
S = Sessions (nested within Clients); Seg = Segments (nested within extensive work sample (perhaps including direct observation of
Sessions). 95% CIs were computed by percentile bootstrapping work with multiple clients and multiple sessions for each client). In
(resampling), with 10,000 replicated samples. general, we expect more reliable and valid ratings when these are
EVALUATING PSYCHOTHERAPIST COMPETENCE 229
based on more extensive observation. Using the variance estimates (see Table 4). If we obtain evaluations from four raters per target
in Table 4, we can quantify how the generalizability coefficient is (nR = 4), taking the average of these four ratings as the compe-
expected to improve as the numbers of clients, sessions, and tence score, the Raters facet contributes varR/4 = 0.06 to the
observation time per session increase. resulting observed score variance. This reduction in error vari-
Predicting G Coefficients From Variance Estimates. The G ance is reflected in a corresponding increase in the G coefficient
coefficient is an ICC that indicates the accuracy of generalization estimated for the nR = 4 nested design, compared with the
from a person’s observed score (in this case, a psychotherapy nR = 1 design.
competence rating) to their universe score (their “true” psychother- When raters are crossed with therapists (RT) in the assessment
apy competence) (Shavelson & Webb, 1991). The G coefficient can design, an additional advantage is gained, in that varR is omitted
be interpreted as the proportion of observed score variance that is from the denominator altogether. When all therapists are evaluated
attributable to trainee universe scores. Alternatively and equiva- by the same set of raters, rater leniency or severity biases are
lently, G coefficients can be interpreted as the estimated correlation constant across targets, so they do not contribute to observed score
between two sets of competence ratings produced using the same set variance, and varR is not included in the denominator when
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
varT
Gcrossed = varSeg∶S∶C∶T varRSeg∶S∶C∶T varRes :
varT + varnC∶T
C
+ varnCS∶C∶T
nS + nC nS nSeg + varnRRT + var
nR nC + nR nC nS + nR nC nS nSeg + nR nC nS nSeg
RC∶T varRS∶C∶T
and segments. G coefficients range from 0 to 1, with higher values Thus, when we are able to cross raters and targets (i.e., have all
reflecting greater consistency of ratings across conditions (facets). raters observe all therapists) in the measurement design, there is a
Determination of the acceptability of G for a given assessment substantial reduction in the estimated observed score variance
procedure is dependent on the context of measurement, but it is (because R was among the largest sources of systematic variance
helpful to keep in mind that G coefficients that take into account in Table 4). This decreases the denominator of the estimated G
multiple sources of error will inevitably be smaller than classical coefficient, which increases the predicted value of G for that design
reliability coefficients that take into account only a single source of (relative to an equivalent nested design). In the next section, we
error (e.g., Schmidt et al., 2003). explore the joint contribution of design (nested vs. crossed), number
Forecasts of coefficients are computed based on two features of of raters, and observation time to dependability of competence
prospective assessment designs: Whether raters are crossed with or scores based on direct observation.
nested within therapists, and the number of levels of each facet Effects of Number of Raters, Rater Nesting, and Observation
(Raters, Clients per therapist, Sessions per client, and Segments—or Time on Predicted G Coefficients. Figure 1 shows the effect, for
minutes viewed—per session). Because we used 5-min segments in both nested and crossed rating designs, of varying the number of
the G study, observation time for each session was specified in raters per therapist and the observation time for each session
5-min increments. For each assessment design of interest, we observed. G coefficients in this figure are computed holding number
computed a G coefficient by dividing the estimated Therapist of clients per therapist (nC = 2) and number of sessions observed
variance (universe score variance, or varT) by the estimated total per client (nS = 2) constant.1 Observation time per session varies
observed score variance—computed as a linear combination of all from 5 to 50 min, so total observation time per therapist, for each
components that contribute to observed score variance for the rater, varies from 20 to 200 min.
specified measurement design. The impact of nesting is clear from comparing the gray lines
For nested designs (R:T), all components estimated in the G study (Gnested) with the black lines (Gcrossed) in Figure 1. Even with
contribute to observed score variance, so all appear in the denomi- nR = 4, the level of dependability achieved by an assessment
nator of the formula for Gnested. procedure in which raters are nested within targets barely exceeds
varT
Gnested = varSeg∶S∶C∶T varRSeg∶S∶C∶T varRes ,
varT + var
nR + nC + nC nS +
R varC∶T varS∶C∶T
nC nS nSeg + varnRRT + var
nR nC + nR nC nS + nR nC nS nSeg + nR nC nS nSeg
RC∶T varRS∶C∶T
therapist effectiveness. The present study, because of its small therapist effects in this small data set, we were not able to conduct
number of therapists (n = 8) cannot provide robust evidence for predictive validity analyses for the competence scores.
validity of competence scores; nonetheless, we consider the nature
of validity evidence for competence ratings and describe how this
Discussion
issue can be addressed in the present data set.
The competencies that are the focus of this G study (relational and The purpose of this study was to provide initial generalizability
intervention domains) are theorized to enhance therapist effective- data on ratings of clinical competence based on direct observation,
ness, leading to better client outcomes. Thus, we should be inter- following procedures that are close to those typically used in clinical
ested in whether competence scores predict between-therapist and counseling psychology doctoral programs to evaluate trainee
variance in outcomes, technically referred to as therapist effects competence. Learning what sources contribute to variance in these
(Baldwin & Imel, 2013). Two considerations in seeking validity ratings can assist evaluators to develop reliable (generalizable)
evidence for predictors of therapist effects are (a) how to define rating procedures—and reliability of ratings is a precondition for
effectiveness and (b) how to obtain generalizable estimates of validity of ratings (Hoyt et al., 2006).
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
therapist effects.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
supervisor(s) over time), and for formative feedback to trainees (see important to gauge how well the present findings generalize to
Hoyt et al., 2021, for a further discussion of this issue). the broader trainee population.
An unexpected finding was the relatively modest contribution of Raters were experienced supervisors for whom evaluating com-
client variance to competence scores. Client factors account for a petence of trainees at various levels was a familiar experience. We
large proportion of outcome variance (Lambert, 2013), and it did not provide rater training because such training does not appear
seemed reasonable to expect that trainee therapists might be judged to be a commonly used strategy when obtaining supervisor ratings of
more or less competent depending on the characteristics of their trainee performance. However, a novel feature of the rating proce-
clients. Evidence for client-based variance was not negligible, dure in this study was that raters had no acquaintance with trainees,
however, and Figure 2 suggests that observation of trainee work and based their ratings only on direct observation of trainees’
with two or more clients is desirable from the standpoint of clinical work. While direct observation is a critical means for
generalizability of ratings. gathering data regarding trainee progress (APA, 2019; Standard
A final noteworthy finding was the lack of evidence for item- II.B.3.d), it is not the only source of data available to evaluators in
based variance, and the very high internal consistency of this 10- naturalistic settings. It is therefore unclear how well the variance
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
item rating form. The lack of evidence for specific-factor error estimates from the present G study generalize to ratings where there
This document is copyrighted by the American Psychological Association or one of its allied publishers.
(Schmidt et al., 2003) suggests that raters tended not to discriminate is access to a wider range of data, and future research should
among items in evaluating clinical competence. In theory, we can examine generalizability of ratings when evaluators are acquainted
imagine a therapist who has high competence in expressing empathy with trainees outside the clinical sessions observed.
but demonstrates weak attention to articulating and agreeing on We note that it is difficult to predict how inclusion of trainees’
goals (goal consensus). In the current rating context, however, behaviors outside the clinical context will affect reliability and
trainees rated high on one item tended to be rated high on all, validity of competence ratings. When raters make inferences about
and scores derived from these ratings may be best interpreted as target characteristics based on observed behaviors, overlap drives
reflecting global clinical competence. To the extent that greater consensus (Kenny, 1991). If each rater relies not just on commonly
discernment among these competency subdomains is desired, addi- observed in-session behaviors, but also on trainee behaviors in
tional measurement development work, and likely rater training supervision sessions or the classroom as a basis for competence
procedures, may be needed. On the other hand, if global clinical ratings, this will reduce overlap of observed behaviors and may
competence ratings are what is desired, the evidence from this initial therefore decrease consensus (generalizability of ratings). On the
G study suggests that the length of the rating form can be reduced other hand, increasing the amount of information available to raters
without adversely affecting generalizability of scores. should increase accuracy, as long as the information gleaned from
observation and interaction outside of sessions is relevant to the
trainee’s clinical competence. If additional out-of-session informa-
Limitations and Future Directions tion increases rater accuracy, consensus should also increase
(Kenny, 2004). In short, it is unclear how reliance on information
Sample size is a common limitation in G studies. Using a fully other than direct observation affects reliability and validity of
crossed rating design (all raters observe all trainees) places practical competence ratings, and how rater access to idiosyncratic personal
constraints on the number of trainees (n = 8) and clients (n = 22) information about trainees may affect variance partitioning and
that can be included. It is reasonable to be concerned that such a generalizability of ratings. More research is needed in this area.
small sample would lead to rather imprecise variance estimates, A final limitation of the present study concerns the inability to
which would limit the utility of the findings. However, trainees and provide evidence of the predictive validity of the competence ratings
clients are only two of the variance sources examined here, and the (lack of therapist variance). Reliability is a precondition for validity,
bootstrapped confidence intervals demonstrate that both variance and our findings suggest that adequate reliability of competence
estimates and G coefficients are estimated with a reasonable degree scores can be achieved by following sound evaluation procedures.
of precision in this study. Although variance components and G However, it is possible that even a highly reliable assessment
coefficients may be reliably estimated with a relatively small sample procedure may not be valid for its intended use (Messick, 1989).
of targets (trainees), much larger numbers of therapists and clients Once rating procedures are refined to yield dependable competence
are needed to establish validity of competence ratings as predictors scores, the next step will be to validate these scores as predictors of
of therapist effectiveness. client outcomes.
An additional concern is the representativeness of the sample and Future research on this topic might consider further exploring the
of the rating procedures. Therapists in this study were all masters or validity of the competency ratings in terms of client outcomes, and
doctoral trainees, all but one of whom were in an initial supervised the impact of rater training on reducing rater variance. High Rater
clinical placement. Range restriction is a risk in this recruitment (R) variance is a major concern in observer-rated competence
procedure and can lead to underestimates of therapist variance (T) evaluations, especially when fully crossing raters and targets is
and therefore also to underestimates of forecast G coefficients. not a viable option. Rater training has been shown to reduce R
However, the rating scale was based in part on the forms used variance and thereby improve the dependability of ratings in other
by doctoral programs for students at this stage of training, so these contexts, and could be a direction for future research in clinical
items are expected to discriminate among competence levels for competency ratings (Roch et al., 2012). In a psychotherapy context,
early-stage trainees. And indeed, there was persuasive evidence of Gonsalvez et al. (2013) have sought to reduce rater bias by
therapist variance, both in the variance estimates and G coefficients. standardizing the meaning of each scale anchor. Their Vignette
Nonetheless, it is prudent to be cautious about interpretation of Matching Assessment Tool (VMAT) asks evaluators to read several
findings from a single sample, and follow-up studies will be brief narratives that describe key features of trainees at specified
EVALUATING PSYCHOTHERAPIST COMPETENCE 233
developmental levels and then select which vignette best describes Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The
the trainee’s current skills. Examining whether using the VMAT in dependability of behavioral measurements: Theory of generalizability for
rater training or as part of a competency rating scale reduces rater single and multiple observations. Wiley.
bias could help improve the reliability of ratings. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological
tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/
h0040957
Fouad, N. A., Grus, C. L., Hatcher, R. L., Kaslow, N. J., Hutchings, P. S.,
Implications for Assessment of Trainee Competence
Madson, M. B., Collins, F. L., Jr., & Crossman, R. E. (2009). Competency
For doctoral programs, tentative recommendations based on these benchmarks: A model for understanding and measuring competence in
initial findings include collecting competence ratings from multiple professional psychology across training levels. Training and Education in
evaluators and having a rating design as close to fully crossed as Professional Psychology, 3(4, Suppl), S5–S26. https://doi.org/10.1037/
possible—that is, having all evaluators observe and provide ratings a0015832
Gonsalvez, C. J., Bushnell, J., Blackman, R., Deane, F., Bliokas, V.,
for all trainees. It is desirable to observe each trainee with more than
Nicholson-Perry, K., Shires, A., Nasstasia, Y., Allan, C., & Knight, R.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
one client, although in our sample improvements to generalizability (2013). Assessment of psychology competencies in field placements:
This document is copyrighted by the American Psychological Association or one of its allied publishers.
taper off after 2–3 clients per trainee. In our crossed rating design, Standardized vignettes reduce rater bias. Training and Education in
the dependability of ratings began to plateau after approximately Professional Psychology, 7(2), 99–111. https://doi.org/10.1037/a0031617
90–120 min of observation time per trainee. Gonsalvez, C. J., & Freestone, J. (2007). Field supervisors’ assessments of
Programs often design competency rating forms to encompass trainee performance: Are they reliable and valid? Australian Psychologist,
different domains and different behavioral indicators within each 42(1), 23–32. https://doi.org/10.1080/00050060600827615
domain. Our rating form items focused on the relationship and Grus, C. L., Falender, C., Fouad, N. A., & Lavelle, A. K. (2016). A culture of
intervention domains, which are arguably best assessed by direct competence: A survey of implementation of competency-based education
observation of their application in session. Very high correlations and assessment. Training and Education in Professional Psychology,
10(4), 198–205. https://doi.org/10.1037/tep0000126
among these items, both within and between therapists, suggest that
Hatcher, R. L., Fouad, N. A., Grus, C. L., Campbell, L. F., McCutcheon,
raters made few distinctions between the conceptually different S. R., & Leahy, K. L. (2013). Competency benchmarks: Practical steps
skills and behaviors referred to in the 10-item scale. Thus, the scale toward a culture of competence. Training and Education in Professional
may function well as an assessment of global clinical competence Psychology, 7(2), 84–91. https://doi.org/10.1037/a0029401
(with very high internal consistency reliability). Programs desiring Hoyt, W. T. (2002). Bias in participant ratings of psychotherapy process: An
to obtain separate evaluations of different clinical competency initial generalizability study. Journal of Counseling Psychology, 49(1),
domains may need to alter the measurement procedures represented 35–46. https://doi.org/10.1037/0022-0167.49.1.35
here. Possible strategies include (a) more elaborate rating forms Hoyt, W. T., & Melby, J. N. (1999). Dependability of measurement in
(with multiple items per rating dimension); (b) use of rater training counseling psychology: An introduction to generalizability theory. The
to standardize the rating scale and provide examples of trainee Counseling Psychologist, 27(3), 325–352. https://doi.org/10.1177/0011
000099273003
behaviors for each scale anchor (Gonsalvez et al., 2013); and (c) use
Hoyt, W. T., Warbasse, R. E., & Chu, E. Y. (2006). Construct validity in
of separate, domain-specific performance tasks (e.g., Facilitative
counseling psychology research. The Counseling Psychologist, 34(6),
Interpersonal Skills tasks; Anderson et al., 2016) to elicit behaviors 769–805. https://doi.org/10.1177/0011000006287389
most relevant to each competency dimension. Generalizability Hoyt, W. T., & Kerns, M. D. (1999). Magnitude and moderators of bias in
analysis can be an important tool for refining general and specific observer ratings: A meta-analysis. Psychological Methods, 4(4), 403–424.
measurement procedures, and demonstrating reliability and validity https://doi.org/10.1037/1082-989X.4.4.403
of outcome assessments in graduate training in HSP. Hoyt, W. T., Kring, M., & Hamm, E. H. (2021). Considerations for
dependable assessment of trainee competence: Implications from an
initial generalizability study of ratings based on direct observation
[Manuscript submitted for publication]. Counseling Psychology, Uni-
References versity of Wisconsin—Madison.
American Psychological Association. (2012). Revised competency benchmarks Kenny, D. A. (1991). A general model of consensus and accuracy in interper-
in professional psychology. http://www.apa.org/ed/graduate/benchmarks- sonal perception. Psychological Review, 98(2), 155–163. https://doi.org/10
evaluation-system.aspx .1037/0033-295X.98.2.155
American Psychological Association. (2019). Standards of accreditation Kenny, D. A. (2004). PERSON: A general model of interpersonal percep-
for health service psychology and accreditation operating procedures. tion. Personality and Social Psychology Review, 8(3), 265–280. https://
https://www.apa.org/ed/accreditation/about/policies/standards-of-accre doi.org/10.1207/s15327957pspr0803_3
ditation.pdf Kenny, D. A. (2019). Essentials of interpersonal perception. Guilford Press.
Anderson, T., Crowley, M. E. J., Himawan, L., Holmberg, J. K., & Uhlin, Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data analysis in social
B. D. (2016). Therapist facilitative interpersonal skills and training status: psychology. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), Handbook
A randomized clinical trial on alliance and outcome. Psychotherapy of social psychology (4th ed., pp. 233–265). McGraw-Hill.
Research, 26(5), 511–529. https://doi.org/10.1080/10503307.2015. Kopta, S. M., & Lowry, J. L. (2002). Psychometric evaluation of the
1049671 Behavioral Health Questionnaire-20: A brief instrument for assessing
Baldwin, S. A., & Imel, Z. E. (2013). Therapist effects: Findings and global mental health and the three phases of psychotherapy. Psychother-
methods. In M. Lambert (Ed.), Handbook of psychotherapy and behavior apy Research, 12(4), 413–426. https://doi.org/10.1093/ptr/12.4.413
change (6th ed., pp. 258–297). Wiley. Kopta, M., Owen, J., & Budge, S. (2015). Measuring psychotherapy
Cronbach, L. J., & Furby, L. (1970). How we should measure "change": Or outcomes with the Behavioral Health Measure–20: Efficient and com-
should we? Psychological Bulletin, 74(1), 68–80. https://doi.org/10.1037/ prehensive. Psychotherapy, 52(4), 442–448. https://doi.org/10.1037/
h0029382 pst0000035
234 KRING, COZART, SINNARD, OBY, HAMM, FROST, AND HOYT
Lambert, M. J. (2013). Outcome in psychotherapy: The past and important Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczynska, U. (2012). Rater
advances. Psychotherapy: Theory, Research, & Practice, 50(1), 42–51. training revisited: An updated meta-analytic review of frame-of-
https://doi.org/10.1037/a0030682 reference training. Journal of Occupational and Organizational Psy-
Laska, K. M., Gurman, A. S., & Wampold, B. E. (2014). Expanding the lens chology, 85(2), 370–395. https://doi.org/10.1111/j.2044-8325.2011
of evidence-based practice in psychotherapy: A common factors perspec- .02045.x
tive. Psychotherapy, 51(4), 467–481. https://doi.org/10.1037/a0034332 Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement examination of the effects of different sources of measurement error on
(3rd ed., pp. 13–103). MacMillan. reliability estimates for measures of individual differences constructs.
Minami, T., Wampold, B. E., Serlin, R. C., Kircher, J. C., & Brown, G. S. Psychological Methods, 8(2), 206–224. https://doi.org/10.1037/1082-
(2007). Benchmarks for psychotherapy efficacy in adult major depression. 989X.8.2.206
Journal of Consulting and Clinical Psychology, 75(2), 232–243. https:// Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer.
doi.org/10.1037/0022-006X.75.2.232 Sage Publications.
Norcross, J. C., & Lambert, M. J. (2018). Psychotherapy relationships that Tao, K. W., Owen, J., Pace, B. T., & Imel, Z. E. (2015). A meta-analysis of
work III. Psychotherapy: Theory, Research, & Practice, 55(4), 303–315. multicultural competencies and psychotherapy process and outcome.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., & R Core Team. (2020). .1037/cou0000086
nlme: Linear and nonlinear mixed effects models (R package version
3.1-144), https://CRAN.R-project.org/package=nlme
Price, S. D., Callahan, J. L., & Cox, R. J. (2017). Psychometric investigation Received January 20, 2021
of competency benchmarks. Training and Education in Professional Revision received June 9, 2021
Psychology, 11(3), 128–139. https://doi.org/10.1037/tep0000133 Accepted June 10, 2021 ▪