Evaluating Psychotherapist Competence: Testing The Generalizability of Clinical Competence Assessments of Graduate Trainees

Journal of Counseling Psychology
© 2021 American Psychological Association 2022, Vol. 69, No. 2, 222–234

ISSN: 0022-0167 https://doi.org/10.1037/cou0000576
Evaluating Psychotherapist Competence: Testing the Generalizability

of Clinical Competence Assessments of Graduate Trainees
Molly Kring, Jessica K. Cozart, Morgan T. Sinnard, Alayna Oby, Emily H. Hamm,
Nickolas D. Frost, and William T. Hoyt
Department of Counseling Psychology, University of Wisconsin–Madison
Health service psychology (HSP) graduate programs are shifting from knowledge- to competency-based
assessments of trainees’ psychotherapy skills. This study used Generalizability Theory to test the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
dependability of psychotherapy competence assessments based on video observation of trainees.

This document is copyrighted by the American Psychological Association or one of its allied publishers.
A 10-item rating form was developed from a collection of forms used by graduate programs (n = 102)
in counseling and clinical psychology, and a review of the common factors research literature. This form
was then used by 11 licensed psychologists to rate eight graduate trainees while viewing 129, approximately
5-min video clips from their psychotherapy sessions with clients (n = 22) at a graduate program’s training
clinic. Generalizability analyses were used to forecast how the number of raters and clients, and length of
observation time impact the dependability of ratings in various rating designs. Raters were the primary
source of error variance in ratings, with rater main effects (leniency bias) and dyadic effects (rater-target
interactions) contributing 24% and 7% of variance, respectively. Variance due to segments (video clips) was
also substantial, suggesting that therapist performance varies within the same counseling session.
Generalizability coefficients (G) were highest for crossed rating designs and reached maximum levels
(G > .50) after four raters watched each therapist working with three clients and observed 15 min per dyad.
These findings suggest that expert raters show consensus in ratings even without rater training and only
limited direct observation. Future research should investigate the validity of competence ratings as
predictors of outcome.
Public Significance Statement

Ratings of clinical competence are used to determine adequate progress for trainees in HSP and to
document competence for accreditation and licensure bodies. This study examined sources of error in
these ratings to provide guidance on improving assessment procedures. For competence assessments
based on direct observation, we recommend evaluation by multiple raters for each trainee, and
observation times of at least 60 min per trainee.
Keywords: competence assessment, clinical skills, direct observation, observer ratings, generalizability theory
The past 3 decades have witnessed a shift in professional “culture of competence” argue that completing graduate course-
psychology from knowledge- to competency-based assessments work and clinical practicum experiences do not adequately ensure
of trainee psychotherapy skills. Proponents of this growing that trainees will be effective psychotherapists; students must
also demonstrate that they can effectively apply their knowledge
to real-world clinical practice (Fouad et al., 2009; Hatcher
et al., 2013).
This article was published Online First July 29, 2021. These emerging calls for increased accountability and gatekeep-
Molly Kring https://orcid.org/0000-0002-1502-3390 ing in graduate training programs are reflected in new accreditation
Jessica K. Cozart https://orcid.org/0000-0003-2537-5362 and licensing standards. In early 2020, the Examination for Profes-
Morgan T. Sinnard https://orcid.org/0000-0001-6820-3320 sional Practice in Psychology (EPPP) began rolling out a new skills-
Emily H. Hamm https://orcid.org/0000-0002-2954-6694 based component that includes questions about applied situations
Nickolas D. Frost https://orcid.org/0000-0003-0221-6422 that psychologists may face in clinical practice. In 2017, the
William T. Hoyt https://orcid.org/0000-0003-4324-5676 American Psychological Association (APA) issued new Standards
Molly Kring is now a postdoctoral fellow at Westside Psychotherapy in of Accreditation that require graduate programs to base their eva-
Madison, Wisconsin. luations of student performance at least in part on direct observation
We have no known conflicts of interest to disclose.
of clinical skills, whether live or via video recordings, and to
A subset of these findings was presented as a poster at the 2019 annual
convention of the American Psychological Association.
measure and document students’ competence in areas outlined by
Correspondence concerning this article should be addressed to Molly the APA (American Psychological Association [APA], 2019).
Kring, Department of Counseling Psychology, University of Wisconsin– Doctoral training programs have responded to this accountability
Madison, 335 Education Building, 1000 Bascom Mall, Madison, WI movement by implementing systematic assessments of clinical
53706, United States. Email: molly.kring@gmail.com competence (Grus et al., 2016). Many programs have developed
222
EVALUATING PSYCHOTHERAPIST COMPETENCE 223
their own assessment tools and procedures tailored to their unique (Price et al., 2017). While this study’s findings are a promising start to
training goals (Grus et al., 2016), and little evidence is available on psychometrically testing the APA rating form, it did not examine the
the reliability and validity of the prototype Competency Bench- extent to which evaluations were based on direct observation, nor did
marks rating form (Fouad et al., 2009) or its variants. Our purpose in it examine interrater reliability.
this study was to conduct an initial generalizability study of observer Separate research has highlighted the significant role of rater bias
ratings of trainee clinical competence to inform further research on in competency ratings. Supervisors are often prone to viewing their
and implementation of these rating procedures. own supervisees in a favorable light (halo bias), and to use the upper
end of the rating scale (leniency bias; Gonsalvez & Freestone,
2007). A meta-analysis of 79 published generalizability studies
The Rise of Observation-Based Competency Assessments
in psychological journals found that 37% of the score variance in
To aid programs in assessing trainee competencies, an APA work rating systems was attributable to raters’ differing interpretations of
group (Fouad et al., 2009) released a Competency Benchmarks rating scales and targets (Hoyt & Kerns, 1999). Rater bias has been
rating form that provides an initial model of a standardized measure found to be particularly strong in ratings of attributes that required
with behavioral anchors. The APA encouraged graduate programs rater inference (e.g., empathy, working alliance) as compared to
to adapt the tool to fit their training goals, noting that some ratings that directly reflect observed behaviors (e.g., frequency
programs may seek to highlight certain competency areas over counts; Hoyt, 2002; Hoyt & Kerns, 1999). Taken together, these
others (e.g., assessment over teaching) and add behavioral anchors findings suggest that rater bias is likely a significant issue in clinical
specific to their discipline. The most recent iteration of the rating competency assessments that warrants further investigation.
form includes four subcategories in the Relational and Application Performance-based ratings of trainee competence are now near-
domains that specifically relate to direct assessment of psychother- universally used in graduate training in psychology, both to evaluate
apy skills: interpersonal relationships, affective skills, expressive progress in the program for individual students and, in aggregate, to
skills, and helping skills (American Psychological Association document program effectiveness. It is important that such high-
[APA], 2012). stakes assessments be reliable and valid, but as yet there is little
The rollout of the new APA accreditation requirements and guidance to training programs about optimal design for these rating
assessment tools has solidified the place of observationally informed procedures. For example, it is unclear how many raters, how much
competency assessments in professional psychology graduate train- direct observation time of a trainee conducting psychotherapy, and
ing programs. A 2016 survey of training directors at APA-accredited how many clients per trainee are needed to provide valid evidence of
programs in health service psychology (HSP) fields indicated that trainee clinical competence. This study is a preliminary effort to
48% of respondents (N = 150) were incorporating the Competency address this need, focusing on dependability (reliability or replica-
Benchmarks rating form into their training model and using it to bility) of trainee clinical competence ratings.
evaluate student progress, and 42% of the programs reported
developing their own competencies rating form (Grus et al.,
Generalizability Theory
2016). Moreover, 60 of the 150 programs reported using live
recordings/performance reviews to assess competencies. Generalizability theory (G theory) offers a framework for inves-
tigating the dependability (reliability) of psychotherapy competence
Implementation of Competency Assessments ratings and the relative influence of multiple possible sources of
variance that contribute to these ratings (Cronbach et al., 1972;
Outpaces Research Support
Hoyt & Melby, 1999). Score dependability is quantified by a
Research support for the reliability and validity of competency generalizability coefficient (or G coefficient) that estimates the
assessments has lagged behind their implementation. To date, no proportion of score variance that is attributable to ratee universe
study has evaluated the reliability or validity of the APA Compe- scores (analogous to the true score in classical test theory). While
tency Benchmarks rating form, although some have examined its classical test theory partitions variance in scores into a true score
psychometric properties. This is likely due in part to the heteroge- component and an error component, which encompasses all other
neity of forms and assessment methods in use by programs, as well variance sources, G theory further deconstructs the variance
as the significant resources needed to conduct reliability and validity components and their interactions to determine the most serious
studies (e.g., multiple raters per trainee, access to psychotherapy sources of error (Shavelson & Webb, 1991). G theory thereby
sessions and client outcome data). allows researchers to investigate multiple sources of error ( facets)
In one of the most rigorous psychometric studies of the Bench- simultaneously, and provides detailed guidance for optimizing rating
marks to date, Price et al. (2017) used an item response theory procedures to improve score dependability (Hoyt & Melby, 1999).
approach to analyze 270 competency evaluations of 94 preinternship In this study, the facets examined were raters, items, clients,
doctoral trainees drawn from clinical, clinical health, and counseling sessions, and segments-within-sessions. Given the rater bias research
psychology programs at a large, public university. Fifty-seven practi- outlined above, it was assumed that competency ratings would vary
cum supervisors rated their trainees at the end of each semester, so that significantly based on the rater evaluating the trainee. Clients were also
each supervisor contributed a mean of 4.98 evaluations and each identified as a potential source of variance in ratings given that client
trainee received a mean of 3.03 evaluations (Price et al., 2017). For characteristics have been shown to be an important predictor of
their evaluations, supervisors used a 52-item practicum evaluation psychotherapy outcomes (Lambert, 2013). Finally, we examined
form derived from the Benchmarks Competencies document. The whether ratings varied based on the session viewed, or by the
researchers found a unidimensional, one-factor model of competency, segment viewed within a given session (recognizing that direct
leading them to question whether the form could be greatly condensed observation in many settings involves viewing part of a session
224 KRING, COZART, SINNARD, OBY, HAMM, FROST, AND HOYT
rather than a full session). Finally, including items as a facet allowed solicitation by the first author and agreed to participate in the study.
us to estimate variance due to specific-factor error, a source of error The first author invited experienced supervisors who had an affilia-
that can be estimated when scores are computed as the mean or sum of tion with her PhD program to serve as raters for the study. Raters
multiple items (Schmidt et al., 2003). were unacquainted with the trainee therapists, and compensated
Generalizability analysis proceeds in two steps. First, data from $400 for their time. The research team relied on local personal and
the generalizability study are analyzed to provide variance estimates professional networks to recruit raters because of the significant time
representing the contribution of each facet, and interactions among commitment required and modest financial compensation. Efforts
facets, to the variance in observed scores. Second, these variance were made to recruit raters with a range of supervision and clinical
estimates are used to estimate G coefficients from different rating experience to mirror the differing experience levels of raters in HSP
designs (e.g., different numbers of raters per trainee; different competence rating processes. Potential raters were informed that
numbers of clients per trainee) to provide guidance to future users their participation in this research study was voluntary, their ratings
of the measurement procedure about how to optimize generalizabil- would be de-identified, and they would be compensated $400 upon
ity of ratings. completion of the study. Of the 40 raters recruited, 11 agreed to
participate (28%).
Every client of each participating therapist was recruited in person

Method by the first author in the clinic waiting room over a 2-week period. A
Participants total of 32 (94%) recruited clients consented to participate in the
study, with an average of 4 (range = 2–7) clients per therapist. Of
Participants (N = 41) were masters- or doctoral-level trainee thera- these clients, 22 were randomly selected for inclusion in the study
pists (n = 8), their psychotherapy clients (n = 22), and licensed using a random number generator, resulting in 2–3 clients per
psychologists (raters; n = 11) who had supervised at least one trainee therapist. Clients and trainee therapists completed a demographic
during their career. All trainee therapists were enrolled in a graduate form and provided informed consent for the research team to access
program in Counseling Psychology at a large Midwestern university video recordings of their psychotherapy sessions. Clients and trainee
and completing a full-year practicum experience at the departmental therapists received no compensation for participating in the study.
training clinic. The clinic provides counseling to university students
and adult community members on a sliding-scale basis, and all
sessions are videotaped. The majority of trainee therapists identified Measure: Clinical Competency Rating Form
as heterosexual (n = 6, 75%), White (n = 6, 75%), cisgender women Raters assessed trainee therapists’ psychotherapy competencies
(n = 5, 63%), and their ages ranged from 23 to 28 years (M = 26, using a 10-item rating form created by the authors. The form was
SD = 2). Self-reported number of hours providing direct psychother- based on actual rating forms collected from doctoral training pro-
apy ranged from 50 to 450 (M = 244, SD = 134). grams (n = 102) in counseling and clinical psychology, and a review
Among the 22 clients selected for inclusion in the study, 17 (77%) of the research literature on the characteristics of effective therapists.
identified as cisgender women, 15 (68%) as White, and 16 (73%) as The brief measure therefore reflected both clinical expertise and the
straight/heterosexual. The average age among clients was 28 research evidence on critical competencies for psychotherapists.
(SD = 9). Selected clients attended an average of 12 (SD = 7)
psychotherapy sessions, with a range of 3–28 videotaped sessions
available per client for use in the study. Presenting concerns for the
Item Generation
majority of clients included anxiety and depression symptoms, with Email requests were sent to the training directors and/or depart-
a lesser number seeking counseling for grief, Posttraumatic Stress ment chairs of all APA-accredited clinical and counseling psychol-
Disorder symptoms, and relationship concerns. ogy training programs (N = 320) for copies of their forms used to
Before each psychotherapy session, clients completed the Behav- assess psychotherapy competencies of doctoral trainees early in
ioral Health Measure-20 (BHM-20; Kopta & Lowry, 2002), a 20-item their graduate studies. For many programs, this was the form they
measure that assesses well-being, psychological symptoms, and life ask students’ clinical practicum supervisors to complete each
functioning (Kopta et al., 2015). The mean Global Mental Health semester and was based on the APA Benchmarks Competencies
(GMH) score on the BHM-20 at intake for these clients was M = 2.55 rating form. A single follow-up email was sent after 2 weeks of
(SD = 0.55), indicating significant psychological distress. nonresponse. Email requests yielded 102 forms (32% response rate),
The majority of raters identified as heterosexual (n = 8, 73%), the majority from PhD programs (n = 79, 77%) and the remainder
White (n = 10, 91%), cisgender women (n = 7, 64%). Raters from PsyD programs (n = 23, 23%). The majority (69; 68%) of the
ranged in age from 33 to 60 (M = 38, SD = 8), and reported forms were obtained from clinical psychology, 31 (30%) from
supervising between 3 and 40 therapist trainees throughout their counseling psychology, and 2 (2%) from combined clinical and
career (M = 13, SD = 12). Raters obtained doctoral degrees in counseling psychology doctoral programs.
counseling (n = 6, 55%) or clinical (n = 5, 45%) psychology, A four-member research team consisting of doctoral students in
and the majority (n = 6) worked primarily in university counseling Counseling Psychology then created a 10-item composite rating
centers. Participant demographic data are described in Table 1. form based on the forms received. The team members first identified
all of the items in the collected program forms that reflected the
relationship and intervention competencies from the APA Bench-
Recruitment
marks Competencies Rating Form. We focused on these competen-
All trainee therapists (n = 8) completing a first-year practicum cies as they are most relevant to assessing trainees’ clinical skill and
experience at the training clinic were recruited through email could be reasonably assessed by raters based on video observation of
Table 1
Demographic Characteristics of Participants
Trainee therapists (n = 8) Raters (n = 11) Clients (n = 22)

Characteristic M (SD) Range M (SD) Range M (SD) Range
Age (years) 26 (2) 23–28 38 (8) 33–60 28 (9) 19–51

Estimated number of direct psychotherapy 244 (134) 50–450 — — — —
hours (trainees only)
Estimated number of students supervised — — 13 (12) 3–40 — —
(raters only)
Characteristic n (%) n (%) n (%)
Gender
Cisgender woman 5 (63%) 7 (64%) 17 (77%)

Cisgender man 2 (25%) 3 (27%) 4 (18%)

Gender nonconforming — 1 (9%) —
Transman 1 (13%) — —
Transwoman — — 1 (5%)
Race/ethnicity
Asian/Asian American 1 — 2
Latino/a/x — — 6
Middle Eastern/North African 1 — —
Multiracial 1 1 —
Native American/American Indian 1 — —
White 6 10 15
Sexual orientation
Bisexual or pansexual 1 (13%) 1 (9%) 4 (18%)
Gay or lesbian — 1 (9%) 1 (5%)
Queer 1 (13%) 1 (9%) 1 (5%)
Straight/heterosexual 6 (75%) 8 (73%) 16 (73%)
Highest degree completed
BA or BS 5 (63%) — —
MA or MS 3 (38%) — —
PhD — 9 (82%) —
PsyD — 2 (18%) —
Theoretical Orientation (trainees and raters only)
Cognitive behavioral or Acceptance and 4 1 —
Commitment Therapy
Eclectic — 5 —
Feminist — 1 —
Humanistic/Existential 2 1 —
Psychodynamic/Interpersonal 2 4 —
Work setting (raters only)
College counseling center — 6 —
Community mental health clinic — 1 —
Hospital — 2 —
Primary care clinic — 1 —
Private practice — 2 —
Veterans affairs hospital or outpatient clinic — 1 —
Note. Percentages are not listed for items that allowed participants to check all categories that apply to them. These items included race/ethnicity, theoretical
orientation, and work setting.
psychotherapy sessions. Research team members then sorted all of Laska et al. (2014) and Norcross and Lambert’s (2018) meta-
the extracted items into discrete categories (e.g., interpersonal skills, analyses on therapist effects, and The Heart and Soul of Change
empathy, working alliance, etc.) and determined the 10 most (2010) were used as representative, systematic reviews of the
common and representative items in the forms. Because many common factors literature for this project. Ten common factors
programs drew their items and rating scales directly from the were identified from this literature review and converted into a
APA Benchmarks Competencies Rating Form, we opted to use rating form with the same scale anchors as the first form.
the same rating scale for our composite form. Raters were specifi-
cally asked, “How characteristic of the trainee’s behavior are the
Item Selection
following competency item descriptions?,” and rated each item on a
5-point Likert scale, with scale anchors of Not at all/slightly, Both 10-item rating forms underwent expert review before being
Somewhat, Moderately, Mostly, and Very. condensed into one, 10-item composite measure. The first form was
The research team also created a second rating form based on sent to all 102 of the training directors or department chairs who
the empirically supported characteristics of effective therapists. responded to the first recruitment email with their program’s
form(s), and the second to 11 academic scholars familiar with the selection process that emphasized the importance of selecting a
common factors literature. Twenty-six (25%) reviewers provided random clip. Once clips were selected, they were reviewed by the
feedback on the graduate program rating form, and 10 (91%) on the first author to ensure similar selection procedures regarding length
common factors rating form. Both sets of expert reviewers were and coherence. Two clips were replaced during this review process,
asked to rate their confidence that supervisors could reliably rate one due to sound issues and a second due to difficulty following the
each item based on video recordings of psychotherapy sessions, as conversation.
well as their general feedback on the items. Common factors
scholars were specifically asked to rank the strength of each item’s Rating Procedures
empirical support as a predictor of client outcomes, and training
directors were alternatively asked to rate the importance of each item The video clips were then uploaded to a secure and Health
as a predictor of therapist competence. Insurance Portability and Accountability Act-compliant file storage
The common factors scholars rated the working alliance and website maintained by the university. Over the course of 6–8 weeks,
empathy items as having the strongest empirical support, and each rater watched the same 129, 5-min video clips of trainees
repairing alliance ruptures and managing countertransference as conducting psychotherapy, in the same order. To avoid overwhelm-
having the least. On average, they were most confident that collab- ing raters and to encourage adherence to a timeline, video clips were
oration and therapist empathy could be rated by supervisors via released to raters in three batches of 40–45 clips each, and raters
video observation, and least confident about rating congruence/ were encouraged to take breaks as needed to prevent fatigue. After
genuineness and managing countertransference. Training directors watching each video clip, raters completed the 10-item rating form
rated working alliance and empathy items as being the most as a Qualtrics survey. The study was approved by the university’s
important indicators of therapist competence, and displaying basic Institutional Review Board.
helping skills and the ability to deal with conflict and negotiate
differences as the least important of the 10 items listed. Training Results
directors on average were most confident that supervisors could
reliably rate basic helping skills and empathy based on video Preliminary Analyses
observation, and were least confident regarding rating how effec- Before proceeding to the generalizability analyses, we attended to
tively trainees implement evidence-based interventions and their issues of missing data and conducted exploratory analyses to better
ability to deal with conflict and negotiate differences. Based on their understand the characteristics of the rating instrument. The 10-item
feedback, the researchers created one final 10-item rating form that rating instrument was based on competency assessment forms used
consisted of five items from graduate program forms, and five from in 102 doctoral training programs and factors empirically linked
the common factors research literature. to therapist outcomes. Our initial item analysis provided insight
into the rating form’s factor structure and guided our treatment
of items as a facet (source of error) in the generalizability analyses.
Procedure R software (version 3.4.1) was used for all data analyses.
Preparation of Video Clips (Segments)
Data Cleaning
Within each therapist–client dyad, three videotaped psychother-
apy sessions were selected: One from the first third of treatment, one One of the 11 raters’ scores was dropped from the final analyses
from the second third, and one from the last third of the available because of a significant number of missing ratings, multiple rating
sessions. Intake and termination sessions, as well as the first and last forms submitted for several clips, and unusually quick completion of
5 min of each session, were excluded as these are less indicative of a the study. All forms for one video clip were also omitted because of
therapist’s skill level. Within each of the three sessions, two 5-min technical difficulties with the video. The final analyses included data
(range = 4–6 min) video clips were created. Clips were identified from 10 raters, 8 therapists, 22 clients, and 129 video clips. A total of
by first using a random number generator to identify two time points 1,291 rating forms were completed. Nine of the 10 raters completed
at least 10 min apart during the session. For each randomly selected forms for every clip; one rater missed one clip, and two raters
time point a research team member watched the video from 5 min submitted the same form twice. In the latter case, the second form
before to 5 min after the time identified. Within that 10-min block, a was used in both instances because the first forms were missing
4–6 min clip was identified that had a reasonably defined “begin- responses, suggesting that they were submitted prematurely. Within
ning” and “end.” To qualify for inclusion, both the therapist and the rating forms themselves, raters completed nearly every item for
client had to be speaking in the clip for at least 30 s each, and the every rating form submitted. The most-skipped item was Goal Con-
therapist had to be delivering an intervention beyond supportive sensus, which was not rated on 19 rating forms (1.5% of all forms).
listening.
These broad selection criteria ensured that therapist skills could Item Analysis
be adequately judged from the clip and that a wide range of clips was
eligible. If these criteria were not met in the randomly selected Table 2 provides the overall mean scores and standard deviations
10-min segment, a new number was randomly generated to identify for each of the 10 rating form items, each rated on a scale from 1 to 5.
a different clip. This selection process ensured that raters could Means ranged from 3.43 (attentiveness to emotion) to 4.04 (positive
understand the context of a given clip and adequately rate the regard) and standard deviations averaged around one point. Thus,
trainees’ competencies. To ensure consistency in selecting clips, there was appreciable variability in ratings both within and between
research team members attended an 1-hr training on the clip items. Interitem correlations, calculated both between and within
Table 2 Not surprisingly, given the strong interitem correlations in

Mean Scores for Rating Form Items Table 3, Cronbach’s α for the 10-item competency rating scale
was high: a = .99 and .96 between and within therapists, respec-
Item M (SD) tively. In addition, a preliminary generalizability analysis (therapists
Alliance: Therapist creates and maintains a 3.85 (0.92) crossed with both raters and items; Raters x Items x Therapists [RIT
strong bond with the client. design]) showed that item-by-therapist interactions account for near-
Empathy: Therapist demonstrates a sensitive 3.91 (0.96) zero variance (varIT = 0.004; 0.4%) in observed scores. These
ability and willingness to understand the
preliminary findings indicate little evidence of discriminant validity
client’s thoughts, feelings, and struggles from
the client’s point of view. between rating form items in the current rating context. Thus, to
Collaboration: Therapist and client actively 3.75 (0.98) simplify reporting and interpretation of results, we ignore items as a
cooperate to achieve treatment goals. facet in the generalizability analyses reported in the next section,
Goal consensus: Therapist and client agree 3.60 (1.00) focusing instead on the sources that are most likely to contribute to
about the nature and origin of the client’s
errors of measurement in the assessment of competence from direct
problem, goals for treatment, and the way that

observation.
the two parties will work together to achieve

these goals.
Positive regard: Therapist treats client in a 4.04 (0.90)
consistently warm, supportive, and highly Generalizability Study: R(Seg:S:C:T)
regarding manner.
Accurate reflection: Therapist accurately 3.77 (1.02) Generalizability analyses partition the variance in outcome scores
perceives, understands, and communicates
into components attributable to the object of measurement (called
client feelings and experiences.
Interpersonal skills: Therapist demonstrates 3.79 (0.95) the target) and to different sources of systematic error (called facets).
effective interpersonal skills. In the present study, clinical competency ratings are the outcome
Communication: Therapist communicates 3.80 (0.94) scores of interest, therapists are the targets of observation, and facets
ideas, feelings, and information clearly using examined are raters, clients, sessions for each client, and segments
verbal and nonverbal skills.
Attentiveness to emotion: Therapist effectively 3.43 (1.14) within those sessions. When facets are crossed in the rating design
attends to client’s emotions and affect. (with one another or with the targets), variance components are also
Intervention effectiveness: Therapist 3.70 (1.00) estimated for their interactions.
demonstrates ability to implement Variance attributable to therapists (T) is valid variance, called
therapeutic treatment interventions
effectively. universe score variance in generalizability theory. This is variance
that is expected to replicate across other rating studies involving the
Note. Mean scores were calculated by averaging ratings across raters, same group of therapists, but different raters (also different clients,
therapists, clients, sessions, and segments; Ns ranged from 1,272 to 1,291.
Items were scored on a scale from 1 to 5.
sessions, and segments). The remaining three sources are considered
error variance in a study where our aim is to derive valid scores
reflecting therapist competence. Rater variance (R), which can be
therapists, are reported in Table 3. Between-therapist item correla- considered a rater main effect, reflects the tendency of raters to use
tions ranged from r = .69 to .98 (median r = .92). Within-therapist different parts of the rating scale. When varR is large, this indicates
item correlations ranged from r = .63 to .82 (median r = .73), that some raters tend to assign higher competency ratings (some-
indicating that when a given therapist was perceived as performing times called leniency bias) whereas others tend to assign lower
well on one competency during a given observation segment, they ratings (severity bias).
were typically rated highly on the other competencies for that Rater-by-therapist interaction variance (RT) represents variance
segment as well. All pairwise item correlations differed significantly due to differential perceptions of the same therapist by different
from zero ( p < .05). raters. Large values of varRT are found when, for example, one rater
Table 3
Within- and Between-Therapist Pearson Correlations Among Rating Form Items
Item 1 2 3 4 5 6 7 8 9 10
1. Alliance — .77** .76** .73** .74** .74** .72** .70** .63** .71*
2. Empathy .91** — .73** .70** .80** .78** .76** .73** .70** .75**
3. Collaboration .95** .79* — .82** .67** .74** .73** .72** .68** .75**
4. Goal consensus .92** .72* .97** — .65** .73** .68** .67** .69** .75**
5. Positive regard .91** .97** .78* .69* — .73** .75** .70** .64** .70**
6. Accurate reflection .94** .93** .89** .88** .85** — .78** .76** .73** .79**
7. Interpersonal skills .98** .93** .90** .84** .93** .94** — .81** .72** .81**
8. Communication .98** .93** .92** .90** .89** .98** .98** — .69** .77**
9. Emotion .88** .97** .80* .76* .90** .96** .89** .93** — .76**
10. Intervention .95** .87** .97** .93** .83** .95** .94** .97** .90** —
Note. Correlations above the diagonal (roman values) are computed within therapist; ns ranged from 120 to 181; those below the diagonal (italicized values)
are computed between therapist; n = 8.
* p < .05. ** p < .01.
consistently perceives Therapist A as highly competent (across sessions (with a given client) in therapist performance. Contrary to
clients and observation units) while a different rater consistently our expectations, C:T variance was also quite modest (2.0%),
perceives Therapist A as lower in competence. Because all raters reflecting relatively low variability in competence ratings for a
observe the same segments for each therapist, a large RT variance given therapist across different clients. As explained earlier, this
component indicates that the raters employ different implicit or C:T variance component confounds the client main effect (C) and
explicit criteria for translating observed therapist behaviors into the client-by-therapist interaction (CT) because of the nested G
competence judgments. study design.
In naturalistic observation of psychotherapy, clients are nested Because Raters are crossed with Segments in the G study, Table 4
within therapists. That is, normally each client sees one therapist, includes estimates of the Rater interaction with each of the other
and each therapist sees multiple clients. In the present study, we facets. In addition to the RT interaction (6.6%), indicating differen-
included sessions (nested within therapist–client dyads) and seg- tial perceptions of therapists, there was some evidence that raters
ments (nested within sessions) as additional sources of error. Thus, showed differential perceptions of therapists with specific clients,
the G study design is multiply nested, with the Raters facet crossed (R(C:T); 2.2%); also of specific sessions for each therapist–client
with Segments, which are nested within Sessions nested within dyad, (R(S:C:T); 2.5%); and especially of specific segments within
Clients nested within Therapists. Nesting is indicated by a “:” in the each session, (R(Seg:S:C:T); 25.0%). The residual variance com-
design notation, R(Seg:S:C:T). ponent (25.5%) reflects variability due to systematic sources of error
When some facets are nested in a G study, their main effects not examined in the G study and to random error.
and interactions will be confounded and cannot be estimated
separately. For example, we can conceptually distinguish between Forecasting G Coefficients for Different Measurement
a client main effect (C) and a client–therapist interaction (CT). The C Designs
component indexes variance due to client (i.e., some clients would
elicit more competent performance from their therapist than others, The variance estimates in Table 4 help us address the question
regardless of which therapist they see). The CT component “What sources contribute to variance in test scores?” (Cronbach &
indexes variance due to the interaction of client and therapist Meehl, 1955), which is important for evaluating reliability as well as
(e.g., Therapist A might perform more competently in treating validity of competence scores (Hoyt et al., 2006). In this section, we
Client X, and less competently with Client Y). Because clients derive forecasts of generalizability (G) coefficients under different
are nested within therapists in the G study, we cannot estimate C rating designs to help guide future users of observer-rated compe-
and CT variance components separately. Instead, we estimate a tency scores.
single component (C:T) which indexes variance attributable to both The G coefficient, similar to Cronbach’s α, is an intraclass
of these sources. correlation coefficient (ICC). It reflects the proportion of variance
in competence scores that is attributable to therapist universe scores.
Variance Estimates This is the proportion of variance that is expected to replicate over
multiple assessments involving different sets of raters, clients,
Variance partitioning for this G study is presented in Table 4. sessions, and segments. Three design considerations for observer-
This table provides information about the proportion of variance due rated competence measures are (a) number of raters, (b) nesting of
to Therapists (4.0%) and shows that both Rater main effects (24.2%) raters within targets, and (c) amount of observation time. Because
and Rater–Therapist interactions (6.6%) are substantial sources of rater-based components (especially R, RT, and RSeg:S:C:T)
error in observer-rated competence. Another large source of error is account for a substantial proportion of observed score variance
Seg:S:C:T (7.5%), which indicates substantial within-session vari- (Table 4), the number of raters who evaluate each trainee will be an
ability in therapist competence. By contrast, the S:C:T component important determinant of the generalizability of competence scores
was quite small (0.6%), reflecting relatively little variation between derived from observer ratings. Just as adding items to a rating scale
improves its internal consistency reliability (Cronbach’s α), aggre-
gating scores over multiple raters will improve the interrater reli-
Table 4
ability of competence assessments.
Estimated Variance Components for R(Seg:S:C:T) Model
Although raters were crossed with therapists in the G study (RT),
Source Variance 95% CI (%) this is not necessarily the case for ratings based on observations
during doctoral training. For example, when ratings are provided by
T 0.040 [0.035, 0.045] 4.0 supervisors during practicum or internship training, the rating
R 0.242 [0.233, 0.251] 24.2
C:T 0.020 [0.015, 0.025] 2.0
design is typically nested (R:T) rather than crossed (RT). It is
S:C:T 0.004 [0.000, 0.011] 0.5 important for training programs to understand how this design
Seg:S:C:T 0.075 [0.067, 0.083] 7.5 feature affects dependability of competency ratings.
RT 0.066 [0.059, 0.073] 6.6 A third feature of the rating context concerns how much data
R(C:T) 0.021 [0.014, 0.030] 2.2 raters have on which to base their evaluations of competence.
R(S:C:T) 0.026 [0.013, 0.037] 2.5
R(Seg:S:C:T) 0.249 [0.235, 0.265] 25.0 Depending on the rating context, evaluators may make competence
Residual 0.254 [0.246, 0.263] 25.5 assessments based on relatively limited data (e.g., observing part of
a single session of work with a single client) or a much more
Note. T = Therapists; R = Raters; C = Clients (nested within Therapists);
S = Sessions (nested within Clients); Seg = Segments (nested within extensive work sample (perhaps including direct observation of
Sessions). 95% CIs were computed by percentile bootstrapping work with multiple clients and multiple sessions for each client). In
(resampling), with 10,000 replicated samples. general, we expect more reliable and valid ratings when these are
based on more extensive observation. Using the variance estimates (see Table 4). If we obtain evaluations from four raters per target
in Table 4, we can quantify how the generalizability coefficient is (nR = 4), taking the average of these four ratings as the compe-
expected to improve as the numbers of clients, sessions, and tence score, the Raters facet contributes varR/4 = 0.06 to the
observation time per session increase. resulting observed score variance. This reduction in error vari-
Predicting G Coefficients From Variance Estimates. The G ance is reflected in a corresponding increase in the G coefficient
coefficient is an ICC that indicates the accuracy of generalization estimated for the nR = 4 nested design, compared with the
from a person’s observed score (in this case, a psychotherapy nR = 1 design.
competence rating) to their universe score (their “true” psychother- When raters are crossed with therapists (RT) in the assessment
apy competence) (Shavelson & Webb, 1991). The G coefficient can design, an additional advantage is gained, in that varR is omitted
be interpreted as the proportion of observed score variance that is from the denominator altogether. When all therapists are evaluated
attributable to trainee universe scores. Alternatively and equiva- by the same set of raters, rater leniency or severity biases are
lently, G coefficients can be interpreted as the estimated correlation constant across targets, so they do not contribute to observed score
between two sets of competence ratings produced using the same set variance, and varR is not included in the denominator when
of measurement procedures, but a different sample of raters, clients, computing Gcrossed.

varT
Gcrossed = varSeg∶S∶C∶T varRSeg∶S∶C∶T varRes :
varT + varnC∶T
C
+ varnCS∶C∶T
nS + nC nS nSeg + varnRRT + var
nR nC + nR nC nS + nR nC nS nSeg + nR nC nS nSeg
RC∶T varRS∶C∶T
and segments. G coefficients range from 0 to 1, with higher values Thus, when we are able to cross raters and targets (i.e., have all
reflecting greater consistency of ratings across conditions (facets). raters observe all therapists) in the measurement design, there is a
Determination of the acceptability of G for a given assessment substantial reduction in the estimated observed score variance
procedure is dependent on the context of measurement, but it is (because R was among the largest sources of systematic variance
helpful to keep in mind that G coefficients that take into account in Table 4). This decreases the denominator of the estimated G
multiple sources of error will inevitably be smaller than classical coefficient, which increases the predicted value of G for that design
reliability coefficients that take into account only a single source of (relative to an equivalent nested design). In the next section, we
error (e.g., Schmidt et al., 2003). explore the joint contribution of design (nested vs. crossed), number
Forecasts of coefficients are computed based on two features of of raters, and observation time to dependability of competence
prospective assessment designs: Whether raters are crossed with or scores based on direct observation.
nested within therapists, and the number of levels of each facet Effects of Number of Raters, Rater Nesting, and Observation
(Raters, Clients per therapist, Sessions per client, and Segments—or Time on Predicted G Coefficients. Figure 1 shows the effect, for
minutes viewed—per session). Because we used 5-min segments in both nested and crossed rating designs, of varying the number of
the G study, observation time for each session was specified in raters per therapist and the observation time for each session
5-min increments. For each assessment design of interest, we observed. G coefficients in this figure are computed holding number
computed a G coefficient by dividing the estimated Therapist of clients per therapist (nC = 2) and number of sessions observed
variance (universe score variance, or varT) by the estimated total per client (nS = 2) constant.1 Observation time per session varies
observed score variance—computed as a linear combination of all from 5 to 50 min, so total observation time per therapist, for each
components that contribute to observed score variance for the rater, varies from 20 to 200 min.
specified measurement design. The impact of nesting is clear from comparing the gray lines
For nested designs (R:T), all components estimated in the G study (Gnested) with the black lines (Gcrossed) in Figure 1. Even with
contribute to observed score variance, so all appear in the denomi- nR = 4, the level of dependability achieved by an assessment
nator of the formula for Gnested. procedure in which raters are nested within targets barely exceeds
varT
Gnested = varSeg∶S∶C∶T varRSeg∶S∶C∶T varRes ,
varT + var
nR + nC + nC nS +
R varC∶T varS∶C∶T
nC nS nSeg + varnRRT + var
nR nC + nR nC nS + nR nC nS nSeg + nR nC nS nSeg
RC∶T varRS∶C∶T
where varX represents the variance estimate for facet X from

Table 4 and nX represents the number of levels of facet X in the
hypothetical assessment design. Note that when more than one
1
level of a facet is included in the assessment procedure, error The median distance between upper and lower confidence limits for
estimated G coefficients (computed by percentile bootstrapping, 10,000
variance in the observed scores is reduced. For example, in a resamples) was Mdn = 0.05; range = 0.01–0.10. The median value for
nested design with a single rater per therapist (nR = 1), the Raters the lower limit of these confidence intervals was Mdn = 0.22; range =
facet contributes varR = 0.24 to the variance in observed scores 0.04–0.58.
Figure 1 number of raters and observation time, training programs might

Predicted G Coefficients When Raters Observe Two Clients Per consider questions like: How many clients do we need to observe for
Therapist and Two Sessions Per Client (nC = 2; nS = 2) each therapist to arrive at dependable ratings? How many sessions
for each client? Do we need to observe the full session or is partial
observation adequate?
Examining Table 4, we tentatively conclude that most of the
variability in competence ratings occurs within sessions (variance
accounted for [VAF] by Seg:S:C:T and Residual components was
8% and 26%, respectively). Therapists appeared to be quite consistent
in their level of competence from session to session in working with a
given client (<1% VAF for S:C:T). Variability from client to client in
therapist competence was not negligible, but relatively low (2% VAF
for C:T)—suggesting that therapist competence ratings may not
change substantially as a function of which client is observed.

To clarify this finding and provide guidance on the importance of

observing each trainee’s work with multiple clients, Figure 2 shows
the impact of varying the number of raters and the number of clients
observed for each therapist, in crossed (RT) rating designs. This
figure holds total observation time constant at 60 min per therapist:
As the number of clients observed for each therapist (nClients)
increases, the observation time allotted to each client (sessions and
that for a crossed design with nR = 1. The G coefficients for the segments-within-sessions) decreases.
nested design reach a ceiling of Gnested < .3, even with four raters Figure 2 suggests that, for a given investment of observation
per therapist and 200 min of observation time for each therapist. time, generalizability will be improved by observing each therapist
Equivalent assessment procedures using crossed designs have with a greater number of clients. Observing 15 min with each of four
dependability from 1.5 to 2.5 times as large as that for nested clients yields more dependable competence ratings than observing
designs. Clearly, the crossed rating design in which all raters 60 min with a single client. Thus, training programs should priori-
observe all therapists is preferable when feasible. tize observation with multiple clients, certainly at least two clients
The impact of number of raters on dependability of competency per therapist, in assessing trainee competence.
ratings is also dramatic. With just a single rater evaluating each
therapist, G coefficients are quite low for crossed as well as nested
designs. Increasing the number of raters produces a marked improve- Validity of Competence Ratings
ment in score dependability, especially in crossed rating designs. Generalizability of scores is a necessary but not sufficient condi-
These findings suggest that evaluations from a single rater are tion for their validity. Therefore, generalizability analysis is only a
unlikely to be generalizable assessments of competence, whereas first step toward establishing validity of competence ratings derived
aggregating ratings from multiple observers, especially in crossed from direct observation. G studies establish that there is consensus
designs, provides much more dependable scores. among raters (and across observations of a given therapist across
Finally, Figure 1 shows that increasing observation time increases clients, sessions, and segments), but this begs the question of
generalizability of competence ratings, but also suggests a point of whether these consensual competence scores accurately predict
diminishing returns on time invested. In Figure 1, we examined
assessment designs in which each rater observes four sessions for Figure 2
each therapist (two sessions each with two different clients). The x-axis Predicted G Coefficients When Raters Observe 60 min Per Thera-
in Figure 1 specifies the number of minutes observed in each of these pist (Raters Crossed With Therapists)
sessions. The trajectories in Figure 1 suggest that there is a marked
improvement in generalizability from observing 10 min per session as
compared to 5 min per session. However, the difference in generaliz-
ability between 20- and 50-min observation times is negligible.
Imagine a training program wishing to optimize resources to
maximize generalizability of competence ratings. The program
currently uses the design in Figure 1, with two raters evaluating
all therapists (RT design), and observation time of 10 min per
session. To improve generalizability, the program is considering
doubling the amount of observation time. Based on Figure 1, it is
clear that doubling the number of raters (nR = 4; nmin = 10) yields a
much better return on observation time invested, in terms of
generalizability, than keeping the number of raters but doubling
the observation time for each rater (nR = 2; nmin = 20).
Effects of Number of Clients per Therapist on G Coefficients,
Holding Total Observation Time Constant. In addition to
therapist effectiveness. The present study, because of its small therapist effects in this small data set, we were not able to conduct
number of therapists (n = 8) cannot provide robust evidence for predictive validity analyses for the competence scores.
validity of competence scores; nonetheless, we consider the nature
of validity evidence for competence ratings and describe how this
Discussion
issue can be addressed in the present data set.
The competencies that are the focus of this G study (relational and The purpose of this study was to provide initial generalizability
intervention domains) are theorized to enhance therapist effective- data on ratings of clinical competence based on direct observation,
ness, leading to better client outcomes. Thus, we should be inter- following procedures that are close to those typically used in clinical
ested in whether competence scores predict between-therapist and counseling psychology doctoral programs to evaluate trainee
variance in outcomes, technically referred to as therapist effects competence. Learning what sources contribute to variance in these
(Baldwin & Imel, 2013). Two considerations in seeking validity ratings can assist evaluators to develop reliable (generalizable)
evidence for predictors of therapist effects are (a) how to define rating procedures—and reliability of ratings is a precondition for
effectiveness and (b) how to obtain generalizable estimates of validity of ratings (Hoyt et al., 2006).
therapist effects.
When researchers have access to session-level outcome data, the

Lessons for Evaluating Clinical Competence
outcome for a given client can be defined as change over time.
Change can be estimated as a difference score (pre–post change; An important finding of this study is that there is consensus
e.g., Minami et al., 2007), or alternatively, as a regression-based among expert raters on trainees’ clinical skill level, even when
change score—the latter metric has the advantage of using all these ratings are based on relatively limited amounts of direct
available session data rather than just pre- and postscores. However, observation. When multiple raters observed the same segments
therapist effects are often estimated from three-level growth models and sessions, there was clear evidence of reliability in their
(Session:Client:Therapist) where therapist characteristics (Level 3) judgments about which therapists were most and least competent
are used to predict client slopes (Level-2). In such models change is in this relatively small sample. With four raters and a moderate
defined as a regression slope with a classical interpretation: the amount of observation time (60 min), we forecast that 50% of
predicted change on the outcome variable for each additional session score variance is attributable to consensus among raters about
of psychotherapy. These three methods for quantifying change are trainee competence levels (i.e., Gcrossed = .50). It is noteworthy
not strictly comparable, and in turn, could lead to divergent estimates to see relatively strong evidence of consensus given that (a) we
of therapist effects. Researchers should consider the philosophical provided no training in the use of the rating scale, as rater training
question of the appropriate definition of change when seeking validity is not the norm in most doctoral programs; and (b) raters had
evidence for competence ratings (Cronbach & Furby, 1970). relatively scant information on which to base their judgments—
The second issue for consideration is how to obtain a robust no acquaintance or interaction with trainees nor knowledge of the
estimate of the therapist effect. It is well known that a large clients outside what they observed in the recorded sessions.
proportion of variance in outcomes is attributable to clients Consensus does not necessarily imply accuracy, but it is a
(Lambert, 2013), so a measure of change for a single client will necessary condition for accuracy in person perception (Kenny,
be a poor indicator of effectiveness of the therapist. Obtaining 2019). Thus, the finding of substantial generalizability of ratings
reliable estimates of therapist variance, often quantified as an in the absence of training and with relatively limited clinical
ICC, requires a large pool of therapists (Baldwin & Imel, 2013), observation provides cautious reassurance about the prospects for
with a sizable number of clients for each therapist (e.g., Kenny validity of competence evaluations by expert raters.
et al., 1998, Table 2). Unreliable therapist effect estimates attenuate Other noteworthy findings pertained to the importance of raters as
correlation coefficients and decrease statistical power in predictive a source of ratings variance, and the relatively limited contribution
validity studies. Variability in the designs used to estimate therapist of clients (and client-by-therapist interactions) to the variance in
effects is a likely explanation for effect size heterogeneity in past observed scores. Both dyadic (RT) and especially rater (R) compo-
research examining correlations between competence and outcome nents contributed strongly to rating variance, as is the case with other
(e.g., Tao et al., 2015). types of performance ratings (Hoyt & Kerns, 1999). As seen in
With IRB approval, we augmented the data set to include outcome Figure 1, this has consequences for optimizing rating design. The
data from additional clients for each trainee therapist. These clients predicted G coefficients are less than adequate for rating procedures
had consented to anonymous research use of their outcome data where raters are nested within trainees, and even for crossed designs
through the training clinic Data Archive at University of Wisconsin– 3–4 raters per trainee are needed for G coefficients to approach the
Madison. This resulted in n = 57 clients for the n = 8 therapists .50 level. Supervisor ratings are a common source of nested
(M = 7.1 clients per therapist; sd = 4.4, range = 2–14). To exam- competence evaluation data in doctoral programs: Each trainee is
ine variance in outcomes among therapists, we computed pre–post rated by a different supervisor, unique to their practicum or intern-
change scores for each client and ran a two-level model (clients at ship site. Figure 1 suggests that these data are unlikely to yield
Level 1 and therapists at Level 2) with no predictors using the R generalizable competence scores, even if multiple supervisors were
package nlme (Pinheiro et al., 2020), to estimate variance in out- available at each site to provide ratings. Programs should therefore
comes at both client and therapist levels. Therapist variance for this be cautious about using these ratings for documenting trainee
model was near zero and not statistically significant, with therapists outcomes or for high-stakes progress decisions. Depending on
accounting for only .004% of outcome variance (ICC = .00004). A how programs are using these ratings, however, this lack of gener-
second analysis using regression-based change scores yielded simi- alizability is not a critical deficit. For example, supervisor ratings
lar results (ICC = .0004, ns). With no evidence of variability in can be meaningful longitudinally (tracking ratings by the same
supervisor(s) over time), and for formative feedback to trainees (see important to gauge how well the present findings generalize to
Hoyt et al., 2021, for a further discussion of this issue). the broader trainee population.
An unexpected finding was the relatively modest contribution of Raters were experienced supervisors for whom evaluating com-
client variance to competence scores. Client factors account for a petence of trainees at various levels was a familiar experience. We
large proportion of outcome variance (Lambert, 2013), and it did not provide rater training because such training does not appear
seemed reasonable to expect that trainee therapists might be judged to be a commonly used strategy when obtaining supervisor ratings of
more or less competent depending on the characteristics of their trainee performance. However, a novel feature of the rating proce-
clients. Evidence for client-based variance was not negligible, dure in this study was that raters had no acquaintance with trainees,
however, and Figure 2 suggests that observation of trainee work and based their ratings only on direct observation of trainees’
with two or more clients is desirable from the standpoint of clinical work. While direct observation is a critical means for
generalizability of ratings. gathering data regarding trainee progress (APA, 2019; Standard
A final noteworthy finding was the lack of evidence for item- II.B.3.d), it is not the only source of data available to evaluators in
based variance, and the very high internal consistency of this 10- naturalistic settings. It is therefore unclear how well the variance
item rating form. The lack of evidence for specific-factor error estimates from the present G study generalize to ratings where there
(Schmidt et al., 2003) suggests that raters tended not to discriminate is access to a wider range of data, and future research should
among items in evaluating clinical competence. In theory, we can examine generalizability of ratings when evaluators are acquainted
imagine a therapist who has high competence in expressing empathy with trainees outside the clinical sessions observed.
but demonstrates weak attention to articulating and agreeing on We note that it is difficult to predict how inclusion of trainees’
goals (goal consensus). In the current rating context, however, behaviors outside the clinical context will affect reliability and
trainees rated high on one item tended to be rated high on all, validity of competence ratings. When raters make inferences about
and scores derived from these ratings may be best interpreted as target characteristics based on observed behaviors, overlap drives
reflecting global clinical competence. To the extent that greater consensus (Kenny, 1991). If each rater relies not just on commonly
discernment among these competency subdomains is desired, addi- observed in-session behaviors, but also on trainee behaviors in
tional measurement development work, and likely rater training supervision sessions or the classroom as a basis for competence
procedures, may be needed. On the other hand, if global clinical ratings, this will reduce overlap of observed behaviors and may
competence ratings are what is desired, the evidence from this initial therefore decrease consensus (generalizability of ratings). On the
G study suggests that the length of the rating form can be reduced other hand, increasing the amount of information available to raters
without adversely affecting generalizability of scores. should increase accuracy, as long as the information gleaned from
observation and interaction outside of sessions is relevant to the
trainee’s clinical competence. If additional out-of-session informa-
Limitations and Future Directions tion increases rater accuracy, consensus should also increase
(Kenny, 2004). In short, it is unclear how reliance on information
Sample size is a common limitation in G studies. Using a fully other than direct observation affects reliability and validity of
crossed rating design (all raters observe all trainees) places practical competence ratings, and how rater access to idiosyncratic personal
constraints on the number of trainees (n = 8) and clients (n = 22) information about trainees may affect variance partitioning and
that can be included. It is reasonable to be concerned that such a generalizability of ratings. More research is needed in this area.
small sample would lead to rather imprecise variance estimates, A final limitation of the present study concerns the inability to
which would limit the utility of the findings. However, trainees and provide evidence of the predictive validity of the competence ratings
clients are only two of the variance sources examined here, and the (lack of therapist variance). Reliability is a precondition for validity,
bootstrapped confidence intervals demonstrate that both variance and our findings suggest that adequate reliability of competence
estimates and G coefficients are estimated with a reasonable degree scores can be achieved by following sound evaluation procedures.
of precision in this study. Although variance components and G However, it is possible that even a highly reliable assessment
coefficients may be reliably estimated with a relatively small sample procedure may not be valid for its intended use (Messick, 1989).
of targets (trainees), much larger numbers of therapists and clients Once rating procedures are refined to yield dependable competence
are needed to establish validity of competence ratings as predictors scores, the next step will be to validate these scores as predictors of
of therapist effectiveness. client outcomes.
An additional concern is the representativeness of the sample and Future research on this topic might consider further exploring the
of the rating procedures. Therapists in this study were all masters or validity of the competency ratings in terms of client outcomes, and
doctoral trainees, all but one of whom were in an initial supervised the impact of rater training on reducing rater variance. High Rater
clinical placement. Range restriction is a risk in this recruitment (R) variance is a major concern in observer-rated competence
procedure and can lead to underestimates of therapist variance (T) evaluations, especially when fully crossing raters and targets is
and therefore also to underestimates of forecast G coefficients. not a viable option. Rater training has been shown to reduce R
However, the rating scale was based in part on the forms used variance and thereby improve the dependability of ratings in other
by doctoral programs for students at this stage of training, so these contexts, and could be a direction for future research in clinical
items are expected to discriminate among competence levels for competency ratings (Roch et al., 2012). In a psychotherapy context,
early-stage trainees. And indeed, there was persuasive evidence of Gonsalvez et al. (2013) have sought to reduce rater bias by
therapist variance, both in the variance estimates and G coefficients. standardizing the meaning of each scale anchor. Their Vignette
Nonetheless, it is prudent to be cautious about interpretation of Matching Assessment Tool (VMAT) asks evaluators to read several
findings from a single sample, and follow-up studies will be brief narratives that describe key features of trainees at specified
developmental levels and then select which vignette best describes Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The
the trainee’s current skills. Examining whether using the VMAT in dependability of behavioral measurements: Theory of generalizability for
rater training or as part of a competency rating scale reduces rater single and multiple observations. Wiley.
bias could help improve the reliability of ratings. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological
tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/
h0040957
Fouad, N. A., Grus, C. L., Hatcher, R. L., Kaslow, N. J., Hutchings, P. S.,
Implications for Assessment of Trainee Competence
Madson, M. B., Collins, F. L., Jr., & Crossman, R. E. (2009). Competency
For doctoral programs, tentative recommendations based on these benchmarks: A model for understanding and measuring competence in
initial findings include collecting competence ratings from multiple professional psychology across training levels. Training and Education in
evaluators and having a rating design as close to fully crossed as Professional Psychology, 3(4, Suppl), S5–S26. https://doi.org/10.1037/
possible—that is, having all evaluators observe and provide ratings a0015832
Gonsalvez, C. J., Bushnell, J., Blackman, R., Deane, F., Bliokas, V.,
for all trainees. It is desirable to observe each trainee with more than
Nicholson-Perry, K., Shires, A., Nasstasia, Y., Allan, C., & Knight, R.
one client, although in our sample improvements to generalizability (2013). Assessment of psychology competencies in field placements:
taper off after 2–3 clients per trainee. In our crossed rating design, Standardized vignettes reduce rater bias. Training and Education in
the dependability of ratings began to plateau after approximately Professional Psychology, 7(2), 99–111. https://doi.org/10.1037/a0031617
90–120 min of observation time per trainee. Gonsalvez, C. J., & Freestone, J. (2007). Field supervisors’ assessments of
Programs often design competency rating forms to encompass trainee performance: Are they reliable and valid? Australian Psychologist,
different domains and different behavioral indicators within each 42(1), 23–32. https://doi.org/10.1080/00050060600827615
domain. Our rating form items focused on the relationship and Grus, C. L., Falender, C., Fouad, N. A., & Lavelle, A. K. (2016). A culture of
intervention domains, which are arguably best assessed by direct competence: A survey of implementation of competency-based education
observation of their application in session. Very high correlations and assessment. Training and Education in Professional Psychology,
10(4), 198–205. https://doi.org/10.1037/tep0000126
among these items, both within and between therapists, suggest that
Hatcher, R. L., Fouad, N. A., Grus, C. L., Campbell, L. F., McCutcheon,
raters made few distinctions between the conceptually different S. R., & Leahy, K. L. (2013). Competency benchmarks: Practical steps
skills and behaviors referred to in the 10-item scale. Thus, the scale toward a culture of competence. Training and Education in Professional
may function well as an assessment of global clinical competence Psychology, 7(2), 84–91. https://doi.org/10.1037/a0029401
(with very high internal consistency reliability). Programs desiring Hoyt, W. T. (2002). Bias in participant ratings of psychotherapy process: An
to obtain separate evaluations of different clinical competency initial generalizability study. Journal of Counseling Psychology, 49(1),
domains may need to alter the measurement procedures represented 35–46. https://doi.org/10.1037/0022-0167.49.1.35
here. Possible strategies include (a) more elaborate rating forms Hoyt, W. T., & Melby, J. N. (1999). Dependability of measurement in
(with multiple items per rating dimension); (b) use of rater training counseling psychology: An introduction to generalizability theory. The
to standardize the rating scale and provide examples of trainee Counseling Psychologist, 27(3), 325–352. https://doi.org/10.1177/0011
000099273003
behaviors for each scale anchor (Gonsalvez et al., 2013); and (c) use
Hoyt, W. T., Warbasse, R. E., & Chu, E. Y. (2006). Construct validity in
of separate, domain-specific performance tasks (e.g., Facilitative
counseling psychology research. The Counseling Psychologist, 34(6),
Interpersonal Skills tasks; Anderson et al., 2016) to elicit behaviors 769–805. https://doi.org/10.1177/0011000006287389
most relevant to each competency dimension. Generalizability Hoyt, W. T., & Kerns, M. D. (1999). Magnitude and moderators of bias in
analysis can be an important tool for refining general and specific observer ratings: A meta-analysis. Psychological Methods, 4(4), 403–424.
measurement procedures, and demonstrating reliability and validity https://doi.org/10.1037/1082-989X.4.4.403
of outcome assessments in graduate training in HSP. Hoyt, W. T., Kring, M., & Hamm, E. H. (2021). Considerations for
dependable assessment of trainee competence: Implications from an
initial generalizability study of ratings based on direct observation
[Manuscript submitted for publication]. Counseling Psychology, Uni-
References versity of Wisconsin—Madison.
American Psychological Association. (2012). Revised competency benchmarks Kenny, D. A. (1991). A general model of consensus and accuracy in interper-
in professional psychology. http://www.apa.org/ed/graduate/benchmarks- sonal perception. Psychological Review, 98(2), 155–163. https://doi.org/10
evaluation-system.aspx .1037/0033-295X.98.2.155
American Psychological Association. (2019). Standards of accreditation Kenny, D. A. (2004). PERSON: A general model of interpersonal percep-
for health service psychology and accreditation operating procedures. tion. Personality and Social Psychology Review, 8(3), 265–280. https://
https://www.apa.org/ed/accreditation/about/policies/standards-of-accre doi.org/10.1207/s15327957pspr0803_3
ditation.pdf Kenny, D. A. (2019). Essentials of interpersonal perception. Guilford Press.
Anderson, T., Crowley, M. E. J., Himawan, L., Holmberg, J. K., & Uhlin, Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data analysis in social
B. D. (2016). Therapist facilitative interpersonal skills and training status: psychology. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), Handbook
A randomized clinical trial on alliance and outcome. Psychotherapy of social psychology (4th ed., pp. 233–265). McGraw-Hill.
Research, 26(5), 511–529. https://doi.org/10.1080/10503307.2015. Kopta, S. M., & Lowry, J. L. (2002). Psychometric evaluation of the
1049671 Behavioral Health Questionnaire-20: A brief instrument for assessing
Baldwin, S. A., & Imel, Z. E. (2013). Therapist effects: Findings and global mental health and the three phases of psychotherapy. Psychother-
methods. In M. Lambert (Ed.), Handbook of psychotherapy and behavior apy Research, 12(4), 413–426. https://doi.org/10.1093/ptr/12.4.413
change (6th ed., pp. 258–297). Wiley. Kopta, M., Owen, J., & Budge, S. (2015). Measuring psychotherapy
Cronbach, L. J., & Furby, L. (1970). How we should measure "change": Or outcomes with the Behavioral Health Measure–20: Efficient and com-
should we? Psychological Bulletin, 74(1), 68–80. https://doi.org/10.1037/ prehensive. Psychotherapy, 52(4), 442–448. https://doi.org/10.1037/
h0029382 pst0000035
Lambert, M. J. (2013). Outcome in psychotherapy: The past and important Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczynska, U. (2012). Rater
advances. Psychotherapy: Theory, Research, & Practice, 50(1), 42–51. training revisited: An updated meta-analytic review of frame-of-
https://doi.org/10.1037/a0030682 reference training. Journal of Occupational and Organizational Psy-
Laska, K. M., Gurman, A. S., & Wampold, B. E. (2014). Expanding the lens chology, 85(2), 370–395. https://doi.org/10.1111/j.2044-8325.2011
of evidence-based practice in psychotherapy: A common factors perspec- .02045.x
tive. Psychotherapy, 51(4), 467–481. https://doi.org/10.1037/a0034332 Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement examination of the effects of different sources of measurement error on
(3rd ed., pp. 13–103). MacMillan. reliability estimates for measures of individual differences constructs.
Minami, T., Wampold, B. E., Serlin, R. C., Kircher, J. C., & Brown, G. S. Psychological Methods, 8(2), 206–224. https://doi.org/10.1037/1082-
(2007). Benchmarks for psychotherapy efficacy in adult major depression. 989X.8.2.206
Journal of Consulting and Clinical Psychology, 75(2), 232–243. https:// Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer.
doi.org/10.1037/0022-006X.75.2.232 Sage Publications.
Norcross, J. C., & Lambert, M. J. (2018). Psychotherapy relationships that Tao, K. W., Owen, J., Pace, B. T., & Imel, Z. E. (2015). A meta-analysis of
work III. Psychotherapy: Theory, Research, & Practice, 55(4), 303–315. multicultural competencies and psychotherapy process and outcome.
https://doi.org/10.1037/pst0000193 Journal of Counseling Psychology, 62(3), 337–350. https://doi.org/10

Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., & R Core Team. (2020). .1037/cou0000086
nlme: Linear and nonlinear mixed effects models (R package version
3.1-144), https://CRAN.R-project.org/package=nlme
Price, S. D., Callahan, J. L., & Cox, R. J. (2017). Psychometric investigation Received January 20, 2021
of competency benchmarks. Training and Education in Professional Revision received June 9, 2021
Psychology, 11(3), 128–139. https://doi.org/10.1037/tep0000133 Accepted June 10, 2021 ▪

Evaluating Psychotherapist Competence: Testing The Generalizability of Clinical Competence Assessments of Graduate Trainees

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluating Psychotherapist Competence: Testing The Generalizability of Clinical Competence Assessments of Graduate Trainees

Uploaded by

Copyright:

Available Formats

Journal of Counseling Psychology

© 2021 American Psychological Association 2022, Vol. 69, No. 2, 222–234

Evaluating Psychotherapist Competence: Testing the Generalizability

dependability of psychotherapy competence assessments based on video observation of trainees.

Public Signiﬁcance Statement

Every client of each participating therapist was recruited in person

Trainee therapists (n = 8) Raters (n = 11) Clients (n = 22)

Age (years) 26 (2) 23–28 38 (8) 33–60 28 (9) 19–51

Characteristic n (%) n (%) n (%)

Cisgender woman 5 (63%) 7 (64%) 17 (77%)

Cisgender man 2 (25%) 3 (27%) 4 (18%)

Table 2 Not surprisingly, given the strong interitem correlations in

problem, goals for treatment, and the way that

the two parties will work together to achieve

of measurement procedures, but a different sample of raters, clients, computing Gcrossed.

where varX represents the variance estimate for facet X from

Figure 1 number of raters and observation time, training programs might

change substantially as a function of which client is observed.

To clarify this ﬁnding and provide guidance on the importance of

When researchers have access to session-level outcome data, the

https://doi.org/10.1037/pst0000193 Journal of Counseling Psychology, 62(3), 337–350. https://doi.org/10

You might also like