You are on page 1of 16

Psychotherapy Research

ISSN: 1050-3307 (Print) 1468-4381 (Online) Journal homepage: https://www.tandfonline.com/loi/tpsr20

The therapeutic factor inventory-8: Using


item response theory to create a brief scale
for continuous process monitoring for group
psychotherapy

Giorgio A. Tasca, Christine Cabrera, Elizabeth Kristjansson, Rebecca


MacNair-Semands, Anthony S. Joyce & John S. Ogrodniczuk

To cite this article: Giorgio A. Tasca, Christine Cabrera, Elizabeth Kristjansson, Rebecca MacNair-
Semands, Anthony S. Joyce & John S. Ogrodniczuk (2016) The therapeutic factor inventory-8:
Using item response theory to create a brief scale for continuous process monitoring for group
psychotherapy, Psychotherapy Research, 26:2, 131-145, DOI: 10.1080/10503307.2014.963729

To link to this article: https://doi.org/10.1080/10503307.2014.963729

Published online: 08 Oct 2014. Submit your article to this journal

Article views: 717 View related articles

View Crossmark data Citing articles: 13 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tpsr20
Psychotherapy Research, 2016
Vol. 26, No. 2, 131–145, http://dx.doi.org/10.1080/10503307.2014.963729

EMPIRICAL PAPER

The therapeutic factor inventory-8: Using item response theory to


create a brief scale for continuous process monitoring for group
psychotherapy

GIORGIO A. TASCA1, CHRISTINE CABRERA2, ELIZABETH KRISTJANSSON2,


REBECCA MACNAIR-SEMANDS3, ANTHONY S. JOYCE4, & JOHN S. OGRODNICZUK5
1
Department of Psychology, The Ottawa Hospital – General Campus, Ottawa, ON, Canada; 2Department of Psychology,
University of Ottawa, Ottawa, ON, Canada; 3Department of Psychology, University of North Carolina at Charlotte,
Charlotte, NC, USA; 4Department of Psychiatry, University of Alberta, Edmonton, AB, Canada & 5Department of
Psychiatry, University of British Columbia, Vancouver, BC, Canada
(Received 14 November 2012; revised 29 July 2014; accepted 25 August 2014)

Abstract
Objective: We tested a very brief version of the 23-item Therapeutic Factors Inventory-Short Form (TFI-S), and describe
the use of Item Response Theory (IRT) for the purpose of developing short and reliable scales for group psychotherapy.
Method: Group therapy patients (N = 578) completed the TFI-S on one occasion, and their data were used for the IRT
analysis. Of those, 304 completed the TFI-S and other measures on more than one occasion to assess sensitivity to change,
concurrent, and predictive validity of the brief version. Results: Results suggest that the new TFI-8 is a brief, reliable, and
valid measure of a higher-order group therapeutic factor. Conclusion: The TFI-8 may be used for continuous process
measurement and feedback to improve the functioning of therapy groups.

Keywords: group psychotherapy; statistical methodology; test development; therapeutic factors; item response theory

The 99-item Therapeutic Factors Inventory (TFI; accessible. The TFI-S and TFI-19 had parallel four
Lese & MacNair-Semands, 2000; MacNair-Semands factor solutions of instillation of hope, secure emo-
& Lese, 2000) was initially developed to assess the tional expression, awareness of relational impact, and
11 group psychotherapy factors defined by Yalom social learning. Joyce and colleagues reported good
(1995). These group therapeutic factors have been evidence for predictive validity, as the TFI-19 scales
described as the key mechanisms by which change measured early in group therapy predicted treatment
occurs in all types of therapy groups (Corsini & outcomes at posttreatment. The authors also re-
Rosenberg, 1955; Yalom & Leszcz, 2005). How- ported good concurrent validity with a measure of
ever, the length of the original TFI was a barrier to group engagement and good discriminant validity
its widespread use (Roy, Turcotte, Montminy, & evidenced by a low correlation with a scale of social
Lindsay, 2005). The TFI short form (TFI-S) was desirability (Joyce et al., 2011).
subsequently developed with 23 items (MacNair- Researchers have argued that group therapeutic
Semands, Ogrodniczuk, & Joyce, 2010) and later factors can be represented by a higher-order single or
with 19 items (TFI-19; Joyce, MacNair-Semands, essential factor (Burlingame, Fuhriman, & Johnson,
Tasca, & Ogrodniczuk, 2011) in order to improve 2002). This single factor may represent expressed
its reliability and validity, and to make it more emotion (Castonguay, Pincus, Agras, & Hines, 1998)

Correspondence concerning this article should be addressed to Giorgio A. Tasca, Department of Psychology, The Ottawa Hospital –
General Campus, 501 Smyth Road – Room 4428, Ottawa ON K1H 8L6, Canada. Email: gtasca@ottawahospital.on.ca
This article was originally published with errors. This version has been updated. Please see Corrigendum (http://dx.doi.org/10.1080/1050
3307.2014.981446).

© 2014 Society for Psychotherapy Research


132 G. A. Tasca et al.

or cohesion (Budman, Soldz, Demby, Davis, & necessary, and to evaluate the performance of this
Merry, 1993; Burlingame et al., 2002) in groups. In reduced scale (Edelen & Reeve, 2007).
fact, although the construct validity of the four- Although IRT models have a long history (e.g.,
factor TFI-19 was supported by using confirmatory Lord, 1953), their use in psychotherapy research is
factor analysis, the correlations among the four relatively new (Doucette & Wolf, 2009). Our review
TFI-19 factors were high, suggesting that a single of the literature did not reveal any IRT-based study in
over-arching therapeutic factor construct may be the group psychotherapy literature. As Doucette and
present (Joyce et al., 2011). Wolf (2009) argued, psychotherapy researchers often
The goals of the current study were to (i) Develop assume measurement precision rather than carefully
and test a very brief version of the TFI in order to assessing the quality of measurements used to support
adapt the measure to make it more feasible for theories and decisions about psychotherapy.
clinical purposes or for research purposes in routine CTT considers a test score to consist of the trait
clinical settings and (ii) describe the use of Item level that is being measured and measurement error.
Response Theory (IRT; Baker, 2001; Furr & Bachar- This is depicted in the following equation:
ach, 2008) for the purpose of developing short and
reliable scales for group psychotherapy. Similar Observed score ¼ True score þ error:
pragmatic concerns motivated the development of
brief versions of individual therapeutic alliance mea- IRT, on the other hand, looks at each individual
sures such as the Agnew Relationship Measure-5 scale item and considers the level of the measured
(ARM-5; Cahill et al., 2012). This trend is consistent trait, measurement error, and characteristics of the
with recent approaches to repeatedly assess processes item, including item discrimination, item difficulty,
and outcomes in psychotherapy in order to track and guessing (Furr & Bacharach, 2008). This addi-
client progress and report these back to therapists tion of detailed item characteristic information sets
(e.g., Lambert & Shimokawa, 2011). Lambert and IRT apart from CTT.
Shimokawa completed a meta-analysis in which The item characteristic curve is the “basis of IRT”
they found that such feedback improves therapists’ (Edelen & Reeve, 2007, p. 6). It characterizes the
responsiveness to clients, especially for the cases in relationship between the probability of correct
which clients are deteriorating. A similar argument response on the item, the amount of the latent trait,
can be made for group psychotherapy processes. and the item characteristics. The equation below
If group therapists are continually aware, through depicts one of the most common IRT models, the
repeated assessment and feedback, of the state of two parameter or 2PL model (Harris, 1989), which
therapeutic factors in the group, then therapists is used with dichotomous (1/0) items.
may be able to respond immediately to address any
e aðHbÞ
problematic issues in the group. Such repeated PðX ¼ 1Ih; a; bÞ ¼
1 þ e aðHabÞ
assessments would be simplified if a brief, valid, and
reliable measure of group psychotherapeutic factors According to this model, the probability of a correct
was available. Accordingly, a main goal of our study response is dependent on the respondent’s underlying
was to use IRT analysis to develop a very brief TFI. ability and two item characteristics, item difficulty
and item discrimination. Item difficulty (β) is an
index of how high one’s trait level (Θ or theta) must
IRT Data Analysis be to achieve a particular score on an item. The term
To achieve this with the TFI-S, we first reduced the “difficulty” is used because IRT models were origin-
number of scale items by using an IRT (Baker, 2001; ally developed to assess educational abilities. There,
Furr & Bacharach, 2008) approach. IRT comprises an item was “difficult” if a high level of educational
modeling techniques that provide a great deal of ability was required to answer the item. In our
item-level information; it has many advantages over context, difficulty refers the amount of the trait (e.g.,
classical test theory (CTT). For example, item dis- group therapeutic factor) required to score highly on a
crimination and difficulty parameters derived from TFI item. In dichotomous items, the β parameter in an
IRT analyses (described below) are unaffected by IRT model represents the trait level necessary to
sample variability. Furthermore, IRT allows re- respond above a certain threshold with at least a .50
searchers to calculate standard errors of measure- probability (Baker, 2001; Furr & Bacharach, 2008).
ment at each point along a latent trait scale so that Item discrimination (α) is an index of how well the
one can select items that provide maximum precision item distinguishes between people with contiguous
at different levels of the trait (Embretson & Reise, trait levels, especially those who are high as opposed
2000). IRT can be used to evaluate the properties of to those who are low on a trait. For example, a highly
an existing scale, to optimally shorten this scale when discriminatory item will differentiate a participant
Psychotherapy Research 133

who is high in a group therapeutic factor trait from a and less reliable at other levels (Edelen & Reeve,
participant who is moderate or low on that group 2007). For example, an item may discriminate well
therapeutic factor trait. Item discrimination is ana- between people who have high levels of the group
logous to an item-factor loading or item-test correla- therapeutic factor trait, but not discriminate as well
tion in CTT. Hence, an item that has a high between people at the lower end of group thera-
discrimination value is a better indicator of the latent peutic factor trait dimension. Furthermore, some
trait. A scale with many items with low discrimina- items provide more overall information than others.
tion is likely to have inaccurate total scores (Baker, IICs can be summed to produce an information
2001; Furr & Bacharach, 2008). curve for the full scale called the test information
Polytomous items have multiple response options. curve (TIC), which represents the relative precision
Common polytomous IRT models include the of a scale across different levels of the trait con-
Nominal Model for nominal or nonordered tinuum. Such detailed and nuanced information
responses (Bock, 1972), the Partial Credit Model about items and tests represent a major advance
(PCM; Masters 1982), the Generalized Partial over CTT methods which only provide one estimate
Credit Model (GPCM; Muraki, 1992), and the of reliability for all ability levels.
Rating Scale Model (Andrich, 1982) for items which Prior to conducting IRT analyses, two assump-
may be ordered. Finally, the Graded Response tions must be met: (i) The underlying latent trait
Model (GRM; Samejima, 1969) is designed for must be unidimensional, and (ii) at a given level of
items which are clearly ordered along a response Φ, or ability, the response of any one item must not
continuum (e.g., Likert items; Hays, Morales, & depend on the response to any other item (i.e., local
Reise 2000; Ostini & Nering, 2006; Templin, n.d.). independence; DeMars, 2010).
The PCM produces a “dichotomous Rasch model As described above, one of the main purposes of
for each pair of adjacent item categories” (Ostini & the current study was to develop and test a very brief
Nering, 2006, p. 34). The GPCM is an extension of version of the TFI in order to adapt the measure for
the PCM that allows discrimination to vary. The clinical purposes or for research purposes in routine
Rating Scale Model is a version of the PCM for clinical settings. Since Joyce and colleagues (2011)
items with the same format (Ostini & Nering, 2006; suggested that the TFI-S factors could be repre-
Templin, n.d.). Currently, there are no well-known sented by a single higher-order construct, we
1PL models for ordered polytomous data (Templin, hypothesized that the TFI scale could be reduced
n.d.). The GRM for ordinal items assesses both to a small set of highly reliable items drawn from the
location (difficulty) and discrimination; it is regularly original TFI-S factors but presented as a unidimen-
used in practical testing situations as it is more stable sional scale.
and has fewer data demands than other polytomous
models (Ostini & Nering, 2006; Templin, n.d.). We
selected the GRM because the items in the TFI Method
are clearly ordered as well as the aforementioned
Participants
advantages.
The GRM assesses the probability that the item The IRT analysis was conducted on data combined
response will be in category k (i.e., representing a from three previously published studies that used
response option) or higher. In this model, each item similar versions of the TFI scales (Joyce et al.,
has one discrimination and several (i.e., the number 2011; Lese & MacNair-Semands, 2000; MacNair-
of k response categories—1) β parameters, which are Semands et al., 2010). For the multilevel modeling
termed “threshold” parameters. These thresholds (MLM) analyses to assess concurrent, discriminant,
represent the place on the scale on which the pro- and predictive validity, only data from the study
bability of selecting a response is .50, compared to published by Joyce and colleagues were used. Parti-
all other responses that are ordinally higher (Van cipants from the first study (Lese & MacNair-
Dam, Earleywine, & Borders, 2010). Semands, 2000) were 77 undergraduate and gradu-
Combining information from discrimination and ate students who participated in counseling and
difficulty parameters allows one to assess an item’s support groups at university counseling centers in
performance, or reliability, at different levels of the three major universities in the Northeast, the South-
trait. This is represented by the IRT concept of item east, and the Southwest of the USA; 59 were
information which is graphically displayed on an women; 70 were White, 1 African-American, and 4
item information curve (IIC). These curves display were Hispanic/Latino. The mean age was 25.50 years
the range of trait levels over which the item provides (SD = 8.34). Groups included open-ended, struc-
the most information (measures most reliably); items tured, and support therapy groups. Participants from
can be very reliable at some levels of ability or trait the second TFI study (MacNair-Semands et al., 2010)
134 G. A. Tasca et al.

were 174 patients consecutively admitted to a day Group climate questionnaire—Short form
treatment program (DTP) at a university hospital in (GCQ-S). The GCQ-S (MacKenzie, 1983) is a
Edmonton, Canada. Their average age was 37.20 self-report measure designed to assess individual
years (SD = 10.65). Most (65%) were women, 91% members’ perceptions of a group’s therapeutic
were White, and 71% received a DSM-IV (Amer- environment. The GCQ-S has 12 items rated on a
ican Psychiatric Association, 2000) Axis II disorder 7-point Likert scale indicating extent of agreement
diagnosis. The most prevalent Axis I disorder was ranging from 0 (not at all) to 6 (extremely). The
major depression (70%). Participants attended the items provide for the scoring of three subscales:
DTP all day, five days a week for 18 weeks. The Engagement (five items), avoidance (three items),
data were collected during the self-awareness group and conflict (four items). The engagement scale was
within the DTP, a small, insight-oriented group used in this study to assess group cohesion. It
(approximately eight members) that individual consists of items that call for ratings on the degree
patients attended throughout their tenure in the of self-disclosure, cognitive understanding, and con-
DTP (Piper, Rosie, Joyce, & Azim, 1996). Parti- frontation occurring in the group. In the present
cipant data from the third TFI study (Joyce et al., study, based on administration of the measure at
2011) were collected at eight sites across North Week 4 in all groups of Joyce et al.’s (2011) sample,
America. These sites included two urban university the engaged scale had a coefficient alpha of .74
counseling centers (one including six groups with (mean inter-item r = .41). The mean inter-item
43 participants; one including eight groups with 84 correlation and coefficient alpha indicate adequate
participants); outpatient services at three local internal consistency for the engaged scale (Clark &
hospitals in Vancouver, Canada (one including Watson, 1995).
four groups with 53 participants; two including
two groups with 15 participants); an outpatient Brief symptom inventory (BSI). The BSI-18 is
clinic system in Calgary, Canada (six cohorts an 18-item instrument that measures psychological
including 35 participants); and an outpatient distress (Derogatis, 2000). The BSI-18 is composed
binge-eating disorder treatment program in a gen- of six items each for the dimensions of somatization,
eral hospital in Ottawa, Canada (12 groups with depression, and anxiety, scored on a Likert scale of
103 participants). After exclusions due to non- 0–4 to reflect the degree of distress experienced
during the preceding two weeks. An overall score
attendance and missing data, a total of 51 different
representing general level of symptom distress (Glo-
groups with 380 participants provided data. The
bal Severity Index) was calculated. For Joyce et al.’s
sample from Joyce et al.’s (2011) study included
(2011) sample at pretreatment, the Global Severity
267 women and 93 men (missing data on gender
Index had a coefficient alpha of .93.
for 20 members), with a mean age of 36.28 years
(SD = 13.6). Of the available sample, 85.5% were
Social desirability. We assessed social desirabil-
Caucasian, 5.0% were African-American, 2.8%
ity using the desirability scale of the Personality
were Asian/Asian-American, 1.7% were multi-
Research Form (Jackson, 1984). The desirability
racial, 1.1% were Persian/Arabic, 1.4% were
scale is a 16-item, true–false, self-report measure.
Latino/Hispanic, 1.1% were East Indian, and
Higher scores reflect a greater tendency to provide
1.4% reported “other” ethnicity.
socially desirable responses. At pretreatment in Joyce
et al.’s (2011) study sample, coefficient alpha
was .73.
Measures
Therapeutic factors inventory—short form
(TFI-S). The 23-item TFI-S is a self-report meas- Procedure
ure designed to assess individual group members’ After a description of the project by the research
perceptions of four broad therapeutic factors: coordinator at each site, participants provided writ-
Instillation of hope, secure emotional expression, ten informed consent to participate. The original
awareness of relational impact, and social learning projects and data collection were approved by the
(MacNair-Semands et al., 2010). The TFI-S items affiliated research ethics board at each site. Prior to
are rated on a 7-point Likert scale that ranges from 1 the start of their groups, participants completed a
(Strongly Disagree) to 7 (Strongly Agree). In the demographics questionnaire. For data collected in
previous work, the four factors had coefficient alpha Lese and MacNair-Semands (2000) study, partici-
values between 0.71 and 0.91 (MacNair-Semands pants completed the TFI-S at one time point during
et al., 2010), indicating adequate internal consist- their attendance in their group. For MacNair-
ency of the scales (Clark & Watson, 1995). Semands et al.’s (2010) study, patients completed
Psychotherapy Research 135

the TFI-S on three occasions—Weeks 4, 10, and 16 discriminating, those with values of .65–1.34 are
of the 18-week program. Ratings from Week 4 were moderate, items with values of 1.35–1.69 are highly
used for the data analysis. For data collected in Joyce discriminating, and items with values above 1.70 are
et al.’s (2011) study, the TFI-S and GCQ-S were very highly discriminating (Baker, 2001). Typically
completed by most participants at Weeks 4, 8, and item difficulty values (β) range between –3 and +3,
12, but individuals at sites with rolling membership with values rising from the lowest to the highest
(i.e., day hospitals) only completed the measures at response category within an item indicating a greater
Weeks 4 and 12. Participants at all sites in Joyce amount of latent trait needed to endorse a higher-level
et al.’s (2011) study also completed the social response category (i.e., higher levels of the underlying
desirability and BSI-18 measures at pretreatment, trait should correspond to endorsing “7” as opposed to
and the BSI-18 at postsession 12. “4” on the TFI; DeMars, 2010). An ICC depicts the
level on the trait scale where a specific item is most
reliable. The ICC graphically depicts the relationship
Checking Assumptions between the particular response to an item and the
PRELIS, a component of the Lisrel 8 program latent trait, each potential response being contingent
(Mels, 2006), was used to examine the assumption on trait level, item discrimination, and item difficulty
of unidimensionality1. To do this, we ran a principal (Baker & Kim, 2004). The TIC indicates whether the
components analysis (PCA) on the 23-item scale and scale is more reliable at certain levels of the latent trait
checked the ratio of the first to second eigenvalue. (in this case, group therapeutic factor) and depicts the
According to Lord (1980), this ratio should be at degree of measurement error across the scale. Also,
least 3 to 1 in order for the scale to be considered the output generates a marginal reliability index (MRI)
unidimensional. We also ran a PCA on the shor- that serves as the IRT-based measure of internal
tened scale (TFI-8) developed after our first IRT consistency (similar to Cronbach’s alpha in CTT;
analysis. Finally, we ran a confirmatory factor ana- Florida Department of Education, 2005). As in CTT,
lyses (CFA) comparing the model fit of the original a minimal acceptable level for the MRI is 0.80
four-factor model (Joyce et al., 2011) to a model (de Ayala, 2009).
with a higher-order single factor. Chi-square differ- IRT analysis with this data-set was conducted
ence test, and the information criteria Bayesian using the MULTILOG 7.03 program (Thiessen,
information criterion (BIC) and corrected Akaike 1991). Maximum likelihood methods were used to
information criterion (CAIC) were used to assess the estimate item discrimination (α) and item difficulty
comparative fit of the models. (β) parameters. For further information regarding
The assumption of local independence for the the item response procedure used by the program,
eight items on the TFI-8 was checked through an see the MULTILOG manual (Thiessen, 1991).
analysis of the residual correlation matrix produced
as an output from the IRT-FIT program (Bjorner,
Smith, Stone, & Sun, 2007). This analysis is a test of Multilevel Modeling
whether or not there are substantial dimensions First, we tested the group effect with the intra-class
leftover after the IRT analysis is performed (Linacre, correlation coefficient (ρ) to evaluate dependence in
2014). the hierarchically nested data with a hierarchical
linear modeling (HLM) approach using methods
described by Tasca, Illing, Joyce, and Ogrodniczuk
IRT Analysis
(2009). To assess sensitivity to change of the new
The IRT analysis was conducted on all available TFI- brief TFI scale, data were analyzed from Weeks 4, 8,
S data from Lese and MacNair-Semands (2000) and and 12 of each group from Joyce et al.’s (2011) study
MacNair-Semands et al.’s (2010) studies, and the (N = 304 individuals and 47 groups with available
Session 4 TFI-S data from Joyce et al.’s (2011) study. data). To evaluate the degree of change across
As TFI-S has clearly ordinal response categories, the assessments while accounting for group dependen-
GRM was used for IRT analysis (Embretson & Reise, cies in the data, a three-level longitudinal HLM
2000; Samejima, 1969). The analysis produced one approach was undertaken (Tasca et al., 2009). One
item discrimination (α) and six item difficulty values advantage of HLM is that all the data are used, i.e.,
(β; k–1) for each item, and three graphics: an IIC and cases were not deleted if ratings from an assessment
an item category characteristic curve (ICC) for each were missing. Level 1 of the model represented the
item, and the TIC for the scale as a whole. Generally, repeated measurement of the TFI scale scores across
items with discrimination values (α) ranging from the three time points within each individual group
0 to .24 are considered to be very poorly discrim- member. Level 2 represented estimated initial scores
inating, those with values of .25–.64 are poorly (intercepts) and rates of change (slopes) of
136 G. A. Tasca et al.

individuals nested within the distinct therapy groups PRELIS from the LISREL 8 program (Du Toit,
in the sample. Level 3 represented intercepts and Du Toit, Mels, & Cheng, n.d.). We found that the
slopes for the distinct groups themselves. The time first component accounted for 46.32% of the vari-
parameter at Level 1 was log transformed to model a ance in the scores, which was larger than the
more pronounced change in TFI-8 scores from recommended criterion of 20% (Reckase, 1979).
Sessions 4–8, and less pronounced change from The second component was substantially smaller
Sessions 8–12. To assess if change in TFI-8 scores than the first, accounting for only 10.29% of the
were related to change in symptoms, we used BSI-18 variance. In our analyses, the first eigenvalue (36.7)
residual change scores calculated by regressing BSI- had a ratio of 4:1 to the second eigenvalue (8.19),
18 scores at Session 12 on to BSI scores at pretreat- suggesting that there was one dominant dimension
ment. BSI-18 residual change scores were entered as (Lord, 1980). A test of the 8-item scale also showed
group-centered predictors at Level 2, and their unidimensionality; the ratio of the first (14.5) to the
group level relationship with TFI-8 slopes was second eigenvalue (3.42) was also 4:1. Second, we
examined at Level 3 of the model. Models were compared a model with a higher-order single factor
developed and effect sizes (pseudo-R2) were assessed by using CFA to the original four-factor model
by adding predictors in sequential models (Tasca tested by Joyce et al. (2011). The difference in chi-
et al., 2009). Appendix 1 shows the full three- square test between the two CFA models was not
level HLM. significant, Δχ2(2) = 2.25, p = .33, however, the
To assess predictive validity, the new brief TFI at information criteria BIC (1174.51) and CAIC
Week 4 was evaluated as a predictor of post-treat- (1224.51) for the single-factor higher-order model
ment status on the BSI-18 outcome variable. Status were smaller than the respective BIC (1183.98) and
at Week 12 (post-treatment) was assessed with the CAIC (1235.98) for the four-factor model. Taking
BSI-18, while controlling for status at Week 4 (i.e., all of this evidence together suggests that the TFI
BSI-18 at baseline). A Level 2, hierarchically struc- can be represented by a single higher order factor.
tured HLM approach (participants nested within
groups) was employed (Tasca et al., 2009). Error Local independence. The items of the TFI-8
terms for the intercept coefficients at Level 2 (group were tested for local independence using the residual
level) were left free to vary. Appendix 1 also shows correlation matrix produced by IRT-FIT.1 First, we
the HLM for this two-level model. assessed the size of the correlations; the largest one
was .25, which is small using Cohen’s criteria.
Second, we ran a linear factor analysis on the residual
Results correlation matrix using SPSS. Linacre (2009) sug-
The TFI-S was administered to 621 therapy group gested that an eigenvalue of 2.00 or more indicates
participants from three studies (Joyce et al., 2011; local dependence. The largest eigenvalue was 1.44,
Lese & MacNair-Semands, 2000; MacNair-Semands which was well below Linacre’s criterion of 2.00.
et al., 2010); 578 of these provided sufficient data on Third, following Bjorner’s suggestion (personal com-
the TFI-S. This instrument was originally composed munication), we ran a one-factor confirmatory factor
of 23 items; however, only 22 items were examined analysis with M-PLUS using unweighted least squares
in this study. Item 20 (“This group helps empower estimation. The CFA results also supported the
me to make a difference in my own life”) from the hypothesis of local independence. The comparative
instillation of hope therapeutic factor was removed fit index (CFI) was 0.56 and the Tucker-Lewis index
from the IRT analysis due to insufficient cases caused (TLI) was 0.39, which are well below the currently
by a clerical error in one of the study samples. recommended cut-off of 0.95 (Hooper, Coughlan, &
Mullen, 2008). Furthermore, the loadings on the
factor were relatively low, ranging from 0.036 to
Checking Assumptions for IRT 0.377. The RMSEA, however, was 0.057, which is
Unidimensionality. Joyce and colleagues (2011) just below the recommended cut-off of 0.06 for good
proposed the TFI-S as a measure of four latent fit. Altogether, the results indicated that there was not
therapeutic factors, which may preclude unidimen- an important dimension remaining after using IRT.
sionality. However, the estimates provided by Joyce
and colleagues (2011) for the final structural model
IRT Analysis
of the scales revealed high correlations among the
four factors, suggesting a common higher-order TFI-S. Discrimination values for all items are
construct. First, we conducted a nonlinear PCA to presented in Table I. The discrimination parameters
test the unidimensionality of the TFI-S (Hambleton ranged from α = 0.63 to α = 2.27. Item 15 had poor
& Rovinelli, 1986; Hattie, 1984, 1985) using discrimination; items 1, 3, 4, 5, 7, 8, 13, 21, and 23
Psychotherapy Research 137
Table I. Discrimination values of 22-item of the TFI-S.

Standard
Item (and group therapeutic factor) Discrimination Error

1. Because I have got a lot in common with other group members, I am starting to think that I may have 1.48 0.16
something in common with people outside group too. (Social Learning)
2. Things seem more hopeful since joining group. (Instillation of Hope) 1.98 0.18
3. I feel a sense of belonging in this group. (Secure Emotional Expression) 2.08 0.17
4. I find myself thinking about my family a surprising amount in group. (Awareness of Relational Impact) 0.92 0.15
5. Some times I notice that in group I have the same reactions or feelings as I did with my sister, brother, 0.74 0.12
or a parent in my family. (Social Learning)
6. In group I have learned that I have more similarities with others than I would have guessed.(Instillation 1.83 0.20
of Hope)
7. It is okay for me to be angry in group. (Secure Emotional Expression) 0.92 0.14
8. In group I have really seen the social impact my family has had on my life. (Awareness of Relational 1.2 0.17
Impact)
9. My group is kind of like a little piece of the larger world I live in: I see the same patterns, and working 1.77 0.19
them out in group helps me work them out in my outside life. (Social Learning)
10. Group helps me feel more positive about my future. (Instillation of Hope) 2.88 0.23
11. It touches me that people in group are caring of each other. (Secure Emotional Expression) 1.63 0.23
12. I pay attention to how others handle difficult situations in my group so I can apply these strategies in my 1.76 0.20
own life. (Awareness of Relational Impact)
13. In group sometimes I learn by watching and later imitating what happens. (Social Learning) 0.95 0.13
14. This group helps me recognize how much I have in common with other people. (Instillation of Hope) 2.35 0.23
15. In group, the members are more alike than different from each other. (Secure Emotional Expression) 1.19 0.14
16. It is surprising, but despite needing support from my group, I have also learned to be more self- 1.73 0.16
sufficient. (Awareness of Relational Impact)
17. This group inspires me about the future. (Instillation of Hope) 2.70 0.22
18. Even though we have differences, our group feels secure to me. (Secure Emotional Expression) 1.85 0.22
19. By getting honest feedback from members and facilitators, I have learned a lot about my impact on 1.81 0.18
other people. (Awareness of Relational Impact)
21. I get to vent my feelings in group. (Secure Emotional Expression) 1.49 0.16
22. Group has shown me the importance of other people in my life. (Awareness of Relational Impact) 2.12 0.18
23. I can “let it all out” in my group. (Secure Emotional Expression) 1.48 0.15

Note. Item 20 data were not available due to clerical error in one of the samples.

had moderate discrimination; items 2, 6, 11, 12, 16, final eight items. Across all items, increasing levels
18, 19, and 22 had high discrimination; and items 9, of the underlying trait were associated with choosing
10, 14, and 17 were very highly discriminatory. a higher-level response category (e.g., a participant
Notably, all of the factors had some items that were perceiving more of the group therapeutic factor was
moderately discriminatory with the exception of the more likely to choose a “6” over a “3” response).
instillation of hope factor. This factor had a majority Within each item, the distance between the highest
of highly discriminatory items (three of the five items and lowest difficulty value was between 3.6 and 5.3
for this factor). units, indicating that response categories were
On the basis of this original analysis, eight items spread reasonably well across the trait range. How-
were selected for further IRT analysis using a ever, there were somewhat more difficulty values
polytomous GRM (Embretson & Reise, 2000; around and below the midpoint (0) of the group
Samejima, 1969). The selected items were the two therapeutic factor range, so this scale will measure
most discriminatory items from each of the four average and lower levels of the group therapeutic
therapeutic factors. Consequently, the conceptual factor particularly well.
basis of the TFI-S was preserved while the discrim- For illustration purposes, Figure 1 displays the
inatory capacity was maximized. ICCs and IICs for items 3 (α = 1.85) and 18 (α =
1.42). In an ICC, the x-axis represents the trait level
TFI-8. Residual correlation values for the eight (i.e., the group therapeutic factor) and the y-axis
items appear in Table II. Discrimination values represents the probability of a response. The average
from a separate run for the final eight items ranged amount of latent trait is set at 0, with a standard
from α = 1.19 to α = 2.66 (see Table III). Items 1, 6, deviation of 1. When an item is more discriminat-
9, and 18 were highly discriminatory and items 3, ory, the slope of the response curves is quite steep
12, 17, and 19 were very highly discriminatory. while those of less discriminatory items slope more
Table IV contains the difficulty values (β) for the gradually (DeMars, 2010). The ICC graph for item
138 G. A. Tasca et al.
Table II. Residual correlation values from the IRT analysis of the TFI-8.

Item TFI-1 TFI-3 TFI-6 TFI-9 TFI-12 TFI-17 TFI-18

TFI-1 1
TFI-3 0.05 1
TFI-6 0.19 0.06 1
TFI-9 0.11 –0.02 0.09 1
TFI-12 –0.12 –0.13 –0.09 –0.07 1
TFI-17 0.03 0.02 0.04 –0.02 –0.09 1
TFI-18 0.01 0.25 0.03 –0.01 –0.01 0.07 1
TFI-19 –0.01 –0.05 –0.02 0.05 –0.06 0.04 0.06

3 depicts steeper slopes for all potential item unimodal curve that indicates that the TFI-8 scale is
responses (i.e., ranging from 1 “Strongly Disagree” reliable across an extensive range of the latent trait,
to 7 “Strongly Agree”) compared to the more including for those who demonstrate below average
gradual slopes of the item response curves for item and average levels of the group therapeutic factor. The
18. On the IICs, the x-axis represents the trait level TFI-8 is less sensitive at high levels of the latent trait
and the y-axis represents the reliability of score (>2 SD above the mean trait level), as indicated by the
information gathered by the specific item at distinct steep downward slope at the high end of the x-axis
levels of the trait. Item 18 functions across the entire (Figure 2). Finally, the MRI for the TFI-8 was found
range of the trait. Item 3 functions more reliably to be 0.88 indicating acceptable internal consistency.
than item 18; however, not at the highest level of the
trait. Correspondingly, the final difficulty parameter
for item 18 is β6 = 2.47, indicating that a higher level Hierarchical Linear Modeling
of the trait corresponds with a 50% chance of The intra-class correlation coefficient (ρ; Tasca et al.,
endorsing this response. ICC and IICs for all other 2009) indicated that 12% of the variance in TFI-8
items are presented in Appendix 2. scores and 9% of the variance in TFI-8 slopes were
The graph of the TIC for the TFI-8 is presented in attributable to the group effect. These results suggest
Figure 2. The x-axis represents the trait level (group that the use of three-level HLM to analyze these data
therapeutic factor) and the y-axis represents the was appropriate.
reliability of score information that has been gath-
ered by the test items at particular levels of the latent Sensitivity to change. The item means and
group therapeutic factor trait. In Figure 2, the solid standard deviations for the TFI-8 scale scores were
line represents the test information and the hatched as follows: At Week 4, M = 4.71, SD = 1.06, n =
line represents the standard error of measurement of 312; at Week 8, M = 5.10, SD = 1.02, n = 186;
the latent trait. The average amount of latent trait is and at Week 12, M = 5.19, SD = 1.17, n = 226.
set at 0 on the x-axis, with a standard deviation Due to differences in data collection across sites
of 1 (DeMars, 2010). The TIC is a broad-based (e.g., certain groups collected TFI-S ratings only at

Table III. TFI-8 items with discrimination and standard error.

Discrimination Standard
Item (and original TFI-S factor) α error

1. Because I’ve got a lot in common with other group members, I’m starting to think that may have 1.19 0.36
something in common with people outside group too. (Social Learning)
3. I feel a sense of belonging in this group. (Secure Emotional Expression) 1.85 0.14
6. In group I’ve learned that I have more similarities with others than I would have guessed. (Instillation 1.47 0.37
of Hope)
9. My group is kind of like a little piece of the larger world I live in: I see the same patterns, and working 1.64 0.13
them out in group helps me work them out in my outside life. (Social Learning)
12. I pay attention to how others handle difficult situations in my group so I can apply these strategies in my 2.66 0.52
own life. (Awareness of Relational Impact)
17. This group inspires me about the future. (Instillation of Hope) 2.17 0.16
18. Even though we have differences, our group feels secure to me. (Secure Emotional Expression) 1.42 0.37
19. By getting honest feedback from members and facilitators, I’ve learned a lot about my impact on other 1.75 0.14
people. (Awareness of Relational Impact)
Psychotherapy Research 139
Item characteristic curve Item information curve
1.0 2.5

0.8 2.0

Information
Probability

0.6 1.5

1
0.4 2 3 1.0
4
5

0.2 6 0.5
7

0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Ability Scale score
Item characteristic curve Item information curve
1.0 2.5

0.8 2.0

1
1.5
Information

0.6
Probability

0.4 2 1.0

4
0.2 3 0.5
5 6 7

0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Ability Scale score

Figure 1. IICs (right) and item response category characteristic curve (left) for items 3 (top) and 18 (bottom).

Week 4 and at Week 12, while other groups collected These findings suggest that the TFI-8 is sensitive to
ratings at each of the three assessments), missing change across time.
data, and treatment noncompletion, the individual
n available at each assessment varied. As indicated, Predictive validity. To assess predictive validity,
one advantage of HLM is that reliable parameters can a two-level random effects HLM was used
be estimated even with missing data. The three-level (Appendix 1). Greater TFI-8 scores at Week 4
HLM models of change appear in Appendix 1. significantly predicted reductions in the BSI-18
Results indicated that there was a significant increase General Severity Index, γ20 = –0.11, t(40) = 3.54,
in TFI-8 scores from Week 4 to Week 8 that was p < .001, accounting for 42% of the symptom
maintained to Week 12, γ100 = 0.92, t(45) = 7.82, outcome variance.
p < .001, and the size of the effect was large, pseudo-
R2 = .33. Further, BSI residual change scores were Concurrent and discriminant validity. Con-
negatively related to the significant slope in TFI-8 current validity was assessed by correlating the TFI-
scores over time, γ120 = –0.74, t(45) = 3.01, p = .00, 8 given at Week 4 from Joyce et al.’s (2011) sample
pseudo-R2 = .11. This result indicates that at the with available data (n = 300) from the Engaged
group level a greater increase in therapeutic factors GCQ scale. As expected, there was a large positive
during therapy was associated with better symptom relationship between the TFI-8 and the engaged
outcomes, and conversely a lesser increase in TFI-8 scale, r = .55, p < .001. We also found evidence of
was associated with poorer symptom outcomes. discriminant validity in that the correlation between
140 G. A. Tasca et al.
10 0.61
9

8 0.49

Standard error
6 0.37

Information
5

4 0.25

2 0.12

0 0
-3 -2 -1 0 1 2 3
Scale score

Test information curve: solid line Standard error curve: dotted line

Figure 2. TIC of Therapeutic Factor Inventory–8.

the TFI-8 and the social desirability scale was small, desirability, which supports the TFI-8’s discriminant
r = .18, p = .003. validity. We made efforts to use both the IRT
method and to select items that reflected each of
the four original factors from the TFI-S so that not
Discussion only does the TFI-8 perform well psychometrically
The major goal of this study was to reduce the length but it also continues to cover the content of all
of the TFI-S in a psychometrically sound manner in aspects of the original TFI-S scale constructs.
order to make it even more efficient for clinical and Hence, we conceptualize the single TFI-8 factor as
research use in practice settings and to assess the feeling hopeful about the processes of emotional
shortened scale’s reliability and validity. The brief expression and relational awareness, which then
TFI-8 items demonstrated very good to excellent translate into and promote social learning.
item discrimination. Although the TFI-8 performed A key finding of this study was that assessment of
well overall, the TIC indicated that the measure was the group therapeutic factor by the TFI-8 early in
better at assessing mid- and lower-levels of the group group therapy was associated with positive change
therapeutic factor. The TFI-8 had very good mar- in general symptom severity, and the effect was large.
ginal internal consistency, indicating that the items This lends support to theTFI-8’s predictive validity
measure the group therapeutic factor construct and to the use of the TFI-8 as a tool for regular
reliably. feedback to therapists about the ongoing function-
The TFI-8 also demonstrated good concurrent ing of a group. The TFI-8 changed over time so that
validity by correlating highly with the engaged scale most groups felt more hopeful about emotional
of the GCQ-S. This suggests that the TFI-8 is expression leading to interpersonal learning. In
closely related to group cohesion, but does not addition, greater change in the group therapeutic
completely overlap the cohesion construct, thus factor was associated with greater positive changes in
representing a related but separate group therapeutic symptoms, and conversely, less positive change in
factor. The TFI-8 is not greatly affected by social the TFI-8 was associated with poorer outcomes.

Table IV. Difficulty values for TFI-8.

Item β1 β2 β3 β4 β5 β6

1 –2.85 (0.88) –1.76 (0.61) –0.7 (0.34) 0.54 (0.32) 1.13 (0.41) 2.39 (0.73)
3 –2.69 (0.25) –1.86 (0.15) –1.05 (0.10) –0.25 (0.08) 0.72 (0.09) 1.62 (0.14)
6 –3.03 (0.89) –1.48 (0.43) –1.22 (0.38) –0.19 (0.23) 0.85 (0.30) 2.24 (0.54)
9 –3.19 (0.33) –2.23 (0.21) –1.44 (0.14) –0.72 (0.10) 0.24 (0.09) 1.41 (0.14)
12 –2.15 (0.41) –1.47 (0.29) –0.82 (0.20) –0.03 (0.17) 0.45 (0.17) 1.40 (0.27)
17 –2.70 (0.25) –2.03 (0.16) –1.49 (0.12) –0.90 (0.08) –0.02 (0.07) 0.98 (0.09)
18 –2.18 (0.65) –0.93 (0.37) –0.27 (0.29) 0.92 (0.32) 1.68 (0.43) 2.47 (0.70)
19 –2.86 (0.28) –1.91 (0.17) –1.27 (0.12) –0.32 (0.08) 0.58 (0.09) 1.48 (0.41)
Psychotherapy Research 141

One of our motivations for reducing the length need education to begin to intentionally facilitate the
of the TFI-S was to make the TFI-8 more user- overall factor of hope, emotional expression, aware-
friendly for use in busy clinical settings. The TFI-8 ness of relational impact, and social learning. Clin-
could help to flag problematic group therapy pro- ically, this may require group therapists to highlight
cesses that are prognostic of poorer treatment and reflect upon these factors when they emerge in
outcomes among group participants. The procedure order to improve therapist skills and create better
of frequently assessing outcomes and providing patient retention and outcome.
feedback to therapists can reduce the number of
patients who deteriorate in the context of individual
therapy (Lambert & Shimokawa, 2011). Repeated Limitations
assessments with the TFI-8 may also allow the group This study had some limitations. First, the samples
therapist to immediately identify a rupture in the were drawn from groups in a variety of settings,
relationship with the group or an individual that including day hospitals, outpatient departments,
requires a specific intervention in order to repair counseling centers, and community agencies. Nev-
group functioning (Safran, Muran, & Eubanks- ertheless, these may not represent the full range of
Carter, 2011). For frequent and repeated administra- groups offered in the community, nor are the
tions, the TFI-8 represents an efficient use of individuals entirely representative of all those who
resources, and it may function in clinical and research attend intervention groups, and thus the results may
contexts to identify problematic group sessions or be limited in generalizability by the sample of
participants who are not benefiting from the group. groups. However, the number and variety of groups
This may be particularly useful for continuous feed- represented in this study is larger and more varied
back to group therapists in training to supplement than is common for group therapy research, there-
supervisory discussions. fore, enhancing the ecological validity of this study.
There are now a few user-friendly descriptions of Second, sample participants were predominantly
the IRT method (e.g., Doucette & Wolfe, 2009) and White, which may have resulted in biased ratings
demonstrations of the use of IRT for scale develop- and may also reduce generalizability. Replicating the
ment (e.g., Fraley, Waller, & Brennan, 2000). There results with a more diverse sample would enhance
are also relatively easy-to-use computer programs to the generalizability to individuals from minority
conduct IRT, such as the MULTILOG program groups. Third, the IRT analyses were conducted on
(Thiessen, 1991). Despite this, we could not find individual-level data, but the data were nested within
any published examples of the application of IRT to therapeutic groups. Nested data may result in
develop group therapy scales. IRT is an important dependence in the data and inflated Type I error.
method for developing and testing group psycho- We used HLM to model change in the outcomes at
therapy measurement tools in order to make meas- the group level for this reason. IRT assumes that the
urement more precise and reliable. As demonstrated item responses are solely dependent on the underly-
here, IRT is also an excellent method of reducing ing trait. Therefore, it is possible that the therapeutic
longer unidimensional scales to a more concise group membership may be influencing responses on
measure consisting of the best functioning items. In the TFI-8 (Fox, 2007). For this reason, the IRT
the case of the TFI-8, the scale is particularly good at results may need to be interpreted with some
discriminating among those with low-to-moderate caution. The HLM results for the sensitivity to
levels of the factor and is less effective at the higher change and predictive validity analyses of the TFI-8
end of the group therapeutic factor dimension. were conducted at the group level, and so those
Hence, the TFI-8 may be most useful for newer results are likely reliable. Fourth, unfortunately, one
groups or groups working toward achieving an item from the original TFI-S was not available for all
adequate level of group therapeutic factor and may samples and so this may have limited the pool of
not work as well for well-functioning groups who items representing the TFI construct. However, the
have achieved a high level of group hopefulness, item was missing from the TFI-S hope factor, which
interpersonal learning, and emotional awareness and has several items with excellent ICC values. Finally,
expression. although shorter tests provide practical advantages
This study is consistent with recent research that over longer tests especially for continuous monitor-
reflects a move away from highly overlapping factors ing, some have argued that shorter tests may be less
that are broad and unevenly balanced to a single- or reliable due to larger measurement error (Emons,
higher-order conceptualization of a group thera- Sijtsma, & Meijer, 2007). This problem may be
peutic mechanism. One of the implications is that mitigated somewhat with the TFI-8 as only items
trainees who are taught about the 11 therapeutic with optimal discriminatory and difficulty values
factors as delineated by Yalom and Leszcz (2005) were selected. Nevertheless, in contexts where the
142 G. A. Tasca et al.

TFI-8 might be used for clinical decision-making Bock, R. D. (1972). Estimating item parameters and latent ability
purposes, we suggest that the TFI-8 be combined when responses are scored in two or more nominal categories.
Psychometrika, 37(1), 29–51. doi:10.1007/BF02291411
with other sources of information to improve reliab- Budman, S. H., Soldz, S., Demby, A., Davis, M., & Merry, J.
ility of the decisions. (1993). What is cohesiveness? An empirical examination Small
Group Research, 24(2), 199–216. doi:10.1177/10464964932
42003
Conclusions Burlingame, G. M., Fuhriman, A., & Johnson, J. E. (2002).
Cohesion in group psychotherapy. In J. C. Norcross (Ed.),
The use of very brief and reliable repeated measure- Psychotherapy relationships that work (pp. 71–88). New York,
ments in clinical and research contexts is growing due NY: Oxford University Press.
to the benefits of systematic feedback to therapists Cahill, J., Stiles, W. B., Barkham, M., Hardy, G. E., Stone, G.,
about their clients’ functioning and the status of Agnew-Davies, R., & Unsworth, G. (2012). Two short forms of
the Agnew Relationship Measure: The ARM-5 and ARM-12.
group processes and outcomes. IRT allows the group
Psychotherapy Research, 22(3), 241–255. doi:10.1080/105033
psychotherapy researcher to scrutinize the quality of 07.2011.643253
their measurements and potentially to reduce longer Castonguay, L., Pincus, A., Agras, W., & Hines, C. (1998). The
unidimensional scales to more practical lengths. IRT role of emotion in group cognitive-behavioral therapy for binge
is a psychometric approach that is only recently being eating disorder: When things have to feel worse before they get
better. Psychotherapy Research, 8, 225–238. doi:10.1080/105033
used in individual psychotherapy research (Doucette
09812331332327
& Wolf, 2009), and this study is the first to use IRT in Clark, L. A., & Watson, D. (1995). Constructing validity: Basic
a group therapy context. IRT may result in better issues in objective scale development. Psychological Assessment,
measurements of group therapy processes and out- 7, 309–319. doi:10.1037/1040-3590.7.3.309
comes with the goal to reduce measurement error, Corsini, R. J., & Rosenberg, B. (1955). Mechanisms of group
psychotherapy: Processes and dynamics. The Journal of Abnor-
increase precision in defining a construct that sup- mal and Social Psychology, 51, 406–411. doi:10.1037/h0048439
ports a theory, and inform clinical decisions regard- de Ayala, R. J. (2009). The theory and practice of Item response
ing the functioning of a group. Results from this study theory. New York, NY: Guilford.
using IRT suggest that the new TFI-8 is a brief, DeMars, C. (2010). Item response theory: Understanding statistics
reliable, and valid measure of a higher-order group New York, NY: Oxford University Press.
Derogatis, L. R. (2000). Brief symptom inventory 18: Administra-
therapeutic factor. The TFI-8 may be used for
tion, scoring, and procedures manual. Minneapolis, MN: National
continuous process measurement and feedback to Computer Systems Pearson.
improve the functioning of therapy groups. Doucette, A., & Wolf, A. W. (2009). Questioning the measure-
ment precision of psychotherapy research. Psychotherapy
Acknowledgement Research, 19, 374–389. doi:10.1080/10503300902894422
Du Toit, S., Du Toit, M., Mels, G., & Cheng, Y (n.d.). LISREL
This study was supported by a grant from the Group for windows – PRELIS user’s guide. Lincolnwood, IL: Scientific
Software International.
Psychotherapy Foundation and by a grant from the Edelen, M. O., & Reeve, B. B. (2007). Applying item response
Canadian Institutes for Health Research. Giorgio A. theory (IRT) modeling to questionnaire development, evalu-
Tasca holds the Research Chair in Psychotherapy ation, and refinement. Quality of Life Research, 16(S1), 5–18.
Research, University of Ottawa and the Ottawa doi:10.1007/s11136-007-9198-0
Hospital. Embretson, S. E., & Reise, S. (2000). Item response theory for
psychologists. Mahwah, NJ: Erlbaum Publishers.
Emons, W. H. M., Sijtsma, K., & Meijer, R. R. (2007). On the
consistency of individual classification using short scales.
Note Psychological Methods, 12(1), 105–120. doi:10.1037/1082-
1 989X.12.1.105
This was done as the TFI-8 is the final scale and thus it should
show no local dependence. Florida Department of Education. (2005). FCAT handbook –
A resource for educators. Tallahassee, FL: Harcourt.
Fox, J. P. (2007). Multilevel IRT modelling in practice with the
References package mlirt. Journal of Statistical Software, 20(5), 1–16.
American Psychiatric Association. (2000). Diagnostic and statistical Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000). An item
manual of mental disorders (4th ed., text rev.). Washington, DC: response theory analysis of self-report measures of adult
Author. attachment. Journal of Personality and Social Psychology, 78,
Andrich, D. (1982). An extension of the Rasch model for ratings 350–365. doi:10.1037/0022-3514.78.2.350
providing both location and dispersion parameters. Psychome- Furr, R., & Bacharach, V. (2008). Item response theory and Rasch
trika, 47(1), 105–113. doi:10.1007/BF02293856 models. In Authors (Ed.), Psychometrics: An introduction.
Baker, F. (2001). The basics of item response theory. College Park, Thousand Oaks, CA: Sage.
MD: ERIC Clearinghouse. Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the
Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter dimensionality of a set of test items. Applied Psychological
estimation techniques (2nd ed.). New York, NY: Dekker. Measurement, 10, 287–302. doi:10.1177/014662168601000307
Bjorner, J., Smith, K., Stone, C., & Sun, X. (2007). IRT-FIT: A Harris, D. (1989). Comparison of 1-, 2-, and 3-parameter IRT
macro for item fit and local dependence tests under IRT models. models. Educational Measurement: Issues and Practice, 8(1),
Lincoln, RI: Quality Metrics. 35–41. doi:10.1111/j.1745-3992.1989.tb00313.x
Psychotherapy Research 143
Hattie, J. (1984). An empirical study of various indices for Safran, J. D., Muran, J. C., & Eubanks-Carter, C. (2011). Repairing
determining unidimensionality. Multivariate Behavioral alliance ruptures. Psychotherapy, 48(1), 80–87. doi:10.1037/a00
Research, 19(1), 49–78. doi:10.1207/s15327906mbr1901_3 22140
Hattie, J. (1985). Methodology review: Assessing unidimension- Samejima, F. (1969). Estimation of latent ability using a response
ality of tests and itenls. Applied Psychological Measurement, 9(2), pattern of graded scores. Psychometrika Monograph Supple-
139–164. doi:10.1177/014662168500900204 ments, 17.
Hays, R., Morales, L., & Reise, S. (2000). Item response theory Tasca, G. A., Illing, V., Joyce, A. S., & Ogrodniczuk, J. S. (2009).
and health outcomes measurement in the 21st century. Medical Three-level multilevel growth models for nested change data: A
Care, 38(9 suppl), 1128–1142. guide for group treatment researchers. Psychotherapy Research,
Hooper, D., Coughlan, J, & Mullen, M. R. (2008). Structural 19, 453–461. doi:10.1080/10503300902933188
equation modelling; Guidelines for determining model fit. Templin, J. (n.d.). IRT models for polytomous data. Lecture 4.
Electronic Journal of Business Research Methods, 6, 53–60. ICPSR item response theory workshop.
Jackson, D. N. (1984). Personality research form manual (3rd ed.). Thiessen, D. (1991). Multilog (Version 6) [Computer program].
Port Huron, MI: Research Psychologists. Mooresville, IN: Scientific Software.
Joyce, A. S., MacNair-Semands, R., Tasca, G. A., & Ogrodnic- Van Dam, N. T., Earleywine, M., & Borders, A. (2010).
zuk, J. S. (2011). Factor structure and validity of the Measuring mindfulness? An item response theory analysis of
Therapeutic Factors Inventory–Short Form. Group Dynamics: the Mindful Attention Awareness Scale. Personality and Indi-
Theory, Research, and Practice, 15, 201–219. doi:10.1037/ vidual Differences, 49, 805–810. doi:10.1016/j.paid.2010.07.020
a0024677] Yalom, I. D. (1995). The theory and practice of group psychotherapy
Lambert, M. J., & Shimokawa, K. (2011). Collecting client (4th ed.). New York, NY: Basic Books.
feedback. Psychotherapy, 48(1), 72–79. doi:10.1037/a0022238 Yalom, I. D., & Leszcz, M. (2005). The theory and practice of group
Lese, K. P., & MacNair-Semands, R. R. (2000). The therapeutic psychotherapy (5th ed.). New York, NY: Basic Books.
factors inventory: development of a scale. Group, 24, 303–317.
doi:10.1023/A:1026616626780
Linacre. (2014). A user’s guide to Winsteps, Ministeps Rasch Model Appendix 1. Hierarchical linear models
Computer Programs. Retrieved from http://www.winsteps.com/
winman/principalcomponents.htm Three-level longitudinal model (for sensitivity to change)
Linacre, J. M. (2009). Local Independence and residual covar-
iance: A study of Olympic figure skating ratings. Journal of Level 1 : Y tij ¼ p0ij þ p1ij  ðlog timeÞ þ etij
Applied Measurement, 10, 157–169.
Level 2 : p0ij ¼ b00j þ b01j  ðindividual prescoresÞ þ b02
Lord, F. M. (1953). On the statistical treatment of football
numbers. American Psychologist, 8, 750–751. doi:10.1037/  ðBSI residualÞ þ f0ij
h0063675
Lord, F. M. (1980). Applications of item response theory to practical p1ij ¼ b10j þ b11j  ðindividual prescoresÞ þ b12
testing problems. Hillsdale, NJ: Erlbaum.  ðBSI residualÞ þ f1ij
MacKenzie, K. R. (1983). The clinical application of a group
measure. In R. R. Dies & K. R. MacKenzie (Eds.), Advances in Level 3 : b00j ¼ c000 þ c001  ðgroup prescoreÞ þ k00j
group psychotherapy: Integrating research and practice (pp. 159–
b01j ¼ c010 þ k01j
170). New York, NY: International Universities.
MacNair-Semands, R. R., & Lese, K. P. (2000). Interpersonal b02j ¼ c020 þ k02j
problems and the perception of therapeutic factors in group
therapy. Small Group Behavior, 31(2), 158–174. doi:10.1177/ b10j ¼ c100 þ c101j  ðgroup prescoreÞ þ k10j
104649640003100202
b11j ¼ c110 þ k11j
MacNair-Semands, R. R., Ogrodniczuk, J. S., & Joyce, A. S.
(2010). Structure and initial validation of a short form of the b12j ¼ c120 þ k12j
Therapeutic Factors Inventory. International Journal of Group
Psychotherapy, 60, 245–281. doi:10.1521/ijgp.2010.60.2.245 The dependent variable (Ytij) in this model is the TFI-8 scores.
Masters, G. N. (1982). A Rasch model for partial credit scoring. The growth models shown here used a log transformation for
Psychometrika, 47, 149–174. doi:10.1007/BF02296272 “time” to model a more pronounced change from Session 4 to 8,
Mels, G. (2006). LISREL for windows: Getting started guide. and less pronounced change from Sessions 8 to 12. Individual
Lincolnwood, IL: Scientific Software International, Inc. scores were group mean centered and group scores were grand
Muraki, E. (1992). A generalized partial credit model: Application mean centered.
of an EM algorithm. Applied Psychological Measurement, 16(2),
159–176. doi:10.1177/014662169201600206 Two-level hierarchically nested models (for predictive validity)
Ostini, R., & Nering, M. (2006). Polytomous item response theory
models. Thousand Oaks, CA: Sage. Level 1 : Y ij ¼ b0j þ b1j  ðpreBSI scoreÞ þ b2j
Piper, W. E. Rosie, J. S., Joyce, A. S., & Azim, H. F. A. (1996).  ðTFI8 Week 4Þ þ fij
Time-limited day treatment for personality disorders: Integration of
research and practice in a group program Washington, DC: Level 2 : b0j ¼ c00 þ k0j
American Psychological Association.
b1j ¼ c10 þ k1j
Reckase, M. D. (1979). Unifactor latent trait models applied to
multifactor tests: Results and implications. Journal of Educa- b2j ¼ c20 þ k2j
tional Statistics, 4, 207–230. doi:10.2307/1164671
Roy, V., Turcotte, D., Montminy, L., & Lindsay, J. (2005). The dependent variables (Yij) in this model are Brief Symptom
Therapeutic factors at the beginning of the intervention process Index (BSI) General Severity Index scale scores at Session 12.
in groups for men who batter. Small Group Research, 36(1), Individual BSI prescores and TFI-8 scale scores at Week 4 were
106–133. doi:10.1177/1046496404270261 grand mean centered.
144 G. A. Tasca et al.

Appendix 2
Item Characteristic and Item Information Curves for Six of the Eight Group Therapeutic Factor Inventory (TFI-8) Items (Curves for Items
3 and 18 Appear in Figure 1).

Item characteristic curve: 1 Item 1 Item information curve: 1


1.0 2.5

0.8 2.0

Information
Probability

0.6 1.5

1
0.4 1.0
2 3
4

0.2 6 0.5
5 7

0
–3 –2 –1 0 1 2 3 0 –3 –2 –1 0 1 2 3
Ability Scale score

Item characteristic curve: 3


Item 6 Item information curve: 3
1.0 2.5

0.8 2.0
Probability

Information

0.6 1.5
2

0.4 4 1.0
1
5
6
0.2 0.5
3
7

0 0
–3 –2 –1 0 1 2 3 –3 –2 –1 0 1 2 3
Ability Scale score
Item 9
Item characteristic curve: 4 Item information curve: 4
1.0 2.5

0.8 2.0
Probability

Information

0.6 1.5

0.4 5 1.0
6
2 3 4
1
7
0.2 0.5

0 0
–3 –2 –1 0 1 2 3 –3 –2 –1 0 1 2 3
Ability Scale score
Psychotherapy Research 145

Item 12
Item characteristic curve: 5 Item information curve: 5
1.0 2.5

0.8 2.0
1
Probability

Information
0.6 1.5

2
0.4 3
1.0
4

6
0.2 5 0.5
7

0 0
–3 –2 –1 0 1 2 3 –3 –2 –1 0 1 2 3
Ability Scale score

Item characteristic curve: 6


Item 17 Item information curve: 6
1.0 2.5

0.8 2.0
Information
Probability

0.6 1.5

5
1 6
0.4 1.0
2 4
3 7

0.2 0.5

0 0
–3 –2 –1 0 1 2 3 –3 –2 –1 0 1 2 3
Ability Scale score

Item characteristic curve: 8


Item 19 Item information curve: 8
1.0 2.5

0.8 2.0
Information
Probability

0.6 1.5

0.4 4 1.0
1 2
5
3
6
0.2 7 0.5

0 –3 –2 –1 0 1 2 3 0 –3 –2 –1 0 1 2 3
Ability Scale score

You might also like