You are on page 1of 18

Exploring the Meaning of Motivation

Across Cultures: IRT Analyses of the Goal


Instability Scale
Alex Casillas
University of Iowa

E. Matthew Schulz
Steven B. Robbins
ACT, Inc.

Paulo Jorge Santos


Porto University

Richard M. Lee
University of Minnesota, Twin Cities

The present study uses item response theory (IRT) to establish comparability
between the English and Portuguese versions of the Goal Instability Scale (GIS), a
measure of generalized motivation. A total of 2,848 American and 679 Portuguese
high school students were administered their respective language versions of the
GIS. Results showed only minor differences in item performance between lan-
guage versions, suggesting equivalent measurement of the underlying motivational
construct. Implications regarding the interpretation of IRT analyses for interven-
tion purposes, as well as future research, are discussed.

Keywords: cross-cultural assessment, goal instability scale, item response


theory, motivation, test adaptation

Over the past few decades, increased immigration patterns have created (or aug-
mented) multicultural societies and global education and workforce/vocational
domains (Casillas & Robbins, 2005). As societies face the challenge of addressing
the needs of increasingly diverse populations, many testing professionals have
turned to U.S.-derived psychological, academic, and workforce tests with the
logic that “if an instrument has been shown to be reliable and valid in one cul-
tural context, it may hold potential for benefiting consumers in other cultures”
(Arnold & Matus, 2000, p. 122). Within the academic achievement arena, the

Correspondence concerning this article should be directed to Steven B. Robbins, Applied Research, ACT, Inc.,
Iowa City, IA 52243-168l; e-mail: steve.robbins@act.org.

JOURNAL OF CAREER ASSESSMENT, Vol. 14 No. 4, November 2006 472–489


DOI: 10.1177/1069072705283764
© 2006 Sage Publications

472
Casillas et al. / IRT AND MOTIVATION 473

Third International Mathematics and Science Study (Gonzales et al., 2000)


serves as a cogent example of the systematic effort to adapt and track math and
science performance across 100 countries.
One of the challenges of adaptation is to translate “an instrument developed
in one culture and language into the language of the second culture, while pre-
serving the integrity and meaning of the original instrument” (Hulin, Drasgow, &
Komocar, 1982, p. 818). Indeed, the ultimate purpose of instrument adaptation
is not to generate scales with the same scores in both languages but to ensure that
the score metrics are equivalent in both languages and that differences in norms
have not been created artifactually by the translation process (Hulin et al., 1982).
Consequently, with the increased demand for high-quality test adaptations,
researchers and consumers have emphasized the need to evaluate the fidelity of
adapted tests to the original version using a variety of psychometric methods (e.g.,
International Test Commission, 2000; Van de Vijver & Hambleton, 1996).
The present study is concerned with establishing construct equivalence of a
specific measure of generalized motivation, the Goal Instability Scale (GIS;
Robbins & Patton, 1985). The GIS is a 10-item, self-report measure of difficulty
initiating action, setting goals, and maintaining the drive to complete such goals.
The English GIS version was constructed using a rational-empirical approach
based on factor analytic procedures. For each of the 10 items, the respondent
endorses a Likert-type scale ranging from 1 (strongly agree) to 6 (strongly dis-
agree). Higher agreement with GIS items produces lower scores and indicates
greater goal instability (i.e., reduced motivation). Confirmatory factor analyses
using diverse groups have shown that the GIS taps a unitary construct (e.g.,
Noble, Roberts, & Sawyer, 2003; Robbins, Payne, & Chartrand, 1990).
We chose to adapt the GIS for use by Portuguese secondary school students
because, as a generalized measure of motivation, it is highly associated with aca-
demic outcomes (cf. Lese & Robbins, 1994; Robbins et al., 2004; Thombs, 2002).
The GIS is frequently used with normal and “at-risk” student populations where
an easily administered, global self-report measure of motivation is desirable. We
believe that Portugal is at a crossroads regarding its educational environment and
needs to adopt practical measures associated with both at-risk behaviors and posi-
tive academic outcomes. As Robbins et al. (2004) pointed out, motivation is a key
factor in understanding both the academic persistence and performance of students.
In particular, research by the Organization for Economic Cooperation and
Development (OECD) has found that Portuguese adolescents drop out of school
at a higher rate than many OECD countries; according to 1998 statistics, only
49% of higher education students complete their degrees (Fundação da
Juventude, 1999). More recent statistics also suggest that the percentage of
Portuguese adolescents and young adults in the 18- to 24-year age group who did
not finish a secondary education was quite high (45%) compared with other
members of the European Union (M = 19%; Ministry of Education, 2003). As a
member of the European Union, Portugal has considerable interest in—and
commitment to—aiding the educational achievement and eventual work success
474 JOURNAL OF CAREER ASSESSMENT / November 2006

of its populace. We believe that by measuring the motivation of Portuguese sec-


ondary school students and comparing them with students in other cultures, we
may assist in understanding the reasons for the aforementioned drop-out rates and
developing effective interventions.
In an initial study by Santos, Casillas, and Robbins (2004), the GIS was trans-
lated into Portuguese, revised by a language and context expert, and pilot tested
in a student sample as part of the translation process. Confirmatory factor analysis
using each language version resulted in similar one-factor solutions. Convergent/
discriminant validity estimates within the Portuguese population suggested that
the GIS construct was operating as expected (Santos et al., 2004). These results
established that the Portuguese version of the GIS (GIS-P) has similar dimension-
ality and relations with external variables as the English version.
The purpose of this study is to further compare the internal meaning of English
and Portuguese versions of the GIS. Comparisons of internal meaning are gener-
ally based on criteria for measurement invariance, which means that items func-
tion the same way across groups or cultures. Criteria for measurement invariance
were first formulated from the perspective of exploratory factor analysis (Hui &
Triandis, 1985). Later, criteria were formulated within the framework of structural
equation modeling (Steenkamp & Baumgartner, 1998). More recently, criteria
within the frameworks of item response theory (IRT; Gerber et al., 2002; Raju,
Laffitte, & Byrne, 2002; Reise, Widaman, & Pugh, 1993) and latent class analysis
(Eid, Langeheine, & Diener, 2003) have been put forward. It is beyond the scope
of this study to compare and contrast the criteria and relative strengths of these
various frameworks. Articles by Reise et al. (1993), Raju et al. (2002), and Eid et al.
(2003) include comparisons of approaches. There seems to be general agreement
in this field that the forms and stringency of measurement invariance with which
one needs to be concerned depend on practice and the goals of the study (e.g.,
Steenkamp & Baumgartner, 1998).
The following section presents the criteria for measurement invariance used in
this study. However, before we present these criteria, it is important to note how
certain assumptions led us to use an IRT framework and a particular IRT model
for this comparison. We believe it is important to use a framework that represents
key assumptions in the scoring and intended use of the GIS and to use a model
that allows the assumptions to be evaluated. First, the practice of obtaining only
one measure from the GIS (i.e., goal instability) suggests the assumption of uni-
dimensionality. It is therefore reasonable to use a unidimensional model to evalu-
ate measure invariance. Specifically, we are interested in the fit of the GIS data
to a unidimensional model because that is how the GIS data are treated.
Second, we assume that differences among the GIS items, in terms of their
endorsability by students, define a progression (or descent) into goal instability
(i.e., low motivation) that is shared by most persons. This assumption leads first
to the use of an IRT model and second to the use of “person fit” statistics that
indicate whether a given person’s responses to the GIS items are consistent with
the progression evidenced by the arrangement of items on the latent IRT scale.
Casillas et al. / IRT AND MOTIVATION 475

IRT models generally locate items on a latent scale that also represents measures
of the trait. The production of person-fit statistics is associated primarily with a
subset of measurement theory and applications within IRT where the arrange-
ment of items on the latent scale is assumed to have meaning for most, if not all,
individuals. For example, items located at one end of the GIS may show how any
person begins the descent into goal instability. Items located at the other end of
the scale may show the final or most advanced stages of goal instability in any
person. These kinds of interpretations have implications for how goal instability
can be addressed through counseling or more general preventive interventions.
Third, we assume that the GIS items contribute equally to the measurement
of goal instability. This assumption is implicit in the fact that the unweighted total
score across GIS items is taken as the measure of goal instability. With this
assumption, it is important to use an IRT model that explicitly incorporates the
assumption of equal weighting and to evaluate the fit of the data to this model. In
a structural modeling framework, one would evaluate the fit of a model in which
equal weights were specified for the items. In an IRT framework, one evaluates
the fit of a model in which the slope parameter in the model is assumed to be a
constant (e.g., 1.0) for all items. The following section shows that the use of a
model with constant slope (e.g., with no slope parameter) has certain advantages
for assessing other facets of measure invariance as well.

The Rating Scale Model and Criteria


for Measurement Invariance

The Rating Scale Model (Andrich, 1978) is a unidimensional IRT model for
data where all items share a common set of ordered response categories, such as
exists with a Likert-type scale. A formulation of the model and an interpretation
of its parameters with respect to GIS data, where items are scored 1 (for strongly
agree) to 6 (for strongly disagree), are shown by the following:

j = 1, 2, …, 5 (1)

where τj is a category threshold parameter (a threshold parameter represents the


relative difficulty of choosing category j rather than category j – 1 in response to
any item); Pnij is the probability that person n surmounts exactly j thresholds on
statement i; Pnij-1 is the probability that person n surmounts exactly j – 1 thresh-
olds on statement i; βn is the goal instability of person n; and δi is the location,
or calibration, of item i on the measurement scale.
An important feature of the Rating Scale Model in the present context is that
it does not include a slope parameter. This is an important difference from the
graded response model (Samejima, 1969), which is often applied to Likert-type
data, and many other IRT models. The slope parameter is so named because it
476 JOURNAL OF CAREER ASSESSMENT / November 2006

allows the slope of the item characteristic curve (ICC) to vary across items. The
ICC is the trace line of the expected score on an item as a function of the trait
value, or β. The slope parameter essentially multiplies the additive combination
of other parameters in the model. For example, αi[βn – (δi +τj)] represents the
addition of a slope parameter, αi, to the Rating Scale Model. Models with slope
parameters tend to fit data better and may be useful in exploratory work, but they
do not correspond to practice when the unweighted total score across items is
used to estimate the underlying trait.
Measure invariance in IRT ultimately implies equivalence of item parameters
across groups (Raju et al., 2002; Reise et al., 1993). Parameters in IRT models
typically represent distinct, substantive issues with regard to measure invariance,
so it would be useful to compare model parameters directly across groups. For
example, differences in the Rating Scale Model threshold parameters, τj, across
groups reflect group differences in how categories of the Likert-type scale are
interpreted and used, as opposed to differences in how the items are interpreted
and used. Group differences in how items are interpreted and used are represent-
ed by differences between paired item parameters, δi, across groups. Thus, in the
Rating Scale Model, the meaning of differences in the τj and δi across groups is
separate and clear. This is a consequence of the additive combination of all
model parameters on the right side of the model formulation (Equation 1).
More generally, measure invariance in IRT is not assessed by directly compar-
ing item parameters (see Raju et al., 2002, for a description of measure invariance
regarding a more general class of IRT models). IRT models frequently include
parameters, such as a slope or a pseudo-guessing parameter, such that the combi-
nation of parameters in the model is not completely linear, or additive. Without
additivity, the meaning of model parameters is not separate and clear. For exam-
ple, differences in item location parameters (e.g., δi) across groups cannot be
evaluated independently of differences in a slope parameter.
It is interesting to note that IRT-based studies of measure invariance generally
use the same methods used in the study of differential item functioning (DIF;
Raju et al., 2002; Reise et al., 1993). One of the most popular and powerful meth-
ods of assessing DIF, the Mantel-Haenzel method, is not directly based on IRT
but is mathematically equivalent to comparing item difficulty parameters in the
one item-parameter (a location or difficulty parameter) Rasch model for dichoto-
mously scored (0 or 1) data under certain conditions, including the fit of data to
the model (Holland & Thayer, 1988). The Rating Scale Model is a member of
the Rasch family of measurement models (Rasch, 1960/1980; Wright & Masters,
1982).
Fit statistics commonly used in conjunction with the Rating Scale Model
include a weighted and unweighted mean squared residual, which are referred to
as infit and outfit, respectively (Wright & Masters, 1982). The fit statistics are
computed for each person and item (Wright & Masters, 1982). Only the outfit
statistic will be used in this study. The outfit statistic is comparable to a chi-square
statistic divided by its degrees of freedom. It has an expected value of 1.0 under
Casillas et al. / IRT AND MOTIVATION 477

the hypothesis that data fit the model. Fit statistics greater than 1.0 indicate
response patterns having more noise than expected according to the probabilities
specified by the model (e.g., Equation 1). Fit statistics outside the range of 0.6 to
1.5 may indicate practically significant overfit (less than 0.6) or underfit (greater
than 1.5).
For some practical purposes, such as counseling, only cases of underfit may be
of concern. The responses of overfitting persons conform unusually well to the
arrangement of items on the GIS scale. The assumptions of a counselor or inter-
vention program about how goal instability progresses, as inferred from the order
of items on the GIS scale, would not be invalid for overfitting persons. Likewise,
overfitting items tend to be associated with a higher than average correlation
between the item score and person measure. This relationship extends to the
weight or slope the item would have in structural equation models or IRT models
that allow weights or slopes to vary. Overfitting items tend to be associated with
greater slope or weight values. These kinds of items are not generally viewed as
problem items in instrument development.
In terms of the Rating Scale Model and associated fit statistics, the following
criteria for measurement invariance are explored in this study. The term groups
refers to the English and Portuguese samples taking their respective language
versions of the GIS.
1. Item calibrations (δi) will be the same across groups. To meet this
criterion, the difference between paired item calibrations should
differ by no more than 0.3 logits (the scale unit in a Rating Scale
Model analysis). This standard is commonly applied in evaluating
measure invariance in educational testing (e.g., Miller, Rotou, &
Twing, 2004). The failure of an item to meet this criterion may be
due to nonequivalence in translation or to more fundamental differ-
ences between populations in how the item defines goal instability.
If the GIS measure is invariant in this respect, one could reasonably
hypothesize that individual counseling and intervention strategies
would not differ by population.
2. Step calibrations (τj) will be the same across groups. To meet this
criterion, the same 0.3 standard described above will be used.
Failure of a step calibration to meet this criterion would call the
translation of category labels into question or suggest more funda-
mental cultural differences in how persons use the Likert-type scale
categories.
3. The proportion of person fit statistics less than 1.5 should be reason-
ably large and comparable across groups. The larger the proportion,
the more the arrangement of items on the IRT scale can be used to
understand the dynamics of goal instability within a given student
and to deliver effective intervention and counseling at the individual
student level. If the proportion is comparable across language ver-
sions, the GIS can be said to have similar potential for counseling
478 JOURNAL OF CAREER ASSESSMENT / November 2006

in both populations. If the order and arrangement of items on the


scale differ in the two populations, individual counseling and inter-
vention strategies may differ by population.
4. Item fit statistics will be comparable across groups. In addition to
their use in detecting technical flaws in individual items, such as
scoring errors or ambiguous language that may be interpreted dif-
ferently by different persons, item fit statistics can indicate a variety
of substantively meaningful patterns in the data, such as dependen-
cies among related items (which leads to overfit) or items tapping
content that is not as strongly related to the central trait as others
(which leads to underfit). These substantive patterns, as well as
any superficial characteristics of the item that may cause misfit, are
part of the meaning of the variable and should be the same across
groups. Due to the approximate relations between item fit statistics,
standard error of measurement (SEM) item weights, and IRT item
slope parameters, a comparison of item fit statistics across groups is
about as productive and meaningful with regard to measure invari-
ance as comparing SEM item weights or IRT item slope parameter
estimates across groups.

METHOD

Samples

American participants. American high school juniors and seniors who regis-
tered for the February and April 2002 administrations of the ACT assessment in
63 high schools across the United States were invited as potential participants in
a survey of noncognitive factors and high school performance (see Noble et al.,
2003, for details regarding the sampling procedure). The voluntary nature of
participation and confidentiality of the data were stressed during recruitment. A
total of 2,983 consented to participate and completed the survey. Of these, 135
surveys (4.5%) were discarded due to missing data. The remaining participants (N =
2,848) had a mean age of 17.7 years (SD = 0.70, range = 16-21 years), were
mostly female (64.2%) and Caucasian (75.8%), and were enrolled in the 11th
grade (67.3%).

Portuguese participants. A total of 679 Portuguese secondary school students


from Grades 10 through 12 participated in the study. Their ages ranged from 15
to 21, with a mean age = 16.8 years (SD = 1.1 years). Participants were mostly
female (54.3%), with a modal enrollment in the 11th grade (43.2%). The students
were recruited from classes in six urban schools (four public and two private) by
one of the authors, who requested help with a study on academic achievement
Casillas et al. / IRT AND MOTIVATION 479

and career planning. Administration of the instrument took place in school, dur-
ing class time, after the participants were informed that the general purpose of the
research was to study several aspects of adolescent development. The voluntary
nature of participation and confidentiality of the data were stressed.

The Goal Instability Scale

English version. The GIS (Robbins & Patton, 1985) was constructed for use in
educational settings to explore how lapses in generalized motivation or drive
affected career development and educational attainment processes. Several stud-
ies demonstrate the salience of this construct in a variety of settings, including
counseling (e.g., Robbins & Tucker, 1986), education (Lese & Robbins, 1994;
Robbins & Schwitzer, 1988; Thombs, 2002), aging (Cook, Casillas, Robbins, &
Dougherty, 2005), and health (Elliott, Uswatte, Lewis, & Palmatier, 2000). In our
current sample, GIS psychometrics were commensurate with those of past
research, with a mean score of 45.6 (SD = 9.3, range 10-60; alpha = .85).

Portuguese version. Details of the translation process were reported in Santos


et al. (2004). Careful effort was taken to ensure the proper adaptation of the GIS
for use in the educational setting (Table 1 features the English and Portuguese
versions of the measure). Furthermore, as suggested by the obtained convergent
and discriminant validity relations to educational and psychological measures,
Santos et al. found that the GIS-P operated in a similar fashion to the U.S.
version. In our current sample, GIS-P psychometric properties were similar
to those reported by Santos et al., with a mean of 45.3 (SD = 9.0, range 15-60,
alpha = .83).

Data Analytic Procedure

Rating Scale Model analyses were performed with the computer program
BIGSTEPS (Wright & Linacre, 1991). Three analyses were performed: a sepa-
rate analysis for each group and an analysis of the combined data. Due to the
additivity in the model, parameter estimates are identified up to an arbitrary
choice of scale origin. By convention, the scale origin, or zero, is set to the aver-
age item calibration. This means that with the use of the same items in all three
analyses, parameter estimates from all three analyses are automatically placed on
the same scale and are directly comparable.
The BIGSTEPS software produces estimates of ancillary statistics, including
(a) the standard error of parameter estimates, (b) item and person outfit statistics,
and (d) reliability estimates. Reliability estimates are based on the traditional for-
mula with estimates of observed and error variance set to the units of the latent
480 JOURNAL OF CAREER ASSESSMENT / November 2006

Table 1
English and Portuguese Versions of the Goal Instability Scale

Item English Portuguese

1 It’s hard to find a reason for working. É-me difícil encontrar uma razão para
trabalhar.
2 I don’t seem to make decisions by myself. Parece-me que não consigo tomar decisões
sozinho(a).
3 I have confusion about who I am. Sinto-me confuso(a) acerca de quem eu sou.
4 I have more ideas than energy. Tenho mais ideias do que energia.
5 I lose my sense of direction. Na minha vida, frequentemente, fico
desorientado(a).
6 It’s easier for me to start than to finish É mais fácil, para mim, iniciar projectos do
projects. que os concluir.
7 I don’t seem to get going on anything Parece-me que não consigo prosseguir com
important. nada de importante.
8 I wonder where my life is headed. Pergunto-me para onde a minha vida se
encontra orientada.
9 After a while, I lose sight of my goals. Ao fim de algum tempo perco os meus
objectivos de vista.
10 I don’t seem to have the drive to get my Parece-me que não tenho energia para
work done. realizar o meu trabalho.

Note. Adapted from Santos et al. (2004).

scale. The reliability estimates are comparable in value to traditional internal


consistency indices, such as Cronbach’s alpha.

RESULTS

Paired item measures (δi) are plotted in Figure 1. To identify the origin of the
scale in the rating scale analysis, the average item measure is arbitrarily set to
zero. Because this is done for both groups, the points in Figure 1 should lie within
measurement error of the identity line under the hypothesis of measurement
invariance. In both groups, items that are relatively easy to endorse, or to agree
with, have positive scale values. It can be seen that Portuguese students were
somewhat more likely to agree with Item 4, “I have more ideas than energy,” and
less likely to agree with Item 5, “I lose my sense of direction,” and Item 6, “It’s
easier for me to start than to finish projects.” Although these differences were sta-
tistically significant, no item exhibited a between-group difference of more than
0.3 logits.
Casillas et al. / IRT AND MOTIVATION 481

Item Difficulty by Language

1.00

8
0.80

0.60

6
0.40

4
Portuguese

0.20
5

0.00
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
9
-0.20
3
2 7
10 -0.40
1

-0.60

-0.80
English

Figure 1. Item difficulty by language.

Estimates of step thresholds (τj) are plotted separately for each group in Figure
2. In both groups, the average step threshold is zero, by convention. Steps with
more positive scale values require higher levels of goal “stability” (or lower levels
of goal instability) to take. Because disagreement received higher scores than
agreement, Step 1 represents the tendency to moderately agree rather than strong-
ly agree with an item, and Step 5 represents the tendency to strongly disagree
rather than moderately disagree with an item. In both groups, the difficulty of
steps increases monotonically with the exception of Step 4, which is easier than
Step 3 in both groups. Step 4 represents the tendency to moderately disagree (Cat-
egory 5) rather than slightly disagree (Category 4) with an item, whereas Step 3
represents the tendency to slightly disagree rather than slightly agree with an item.
A monotonic trend of increasing step difficulty is desirable for efficient measure-
ment (Linacre, 2002). However, exceptions to this trend are common and do not
tend to indicate serious measurement problems. It can be seen from Figure 2 that
the largest between-group difference in step thresholds occurred for Thresholds 2
and 4. Portuguese students were less likely to choose slightly agree over moderately
agree (Step 2) and more likely to choose moderately disagree over slightly disagree
(Step 4). These differences were statistically significant. However, neither differ-
ence exceeded 0.3 logits.
The percentage of person outfit statistics (i.e., unweighted mean squared
residuals) greater than 1.5 was 19.1 for the American sample (n = 2,848), 13.8 for
the Portuguese sample (n = 679), and 18.1 for the combined analysis (N = 3,527).
Because the weighted average percentage equals the percentage obtained in a
482 JOURNAL OF CAREER ASSESSMENT / November 2006

Steps by Language

1.5

0.5
Step Calibration

0
1 2 3 4 5

-0.5

English
-1 Portuguese

-1.5
Steps

Figure 2. Steps by language.

combined analysis, we concluded that a single set of item and threshold parame-
ters, based on a combined analysis, was appropriate for both groups. That is, per-
son fit is substantially the same whether it is computed with respect to a common
or group-specific set of item parameters and step thresholds.
A plot of paired outfit statistics is shown in Figure 3. This plots shows that items
had similar fit statistics in each language group. The correlation between outfit
statistics was .91. Only one item, Item 1, has a fit statistic greater than 1.5. However,
items outside this range are not automatically deleted from an instrument. In the
present case, we suggest that the reasons for the misfit of Item 1 be investigated.
For example, it is possible that the step difficulties for this item are not consistent
with the common step structure illustrated in Figure 3. If this were the case, Item
1 could be allowed to have its own set of step difficulties. In contrast, the reason
for Item 1’s misfit might be related to its content or to subtle connotations associ-
ated with the particular way the item is written. In that case, it might be revised
and tried again or replaced altogether. However, the focus of the present study is
on comparability across language groups; in that regard, it is important to observe
that Item 1 has similar levels of misfit in both language groups.
Table 2 shows summary statistics from the combined and separate (i.e., by lan-
guage) analyses. The average person measure was about the same for the English
and Portuguese samples, as well as for the combined group. A positive average
means that students tended to disagree with the GIS items. The reliability of per-
Casillas et al. / IRT AND MOTIVATION 483

Outfit by Language

1.8

1.6

1.4
Portuguese

1.2
8

3
2
1 6

10
0.8
5
7
9

0.6
0.6 0.8 1 1.2 1.4 1.6 1.8
English

Figure 3. Outfit by language.

son measures was also the same for the individual language samples and the
combined group. Person and item fit statistics from separate and combined analy-
ses are also summarized in Table 2. As can be seen, means and standard devia-
tions of the fit statistics did not differ substantially across the three analyses.
Although evaluations of data/model fit are largely based on judgment and
experience (Embretson & Reise, 2000), the item fit statistics summarized in
Table 2 fell generally within established standards for a Rating Scale Model anal-
ysis. Given means near 1.0 and standard deviations ranging from .21 to .29, one
would expect few item fit statistics to be outside the range of 0.6 to 1.4. Indeed,
only one item fit statistic was outside this range (Item 1 outfit = 1.51).

DISCUSSION

What Is the GIS Construct?

The GIS construct is defined by the arrangement of items and item steps on a
latent scale. This arrangement is commonly called a “variable map” because it
represents the “journey” from one level of a variable to another, such as from low
goal instability (higher motivation) to high goal instability (lower motivation). In
the present case, we may speak of moving from high to low goal instability over
the course of counseling or intervention or of a student “descending” into goal
484 JOURNAL OF CAREER ASSESSMENT / November 2006

Table 2
Item Response Statistics for Combined and Separate Analyses

Person Measure Person Outfit Item Outfit Reliability


M M SD M SD Person Item
a
Combined 0.76 1.04 0.72 1.04 0.26 0.81 1.00
Englishb 0.78 1.04 0.72 1.04 0.26 0.81 1.00
Portuguesec 0.72 1.04 0.71 1.04 0.29 0.81 1.00

a. N = 3,520.
b. n = 2,848.
c. n = 679.

instability as a result of the lack of certain support systems, information, or other


factors whose presence would normally prevent such a course of development.
Table 3 contains the item variable map for the average step of the GIS con-
struct. This map is based on the analysis of the combined American and Portuguese
data. The combined data were used because, as shown above, the GIS construct
is substantially the same in both groups. As stated in the introduction, a step is the
distance between two Likert-type scale category labels (e.g., going from endorsing
slightly agree to slightly disagree). The map can be described in terms of moving
from low to high goal instability. The average step is used because it seems repre-
sentative of the overall movement from low to high goal instability. For example,
as far as the GIS is able to document, item 8 is the most likely to be endorsed by
someone who is just descending from low to high goal instability, whereas item
10 is the last item to be endorsed as part of the descent process. According to the
Rating Scale Model, persons located above this step for a given item are more
likely to disagree (either slightly, moderately, or strongly) than to agree with the
item. The converse is true for persons below this step, who are more likely to
agree (slightly, moderately, or strongly) with the item, thus yielding higher goal
instability scores.

How Can the Meaning of This Construct Be Put to Use?

Information from the variable map may be useful in implementing counseling


and intervention approaches. For example, if a student tends to agree with the
items at the top of the map (e.g., 8, 4, 6) but not with those at the bottom (e.g., 9,
1, 10), it may indicate a moderate level of goal instability. This, in turn, would
suggest that the student has a baseline level of motivation but is experiencing
some difficulty in finding long-term direction and/or following through with pre-
viously set goals. On the other hand, if a student tends to agree with all items on
the map, including 9, 1, and 10, it may indicate a higher level of goal instability,
Casillas et al. / IRT AND MOTIVATION 485

Table 3
Item Variable Map for an Average Step
Between Low and High Goal Instability

8 I wonder where my life is headed.


4 I have more ideas than energy.
6 It’s easier for me to start than to finish projects.
5 I lose my sense of direction.
3 I have confusion about who I am.
2 I don’t seem to make decisions by myself.
7 I don’t seem to get going on anything important.
9 After a while, I lose sight of my goals.
1 It’s hard to find a reason for working.
10 I don’t seem to have the drive to get my work done.

which would suggest that the student is experiencing low motivation and having
difficulties with completing even basic activities and/or setting simple goals.
Person fit statistics from a Rating Scale Model analysis are generally used to
evaluate the assumption that the arrangement of item steps on the variable map
is pretty much descriptive of each person’s experience, or journey, in moving
along the variable in either direction. Counseling and intervention approaches
based on the information in the variable map may be effective to the extent that
this assumption holds among individuals and the population at large. Thus, it is
important to assess the level of misfit (i.e., the percentage of person fit statistics
above 1.5). In this study, 19.1% of the American students and 13.8% of the
Portuguese students fell above the optimal level of person fit. Although the level
of person misfit is considerable, it is relatively common in IRT analyses of rating
scale data to show some level of misfit, and it is typically due to response sets
(Schulz & Sun, 2001). An example of a response set is the tendency to respond
in the same category for all items. The greater percentage of person misfit in the
American sample, compared with the Portuguese sample, may indicate a greater
propensity to response sets among American students. However, we do not
believe that the level of person misfit or the possibility of response sets among
some students would make intervention strategies based on the variable map inef-
fective. In fact, person fit statistics could be used in a two-stage approach for
identifying students or groups in need of intervention. First, for the substantial
proportion of students whose fit statistics were less than 1.5, the variable map
could be used to prioritize issues for addressing and/or for resource allocation.
Second, inferences about the goal instability of students or groups with high per-
son misfit could be qualified as temporary until more detailed analyses or ques-
tioning reveal the reasons for misfit.
486 JOURNAL OF CAREER ASSESSMENT / November 2006

Is the Motivational Construct Measured


by the GIS the Same Across Cultures?

This study explored the cross-cultural and language equivalence between the
English and Portuguese versions of the GIS, a measure of generalized motivation.
Results of IRT analyses suggest that despite some minor differences in item per-
formance between the two test versions, goal instability has the same internal
meaning for both groups. Item parameters and fit statistics were substantially the
same between groups. Step thresholds, representing the relative attractiveness of
adjacent categories on the Likert-type scale, were also comparable between
groups.
One strength of the study is its use of large sample sizes and the Rating Scale
Model. As one of the options afforded by such sample sizes, we could have used
a more highly parameterized model. However, previous research suggests that the
Rating Scale Model tends to outperform more highly parameterized models,
such as the Graded Response Model (Dodd, De Ayala, & Koch, 1995; Lei,
Bassiri, & Schulz, 2003). In terms of weaknesses, analyses showed that the
American sample had a higher level of person misfit. Although we do not believe
that this poses a major risk to our interpretations, it suggests that response styles
and/or other reasons may underlie this level of misfit. We hope to explore this
issue in future research. Furthermore, it would have been ideal to obtain a larger
Portuguese sample, because this would increase our confidence that the results
of this study will generalize to this population.
Although we advocate the use of IRT procedures for making detailed cross-
cultural comparisons, we want to emphasize that analyses based on traditional
classical test theory–based psychometric procedures remain valuable and infor-
mative. Indeed, some types of information may be gleaned from either approach
(e.g., Raju et al., 2002). However, we believe that IRT procedures provide the
most sophisticated methods for establishing equivalence between measures,
which is an important consideration for cross-cultural measurement. By equiva-
lence, we do not mean that the two measures must be identical but rather that
they must be free from item bias (Arnold & Matus, 2000; Butcher & Han, 1996;
Casillas & Robbins, 2005). If different versions of a measure are not equivalent,
one risks making a variety of faulty inferences about the meaning of such mea-
sures and their respective populations, such as inappropriate scaling, inaccurate
comparisons between groups, and faulty conclusions about groups (Hambleton
& Slater, 1997). Thus, if one wishes to adapt an existing measure to another lan-
guage and/or culture, IRT procedures may be indispensable. In contrast, if one
wishes to develop a new test in the target culture from the ground up, IRT proce-
dures may not be necessary, or even desirable. Of course, regardless of the meth-
ods used, the ultimate value of any analytical procedure is whether it can facilitate
the examination of the construct validity of a measure.
Casillas et al. / IRT AND MOTIVATION 487

Future Research

By demonstrating that the two measures are similarly tapping the construct
measured by the GIS, these results complement the findings of Santos et al.
(2004), who showed that the GIS-P was unidimensional and resulted in a pattern
of convergent and discriminant relations with other measures similar to that dem-
onstrated by the original GIS. However, we realize that these analyses, by them-
selves, do not complete the process of construct validation of the GIS-P. Future
research with external criteria (e.g., GPA, retention rates) is needed to expand our
knowledge of the construct validity of this measure adaptation. As stated before,
Portugal is experiencing a high secondary school dropout rate. Unless interven-
tions are developed to curtail dropout, Portugal will have considerable difficulty
competing within the European Union as well as in the global economy. At the
same time, motivation appears to be a salient indicator of academic performance
and career readiness for Portuguese secondary school students (see Santos et al.,
2004). Thus, future research is needed to determine if interventions specifically
targeted to highlight motivational factors will help to improve students’ persis-
tence in achievement, their ability to set career goals, and their eventual pursuit
of postsecondary training and productive career placements.
Furthermore, research is needed to examine how best to assist Portuguese sec-
ondary school students in pursuing various forms of postsecondary education.
This may require assessment not only of Portuguese students themselves but also
of the Portuguese school system as well as other major influences on students’
academic development (e.g., family networks, peers, Portuguese media). Findings
from this type of research will be of considerable value in the development of
interventions for promoting academic achievement, retention, and the eventual
career placement of Portuguese students. Once this research is under way, it will
be interesting to examine whether some of the strategies used in the United States
(e.g., Robbins & Tucker, 1986) may be applied successfully to Portuguese
students.

REFERENCES

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometricka, 43, 561-
573.
Arnold, B. R., & Matus, Y. E. (2000). Test translation and cultural equivalence methodologies for
use with diverse populations. In I. Cuellar & F. A. Paniagua (Eds.), Handbook of multicultural
mental health (pp. 121-136). San Diego, CA: Academic Press.
Butcher, J. N., & Han, K. (1996). Methods of establishing cross-cultural equivalence. In
J. N. Butcher (Ed.), International adaptations of the MMPI-2: Research and clinical applications
(pp. 44-63). Minneapolis: University of Minnesota Press.
Casillas, A., & Robbins, S. B. (2005). Test adaptation and cross-cultural assessment from a business
perspective: Issues and recommendations. International Journal of Testing, 5(1), 5-21.
488 JOURNAL OF CAREER ASSESSMENT / November 2006

Cook, D., Casillas, A., Robbins, S., & Dougherty, L. (2005). Goal continuity and the “Big Five”
as predictors of older adult marital adjustment. Personality and Individual Differences, 38,
519-531.
Dodd, B. G., De Ayala, R. J., & Koch, W. R. (1995). Computerized adaptive testing with polyto-
mous items. Applied Psychological Measurement, 19, 5-22.
Eid, M., Langeheine, R., & Diener, E. (2003). Comparing typological structures across cultures by
multigroup latent class analysis: A primer. Journal of Cross-Cultural Psychology, 34, 195-210.
Elliott, T. R., Uswatte, G., Lewis, L., & Palmatier, A. (2000). Goal instability and adjustment to
physical disability. Journal of Counseling Psychology, 47, 251-265.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum.
Fundação da Juventude. (1999). Diplomados desempregados [Unemployed graduates]. Porto,
Portugal: Author.
Gerber, B., Smith, E.V., Girotti, M., Pelaez, L., Lawless, K., Smolin, L., et al. (2002). Using Rasch
measurement to investigate the cross-form equivalence and clinical utility of Spanish and
English versions of a diabetes questionnaire: A pilot study. Journal of Applied Measurement, 3,
243-271.
Gonzales, P., Calsyn, C., Jocelyn, L., Mak, K., Kastberg, D., Arafeh, S., et al. (2000, December).
Pursuing excellence: Comparisons of international eight-grade mathematics and science achieve-
ment form a U.S. perspective, 1995 and 1999. Washington, DC: U.S. Department of Education,
National Center for Education Statistics.
Hambleton, R. K., & Slater, S. C. (1997). Item response theory models and testing practices:
Current international status and future directions. European Journal of Psychological Assessment,
13, 21-28.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel–Haenszel
procedure. In H. Wainer & H. Braun (Eds), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence
Erlbaum.
Hui, C. H., & Triandis, H. C. (1985). Measurement in cross-cultural psychology: A review and
comparison of strategies. Journal of Cross-Cultural Psychology, 16, 131-152.
Hulin, C. L., Drasgow, F., & Komocar, J. (1982). Applications of Item Response Theory to analysis
of attitude scale translations. Journal of Applied Psychology, 67, 818-825.
International Test Commission. (2000). International guidelines for test use. Surrey, UK: Author.
Lei, P., Bassiri, D., & Schulz, E. M. (2003). A comparative evaluation of methods of adjusting GPA
for differences in grade assignment practices. Journal of Applied Measurement, 4, 70-86.
Lese, K., & Robbins, S. (1994). Relationship between goal attributes and the academic achieve-
ment of southeast Asian refugee adolescents. Journal of Counseling Psychology, 41, 45-52.
Linacre, J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effec-
tiveness. Journal of Applied Measurement, 3, 85-106.
Miller, E. G., Rotou, O., & Twing, J. S. (2004). Evaluation of the 0.3 logits screening criterion in
common item equating. Journal of Applied Measurement, 5, 172-177.
Ministry of Education. (2003). Insucesso e abandono escolar em Portugal [School failure and drop-
out in Portugal]. Retrieved April 15, 2004, from htpp://www.minedu.pt/Scripts/ASP/destaque/
estudo01/docs/sintese.pdf
Noble, J. P., Roberts, W. L., & Sawyer, R. L. (2003, April). Student achievement, behavior, percep-
tions, and other factors affecting college admissions test scores. Paper presented at the annual
meeting of the American Educational Research Association, Chicago.
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of
methods based on confirmatory factor analysis and item response theory. Journal of Applied
Psychology, 87, 517-529.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Danish Institute for
Educational Research. Chicago: University of Chicago Press. (Original work published 1960)
Casillas et al. / IRT AND MOTIVATION 489

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response
theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114,
552-566.
Robbins, S., Lauver, K., Le, H., Davis, D., Langley, R., & Carlstrom, A. (2004). Do psychosocial
and study skill factors predict college outcomes? A meta-analysis. Psychological Bulletin, 130,
261-288.
Robbins, S. B., & Patton, M. J. (1985). Self-psychology and career development: Construction of
the Superiority and Goal Instability scales. Journal of Counseling Psychology, 32, 221-231.
Robbins, S. B., Payne, E. C., & Chartrand, J. M. (1990). Goal instability and later life adjustment.
Psychology and Aging, 5, 447-450.
Robbins, S. B., & Schwitzer, A. M. (1988). Validity of the superiority and goal instability scales as
predictors of women’s adjustment to college life. Measurement and Evaluation in Counseling
Development, 21, 117-123.
Robbins, S. B., & Tucker, K. R., Jr. (1986). Relation of goal instability to self-directed and interac-
tional career counseling workshops. Journal of Counseling Psychology, 33, 418-424.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph, No. 17.
Santos, P. J., Casillas, A., & Robbins, S. B. (2004). Motivational correlates of Portuguese high
schoolers’ vocational identity: Cultural validation of the Goal Instability Scale. Journal of Career
Assessment, 12, 17-32.
Schulz, E. M., & Sun, A. (2001). Controlling for rater effects when comparing survey items with
incomplete Likert data. Journal of Applied Measurement, 2, 337-355.
Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-
national consumer research. Journal of Consumer Research, 25, 78-90.
Thombs, D. (2002). Problem behavior and academic achievement among first-semester college
freshman. Journal of College Student Development, 36, 280-288.
Van de Vijver, F., & Hambleton, R. K. (1996). Translating tests: Some practical guidelines.
European Psychologist, 1, 89-99.
Wright, B. D., & Linacre, J. M. (1991). A user’s guide to BIGSTEPS: Rasch-model computer pro-
gram. Chicago: Mesa Press.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.

You might also like