Professional Documents
Culture Documents
E. Matthew Schulz
Steven B. Robbins
ACT, Inc.
Richard M. Lee
University of Minnesota, Twin Cities
The present study uses item response theory (IRT) to establish comparability
between the English and Portuguese versions of the Goal Instability Scale (GIS), a
measure of generalized motivation. A total of 2,848 American and 679 Portuguese
high school students were administered their respective language versions of the
GIS. Results showed only minor differences in item performance between lan-
guage versions, suggesting equivalent measurement of the underlying motivational
construct. Implications regarding the interpretation of IRT analyses for interven-
tion purposes, as well as future research, are discussed.
Over the past few decades, increased immigration patterns have created (or aug-
mented) multicultural societies and global education and workforce/vocational
domains (Casillas & Robbins, 2005). As societies face the challenge of addressing
the needs of increasingly diverse populations, many testing professionals have
turned to U.S.-derived psychological, academic, and workforce tests with the
logic that “if an instrument has been shown to be reliable and valid in one cul-
tural context, it may hold potential for benefiting consumers in other cultures”
(Arnold & Matus, 2000, p. 122). Within the academic achievement arena, the
Correspondence concerning this article should be directed to Steven B. Robbins, Applied Research, ACT, Inc.,
Iowa City, IA 52243-168l; e-mail: steve.robbins@act.org.
472
Casillas et al. / IRT AND MOTIVATION 473
IRT models generally locate items on a latent scale that also represents measures
of the trait. The production of person-fit statistics is associated primarily with a
subset of measurement theory and applications within IRT where the arrange-
ment of items on the latent scale is assumed to have meaning for most, if not all,
individuals. For example, items located at one end of the GIS may show how any
person begins the descent into goal instability. Items located at the other end of
the scale may show the final or most advanced stages of goal instability in any
person. These kinds of interpretations have implications for how goal instability
can be addressed through counseling or more general preventive interventions.
Third, we assume that the GIS items contribute equally to the measurement
of goal instability. This assumption is implicit in the fact that the unweighted total
score across GIS items is taken as the measure of goal instability. With this
assumption, it is important to use an IRT model that explicitly incorporates the
assumption of equal weighting and to evaluate the fit of the data to this model. In
a structural modeling framework, one would evaluate the fit of a model in which
equal weights were specified for the items. In an IRT framework, one evaluates
the fit of a model in which the slope parameter in the model is assumed to be a
constant (e.g., 1.0) for all items. The following section shows that the use of a
model with constant slope (e.g., with no slope parameter) has certain advantages
for assessing other facets of measure invariance as well.
The Rating Scale Model (Andrich, 1978) is a unidimensional IRT model for
data where all items share a common set of ordered response categories, such as
exists with a Likert-type scale. A formulation of the model and an interpretation
of its parameters with respect to GIS data, where items are scored 1 (for strongly
agree) to 6 (for strongly disagree), are shown by the following:
j = 1, 2, …, 5 (1)
allows the slope of the item characteristic curve (ICC) to vary across items. The
ICC is the trace line of the expected score on an item as a function of the trait
value, or β. The slope parameter essentially multiplies the additive combination
of other parameters in the model. For example, αi[βn – (δi +τj)] represents the
addition of a slope parameter, αi, to the Rating Scale Model. Models with slope
parameters tend to fit data better and may be useful in exploratory work, but they
do not correspond to practice when the unweighted total score across items is
used to estimate the underlying trait.
Measure invariance in IRT ultimately implies equivalence of item parameters
across groups (Raju et al., 2002; Reise et al., 1993). Parameters in IRT models
typically represent distinct, substantive issues with regard to measure invariance,
so it would be useful to compare model parameters directly across groups. For
example, differences in the Rating Scale Model threshold parameters, τj, across
groups reflect group differences in how categories of the Likert-type scale are
interpreted and used, as opposed to differences in how the items are interpreted
and used. Group differences in how items are interpreted and used are represent-
ed by differences between paired item parameters, δi, across groups. Thus, in the
Rating Scale Model, the meaning of differences in the τj and δi across groups is
separate and clear. This is a consequence of the additive combination of all
model parameters on the right side of the model formulation (Equation 1).
More generally, measure invariance in IRT is not assessed by directly compar-
ing item parameters (see Raju et al., 2002, for a description of measure invariance
regarding a more general class of IRT models). IRT models frequently include
parameters, such as a slope or a pseudo-guessing parameter, such that the combi-
nation of parameters in the model is not completely linear, or additive. Without
additivity, the meaning of model parameters is not separate and clear. For exam-
ple, differences in item location parameters (e.g., δi) across groups cannot be
evaluated independently of differences in a slope parameter.
It is interesting to note that IRT-based studies of measure invariance generally
use the same methods used in the study of differential item functioning (DIF;
Raju et al., 2002; Reise et al., 1993). One of the most popular and powerful meth-
ods of assessing DIF, the Mantel-Haenzel method, is not directly based on IRT
but is mathematically equivalent to comparing item difficulty parameters in the
one item-parameter (a location or difficulty parameter) Rasch model for dichoto-
mously scored (0 or 1) data under certain conditions, including the fit of data to
the model (Holland & Thayer, 1988). The Rating Scale Model is a member of
the Rasch family of measurement models (Rasch, 1960/1980; Wright & Masters,
1982).
Fit statistics commonly used in conjunction with the Rating Scale Model
include a weighted and unweighted mean squared residual, which are referred to
as infit and outfit, respectively (Wright & Masters, 1982). The fit statistics are
computed for each person and item (Wright & Masters, 1982). Only the outfit
statistic will be used in this study. The outfit statistic is comparable to a chi-square
statistic divided by its degrees of freedom. It has an expected value of 1.0 under
Casillas et al. / IRT AND MOTIVATION 477
the hypothesis that data fit the model. Fit statistics greater than 1.0 indicate
response patterns having more noise than expected according to the probabilities
specified by the model (e.g., Equation 1). Fit statistics outside the range of 0.6 to
1.5 may indicate practically significant overfit (less than 0.6) or underfit (greater
than 1.5).
For some practical purposes, such as counseling, only cases of underfit may be
of concern. The responses of overfitting persons conform unusually well to the
arrangement of items on the GIS scale. The assumptions of a counselor or inter-
vention program about how goal instability progresses, as inferred from the order
of items on the GIS scale, would not be invalid for overfitting persons. Likewise,
overfitting items tend to be associated with a higher than average correlation
between the item score and person measure. This relationship extends to the
weight or slope the item would have in structural equation models or IRT models
that allow weights or slopes to vary. Overfitting items tend to be associated with
greater slope or weight values. These kinds of items are not generally viewed as
problem items in instrument development.
In terms of the Rating Scale Model and associated fit statistics, the following
criteria for measurement invariance are explored in this study. The term groups
refers to the English and Portuguese samples taking their respective language
versions of the GIS.
1. Item calibrations (δi) will be the same across groups. To meet this
criterion, the difference between paired item calibrations should
differ by no more than 0.3 logits (the scale unit in a Rating Scale
Model analysis). This standard is commonly applied in evaluating
measure invariance in educational testing (e.g., Miller, Rotou, &
Twing, 2004). The failure of an item to meet this criterion may be
due to nonequivalence in translation or to more fundamental differ-
ences between populations in how the item defines goal instability.
If the GIS measure is invariant in this respect, one could reasonably
hypothesize that individual counseling and intervention strategies
would not differ by population.
2. Step calibrations (τj) will be the same across groups. To meet this
criterion, the same 0.3 standard described above will be used.
Failure of a step calibration to meet this criterion would call the
translation of category labels into question or suggest more funda-
mental cultural differences in how persons use the Likert-type scale
categories.
3. The proportion of person fit statistics less than 1.5 should be reason-
ably large and comparable across groups. The larger the proportion,
the more the arrangement of items on the IRT scale can be used to
understand the dynamics of goal instability within a given student
and to deliver effective intervention and counseling at the individual
student level. If the proportion is comparable across language ver-
sions, the GIS can be said to have similar potential for counseling
478 JOURNAL OF CAREER ASSESSMENT / November 2006
METHOD
Samples
American participants. American high school juniors and seniors who regis-
tered for the February and April 2002 administrations of the ACT assessment in
63 high schools across the United States were invited as potential participants in
a survey of noncognitive factors and high school performance (see Noble et al.,
2003, for details regarding the sampling procedure). The voluntary nature of
participation and confidentiality of the data were stressed during recruitment. A
total of 2,983 consented to participate and completed the survey. Of these, 135
surveys (4.5%) were discarded due to missing data. The remaining participants (N =
2,848) had a mean age of 17.7 years (SD = 0.70, range = 16-21 years), were
mostly female (64.2%) and Caucasian (75.8%), and were enrolled in the 11th
grade (67.3%).
and career planning. Administration of the instrument took place in school, dur-
ing class time, after the participants were informed that the general purpose of the
research was to study several aspects of adolescent development. The voluntary
nature of participation and confidentiality of the data were stressed.
English version. The GIS (Robbins & Patton, 1985) was constructed for use in
educational settings to explore how lapses in generalized motivation or drive
affected career development and educational attainment processes. Several stud-
ies demonstrate the salience of this construct in a variety of settings, including
counseling (e.g., Robbins & Tucker, 1986), education (Lese & Robbins, 1994;
Robbins & Schwitzer, 1988; Thombs, 2002), aging (Cook, Casillas, Robbins, &
Dougherty, 2005), and health (Elliott, Uswatte, Lewis, & Palmatier, 2000). In our
current sample, GIS psychometrics were commensurate with those of past
research, with a mean score of 45.6 (SD = 9.3, range 10-60; alpha = .85).
Rating Scale Model analyses were performed with the computer program
BIGSTEPS (Wright & Linacre, 1991). Three analyses were performed: a sepa-
rate analysis for each group and an analysis of the combined data. Due to the
additivity in the model, parameter estimates are identified up to an arbitrary
choice of scale origin. By convention, the scale origin, or zero, is set to the aver-
age item calibration. This means that with the use of the same items in all three
analyses, parameter estimates from all three analyses are automatically placed on
the same scale and are directly comparable.
The BIGSTEPS software produces estimates of ancillary statistics, including
(a) the standard error of parameter estimates, (b) item and person outfit statistics,
and (d) reliability estimates. Reliability estimates are based on the traditional for-
mula with estimates of observed and error variance set to the units of the latent
480 JOURNAL OF CAREER ASSESSMENT / November 2006
Table 1
English and Portuguese Versions of the Goal Instability Scale
1 It’s hard to find a reason for working. É-me difícil encontrar uma razão para
trabalhar.
2 I don’t seem to make decisions by myself. Parece-me que não consigo tomar decisões
sozinho(a).
3 I have confusion about who I am. Sinto-me confuso(a) acerca de quem eu sou.
4 I have more ideas than energy. Tenho mais ideias do que energia.
5 I lose my sense of direction. Na minha vida, frequentemente, fico
desorientado(a).
6 It’s easier for me to start than to finish É mais fácil, para mim, iniciar projectos do
projects. que os concluir.
7 I don’t seem to get going on anything Parece-me que não consigo prosseguir com
important. nada de importante.
8 I wonder where my life is headed. Pergunto-me para onde a minha vida se
encontra orientada.
9 After a while, I lose sight of my goals. Ao fim de algum tempo perco os meus
objectivos de vista.
10 I don’t seem to have the drive to get my Parece-me que não tenho energia para
work done. realizar o meu trabalho.
RESULTS
Paired item measures (δi) are plotted in Figure 1. To identify the origin of the
scale in the rating scale analysis, the average item measure is arbitrarily set to
zero. Because this is done for both groups, the points in Figure 1 should lie within
measurement error of the identity line under the hypothesis of measurement
invariance. In both groups, items that are relatively easy to endorse, or to agree
with, have positive scale values. It can be seen that Portuguese students were
somewhat more likely to agree with Item 4, “I have more ideas than energy,” and
less likely to agree with Item 5, “I lose my sense of direction,” and Item 6, “It’s
easier for me to start than to finish projects.” Although these differences were sta-
tistically significant, no item exhibited a between-group difference of more than
0.3 logits.
Casillas et al. / IRT AND MOTIVATION 481
1.00
8
0.80
0.60
6
0.40
4
Portuguese
0.20
5
0.00
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
9
-0.20
3
2 7
10 -0.40
1
-0.60
-0.80
English
Estimates of step thresholds (τj) are plotted separately for each group in Figure
2. In both groups, the average step threshold is zero, by convention. Steps with
more positive scale values require higher levels of goal “stability” (or lower levels
of goal instability) to take. Because disagreement received higher scores than
agreement, Step 1 represents the tendency to moderately agree rather than strong-
ly agree with an item, and Step 5 represents the tendency to strongly disagree
rather than moderately disagree with an item. In both groups, the difficulty of
steps increases monotonically with the exception of Step 4, which is easier than
Step 3 in both groups. Step 4 represents the tendency to moderately disagree (Cat-
egory 5) rather than slightly disagree (Category 4) with an item, whereas Step 3
represents the tendency to slightly disagree rather than slightly agree with an item.
A monotonic trend of increasing step difficulty is desirable for efficient measure-
ment (Linacre, 2002). However, exceptions to this trend are common and do not
tend to indicate serious measurement problems. It can be seen from Figure 2 that
the largest between-group difference in step thresholds occurred for Thresholds 2
and 4. Portuguese students were less likely to choose slightly agree over moderately
agree (Step 2) and more likely to choose moderately disagree over slightly disagree
(Step 4). These differences were statistically significant. However, neither differ-
ence exceeded 0.3 logits.
The percentage of person outfit statistics (i.e., unweighted mean squared
residuals) greater than 1.5 was 19.1 for the American sample (n = 2,848), 13.8 for
the Portuguese sample (n = 679), and 18.1 for the combined analysis (N = 3,527).
Because the weighted average percentage equals the percentage obtained in a
482 JOURNAL OF CAREER ASSESSMENT / November 2006
Steps by Language
1.5
0.5
Step Calibration
0
1 2 3 4 5
-0.5
English
-1 Portuguese
-1.5
Steps
combined analysis, we concluded that a single set of item and threshold parame-
ters, based on a combined analysis, was appropriate for both groups. That is, per-
son fit is substantially the same whether it is computed with respect to a common
or group-specific set of item parameters and step thresholds.
A plot of paired outfit statistics is shown in Figure 3. This plots shows that items
had similar fit statistics in each language group. The correlation between outfit
statistics was .91. Only one item, Item 1, has a fit statistic greater than 1.5. However,
items outside this range are not automatically deleted from an instrument. In the
present case, we suggest that the reasons for the misfit of Item 1 be investigated.
For example, it is possible that the step difficulties for this item are not consistent
with the common step structure illustrated in Figure 3. If this were the case, Item
1 could be allowed to have its own set of step difficulties. In contrast, the reason
for Item 1’s misfit might be related to its content or to subtle connotations associ-
ated with the particular way the item is written. In that case, it might be revised
and tried again or replaced altogether. However, the focus of the present study is
on comparability across language groups; in that regard, it is important to observe
that Item 1 has similar levels of misfit in both language groups.
Table 2 shows summary statistics from the combined and separate (i.e., by lan-
guage) analyses. The average person measure was about the same for the English
and Portuguese samples, as well as for the combined group. A positive average
means that students tended to disagree with the GIS items. The reliability of per-
Casillas et al. / IRT AND MOTIVATION 483
Outfit by Language
1.8
1.6
1.4
Portuguese
1.2
8
3
2
1 6
10
0.8
5
7
9
0.6
0.6 0.8 1 1.2 1.4 1.6 1.8
English
son measures was also the same for the individual language samples and the
combined group. Person and item fit statistics from separate and combined analy-
ses are also summarized in Table 2. As can be seen, means and standard devia-
tions of the fit statistics did not differ substantially across the three analyses.
Although evaluations of data/model fit are largely based on judgment and
experience (Embretson & Reise, 2000), the item fit statistics summarized in
Table 2 fell generally within established standards for a Rating Scale Model anal-
ysis. Given means near 1.0 and standard deviations ranging from .21 to .29, one
would expect few item fit statistics to be outside the range of 0.6 to 1.4. Indeed,
only one item fit statistic was outside this range (Item 1 outfit = 1.51).
DISCUSSION
The GIS construct is defined by the arrangement of items and item steps on a
latent scale. This arrangement is commonly called a “variable map” because it
represents the “journey” from one level of a variable to another, such as from low
goal instability (higher motivation) to high goal instability (lower motivation). In
the present case, we may speak of moving from high to low goal instability over
the course of counseling or intervention or of a student “descending” into goal
484 JOURNAL OF CAREER ASSESSMENT / November 2006
Table 2
Item Response Statistics for Combined and Separate Analyses
a. N = 3,520.
b. n = 2,848.
c. n = 679.
Table 3
Item Variable Map for an Average Step
Between Low and High Goal Instability
which would suggest that the student is experiencing low motivation and having
difficulties with completing even basic activities and/or setting simple goals.
Person fit statistics from a Rating Scale Model analysis are generally used to
evaluate the assumption that the arrangement of item steps on the variable map
is pretty much descriptive of each person’s experience, or journey, in moving
along the variable in either direction. Counseling and intervention approaches
based on the information in the variable map may be effective to the extent that
this assumption holds among individuals and the population at large. Thus, it is
important to assess the level of misfit (i.e., the percentage of person fit statistics
above 1.5). In this study, 19.1% of the American students and 13.8% of the
Portuguese students fell above the optimal level of person fit. Although the level
of person misfit is considerable, it is relatively common in IRT analyses of rating
scale data to show some level of misfit, and it is typically due to response sets
(Schulz & Sun, 2001). An example of a response set is the tendency to respond
in the same category for all items. The greater percentage of person misfit in the
American sample, compared with the Portuguese sample, may indicate a greater
propensity to response sets among American students. However, we do not
believe that the level of person misfit or the possibility of response sets among
some students would make intervention strategies based on the variable map inef-
fective. In fact, person fit statistics could be used in a two-stage approach for
identifying students or groups in need of intervention. First, for the substantial
proportion of students whose fit statistics were less than 1.5, the variable map
could be used to prioritize issues for addressing and/or for resource allocation.
Second, inferences about the goal instability of students or groups with high per-
son misfit could be qualified as temporary until more detailed analyses or ques-
tioning reveal the reasons for misfit.
486 JOURNAL OF CAREER ASSESSMENT / November 2006
This study explored the cross-cultural and language equivalence between the
English and Portuguese versions of the GIS, a measure of generalized motivation.
Results of IRT analyses suggest that despite some minor differences in item per-
formance between the two test versions, goal instability has the same internal
meaning for both groups. Item parameters and fit statistics were substantially the
same between groups. Step thresholds, representing the relative attractiveness of
adjacent categories on the Likert-type scale, were also comparable between
groups.
One strength of the study is its use of large sample sizes and the Rating Scale
Model. As one of the options afforded by such sample sizes, we could have used
a more highly parameterized model. However, previous research suggests that the
Rating Scale Model tends to outperform more highly parameterized models,
such as the Graded Response Model (Dodd, De Ayala, & Koch, 1995; Lei,
Bassiri, & Schulz, 2003). In terms of weaknesses, analyses showed that the
American sample had a higher level of person misfit. Although we do not believe
that this poses a major risk to our interpretations, it suggests that response styles
and/or other reasons may underlie this level of misfit. We hope to explore this
issue in future research. Furthermore, it would have been ideal to obtain a larger
Portuguese sample, because this would increase our confidence that the results
of this study will generalize to this population.
Although we advocate the use of IRT procedures for making detailed cross-
cultural comparisons, we want to emphasize that analyses based on traditional
classical test theory–based psychometric procedures remain valuable and infor-
mative. Indeed, some types of information may be gleaned from either approach
(e.g., Raju et al., 2002). However, we believe that IRT procedures provide the
most sophisticated methods for establishing equivalence between measures,
which is an important consideration for cross-cultural measurement. By equiva-
lence, we do not mean that the two measures must be identical but rather that
they must be free from item bias (Arnold & Matus, 2000; Butcher & Han, 1996;
Casillas & Robbins, 2005). If different versions of a measure are not equivalent,
one risks making a variety of faulty inferences about the meaning of such mea-
sures and their respective populations, such as inappropriate scaling, inaccurate
comparisons between groups, and faulty conclusions about groups (Hambleton
& Slater, 1997). Thus, if one wishes to adapt an existing measure to another lan-
guage and/or culture, IRT procedures may be indispensable. In contrast, if one
wishes to develop a new test in the target culture from the ground up, IRT proce-
dures may not be necessary, or even desirable. Of course, regardless of the meth-
ods used, the ultimate value of any analytical procedure is whether it can facilitate
the examination of the construct validity of a measure.
Casillas et al. / IRT AND MOTIVATION 487
Future Research
By demonstrating that the two measures are similarly tapping the construct
measured by the GIS, these results complement the findings of Santos et al.
(2004), who showed that the GIS-P was unidimensional and resulted in a pattern
of convergent and discriminant relations with other measures similar to that dem-
onstrated by the original GIS. However, we realize that these analyses, by them-
selves, do not complete the process of construct validation of the GIS-P. Future
research with external criteria (e.g., GPA, retention rates) is needed to expand our
knowledge of the construct validity of this measure adaptation. As stated before,
Portugal is experiencing a high secondary school dropout rate. Unless interven-
tions are developed to curtail dropout, Portugal will have considerable difficulty
competing within the European Union as well as in the global economy. At the
same time, motivation appears to be a salient indicator of academic performance
and career readiness for Portuguese secondary school students (see Santos et al.,
2004). Thus, future research is needed to determine if interventions specifically
targeted to highlight motivational factors will help to improve students’ persis-
tence in achievement, their ability to set career goals, and their eventual pursuit
of postsecondary training and productive career placements.
Furthermore, research is needed to examine how best to assist Portuguese sec-
ondary school students in pursuing various forms of postsecondary education.
This may require assessment not only of Portuguese students themselves but also
of the Portuguese school system as well as other major influences on students’
academic development (e.g., family networks, peers, Portuguese media). Findings
from this type of research will be of considerable value in the development of
interventions for promoting academic achievement, retention, and the eventual
career placement of Portuguese students. Once this research is under way, it will
be interesting to examine whether some of the strategies used in the United States
(e.g., Robbins & Tucker, 1986) may be applied successfully to Portuguese
students.
REFERENCES
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometricka, 43, 561-
573.
Arnold, B. R., & Matus, Y. E. (2000). Test translation and cultural equivalence methodologies for
use with diverse populations. In I. Cuellar & F. A. Paniagua (Eds.), Handbook of multicultural
mental health (pp. 121-136). San Diego, CA: Academic Press.
Butcher, J. N., & Han, K. (1996). Methods of establishing cross-cultural equivalence. In
J. N. Butcher (Ed.), International adaptations of the MMPI-2: Research and clinical applications
(pp. 44-63). Minneapolis: University of Minnesota Press.
Casillas, A., & Robbins, S. B. (2005). Test adaptation and cross-cultural assessment from a business
perspective: Issues and recommendations. International Journal of Testing, 5(1), 5-21.
488 JOURNAL OF CAREER ASSESSMENT / November 2006
Cook, D., Casillas, A., Robbins, S., & Dougherty, L. (2005). Goal continuity and the “Big Five”
as predictors of older adult marital adjustment. Personality and Individual Differences, 38,
519-531.
Dodd, B. G., De Ayala, R. J., & Koch, W. R. (1995). Computerized adaptive testing with polyto-
mous items. Applied Psychological Measurement, 19, 5-22.
Eid, M., Langeheine, R., & Diener, E. (2003). Comparing typological structures across cultures by
multigroup latent class analysis: A primer. Journal of Cross-Cultural Psychology, 34, 195-210.
Elliott, T. R., Uswatte, G., Lewis, L., & Palmatier, A. (2000). Goal instability and adjustment to
physical disability. Journal of Counseling Psychology, 47, 251-265.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum.
Fundação da Juventude. (1999). Diplomados desempregados [Unemployed graduates]. Porto,
Portugal: Author.
Gerber, B., Smith, E.V., Girotti, M., Pelaez, L., Lawless, K., Smolin, L., et al. (2002). Using Rasch
measurement to investigate the cross-form equivalence and clinical utility of Spanish and
English versions of a diabetes questionnaire: A pilot study. Journal of Applied Measurement, 3,
243-271.
Gonzales, P., Calsyn, C., Jocelyn, L., Mak, K., Kastberg, D., Arafeh, S., et al. (2000, December).
Pursuing excellence: Comparisons of international eight-grade mathematics and science achieve-
ment form a U.S. perspective, 1995 and 1999. Washington, DC: U.S. Department of Education,
National Center for Education Statistics.
Hambleton, R. K., & Slater, S. C. (1997). Item response theory models and testing practices:
Current international status and future directions. European Journal of Psychological Assessment,
13, 21-28.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel–Haenszel
procedure. In H. Wainer & H. Braun (Eds), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence
Erlbaum.
Hui, C. H., & Triandis, H. C. (1985). Measurement in cross-cultural psychology: A review and
comparison of strategies. Journal of Cross-Cultural Psychology, 16, 131-152.
Hulin, C. L., Drasgow, F., & Komocar, J. (1982). Applications of Item Response Theory to analysis
of attitude scale translations. Journal of Applied Psychology, 67, 818-825.
International Test Commission. (2000). International guidelines for test use. Surrey, UK: Author.
Lei, P., Bassiri, D., & Schulz, E. M. (2003). A comparative evaluation of methods of adjusting GPA
for differences in grade assignment practices. Journal of Applied Measurement, 4, 70-86.
Lese, K., & Robbins, S. (1994). Relationship between goal attributes and the academic achieve-
ment of southeast Asian refugee adolescents. Journal of Counseling Psychology, 41, 45-52.
Linacre, J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effec-
tiveness. Journal of Applied Measurement, 3, 85-106.
Miller, E. G., Rotou, O., & Twing, J. S. (2004). Evaluation of the 0.3 logits screening criterion in
common item equating. Journal of Applied Measurement, 5, 172-177.
Ministry of Education. (2003). Insucesso e abandono escolar em Portugal [School failure and drop-
out in Portugal]. Retrieved April 15, 2004, from htpp://www.minedu.pt/Scripts/ASP/destaque/
estudo01/docs/sintese.pdf
Noble, J. P., Roberts, W. L., & Sawyer, R. L. (2003, April). Student achievement, behavior, percep-
tions, and other factors affecting college admissions test scores. Paper presented at the annual
meeting of the American Educational Research Association, Chicago.
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of
methods based on confirmatory factor analysis and item response theory. Journal of Applied
Psychology, 87, 517-529.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Danish Institute for
Educational Research. Chicago: University of Chicago Press. (Original work published 1960)
Casillas et al. / IRT AND MOTIVATION 489
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response
theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114,
552-566.
Robbins, S., Lauver, K., Le, H., Davis, D., Langley, R., & Carlstrom, A. (2004). Do psychosocial
and study skill factors predict college outcomes? A meta-analysis. Psychological Bulletin, 130,
261-288.
Robbins, S. B., & Patton, M. J. (1985). Self-psychology and career development: Construction of
the Superiority and Goal Instability scales. Journal of Counseling Psychology, 32, 221-231.
Robbins, S. B., Payne, E. C., & Chartrand, J. M. (1990). Goal instability and later life adjustment.
Psychology and Aging, 5, 447-450.
Robbins, S. B., & Schwitzer, A. M. (1988). Validity of the superiority and goal instability scales as
predictors of women’s adjustment to college life. Measurement and Evaluation in Counseling
Development, 21, 117-123.
Robbins, S. B., & Tucker, K. R., Jr. (1986). Relation of goal instability to self-directed and interac-
tional career counseling workshops. Journal of Counseling Psychology, 33, 418-424.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph, No. 17.
Santos, P. J., Casillas, A., & Robbins, S. B. (2004). Motivational correlates of Portuguese high
schoolers’ vocational identity: Cultural validation of the Goal Instability Scale. Journal of Career
Assessment, 12, 17-32.
Schulz, E. M., & Sun, A. (2001). Controlling for rater effects when comparing survey items with
incomplete Likert data. Journal of Applied Measurement, 2, 337-355.
Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-
national consumer research. Journal of Consumer Research, 25, 78-90.
Thombs, D. (2002). Problem behavior and academic achievement among first-semester college
freshman. Journal of College Student Development, 36, 280-288.
Van de Vijver, F., & Hambleton, R. K. (1996). Translating tests: Some practical guidelines.
European Psychologist, 1, 89-99.
Wright, B. D., & Linacre, J. M. (1991). A user’s guide to BIGSTEPS: Rasch-model computer pro-
gram. Chicago: Mesa Press.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.