You are on page 1of 24

Applied Psychological Measurement

http://apm.sagepub.com

Investigation of IRT-Based Equating Methods in the Presence of Outlier Common


Items
Huiqin Hu, W. Todd Rogers and Zarko Vukmirovic
Applied Psychological Measurement 2008; 32; 311
DOI: 10.1177/0146621606292215

The online version of this article can be found at:


http://apm.sagepub.com/cgi/content/abstract/32/4/311

Published by:

http://www.sagepublications.com

Additional services and information for Applied Psychological Measurement can be found at:

Email Alerts: http://apm.sagepub.com/cgi/alerts

Subscriptions: http://apm.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Citations (this article cites 21 articles hosted on the


SAGE Journals Online and HighWire Press platforms):
http://apm.sagepub.com/cgi/content/refs/32/4/311

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Investigation of IRT-Based Equating Methods
in the Presence of Outlier Common Items
Huiqin Hu, Data Recognition Corporation
W. Todd Rogers, University of Alberta, Canada
Zarko Vukmirovic, Harcourt Assessment, Inc.

Common items with inconsistent b-parameter generated for the common-item nonequivalent
estimates may have a serious impact on item groups matrix design to reflect the manipulated
response theory (IRT)–based equating results. To factors: group ability differences and nonequivalent
find a better way to deal with the outlier common groups, number/score points of outliers, and types of
items with inconsistent b-parameters, the current outliers. When no outliers were present, the TCC
study investigated the comparability of 10 variations and M/S transformations performed the best. When
of four IRT-based equating methods (i.e., there were outliers, overall, the methods that
concurrent calibration, separate calibration with test considered them (except the M/S transformation
characteristic curve [TCC] and mean/sigma [M/S] with outliers weighted) resulted in a vast
transformations, and calibration with fixed common improvement compared to the methods that ignored
item parameters [FCIP]) when outliers were either them. Index terms: item response theory;
ignored or considered. Simulated data were equating; outliers; calibration; transformation.

Because many large-scale testing programs use unidimensional item response theory (IRT)
models to develop tests, the use of IRT-based equating methods has become more and more attrac-
tive to large-scale testing practitioners. Kolen and Brennan (2004) pointed out that a crucial aspect
of IRT applications is to study the robustness of the models to violations of the assumptions that
underlie their use. The basic assumptions of the commonly used IRT models are unidimensional-
ity, local independence, and nonspeededness (Hambleton & Murray, 1983; Lord, 1980). When
a given IRT model fits the test data of interest, two features are obtained: Examinee ability esti-
mates are not test dependent, and item parameter estimates are not group dependent. Several stud-
ies have been conducted to explore the effects of violating these assumptions on equating results
obtained using IRT models (e.g., Bolt, 1999; De Champlain, 1996; Dorans & Kingston, 1985;
Lee, Kolen, Frisbie, & Ankenmann, 2001; Yen, 1984). However, only a few studies involved the
assumption of item parameter invariance that is specific to equating with the common-item non-
equivalent groups design (e.g., Bejar & Wingersky, 1981; Cook, Eignor, & Hutton, 1979; Linn,
Levine, Hastings, & Wardrop, 1980; Stocking & Lord, 1983; Vukmirovic, Hu, & Turner, 2003).
In the common-item nonequivalent groups design, the IRT-based equating methods typically
involve two steps (Kolen & Brennan, 2004). First, item parameters of the reference and equated
tests are calibrated separately. Second, item parameter estimates from the equated test are scaled
onto the scale of the parameter estimates for the reference test using a linear transformation

Applied Psychological Measurement, Vol. 32 No. 4, June 2008, 311–333


DOI: 10.1177/0146621606292215 311
© 2008 Sage Publications

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
312 APPLIED PSYCHOLOGICAL MEASUREMENT

Figure 1
Illustration of an Outlier in a Set of Common Items

a/b - Ref a/b - Ref


4 4

2 il 2
il

0 0

–2 –2

–4 –4
–4 –2 0 2 4 –4 –2 0 2 4
a/b - Equated a/b - Equated

method such as the mean/mean (M/M; Loyd & Hoover, 1980), mean/sigma (M/S; Marco, 1977),
or characteristic curve (Haebara, 1980; Stocking & Lord, 1983) methods. Alternatively, the para-
meters of the common items can be held constant in the calibration of the equated test using the
parameters estimated in the reference test (referred to as fixed common item parameters [FCIP]).
As a result, the estimation of the characteristics of the unique items is constrained by the scale of
the common items (Hills, Subhiyah, & Hirsch, 1988; Li, Lissitz, & Yang, 1999; Vukmirovic et al.,
2003; Zenisky, 2001). An alternative procedure to the two-step procedure is IRT concurrent cali-
bration: Examinees’ responses from the two tests to be equated are combined as one data file, and
the parameters are estimated simultaneously. Thus, the parameter estimates are put onto one scale.
In the first step of the two-step procedure, it is expected that the item parameters, such as the
discrimination (a-parameter) and difficulty (b-parameter) parameters, of the common items will
be the same, within sampling error, if they are estimated separately from two randomly equivalent
groups. In contrast, it might be expected that the item parameters will be different but linearly
related if they are estimated from two nonequivalent groups. The guessing parameter (c-parameter),
if specified in the model, will remain the same regardless of the form of group equivalence
(Hambleton & Murray, 1983).
In the case of common-item nonequivalent groups, as indicated, the a- and b-parameters of
common items calibrated separately from two groups may be different. These differences are
due to the indeterminacy of estimation. Because both item parameters and ability parameters are
not known in an IRT model, the mean and standard deviation of the ability scale are often set to
0 and 1, respectively, to obtain the parameter estimates. That is, although the two groups may
differ in ability, the abilities for each group are scaled to have a common mean and standard
deviation. As a result, the common item parameters of two nonequivalent groups are expected to
have a linear relation and become the same once they are transformed to the same scale. If a scat-
terplot is made of the a- or b-parameter estimated from two equating groups, all the points (i.e.,
items) are expected to be located along a straight line or within a narrow band around this
straight line (see the left panel, Figure 1; Hambleton & Murray, 1983; Stocking & Lord, 1983).
However, some outlier items have been observed in practice that are located far away from the
straight line (see the right panel, Item i1, Figure 1) due to estimation errors, disclosure of some
of the common items, and differential curriculum emphasis (e.g., Stocking & Lord, 1983;
Vukmirovic et al., 2003).

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 313

When outliers are present in the common items, one may choose to ignore the outliers. However,
researchers have been aware of the effects of outliers with large inconsistent parameter estimates on
IRT-based equating (e.g., Bejar & Wingersky, 1981; Cohen & Kim, 1998; Cook et al., 1979; Hanson
& Feinstein, 1997; Linn et al., 1980; Stocking & Lord, 1983). For example, Stocking and Lord (1983)
pointed out that outliers with poorly estimated item difficulties might negatively affect the estimation
of the equating coefficients A and B when the M/M and M/S transformations were used. Vukmirovic
et al. (2003) found that fixing and not fixing the item parameters with inconsistent b-parameters led to
different equating results when the FCIP was employed.
To remove the possible negative effect of outliers, procedures have been proposed to modify the
M/M and M/S transformations. For example, Cook et al. (1979) restricted the range of the difficulties
used in computing moments. Bejar and Wingersky (1981) suggested giving smaller weights to the out-
liers. Linn et al. (1980) used weighted item difficulties where the weights were the inverse of the items’
squared standard errors. Stocking and Lord (1983) proposed an iterative procedure that employed both
Linn et al.’s and Bejar and Wingersky’s methods. Cohen and Kim (1998) extended Linn et al.’s proce-
dure to calculate the equating coefficients for polytomously scored items. However, it is not clear how
much these procedures improved the equating results. For example, Cook et al.’s method leads to the
deletion of outliers, thereby eliminating their negative effects. However, deletion of outliers may
adversely alter the content and statistical representativeness of the common items, which in turn may
adversely affect the equating results. Linn et al.’s and Cohen and Kim’s solutions may be useful when
the presence of outliers is due to item parameter estimation errors. However, other reasons, such as dis-
closure of some common items, may also produce outliers. In this case, it is not clear whether weight-
ing outliers will lead to a better equating result.
Few studies have addressed the issue of outliers when other IRT-based equating methods, such as
separate calibration with test characteristic curve (TCC) transformation, FCIP, or concurrent calibra-
tion, were employed. Vukmirovic et al. (2003) explored the effects of fixing and not fixing random
outliers using FCIP. However, how to deal with outliers when they appear nonrandomly (e.g., from
one domain or content area) is not clear. Theoretically, one may suggest removing outliers when no
harm would be done to the balance of content in the set of common items (Hanson & Feinstein,
1997). However, further systematic study needs to be conducted to obtain a clearer understanding on
which IRT-based equating method to use and how to deal with outliers under various conditions.
The purpose of this study, therefore, was to find a solution by investigating the comparability of
concurrent calibration, separate calibration with M/S and TCC (Stocking & Lord, 1983) transforma-
tions, and FCIP calibration when the effects of outliers with large inconsistent b-parameter estimates
were either ignored or considered. More specifically, four research questions were addressed: (1) Do
the IRT-based equating methods that consider the influence of outliers produce better results than
the IRT-based equating methods that do not consider the influence of outliers? (2) Is the effect found
in Question 1, if any, confounded by factors such as group ability differences, number/score points
of outliers (the use of the score point is to emphasize the open-ended response outlier items), and
types of outliers (i.e., location of outliers in terms of content coverage and size of b-parameter esti-
mates)? (3) Which of the IRT-based equating methods produces a better result, especially among
the IRT-based equating methods that consider the influence of outliers? (4) Is the effect found in
Question 3, if any, confounded by factors such as group ability differences, number/score points
of outliers, and types of outliers?

Method
In the current study, simulated data were generated for the common-item nonequivalent groups
matrix design and to reflect the manipulation of the group ability differences, number/score points

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
314 APPLIED PSYCHOLOGICAL MEASUREMENT

Figure 2
Illustration of Common-Item Nonequivalent Groups Matrix Design

Test Form Y

Form Y_1 U1 C_1

Year 1 Form Y_2 U1 C_2

Form Y_3 U1 C_3


.
.
.

Test Form X

Form X_1 U2 C_1

Year 2 Form X_2 U2 C_2

Form X_3 U2 C_3


.
.
.

of the outliers, and types of outliers. Ten variations of IRT-based equating methods were used to
equate the simulated data. The equating results produced by the 10 methods were compared and
evaluated; as a result, the performance of the IRT-based equating methods in the presence of out-
liers was investigated.

Equating Design
A simplified common-item nonequivalent groups matrix design with an external anchor test was
employed in the current study. Generally, in a matrix design (see Figure 2), the tests to be equated
are administered on two different test dates. On one test date (e.g., Year 1), multiple test forms (e.g.,
Form Y 1, Form Y 2, and Form Y 3) are administered to different student samples at the same time.
These multiple test forms include the same unique items (e.g., U1) but different sets of common
items, for example, C_1, C_2, and C_3. The unique items administered in Year 2 (e.g., U2) are dif-
ferent from those administered in Year 1. However, the same sets of common items are used. Thus,
examinees’ scores on U1 and U2 can be equated through common items in C 1, C 2, and C 3. The
advantage of this design over the simple common-item nonequivalent groups design, which includes
only one test form containing both unique items and a common set of items on one test date, is that
part of the common item set can be imbedded into each test form in a matrix design without exces-
sively prolonging the test administration time. Meanwhile, examinees are equally motivated to take
both unique and common items. If some examinees take the test next year, the possibility that they
will take the same common items that they did in the previous year is low given the multiple com-
mon item sets in the matrix design. As a result, it is possible to include open-ended response items in
the anchor test without risking test security. Finally, the total number of common items shared
between tests to be equated can be much more than the minimum number (e.g., 20 items or 20% of
the total number of items in a test) suggested by Angoff (1984) and Kolen and Brennan (2004).

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 315

Figure 3
Number and Types of Items in Each Test Form

Test Form Y
MC SA OR MC SA OR
Form Y_1 26Y 5Y 5Y 8A 1A 1A
MC SA OR
Year 1 Form Y_2 26Y 5Y 5Y 8B 1B 1B
MC SA OR
Form Y_3 26Y 5Y 5Y 8C 1C 1C
.
.
.

Test Form X

Form X_1 26X 5X 5X 8A 1A 1A

Year 2 Form X_2 26X 5X 5X 8B 1B 1B

Form X_3 26X 5X 5X 8C 1C 1C


.
.
.

Note. MC = multiple choice; SA = short answer; OR = open-ended response.

Because of these advantages, the common-item nonequivalent groups matrix design has been
employed in many large-scale testing programs (e.g., Minnesota Comprehensive Assessments, 2002;
Massachusetts Comprehensive Assessment System [2001 MCAS, 2001; 2002 MCAS, 2002]).
The results of the current study are intended to be generalized to equating large-scale achieve-
ment tests with mixed item formats using IRT-based equating methods. Thus, the common-item
nonequivalent groups matrix design was chosen to best simulate the equating of the 2001 and 2002
administrations of the Massachusetts Comprehensive Assessment System (MCAS) mathematics
tests. The simulation test forms included the same item formats as the real test did and covered the
five MCAS mathematics content areas. The number of items included in the simulation was based
on both the available item characteristics from the real mathematics test and the basic considerations
for test development (e.g., content relevance and representativeness and test administration time).
However, to make the study less complex and doable, a smaller number of subtest forms (i.e., 3)
were simulated (see Figure 3). The reduction in the numbers of subtest forms should not change the
key steps taken to equate tests under the common-item nonequivalent groups matrix design.
As shown in Figure 3, test forms Y and X, which were administered in 2 different years, needed
to be equated. Each test form included three subtests. There were 36 unique items in each subtest
of Form Y. Of these, 26 were multiple-choice (MC) items with two score categories (i.e., 0 and 1),
5 were short-answer (SA) items with two score categories (i.e., 0 and 1), and 5 were open-ended
response (OR) items with five score categories (i.e., 0, 1, 2, 3, and 4). Each subform contained the
same set of unique items but a different set of eight common MC items, one common SA item, and
one common OR item. The 30 (30 = 10 × 3) common items together represented the statistical,
content, and item format characteristics of the unique items. The same structure was used for Form
X but with a different set of 36 unique items. Using the 30 common items, the unique items in the
test forms Y and X were equated onto the same scale.

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
316 APPLIED PSYCHOLOGICAL MEASUREMENT

Manipulated and Fixed Factors


The accuracy of IRT-based equating for the common-item nonequivalent groups matrix design
may be influenced by factors such as group ability differences, number/score points of outliers,
types of outliers, and equating method. In the current study, these factors were manipulated.
Group ability differences. Theoretically, in the common-item nonequivalent groups matrix
design, equating is only needed when the groups taking the tests are nonequivalent. However,
for the purpose of comparison, equating was conducted for the situations with equivalent and
nonequivalent groups in the current study. Samples of item responses for test form Y were gen-
erated by sampling the latent trait (θÞ from a normal independent distribution (NID) with mean
0 and standard deviation 1 (NID (0, 1)). Two sets of item responses were generated for test form
X by sampling θ from an NID (0, 1) distribution and an NID (1, 1) distribution. The samples with
NID (0, 1) for both test forms were used to examine the case when the two equating groups were
equivalent. The samples with NID (0, 1) for test form Y and NID (1, 1) for test form X were used
to examine the case when the two groups were not equivalent. Including a condition of non-
equivalent groups with a 1–standard deviation ability difference is based on three considera-
tions: (a) A large ability difference can better show which equating methods are most sensitive
to group differences, (b) the findings of the current study can be compared with other similar
studies that often included the same condition (e.g., Hanson & Beguin, 2002), and (c) in real test
situations, especially high-stake tests with several retests, it is often a case that the two non-
equivalent groups (i.e., retesters vs. first-time testers) have large ability differences.
Number/score points and types of outliers. As mentioned previously, when the b-parameters
of common items are estimated using data collected from two nonequivalent groups, the two sets
of b-parameters are supposed to have a linear relationship. In the scatterplot of item difficulties,
if two perpendicular straight lines are drawn from each item’s x-axis and y-axis position, the
intersection points of the two perpendicular lines are supposed to be on a straight line. Depar-
tures, if any, are due to measurement error. In real tests, however, outliers could be found on
either both sides or one side of the straight line. The departure distance to the straight line often
varies. In the current study, the outliers were constrained as follows: On the scatterplot, if the
distance between the intersection point and its presumed position on the straight line was equal
to or more than two score points, then this item was defined as an outlier. To operationalize this
definition, all the b-parameters for the outliers in Year 2 (i.e., Form X) were 2 score points lower
than in Year 1 (i.e., Form Y), which meant only the outliers located on the left side of the straight
line were investigated.
The following reasons were also considered while conceiving the operational definition of
outliers. In practice, the presence of outliers is often due to the revelation of some common
items, a change in the instructional emphasis of a certain content area, and item parameter esti-
mation error. The most plausible result of the exposure of some common items and subsequent
instructional emphasis on one content area represented by those items is that the corresponding
items become easier when they are administered in the 2nd year. However, as pointed out above,
outliers can be located on both sides of the straight line. In these cases, the possible effects of
outliers on equating results will be cancelled out if the outliers are located on both sides of the
straight line symmetrically. This observation makes theoretical sense, especially for the separate
calibration with M/M and M/S transformations. For the FCIP method, Vukmirovic et al. (2003)
found that the differences in equating results between including and excluding outliers that were
randomly distributed on both sides of the straight line were negligible in size compared to the
differences where outliers were located on one side of the straight line. In practice, it is rare
to find outliers that cancel each other completely. The outliers ‘‘left’’ after cancellation will

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 317

then be located on one side of the straight line. Theoretically, which side of the straight line
the outliers or remaining outliers are located should not change the findings of the current
study. Consequently, only the outliers located on the left side of the straight line (i.e., with smal-
ler b-parameters in the 2nd year) were investigated.
Six combinations of number/score points and types of outliers were examined: (a) There were
no outliers in the common items, (b) the outliers were three MC items with three score points
(i.e., the maximum score points for the three MC items was three) and were from one content
area, (c) the outliers were three MC items and were randomly distributed across the five content
areas, (d) the outliers were three MC items and with extreme b-parameter estimates (i.e., the range
of the b-parameter estimates from two equating groups was between –1.3979 and –3.6670),
(e) the outliers were five MC items and one OR item with nine score points and were from one
content area, and (f) the outliers were five MC items and one OR item with nine score points and
were randomly distributed across the five content areas. The first condition served as the baseline
against which the other conditions were compared. Conditions 2, 3, 5, and 6 reflected the increase
in the number/score points of outliers caused by the disclosure of some of the common items and
differential curriculum emphasis. The fourth condition represented situations in which estimation
error leads to the presence of outliers. The manipulation of the number/score points and types of
outliers was not only to simulate real tests but also to test the need for content representativeness
of the equating form (see Kolen & Brennan, 2004, p. 10).
Under the first condition, the number/score points and representativeness of the common items
were well controlled. Li et al. (1999) and Tate (2000) suggested that while determining the num-
ber of common items, a polytomously scored item could be treated as several dichotomously
scored items (= the number of score categories – 1). Because the tests simulated in the current
study contained open-ended response items, both the number of items and the score points of the
items were considered when developing the initial set of common items. The numbers of different
types of items were presented for each content area in Table 1. For example, in the content area of
number sense, there were 10 MC, 1 SA, and 1 OR unique items for both Tests X and Y, respec-
tively, and 9 MC, 2 SA, and 2 OR common items. The corresponding score points were 15
(10 × 1 + 1 × 1 + 1 × 4 = 15), 15, and 19. The distributions of items in Table 1 demonstrate that
the number of common items was considered on content grounds.
The means and standard deviations of the b-parameters of the unique and common items are
listed in Table 2. Because the unique and common items contained OR items, the means and stan-
dard deviations were calculated using two approaches. The first approach was based on the number
of items. In this approach, OR items were treated as one item. The location parameter, which repre-
sents the average difficulty of an OR item, was used to represent the b-parameter for that item. The
second approach was based on the score points. In this approach, an OR item was treated as four
dichotomously scored items. The four corresponding step parameters for the OR item were used to
represent the b-parameters. As indicated in the table, the mean difficulties of the two sets of unique
items and the three sets of common items were similar, which indicated that the difficulties of each
subtest form were similar and that the common items represented the unique items.
However, the representativeness of the common items might change after the types of outliers
and the equating methods are manipulated. For example, if the outliers were from one content area
and were removed while conducting equating, then the content representativeness, but not neces-
sarily the statistical representativeness, of common items would be violated. If the outliers were
randomly distributed across the five content areas, removing outliers may not violate the content
and statistical representativeness of common items. Including any type of the outliers in the equat-
ing analyses would not violate the content representativeness; however, the statistical representa-
tiveness of common items would change.

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
318 APPLIED PSYCHOLOGICAL MEASUREMENT

Table 1
The Number of Unique and Common Items in Each Content Area

Content Area Item MC SA OR Number of Items Score Points

Number sense Unique items in Form X 10 1 1 12 15


Unique items in Form Y 10 1 1 12 15
Common items 9 2 2 13 19
Patterns, relations, Unique items in Form X 6 1 1 8 11
and functions Unique items in Form Y 6 1 1 8 11
Common items 5 0 1 6 9
Statistics and Unique items in Form X 5 1 1 7 10
probability Unique items in Form Y 5 1 1 7 10
Common items 5 0 0 5 5
Geometry Unique items in Form X 3 1 1 5 8
Unique items in Form Y 3 1 1 5 8
Common items 3 0 0 3 3
Measurement Unique items in Form X 2 1 1 4 7
Unique items in Form Y 2 1 1 4 7
Common items 2 1 0 3 3
Total Unique items in Form X 26 5 5 36 51
Unique items in Form Y 26 5 5 36 51
Common items 24 3 3 30 39

Note. MC = multiple choice; SA = short answer; OR = open-ended response.

Table 2
Descriptive Statistics for the b-Parameter Estimates of the Unique and Common Items

Based on the Number of Items Based on the Number of Score Points

Standard Standard
Item Items Mean Deviation Score Points Mean Deviation

Unique items in Form Y 36 –0.3074 0.6902 51 –0.0595 1.1130


Unique items in Form X 36 –0.3061 0.7348 51 –0.0460 1.1648
Common items in Forms X and Y 30 –0.2942 0.7525 39 –0.1220 0.8599
Common items in Subtest 1 10 –0.2919 0.6388 13 –0.0963 0.8631
Common items in Subtest 2 10 –0.2888 0.8373 13 –0.1302 0.8622
Common items in Subtest 3 10 –0.3019 0.8459 13 –0.1395 0.9231

IRT-based equating methods. Four commonly used IRT-based equating methods—concurrent


calibration, separate calibration with TCC transformation, separate calibration with M/S transforma-
tion, and FCIP calibration—were used to equate the two test forms without outliers. To emphasize
the presence of outliers and the difference between the methods that did and did not consider the
influence of outliers, the four equating methods were renamed as concurrent calibration with outliers
included (i.e., the possible effects of outliers were ignored), TCC transformation with outliers

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 319

included, M/S transformation with outliers included, and FCIP calibration with outliers fixed. The
corresponding methods that considered the influence of outliers were named as concurrent calibra-
tion with outliers excluded, TCC transformation with outliers excluded, M/S transformation with
outliers excluded, M/S transformation with outliers weighted (Cohen & Kim, 1998; Linn et al.,
1980), FCIP calibration with outliers not fixed, and FCIP calibration with outliers excluded. Alto-
gether, 10 variations were investigated when outliers were present.
Other factors, such as sample size, IRT models used for the parameter estimation, and computer
programs used for the parameter estimation, may influence the equating results too. However,
because many studies have investigated these factors (e.g., Childs & Chen, 1999; Hanson & Beguin,
2002; Kolen & Brennan, 2004), they were fixed (i.e., not examined) as in the following sections.
Sample size. The sample size for each subtest form was controlled at 2,000, which is large enough
to produce stable parameter estimates (e.g., Kolen & Brennan, 2004; Zeng, 1991). Consequently,
the total sample size for one test form (either Form X or Form Y) was 6,000 (2,000 × 3 = 6,000).
IRT models. Three IRT models—three-parameter logistical model (3PL), two-parameter logis-
tical model (2PL), and extended graded response model (GRM; Muraki & Bock, 1999)—were
chosen for modeling the data generated for test forms X and Y. Based on the observation that it is
always possible for an examinee to answer MC items correctly by guessing, the 3PL model was
used for modeling the MC items. The 2PL model was used for modeling the SA items. The use of
the 2PL model is based on the belief that the possibility of answering an SA question correctly by
guessing is close to zero, and it is reasonable to assume that the item discrimination parameters are
different. The extended GRM was used to model the OR items because (a) the scores for the OR
items are ordered; (b) the adjacent scores (e.g., 2, 3, and 4) can be collapsed as one category, if nec-
essary; and (c) it is meaningful to know the possibility of getting a higher score over getting a lower
score.
Computer programs. Computer programs were thought to be one factor that might influence
the equating results (Hanson & Beguin, 2002). However, after comparing the performance of
MULTILOG (Thissen, 1991) and BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) in
the context of concurrent calibration and separate calibration with linear transformations, Hanson
and Beguin (2002) concluded that the two programs tended to perform similarly. Childs and Chen
(1999) found that although the MULTILOG and PARSCALE (Muraki & Bock, 1999) parameter-
ized the polytomous IRT models differently (e.g., MULTILOG directly produces only category/
step parameters, whereas PARSCALE produces an overall item location parameter and repro-
duces the category parameters by centering them to zero), similar parameter estimates were
obtained. Thus, only PARSCALE was used in the current study to estimate the parameters. The
PARSCALE control files were altered for the concurrent, separate, and FCIP calibrations.1

Steps for the Computer Simulation


Computer simulations have been employed in many studies to investigate IRT-based equating
methods (e.g., Baker, 1996; Bolt, 1999; Cohen & Kim, 1998; Hanson & Beguin, 2002; Wang,
Hanson, & Harris, 2000). Lehman and Bailey (1968) pointed out that a computer simulation might
be conducted when an experimental study is too costly or impossible. The publication policy of
Psychometrika (Psychometric Society, 1979) indicates that simulation studies should be employed
only if the information cannot be reasonably obtained in other ways (e.g., in an analytical way).
These reasons supported the use of computer simulation in the current study. It is difficult to find
two real tests that need to be equated with all the types of outliers considered in the present study.
In contrast, different conditions of interest can be reflected in the simulated data or implied in the
simulation process. Furthermore, it is almost impossible to pursue a research question, such as

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
320 APPLIED PSYCHOLOGICAL MEASUREMENT

which IRT-based equating method produces a more accurate result using real data, due to the lack
of definite evaluation criteria. A simulation study can solve these problems. The equating results
of the IRT-based equating methods can be compared with the true scores that are known before
the simulated data are generated. Therefore, the accuracy of the performance of the IRT-based
equating methods can be compared.
To conduct the computer simulations, the item parameters have to be determined before gener-
ating the item response sample. To make the generated data similar enough to real data, all the
unique SA and OR item parameters and 90% of the unique MC item parameters estimated from
the MCAS (2001 MCAS, 2001; 2002 MCAS, 2002) Grade 4 mathematics tests were used to gener-
ate the item response sample. Likewise, except for the outliers, the item parameters for the com-
mon items used in the simulation came from the same tests. Once the item parameters were
determined, the computer simulations were completed in four steps:
1. For test form Y, an item response sample was generated for each of the six outlier condi-
tions that had an underlying theta distribution of NID (0, 1).
2. For test form X, an item response sample was generated for each of the six outlier condi-
tions with an underlying theta distribution of NID (0, 1); a second item response sample
was generated for each of the six outlier conditions with an underlying theta distribution of
NID (1, 1). The samples for the test form X were paired with the samples for the test form
Y to represent the 12 response conditions.
3. The response samples for the two test forms were calibrated and/or equated by the four
IRT-based equating methods or their 10 variations. The IRT true-score equating followed
each of these methods.
4. This process was replicated 50 times, which is thought to be sufficient to compare the
results obtained from Step 3 for each condition (Hanson & Beguin, 2002; Harwell, Stone,
Hsu, & Kirisci, 1996).

Evaluation of the IRT-Based Equating Methods


To evaluate the performance of IRT-based equating methods in the presence of outliers, the
unweighted mean square error for the b-parameters (MSEb Þand the unweighted mean square error
for the number-correct true scores (MSEt Þ were used. The formulas for each are as follows:
mj
P
50 P
36 P
ðb∗jkr − bjk Þ
2

r=1 j=1 k=0


MSEb =
50 × 51
and
P
50 P
51
ðtsr∗ − τs Þ
2

MSEt = r=1 s=0


;
50 × 52
where mj is the number of categories minus 1 for item j, b∗jkr is the b-parameter for score category
k of item j in the equated test (Form X) for replication r, bjk is the true value of the b-parameter for
score category k of item j in the equated test form X, tsr∗ is the number-correct true score at score
point s in the equated test for replication r, and τs is the true number-correct true score at score
point s. The mean square errors were further decomposed into systematic errors (MSEb SE and

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 321

MSEt SEÞ and random errors (MSEb RE and MSEt RE) (Gifford & Swaminathan, 1990). The
systematic errors were calculated by
mj
P
36 P
ðb∗jk − bjk Þ
 2

j=1 k=0
MSEb SE =
51
and
P
51
ðts∗ − τs Þ
2

MSEt SE = s=0
:
52
The random errors were calculated by
mj
P
50 P
36 P
ðb∗jkr − b∗jk
 2
Þ
r=1 j=1 k=0
MSEb RE =
50 × 51
and
P
50 P
51
ðtsr∗ − ts∗ Þ
2

MSEt RE = r=1 s=0


;
50 × 52
where b∗jk

is the mean of the b-parameters for score category k of item j in the equated test (Form X)
across 50 replications, and ts∗ is the mean of the number-correct true scores at score point s across 50
replications.
Preliminary results revealed that as the MSEb and MSEt values increased, the corresponding
MSEb SE and MSEt SE values also increased. In contrast, the corresponding MSEb RE and
MSEt RE values did not change nearly as much. For example, when no outliers were present, the
MSEb , MSEb SE, and MSEb RE values for the concurrent calibration under the equivalent
groups condition were 0.0157, 0.0108, and 0.0049, respectively (see left side, Panel A, Table 3);
the corresponding values for the concurrent calibration under the nonequivalent groups condition
were 0.6907, 0.6836, and 0.0071, respectively (see left side, Panel B, Table 3). This example indi-
cated that the changes in the values of MSEb , MSEb SE, and MSEb RE were mainly attributable
to the systematic errors and not to the random errors. Besides, theoretically, only systematic errors
reflect the magnitudes of the bias introduced by specific equating methods and do not decrease as
the sample size increases (Kolen & Brennan, 2004). Consequently, the MSEb SE and MSEt SE
were used to (a) evaluate the accuracy of the equating methods and (b) make comparisons among
the methods. Given space limitations, only the systematic errors for the conditions with outliers
are reported in Tables 4 and 5.2
The next question addressed was the following: How does one compare the values of MSEb SE
and MSEt _SE across the various conditions? The relative magnitudes of mean square error values
have been compared in the majority of previously published equating simulation studies to deter-
mine the relative accuracy of the equating methods considered. However, the use of relative mag-
nitudes of systematic errors in the current study was problematic. For example:
1. The MSEt _SE values for the M/S transformation and FCIP calibration in the presence of no
outliers and equivalent groups were 0.0859 and 0.0858, respectively (see Panel A, Table 3).
Based on relative criteria, it may be concluded that the FCIP calibration performed better

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
322 APPLIED PSYCHOLOGICAL MEASUREMENT

Table 3
Mean Square Total, Systematic, and Random Errors of the IRT-Based Equating Methods: No Outliers

Item Difficulty b Number-Correct True Score t

Method MSEb MSEbSE MSEbRE MSEt MSEtSE MSEtRE

A. Equivalent groups (NID (0,1) vs. NID (0,1))


Concurrent 0.0157 0.0108 (S) 0.0049 0.1550 0.0459 (S) 0.1091
TCC 0.0189 0.0108 (S) 0.0081 0.2285 0.0646 (S) 0.1638
M/S 0.0176 0.0110 (S) 0.0067 0.2754 0.0859 (S) 0.1895
FCIP 0.0244 0.0118 (S) 0.0127 0.2277 0.0858 (S) 0.1419
B. Nonequivalent groups (NID (0,1) vs. NID (1,1))
Concurrent 0.6907 0.6836 (L) 0.0071 2.7538 2.6989 (M) 0.0549
TCC 0.0699 0.0562 (S) 0.0136 1.6142 1.3888 (S) 0.2255
M/S 0.0288 0.0191 (S) 0.0097 0.3127 0.0960 (S) 0.2167
FCIP 0.1385 0.1302 (M) 0.0083 2.7546 2.7021 (M) 0.0525

Note. S, M, and L represent the size of systematic errors. S refers to small, M refers to moderate, and L refers
to large. IRT = item response theory; NID = normal independent distribution; TCC = test characteristic
curve; M/S = mean/sigma; FCIP = fixed common item parameters.

than the M/S transformation because the former method had a smaller MSEt SE value than
the latter one. However, the difference between the two values is small and is likely due to
sampling variability.
2. The MSEt _SE values for the concurrent calibration with outliers included and the TCC
transformation with outliers included under the condition of three outliers with extreme
values and equivalent groups were 0.6026 and 3.7795, respectively (see right side, Panel A1,
Table 4). Based on the relative criteria, one may conclude that the concurrent calibration with
outliers included performed better than the TCC transformation with outliers included
because it had a smaller MSEt SE value. Although the conclusion may sound reasonable for
this case, it likely makes no sense in the following case.
3. The MSEt _SE values for the TCC transformation with outliers included and the M/S
transformation with outliers included under the condition of outliers with nine score points
from one content area and equivalent groups were 16.2440 and 19.5356, respectively (see
right side, Panel A1, Table 5). Based on the relative criteria, one may conclude that the
TCC transformation with outliers included performed better than the M/S transformation
with outliers included. However, both values are large, perhaps too large. Consequently,
neither the TCC nor the M/S transformations with outliers included would be recom-
mended in this situation.

These examples led to the following question: What should the minimum size of MSEb SE and
MSEt SE be to claim the systematic error is small, moderate, or large? To answer this question and
to make the discussion consistent, absolute rules for interpreting the sizes of the systematic errors for
the b-parameters and the number-correct true scores were developed. The development of rules was
based on the magnitude of the square roots of the systematic errors (referred to as bias), which repre-
sents the difference between the observed b-parameters or the number-correct true scores and their
(text continues on p. 327)

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Table 4
Systematic Errors for the IRT-Based Equating Methods Under the Condition of Outliers With Three Score Points

Item Difficulty b Number-Correct True Score t

From One From Any With Extreme From One From Any With Extreme
Method Content Content Values Content Content Values

A. Equivalent groups (NID (0,1) vs. NID (0,1))


A1. Methods that did not consider the influence of outliers
C + include (S) 0.0154 0.0164 0.0137 (S) 0.7366 0.7601 0.6026
TCC + include (M) 0.0907 0.0753 0.0805 (M) 3.6257 2.7929 3.7795
M/S + include (M) 0.1179 0.1037 0.1658 (L) 9.5457 7.5302 16.4099
FCIP + fixed (S) 0.0303 0.0346 0.0192 (S) 0.8197 1.0358 0.6189
A2. Methods that considered the influence of outliers
C + exclude (S) 0.0106 0.0116 0.0120 (S) 0.0648 0.0413 0.0538
TCC + exclude (S) 0.0104 0.0118 0.0120 (S) 0.0687 0.0927 0.0962
M/S + exclude (S) 0.0105 0.0117 0.0120 (S) 0.1103 0.0727 0.0683
M/S + weight (S) 0.0104 0.0117 0.0122 (S) 0.0944 0.0819 0.0958
FCIP + nofixed (S) 0.0105 0.0105 0.0108 (S) 0.0733 0.0803 0.0711

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


FCIP + exclude (S) 0.0107 0.0118 0.0122 (S) 0.0660 0.0625 0.0681

(continued)

323
324
Table 4 (continued)

Item Difficulty b Number-Correct True Score t

From One From Any With Extreme From One From Any With Extreme
Method Content Content Values Content Content Values

B. Nonequivalent groups (NID (0,1) vs. NID (1,1))


B1. Methods that did not consider the influence of outliers
C + include (L) 0.5673 0.5583 0.6242 (S) 1.0825 1.0259 2.2091
TCC + include (M) 0.1697 0.1923 0.0909 (M) 5.6282 6.0737 3.7821
M/S + include (M) 0.0901 0.1334 0.1218 (L) 6.2887 9.6805 10.8171
FCIP + fixed (M) 0.0718 0.0691 0.1001 (S) 1.6267 1.5823 2.2923
B2. Methods that considered the influence of outliers
C + exclude (L) 0.7102 0.7099 0.7047 (M) 3.3607 3.3384 3.1658
TCC + exclude (S) 0.0497 0.0642 0.0601 (S) 1.2560 1.5888 1.4682
M/S + exclude (S) 0.0168 0.0181 0.0182 (S) 0.0726 0.0824 0.0860
M/S + weight (L) 0.4545 0.4508 0.4521 (L) 34.5878 35.1571 34.4702
FCIP + nofixed (M) 0.1473 0.1473 0.1394 (M) 3.3135 3.3267 3.0315
FCIP + exclude (M) 0.1497 0.1485 0.1428 (M) 3.4004 3.3558 3.1325

Note. S, M, and L represent the size of systematic errors. S refers to small, M refers to moderate, and L refers to large. IRT = item response theory; NID = normal

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


independent distribution; TCC = test characteristic curve; M/S = mean/sigma; FCIP = fixed common item parameters. C + include: concurrent calibration with
outliers included; TCC + include: TCC transformation with outliers included; M/S include: M/S transformation with outliers included; FCIP + fixed: FCIP cali-
bration with outliers fixed; C + exclude: concurrent calibration with outliers excluded; TCC + exclude: TCC transformation with outliers excluded; M/
S + exclude: M/S transformation with outliers excluded; M/S + weight: M/S transformation with outliers weighted; FCIP + nofixed: FCIP calibration with out-
liers not fixed; FCIP + exclude: FCIP calibration with outliers excluded.
Table 5
Systematic Errors for the IRT-Based Equating Methods Under the Condition of Outliers with Nine Score Points

Item Difficulty b Number-Correct True Score t

From One From Any From One From Any


Method Content Content Content Content

A. Equivalent groups (NID (0,1) vs. NID (0,1))


A1. Methods that did not consider the influence of outliers
C + include (S) 0.0454 0.0420 (M) 4.6003 4.5843
TCC + include (L) 0.4750 0.3966 (L) 16.2440 16.1420
M/S + include (L) 0.3198 0.3277 (L) 19.5356 26.8693
FCIP + fixed (M) 0.1364 0.1372 (M) 4.4692 4.7382
A2. Methods that considered the influence of outliers
C + exclude (S) 0.0107 0.0102 (S) 0.0440 0.0372
TCC + exclude (S) 0.0107 0.0102 (S) 0.0445 0.0660
M/S + exclude (S) 0.0107 0.0102 (S) 0.0750 0.0847
M/S + weight (M) 0.2004 0.2448 (L) 38.3385 31.9253
FCIP + nofixed (S) 0.0106 0.0090 (S) 0.0591 0.0541

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


FCIP + exclude (S) 0.0109 0.0104 (S) 0.0586 0.0524

(continued)

325
326
Table 5 (continued)

Item Difficulty b Number-Correct True Score t

From One From Any From One From Any


Method Content Content Content Content

B. Nonequivalent groups (NID (0,1) vs. NID (1,1))


B1. Methods that did not consider the influence of outliers
C + include (L) 0.3709 0.4005 (S) 0.3283 0.3633
TCC + include (L) 0.6227 0.4727 (L) 20.3331 18.8192
M/S + include (L) 0.2703 0.3476 (L) 14.3375 27.6667
FCIP + fixed (S) 0.0252 0.0284 (S) 0.3161 0.4353
B2. Methods that considered the influence of outliers
C + exclude (L) 0.7521 0.7597 (M) 4.5402 4.5644
TCC + exclude (S) 0.0633 0.0507 (S) 1.6172 1.1761
M/S + exclude (S) 0.0174 0.0185 (S) 0.0855 0.0770
M/S + weight (L) 0.4229 0.9049 (L) 37.1089 32.7707
FCIP + nofixed (M) 0.1839 0.1839 (M) 4.6466 4.6068
FCIP + exclude (M) 0.1857 0.1873 (M) 4.6911 4.6822

Note. S, M, and L represent the size of systematic errors. S refers to small, M refers to moderate, and L refers to large. IRT = item response theory; NID = normal

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


independent distribution; TCC = test characteristic curve; M/S = mean/sigma; FCIP = fixed common item parameters. C + include: concurrent calibration with
outliers included; TCC + include: TCC transformation with outliers included; M/S include: M/S transformation with outliers included; FCIP + fixed: FCIP cali-
bration with outliers fixed; C + exclude: concurrent calibration with outliers excluded; TCC + exclude: TCC transformation with outliers excluded; M/
S + exclude: M/S transformation with outliers excluded; M/S + weight: M/S transformation with outliers weighted; FCIP + nofixed: FCIP calibration with out-
liers not fixed; FCIP + exclude: FCIP calibration with outliers excluded.
H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 327

corresponding true b-parameters or number-correct true scores. For the b-parameter, the bias values
of 0.2500 (one fourth of the standard deviation of the distribution of the b-parameters) and 0.5000
(one half of the standard deviation of the distribution of the b-parameters) were adopted as the cutoff
scores. These values correspond to 0.0625 and 0.2500 in the metric of mean square errors. Conse-
quently, the rules for the MSEb _SE were as follows: (a) MSEb _SE ≤ 0:06: small, (b) 0:06 <
MSEb _SE ≤ 0:25: moderate, and (c) MSEb _SE > 0:25: large. The MSEb _SE values were rounded
to two decimal points to avoid the situations when an MSEb _SE value is placed in a higher category
due to a small difference from a cutoff value. The rules for the MSEt _SE were as follows: (a)
MSEt _SE ≤ 2:25: small, (b) 2:25 < MSEt _SE ≤ 6:25: moderate, and (c) MSEt _SE > 6:25: large.
As for the case of MSEb _SE, two decimal points were used in judging the size of MSEt _SE.

Results

Comparison of Methods Under Conditions Without Outliers


As shown in Panel A, Table 3, the MSEb _SE and MSEt _SE values for each of the four equat-
ing methods were small when the two equating groups were equivalent. In contrast, when the two
equating groups differed by one standard deviation of ability, the MSEb _SE and MSEt _SE values
varied (see Panel B, Table 3). The TCC and M/S transformations had small MSEb _SE and
MSEt _SE values, the FCIP calibration had moderate MSEb _SE and MSEt _SE values, and the
concurrent calibration had a large MSEb _SE value but a moderate MSEt _SE value. These results
indicate that the four equating methods performed equally accurate, as determined by the criterion
of MSEb _SE and MSEt _SE under the condition of equivalent groups, but not when the two equat-
ing groups were nonequivalent. The four methods were sensitive, but not equally, to the presence
of nonequivalent groups.

Comparison of Methods Under Conditions With Outliers


Outliers with three score points. The first interesting finding in Table 4 is that the systematic
errors of all the equating methods under the conditions in which the outliers were from one content
area were randomly distributed across content areas and had extreme values that were similar. The
same finding was also observed for the conditions of outliers with nine score points. However,
based on the assumption that the content representativeness of the common items is important, the
expectation was that the systematic errors for the methods with outliers from one content area
excluded would be greater than those for the methods with outliers randomly distributed across
content areas. Seemingly, this is not the case.
In contrast, the performance of the 10 variations of IRT-based equating methods was sensitive
to group equivalence. For equivalent groups (see Panel A1, Table 4), the concurrent calibration
with outliers included and the FCIP calibration with outliers fixed had small MSEb _SE and
MSEt _SE values, the TCC transformation with outliers included had moderate MSEb _SE and
MSEt _SE values, and the M/S transformation with outliers included had a moderate MSEb _SE
value but a large MSEt _SE value. In contrast, all the MSEb _SE and MSEt _SE values for the
methods that considered the influence of outliers were small (see Panel A2, Table 4).
For the nonequivalent groups (see Panel B1, Table 4), the TCC and M/S transformations with
outliers included and the FCIP calibration with outliers fixed had moderate MSEb _SE values, the
concurrent calibration with outliers included had a large MSEb _SE value, the concurrent calibra-
tion with outliers included and FCIP calibration with outliers fixed had small MSEt _SE values,
the TCC transformation with outliers included had a moderate MSEt _SE value, and the M/S

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
328 APPLIED PSYCHOLOGICAL MEASUREMENT

transformation with outliers included had a large MSEt _SE value. Examination of the systematic
errors in Panel B2, Table 4 revealed that the MSEb _SE and MSEt _SE values for the TCC and M/
S transformations with outliers excluded were small, the MSEb _SE and MSEt _SE values for the
FCIP calibration with outliers not fixed and excluded were moderate, and the MSEb _SE and
MSEt _SE values for the concurrent calibration with outliers excluded and M/S transformation
with outliers weighted were large or moderate.
Outliers with nine score points. The systematic errors for the conditions of outliers with nine
score points are summarized in Table 5. The purpose of having a condition of outliers with nine
score points is to examine how the 10 variations of IRT-based equating methods perform when the
number/score points of outliers increase. The results are discussed by comparing the systematic
errors in Tables 4 and 5 in this section.
Comparison of the sizes of systematic errors in Panel A1, Table 5 with the sizes of the system-
atic errors in Panel A1, Table 4 revealed that, for equivalent groups, the MSEb _SE and MSEt _SE
systematic errors tended to increase when the number/score points of outliers increased. In con-
trast, comparison of the systematic errors in Panel A2, Table 5 with the corresponding errors in
Panel A2, Table 4 showed that, with the exception of the M/S transformation with outliers
weighted, the methods that considered the influence of outliers had small MSEb _SE and
MSEt _SE values, as did the corresponding methods under the conditions of outliers with three
score points. The M/S transformation with outliers weighted had a moderate MSEb _SE value and
a large MSEt _SE value under the condition of outliers with nine score points but small MSEb _SE
and MSEt _SE values under the conditions of outliers with three score points.
When the systematic errors in Panel B1, Table 5 were compared with those in Panel B1, Table
4, it was found that the MSEb _SE and MSEt _SE values for the concurrent calibration with out-
liers included remained the same size as the number/score points of outliers increased, the
MSEb _SE and MSEt _SE values for the TCC and M/S transformations with outliers included
tended to increase as the number/score points of outliers increased, the MSEb _SE systematic
errors for the FCIP calibration with outliers fixed were moderate under the conditions of outliers
with three score points but small under the condition of outliers with nine score points, and the
MSEt _SE values for the FCIP calibration with outliers fixed were small regardless of the number/
score points of outliers. In contrast, the MSEb _SE and MSEt _SE values for the methods that con-
sidered the influence of outliers (Panel B2, Tables 4 and 5) remained the same size as the number
of outlier items and the number of score points of outliers increased.

Summary and Discussion

When outliers were not present in the data set and the equating groups were equivalent, the
methods of concurrent calibration, TCC transformation, M/S transformation, and FCIP calibration
performed equally well. However, the same cannot be said when the two equating groups were not
equivalent. The four methods were sensitive, but not equally, to the presence of nonequivalent
groups. These findings about the concurrent calibration and the TCC and M/S transformations are
consistent with the previous research (e.g., Hanson & Beguin, 2002).
When outliers were present, under the equivalent groups condition, the methods that considered
the influence of outliers tended to have smaller systematic errors than the methods that did not con-
sider the influence of outliers, which indicated that the former methods performed better. Among
the methods that considered the influence of outliers, with the exception of the M/S transformation
with outliers weighted, the remaining methods produced small systematic errors regardless of the

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 329

number/score points and types of outliers, which indicated that these methods performed equally
well under the condition of equivalent groups.
When the equating groups were not equivalent, not all of the systematic errors for the methods
that considered the influence of outliers were smaller than the corresponding values for the meth-
ods that did not consider the influence of outliers. Thus, caution needs to be taken when drawing
conclusions about whether methods that consider the influence of outliers will perform better than
methods that do not consider the influence of outliers when the equating groups are not equivalent.
For the concurrent calibration, excluding the outliers did not reduce the systematic errors as one
would expect. Because the MSEb _SE value for the concurrent calibration was large, even when no
outliers were present, and the MSEt _SE was small for the concurrent calibration with outliers
included but moderate for concurrent calibration with outliers excluded, one may conclude that
whether the concurrent calibration performs well depends on multiple factors such as group equiv-
alence, outliers, and evaluation criteria. Among these, group equivalence is the most important fac-
tor. Although the FCIP calibration had smaller systematic errors than the concurrent calibration,
the same conclusion applies. For the TCC and M/S transformations, excluding outliers produced
small systematic errors. In contrast, including outliers resulted in moderate or large systematic
errors. The M/S transformation with outliers weighted produced large systematic errors under the
conditions of outliers with three and nine score points and nonequivalent equating groups. This last
observation is likely attributable to the weighting used in this method. As described previously, this
method uses the weighted item difficulties to calculate the equating coefficients, where the weights
are inversely proportional to the standard errors of the item difficulty estimates. Under the non-
equivalent groups conditions, one group had an ability distribution with mean 1 and standard devi-
ation 1, which means the standard errors of the item difficulty estimate are large when the item
responses from this group are used. Unfortunately, this method uses these large standard errors to
weight the item difficulties, which in turn leads to large systematic errors.
The results of the current study revealed that there was an interaction among the IRT-based
equating methods, group equivalence, and number/score points of outliers. For the methods that
did not consider the influence of outliers, the MSEb _SE and MSEt _SE values tended to increase
as the number/score points of outliers increased for the conditions with equivalent groups. How-
ever, this cannot be said for the corresponding methods with the conditions of nonequivalent
groups. For the methods that considered the influence of outliers, with the exception of M/S trans-
formation with outliers weighted, the sizes of the MSEb _SE and MSEt _SE values for the other
methods remained the same as the number/score points of outliers increased under equivalent and
nonequivalent groups conditions. Among the methods that considered the influence of outliers,
the TCC and M/S transformations with outliers excluded performed the best (i.e., with small
MSEb _SE and MSEt _SE values), regardless of the group equivalence and number/score points of
outliers.
Selecting common items is often based on both content and statistical grounds (Kolen & Bren-
nan, 2004). Many researchers have investigated the importance of the assumption of content and
statistical representativeness of common items (e.g., Cook & Petersen, 1987; Harris, 1991; Klein
& Jarjoura, 1985; Kromrey, Parshall, & Yi, 1998; Petersen, Marco, & Stewart, 1982; Yang, 1997).
However, the conclusions are not consistent. For example, Cook and Petersen (1987) reviewed
several studies that considered anchor test properties. They pointed out that content and statistical
representativeness was especially important when groups varied in ability. Yang (1997) found out
that the accuracy of equating depended on the content representativeness of the anchor items, no
matter which equating method (Tucker linear and two IRT-based methods) was used to equate
two test forms. However, Harris (1991) examined content and statistical nonrepresentativeness
and found that content itself did not greatly influence equating results. The current study

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
330 APPLIED PSYCHOLOGICAL MEASUREMENT

investigated content and statistical representativeness from the perspective of dealing with com-
mon items that had outlier b-values. Because excluding the outliers that appear in only one content
area may violate the content representativeness of the common item set, it might be expected that
the systematic errors for the methods that excluded the outliers in one content area should be
greater than the systematic errors for the methods that excluded outliers randomly distributed
across the content areas. However, the similarity among the systematic errors under the conditions
of the different types of outliers suggests that the violation of the assumption of content representa-
tiveness of the common items did not influence of performance of the IRT-based equating meth-
ods. The observation that the mean and standard deviation of the common items remained
essentially unchanged while the types of outliers changed may indicate that the statistical repre-
sentativeness of common items affects the equating results more directly than the content repre-
sentativeness of common items. This finding is consistent with the conclusions drawn by Harris
(1991) and Kromrey et al. (1998).

Implications for Practice


The results of the current study reveal that if outliers with inconsistent b-parameter estimates
are detected in the common items, especially when the outliers drift to one direction, the influence
of such outliers should be removed. Violation of content representativeness should have less con-
cern if removing outliers does not change the statistical representativeness. The methods that can
be considered are the TCC and M/S transformations with outliers excluded. If the two equating
groups have similar ability distributions, the concurrent calibration with outliers excluded and the
FCIP calibration with outliers excluded or not fixed can be considered too. When the two equating
groups have a large ability distribution difference, the M/S transformation with outliers weighted
and the concurrent calibration with outliers excluded are not recommended.

Limitations and the Need for Further Research


The current study shares the limitations of a typical simulation study. The first limitation is how
realistic the simulated data are, even though efforts have been made to reduce the difference
between the simulated data and real data. In the present study, care was taken to ensure that the
simulated forms closely matched the actual test forms used to determine the item parameters.
However, future study is needed in which actual student responses are used to determine the fit
between the simulated results and actual results.
The second limitation of computer simulations is that often only a small number of conditions
are investigated. For example, the current study is limited to the investigation of IRT-based equat-
ing methods in the presence of outliers with inconsistent b-parameter estimates. The influence of
outliers with inconsistent a- and/or c-parameter estimates was not considered. The selection of the
b-parameter was based on the observations that poorly estimated item difficulties had a serious
impact on the equating results (Stocking & Lord, 1983) and that the a- and c-parameter estimates
are not as stable as the b-parameter estimates (Ironson, 1983). Research is needed to determine the
influence of outlier items defined by their values for the a-parameters, c-parameters, and item
characteristic curves that represent the interaction of item parameters.
The current study is also limited to the investigation of outliers located on the left side of the
straight line in the scatterplot of b-parameters and with the Year 2 b-parameters two score points
lower than the Year 1 b-parameters. Future work should be done to examine the effects of outliers
on both sides or the right side of the straight line and in which the distance could be more or less
than two score points.

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 331

In the current study, it was found that content representativeness of common items had little
impact on the equating results. This conclusion is associated with the fact that the mean and stan-
dard deviation of the common items remained essentially unchanged, whereas the assumption of
content representativeness was violated. Further study is needed to investigate whether content
representativeness of common items is a direct causal effect of the equating accuracy or is moder-
ated by some other factors such as statistical representativeness.
Absolute rules were proposed in the current study to distinguish small, moderate, and large sys-
tematic errors of b-parameters and number-correct true scores. The development of these rules
was somewhat subjective. More research is needed to determine whether these rules will hold over
the other studies.

Notes

1. The sample control files for each of these methods are available from the first author.
2. The full set of results is available from the first author.

References

Angoff, W. H. (1984). Scales, norms, and equivalent theory equating methods in less than optimal
scores. Princeton, NJ: Educational Testing circumstances. Applied Psychological Measure-
Service. ment, 11, 225-244.
Baker, F. B. (1996). An investigation of the sampling De Champlain, A. F. (1996). The effect of multidi-
distributions of equating coefficients. Applied mensionality on IRT true-score equating for sub-
Psychological Measurement, 20(1), 45-57. groups of examinees. Journal of Educational
Bejar, I., & Wingersky, M. S. (1981). An application Measurement, 33(2), 181-201.
of item response theory to equating the Test of Dorans, N. J., & Kingston, N. M. (1985). The effects
Standard Written English (College Board Report of violations of unidimensionality on the estima-
No. 81-8). Princeton, NJ: Educational Testing tion of item and ability parameters and on item
Service. (ETS No. 81-35) response theory equating of the GRE verbal
Bolt, D. M. (1999). Evaluating the effects of multidi- scale. Journal of Educational Measurement,
mensionality on IRT true-score equating. Applied 22(4), 249-262.
Measurement in Education, 12(3), 383-407. Gifford, J. A., & Swaminathan, H. (1990). Bias and
Childs, R. A., & Chen, W.-H. (1999). Obtaining the effect of priors in Bayesian estimation of
comparable item parameter estimates in MULTI- parameters of item response models. Applied
LOG and PARSCALE for two polytomous IRT Psychological Measurement, 14, 33-43.
models. Applied Psychological Measurement, Haebara, T. (1980). Equating logistic ability scales
23(4), 371-379. by a weighted least squares method. Japanese
Cohen, A. S., & Kim, S.-H. (1998). An investigation Psychological Research, 22, 144-149.
of linking methods under the graded response Hambleton, R. K., & Murray, L. (1983). Some good-
model. Applied Psychological Measurement, ness of fit investigations for item response mod-
22(2), 116-130. els. In R. K. Hambleton (Ed.), Applications of
Cook, L. L., Eignor, D. R., & Hutton, L. R. (1979, item response theory (pp. 71-94). British Colum-
April). Considerations in the application of latent bia: Educational Research Institute of British
trait theory to objective-based criterion-refer- Columbia.
enced tests. Paper presented at the meeting of the Hanson, B. A., & Beguin, A. A. (2002). Obtaining
American Educational Research Association, a common scale for item response theory item
San Francisco. parameters using separate versus concurrent esti-
Cook, L. L., & Petersen, N. S. (1987). Problems mation in the common-item equating design.
related to the use of conventional and item response Applied Psychological Measurement, 26(1), 3-24.

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


Volume 32 Number 4 June 2008
332 APPLIED PSYCHOLOGICAL MEASUREMENT

Hanson, B. A., & Feinstein, Z. S. (1997). Applica- Linn, R. L., Levine, M. V., Hastings, C. N., & War-
tion of a polynomial log linear model to asses- drop, J. L. (1980). An investigation of item bias in
sing differential item functioning for common a test of reading comprehension (Tech. Rep. No.
items in the common-item equating design (ACT 163). Urbana: Center for the Study of Reading,
Research Report Series 97-1). Iowa City, IA: University of Illinois.
American College Testing. Lord, F. M. (1980). Applications of item response
Harris, D. J. (1991, April). Equating with nonrepre- theory to practical testing problems. Hillsdale,
sentative common item sets and non-equivalent NJ: Lawrence Erlbaum.
groups. Paper presented at the annual meeting of Loyd, B. H., & Hoover, H. D. (1980). Vertical equat-
the American Educational Research Association, ing using the Rasch model. Journal of Educa-
Chicago. tional Measurement, 17, 179-193.
Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. Marco, G. L. (1977). Item characteristic curve solu-
(1996). Monte Carlo studies in item response the- tions to three intractable testing problems. Jour-
ory. Applied Psychological Measurement, 20(2), nal of Educational Measurement, 14, 139-160.
101-125. Minnesota Comprehensive Assessments Grade 3 & 5
Hills, J. R., Subhiyah, R. G., & Hirsch, T. M. (1988). Technical Manual. (2002). Retrieved November
Equating minimum-competency tests: Compari- 26, 2005, from http://education.state.mn.us/mde/
son of methods. Journal of Educational Measure- static/001879.pdf
ment, 25(3), 221-231. Muraki, E., & Bock, R. D. (1999). PARSCALE:
Ironson, G. H. (1983). Using item response theory IRT Item Analysis and Test Scoring for Rating-
to measure bias. In R. K. Hambleton (Ed.), scale Data (Version 3.5) [Computer software].
Applications of item response theory (pp. 155- Chicago: Scentific Software.
174). British Columbia, Canada: Educational Petersen, N. C., Marco, G. L., & Stewart, E. E.
Research Institute of British Columbia. (1982). A test of the adequacy of linear score
Klein, L. W., & Jarjoura, D. (1985). The importance equating models. In P. W. Holland & D. B. Rubin
of content representation for common-item (Eds.), Test equating (pp. 71-135). New York:
equating with non-random groups. Journal of Academic Press.
Educational Measurement, 22, 197-206. Psychometric Society. (1979). Publication policy
Kolen, M. J., & Brennan, R. L. (2004). Test equating, regarding Monte Carlo studies. Psychometrika,
scaling, and linking: Methods and practices. 44, 133-134.
New York: Springer. Stocking, M. L., & Lord, F. M. (1983). Developing
Kromrey, J. D., Parshall, C. G., & Yi, Q. (1998, a common metric in item response theory.
April). The effects of content representativeness Applied Psychological Measurement, 7, 201-210.
and differential weighting on test equating: A Tate, R. (2000). Performance of a proposed method
Monte Carlo study. Paper presented at the annual for the linking of mixed format tests with con-
meeting of the American Educational Research structed response and multiple choice items.
Association, San Diego, CA. Journal of Educational Measurement, 37(4),
Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, 329-346.
R. D. (2001). Comparison of dichotomous and Thissen, D. (1991). MULTILOG user’s guide: Multi-
polytomous item response models in equating ple, categorical item analysis and test scoring
scores from tests composed of testlets. Applied using item response theory (Version 6.0). New
Psychological Measurement, 25(4), 357-372. York: Springer.
Lehman, R. S., & Bailey, D. E. (1968). Digital com- 2001 MCAS technical report. (2001). Retrieved
puting: Fortran IV and its applications in beha- November 26, 2005, from http://www.doe.mass.
vioural science. New York: John Wiley. edu/mcas/2002/news/01techrpt.pdf
Li, Y. H., Lissitz, R. W., & Yang, Y.-N. (1999, 2002 MCAS technical report. (2002). Retrieved
April). Estimating IRT equating coefficients for November 26, 2005, from http://www.doe.mass
tests with polytomously and dichotomously .edu/mcas/2003/news/02techrpt.pdf
scored items. Paper presented at the annual meet- Vukmirovic, Z., Hu, H., & Turner, J. C. (2003, April).
ing of the National Council on Measurement in The effects of outliers on IRT equating with fixed
Education, Montreal, Canada. common item parameters. Paper presented at the

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008


H. HU, W. T. ROGERS and Z. VUKMIROVIC
IRT-BASED EQUATING 333

meeting of the National Council on Measurement Zenisky, A. L. (2001, October). Investigating the
in Education, Chicago. accumulation of equating error in fixed common
Wang, T.-Y., Hanson, B. A., & Harris, D. J. (2000). item parameter linking: A simulation study. Paper
The effectiveness of circular equating as a crite- presented at the annual meeting of the Northeast-
rion for evaluating equating. Applied Psychologi- ern Educational Research Association, Kerhonk-
cal Measurement, 24(3), 195-210. son, NY.
Yang, W. (1997, April). The effects of content mix Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock,
and equating method on the accuracy of test R. D. (1996). BILOG-MG: Multiple group IRT
equating using anchor-item design. Paper pre- analysis and test maintenance for binary items
sented at the annual meeting of the American [Computer program]. Chicago: Scientific Soft-
Educational Research Association, Chicago. ware International.
Yen, W. M. (1984). Effects of local item dependence
on the fit and equating performance of the three-
parameter logistic model. Applied Psychological
Measurement, 8(2), 125-145. Author’s Address
Zeng, L. (1991). Standard errors of linear equating
for the single-group design (ACT Research Address all correspondence to Huiqin Hu, DRC,
Report 91-4). Iowa City, IA: American College 13490 Bass Lake Road, Maple Grove, MN 55311;
Testing. e-mail: hhu@datarecognitioncorp.com.

Downloaded from http://apm.sagepub.com at Sultan Qaboos University on September 17, 2008

You might also like