Professional Documents
Culture Documents
Investigation of IRT Based Equating Methods in The Presence of Outliers Common Items
Investigation of IRT Based Equating Methods in The Presence of Outliers Common Items
http://apm.sagepub.com
Published by:
http://www.sagepublications.com
Additional services and information for Applied Psychological Measurement can be found at:
Subscriptions: http://apm.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Common items with inconsistent b-parameter generated for the common-item nonequivalent
estimates may have a serious impact on item groups matrix design to reflect the manipulated
response theory (IRT)–based equating results. To factors: group ability differences and nonequivalent
find a better way to deal with the outlier common groups, number/score points of outliers, and types of
items with inconsistent b-parameters, the current outliers. When no outliers were present, the TCC
study investigated the comparability of 10 variations and M/S transformations performed the best. When
of four IRT-based equating methods (i.e., there were outliers, overall, the methods that
concurrent calibration, separate calibration with test considered them (except the M/S transformation
characteristic curve [TCC] and mean/sigma [M/S] with outliers weighted) resulted in a vast
transformations, and calibration with fixed common improvement compared to the methods that ignored
item parameters [FCIP]) when outliers were either them. Index terms: item response theory;
ignored or considered. Simulated data were equating; outliers; calibration; transformation.
Because many large-scale testing programs use unidimensional item response theory (IRT)
models to develop tests, the use of IRT-based equating methods has become more and more attrac-
tive to large-scale testing practitioners. Kolen and Brennan (2004) pointed out that a crucial aspect
of IRT applications is to study the robustness of the models to violations of the assumptions that
underlie their use. The basic assumptions of the commonly used IRT models are unidimensional-
ity, local independence, and nonspeededness (Hambleton & Murray, 1983; Lord, 1980). When
a given IRT model fits the test data of interest, two features are obtained: Examinee ability esti-
mates are not test dependent, and item parameter estimates are not group dependent. Several stud-
ies have been conducted to explore the effects of violating these assumptions on equating results
obtained using IRT models (e.g., Bolt, 1999; De Champlain, 1996; Dorans & Kingston, 1985;
Lee, Kolen, Frisbie, & Ankenmann, 2001; Yen, 1984). However, only a few studies involved the
assumption of item parameter invariance that is specific to equating with the common-item non-
equivalent groups design (e.g., Bejar & Wingersky, 1981; Cook, Eignor, & Hutton, 1979; Linn,
Levine, Hastings, & Wardrop, 1980; Stocking & Lord, 1983; Vukmirovic, Hu, & Turner, 2003).
In the common-item nonequivalent groups design, the IRT-based equating methods typically
involve two steps (Kolen & Brennan, 2004). First, item parameters of the reference and equated
tests are calibrated separately. Second, item parameter estimates from the equated test are scaled
onto the scale of the parameter estimates for the reference test using a linear transformation
Figure 1
Illustration of an Outlier in a Set of Common Items
2 il 2
il
0 0
–2 –2
–4 –4
–4 –2 0 2 4 –4 –2 0 2 4
a/b - Equated a/b - Equated
method such as the mean/mean (M/M; Loyd & Hoover, 1980), mean/sigma (M/S; Marco, 1977),
or characteristic curve (Haebara, 1980; Stocking & Lord, 1983) methods. Alternatively, the para-
meters of the common items can be held constant in the calibration of the equated test using the
parameters estimated in the reference test (referred to as fixed common item parameters [FCIP]).
As a result, the estimation of the characteristics of the unique items is constrained by the scale of
the common items (Hills, Subhiyah, & Hirsch, 1988; Li, Lissitz, & Yang, 1999; Vukmirovic et al.,
2003; Zenisky, 2001). An alternative procedure to the two-step procedure is IRT concurrent cali-
bration: Examinees’ responses from the two tests to be equated are combined as one data file, and
the parameters are estimated simultaneously. Thus, the parameter estimates are put onto one scale.
In the first step of the two-step procedure, it is expected that the item parameters, such as the
discrimination (a-parameter) and difficulty (b-parameter) parameters, of the common items will
be the same, within sampling error, if they are estimated separately from two randomly equivalent
groups. In contrast, it might be expected that the item parameters will be different but linearly
related if they are estimated from two nonequivalent groups. The guessing parameter (c-parameter),
if specified in the model, will remain the same regardless of the form of group equivalence
(Hambleton & Murray, 1983).
In the case of common-item nonequivalent groups, as indicated, the a- and b-parameters of
common items calibrated separately from two groups may be different. These differences are
due to the indeterminacy of estimation. Because both item parameters and ability parameters are
not known in an IRT model, the mean and standard deviation of the ability scale are often set to
0 and 1, respectively, to obtain the parameter estimates. That is, although the two groups may
differ in ability, the abilities for each group are scaled to have a common mean and standard
deviation. As a result, the common item parameters of two nonequivalent groups are expected to
have a linear relation and become the same once they are transformed to the same scale. If a scat-
terplot is made of the a- or b-parameter estimated from two equating groups, all the points (i.e.,
items) are expected to be located along a straight line or within a narrow band around this
straight line (see the left panel, Figure 1; Hambleton & Murray, 1983; Stocking & Lord, 1983).
However, some outlier items have been observed in practice that are located far away from the
straight line (see the right panel, Item i1, Figure 1) due to estimation errors, disclosure of some
of the common items, and differential curriculum emphasis (e.g., Stocking & Lord, 1983;
Vukmirovic et al., 2003).
When outliers are present in the common items, one may choose to ignore the outliers. However,
researchers have been aware of the effects of outliers with large inconsistent parameter estimates on
IRT-based equating (e.g., Bejar & Wingersky, 1981; Cohen & Kim, 1998; Cook et al., 1979; Hanson
& Feinstein, 1997; Linn et al., 1980; Stocking & Lord, 1983). For example, Stocking and Lord (1983)
pointed out that outliers with poorly estimated item difficulties might negatively affect the estimation
of the equating coefficients A and B when the M/M and M/S transformations were used. Vukmirovic
et al. (2003) found that fixing and not fixing the item parameters with inconsistent b-parameters led to
different equating results when the FCIP was employed.
To remove the possible negative effect of outliers, procedures have been proposed to modify the
M/M and M/S transformations. For example, Cook et al. (1979) restricted the range of the difficulties
used in computing moments. Bejar and Wingersky (1981) suggested giving smaller weights to the out-
liers. Linn et al. (1980) used weighted item difficulties where the weights were the inverse of the items’
squared standard errors. Stocking and Lord (1983) proposed an iterative procedure that employed both
Linn et al.’s and Bejar and Wingersky’s methods. Cohen and Kim (1998) extended Linn et al.’s proce-
dure to calculate the equating coefficients for polytomously scored items. However, it is not clear how
much these procedures improved the equating results. For example, Cook et al.’s method leads to the
deletion of outliers, thereby eliminating their negative effects. However, deletion of outliers may
adversely alter the content and statistical representativeness of the common items, which in turn may
adversely affect the equating results. Linn et al.’s and Cohen and Kim’s solutions may be useful when
the presence of outliers is due to item parameter estimation errors. However, other reasons, such as dis-
closure of some common items, may also produce outliers. In this case, it is not clear whether weight-
ing outliers will lead to a better equating result.
Few studies have addressed the issue of outliers when other IRT-based equating methods, such as
separate calibration with test characteristic curve (TCC) transformation, FCIP, or concurrent calibra-
tion, were employed. Vukmirovic et al. (2003) explored the effects of fixing and not fixing random
outliers using FCIP. However, how to deal with outliers when they appear nonrandomly (e.g., from
one domain or content area) is not clear. Theoretically, one may suggest removing outliers when no
harm would be done to the balance of content in the set of common items (Hanson & Feinstein,
1997). However, further systematic study needs to be conducted to obtain a clearer understanding on
which IRT-based equating method to use and how to deal with outliers under various conditions.
The purpose of this study, therefore, was to find a solution by investigating the comparability of
concurrent calibration, separate calibration with M/S and TCC (Stocking & Lord, 1983) transforma-
tions, and FCIP calibration when the effects of outliers with large inconsistent b-parameter estimates
were either ignored or considered. More specifically, four research questions were addressed: (1) Do
the IRT-based equating methods that consider the influence of outliers produce better results than
the IRT-based equating methods that do not consider the influence of outliers? (2) Is the effect found
in Question 1, if any, confounded by factors such as group ability differences, number/score points
of outliers (the use of the score point is to emphasize the open-ended response outlier items), and
types of outliers (i.e., location of outliers in terms of content coverage and size of b-parameter esti-
mates)? (3) Which of the IRT-based equating methods produces a better result, especially among
the IRT-based equating methods that consider the influence of outliers? (4) Is the effect found in
Question 3, if any, confounded by factors such as group ability differences, number/score points
of outliers, and types of outliers?
Method
In the current study, simulated data were generated for the common-item nonequivalent groups
matrix design and to reflect the manipulation of the group ability differences, number/score points
Figure 2
Illustration of Common-Item Nonequivalent Groups Matrix Design
Test Form Y
Test Form X
of the outliers, and types of outliers. Ten variations of IRT-based equating methods were used to
equate the simulated data. The equating results produced by the 10 methods were compared and
evaluated; as a result, the performance of the IRT-based equating methods in the presence of out-
liers was investigated.
Equating Design
A simplified common-item nonequivalent groups matrix design with an external anchor test was
employed in the current study. Generally, in a matrix design (see Figure 2), the tests to be equated
are administered on two different test dates. On one test date (e.g., Year 1), multiple test forms (e.g.,
Form Y 1, Form Y 2, and Form Y 3) are administered to different student samples at the same time.
These multiple test forms include the same unique items (e.g., U1) but different sets of common
items, for example, C_1, C_2, and C_3. The unique items administered in Year 2 (e.g., U2) are dif-
ferent from those administered in Year 1. However, the same sets of common items are used. Thus,
examinees’ scores on U1 and U2 can be equated through common items in C 1, C 2, and C 3. The
advantage of this design over the simple common-item nonequivalent groups design, which includes
only one test form containing both unique items and a common set of items on one test date, is that
part of the common item set can be imbedded into each test form in a matrix design without exces-
sively prolonging the test administration time. Meanwhile, examinees are equally motivated to take
both unique and common items. If some examinees take the test next year, the possibility that they
will take the same common items that they did in the previous year is low given the multiple com-
mon item sets in the matrix design. As a result, it is possible to include open-ended response items in
the anchor test without risking test security. Finally, the total number of common items shared
between tests to be equated can be much more than the minimum number (e.g., 20 items or 20% of
the total number of items in a test) suggested by Angoff (1984) and Kolen and Brennan (2004).
Figure 3
Number and Types of Items in Each Test Form
Test Form Y
MC SA OR MC SA OR
Form Y_1 26Y 5Y 5Y 8A 1A 1A
MC SA OR
Year 1 Form Y_2 26Y 5Y 5Y 8B 1B 1B
MC SA OR
Form Y_3 26Y 5Y 5Y 8C 1C 1C
.
.
.
Test Form X
Because of these advantages, the common-item nonequivalent groups matrix design has been
employed in many large-scale testing programs (e.g., Minnesota Comprehensive Assessments, 2002;
Massachusetts Comprehensive Assessment System [2001 MCAS, 2001; 2002 MCAS, 2002]).
The results of the current study are intended to be generalized to equating large-scale achieve-
ment tests with mixed item formats using IRT-based equating methods. Thus, the common-item
nonequivalent groups matrix design was chosen to best simulate the equating of the 2001 and 2002
administrations of the Massachusetts Comprehensive Assessment System (MCAS) mathematics
tests. The simulation test forms included the same item formats as the real test did and covered the
five MCAS mathematics content areas. The number of items included in the simulation was based
on both the available item characteristics from the real mathematics test and the basic considerations
for test development (e.g., content relevance and representativeness and test administration time).
However, to make the study less complex and doable, a smaller number of subtest forms (i.e., 3)
were simulated (see Figure 3). The reduction in the numbers of subtest forms should not change the
key steps taken to equate tests under the common-item nonequivalent groups matrix design.
As shown in Figure 3, test forms Y and X, which were administered in 2 different years, needed
to be equated. Each test form included three subtests. There were 36 unique items in each subtest
of Form Y. Of these, 26 were multiple-choice (MC) items with two score categories (i.e., 0 and 1),
5 were short-answer (SA) items with two score categories (i.e., 0 and 1), and 5 were open-ended
response (OR) items with five score categories (i.e., 0, 1, 2, 3, and 4). Each subform contained the
same set of unique items but a different set of eight common MC items, one common SA item, and
one common OR item. The 30 (30 = 10 × 3) common items together represented the statistical,
content, and item format characteristics of the unique items. The same structure was used for Form
X but with a different set of 36 unique items. Using the 30 common items, the unique items in the
test forms Y and X were equated onto the same scale.
then be located on one side of the straight line. Theoretically, which side of the straight line
the outliers or remaining outliers are located should not change the findings of the current
study. Consequently, only the outliers located on the left side of the straight line (i.e., with smal-
ler b-parameters in the 2nd year) were investigated.
Six combinations of number/score points and types of outliers were examined: (a) There were
no outliers in the common items, (b) the outliers were three MC items with three score points
(i.e., the maximum score points for the three MC items was three) and were from one content
area, (c) the outliers were three MC items and were randomly distributed across the five content
areas, (d) the outliers were three MC items and with extreme b-parameter estimates (i.e., the range
of the b-parameter estimates from two equating groups was between –1.3979 and –3.6670),
(e) the outliers were five MC items and one OR item with nine score points and were from one
content area, and (f) the outliers were five MC items and one OR item with nine score points and
were randomly distributed across the five content areas. The first condition served as the baseline
against which the other conditions were compared. Conditions 2, 3, 5, and 6 reflected the increase
in the number/score points of outliers caused by the disclosure of some of the common items and
differential curriculum emphasis. The fourth condition represented situations in which estimation
error leads to the presence of outliers. The manipulation of the number/score points and types of
outliers was not only to simulate real tests but also to test the need for content representativeness
of the equating form (see Kolen & Brennan, 2004, p. 10).
Under the first condition, the number/score points and representativeness of the common items
were well controlled. Li et al. (1999) and Tate (2000) suggested that while determining the num-
ber of common items, a polytomously scored item could be treated as several dichotomously
scored items (= the number of score categories – 1). Because the tests simulated in the current
study contained open-ended response items, both the number of items and the score points of the
items were considered when developing the initial set of common items. The numbers of different
types of items were presented for each content area in Table 1. For example, in the content area of
number sense, there were 10 MC, 1 SA, and 1 OR unique items for both Tests X and Y, respec-
tively, and 9 MC, 2 SA, and 2 OR common items. The corresponding score points were 15
(10 × 1 + 1 × 1 + 1 × 4 = 15), 15, and 19. The distributions of items in Table 1 demonstrate that
the number of common items was considered on content grounds.
The means and standard deviations of the b-parameters of the unique and common items are
listed in Table 2. Because the unique and common items contained OR items, the means and stan-
dard deviations were calculated using two approaches. The first approach was based on the number
of items. In this approach, OR items were treated as one item. The location parameter, which repre-
sents the average difficulty of an OR item, was used to represent the b-parameter for that item. The
second approach was based on the score points. In this approach, an OR item was treated as four
dichotomously scored items. The four corresponding step parameters for the OR item were used to
represent the b-parameters. As indicated in the table, the mean difficulties of the two sets of unique
items and the three sets of common items were similar, which indicated that the difficulties of each
subtest form were similar and that the common items represented the unique items.
However, the representativeness of the common items might change after the types of outliers
and the equating methods are manipulated. For example, if the outliers were from one content area
and were removed while conducting equating, then the content representativeness, but not neces-
sarily the statistical representativeness, of common items would be violated. If the outliers were
randomly distributed across the five content areas, removing outliers may not violate the content
and statistical representativeness of common items. Including any type of the outliers in the equat-
ing analyses would not violate the content representativeness; however, the statistical representa-
tiveness of common items would change.
Table 1
The Number of Unique and Common Items in Each Content Area
Table 2
Descriptive Statistics for the b-Parameter Estimates of the Unique and Common Items
Standard Standard
Item Items Mean Deviation Score Points Mean Deviation
included, M/S transformation with outliers included, and FCIP calibration with outliers fixed. The
corresponding methods that considered the influence of outliers were named as concurrent calibra-
tion with outliers excluded, TCC transformation with outliers excluded, M/S transformation with
outliers excluded, M/S transformation with outliers weighted (Cohen & Kim, 1998; Linn et al.,
1980), FCIP calibration with outliers not fixed, and FCIP calibration with outliers excluded. Alto-
gether, 10 variations were investigated when outliers were present.
Other factors, such as sample size, IRT models used for the parameter estimation, and computer
programs used for the parameter estimation, may influence the equating results too. However,
because many studies have investigated these factors (e.g., Childs & Chen, 1999; Hanson & Beguin,
2002; Kolen & Brennan, 2004), they were fixed (i.e., not examined) as in the following sections.
Sample size. The sample size for each subtest form was controlled at 2,000, which is large enough
to produce stable parameter estimates (e.g., Kolen & Brennan, 2004; Zeng, 1991). Consequently,
the total sample size for one test form (either Form X or Form Y) was 6,000 (2,000 × 3 = 6,000).
IRT models. Three IRT models—three-parameter logistical model (3PL), two-parameter logis-
tical model (2PL), and extended graded response model (GRM; Muraki & Bock, 1999)—were
chosen for modeling the data generated for test forms X and Y. Based on the observation that it is
always possible for an examinee to answer MC items correctly by guessing, the 3PL model was
used for modeling the MC items. The 2PL model was used for modeling the SA items. The use of
the 2PL model is based on the belief that the possibility of answering an SA question correctly by
guessing is close to zero, and it is reasonable to assume that the item discrimination parameters are
different. The extended GRM was used to model the OR items because (a) the scores for the OR
items are ordered; (b) the adjacent scores (e.g., 2, 3, and 4) can be collapsed as one category, if nec-
essary; and (c) it is meaningful to know the possibility of getting a higher score over getting a lower
score.
Computer programs. Computer programs were thought to be one factor that might influence
the equating results (Hanson & Beguin, 2002). However, after comparing the performance of
MULTILOG (Thissen, 1991) and BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) in
the context of concurrent calibration and separate calibration with linear transformations, Hanson
and Beguin (2002) concluded that the two programs tended to perform similarly. Childs and Chen
(1999) found that although the MULTILOG and PARSCALE (Muraki & Bock, 1999) parameter-
ized the polytomous IRT models differently (e.g., MULTILOG directly produces only category/
step parameters, whereas PARSCALE produces an overall item location parameter and repro-
duces the category parameters by centering them to zero), similar parameter estimates were
obtained. Thus, only PARSCALE was used in the current study to estimate the parameters. The
PARSCALE control files were altered for the concurrent, separate, and FCIP calibrations.1
which IRT-based equating method produces a more accurate result using real data, due to the lack
of definite evaluation criteria. A simulation study can solve these problems. The equating results
of the IRT-based equating methods can be compared with the true scores that are known before
the simulated data are generated. Therefore, the accuracy of the performance of the IRT-based
equating methods can be compared.
To conduct the computer simulations, the item parameters have to be determined before gener-
ating the item response sample. To make the generated data similar enough to real data, all the
unique SA and OR item parameters and 90% of the unique MC item parameters estimated from
the MCAS (2001 MCAS, 2001; 2002 MCAS, 2002) Grade 4 mathematics tests were used to gener-
ate the item response sample. Likewise, except for the outliers, the item parameters for the com-
mon items used in the simulation came from the same tests. Once the item parameters were
determined, the computer simulations were completed in four steps:
1. For test form Y, an item response sample was generated for each of the six outlier condi-
tions that had an underlying theta distribution of NID (0, 1).
2. For test form X, an item response sample was generated for each of the six outlier condi-
tions with an underlying theta distribution of NID (0, 1); a second item response sample
was generated for each of the six outlier conditions with an underlying theta distribution of
NID (1, 1). The samples for the test form X were paired with the samples for the test form
Y to represent the 12 response conditions.
3. The response samples for the two test forms were calibrated and/or equated by the four
IRT-based equating methods or their 10 variations. The IRT true-score equating followed
each of these methods.
4. This process was replicated 50 times, which is thought to be sufficient to compare the
results obtained from Step 3 for each condition (Hanson & Beguin, 2002; Harwell, Stone,
Hsu, & Kirisci, 1996).
MSEt SEÞ and random errors (MSEb RE and MSEt RE) (Gifford & Swaminathan, 1990). The
systematic errors were calculated by
mj
P
36 P
ðb∗jk − bjk Þ
2
j=1 k=0
MSEb SE =
51
and
P
51
ðts∗ − τs Þ
2
MSEt SE = s=0
:
52
The random errors were calculated by
mj
P
50 P
36 P
ðb∗jkr − b∗jk
2
Þ
r=1 j=1 k=0
MSEb RE =
50 × 51
and
P
50 P
51
ðtsr∗ − ts∗ Þ
2
Table 3
Mean Square Total, Systematic, and Random Errors of the IRT-Based Equating Methods: No Outliers
Note. S, M, and L represent the size of systematic errors. S refers to small, M refers to moderate, and L refers
to large. IRT = item response theory; NID = normal independent distribution; TCC = test characteristic
curve; M/S = mean/sigma; FCIP = fixed common item parameters.
than the M/S transformation because the former method had a smaller MSEt SE value than
the latter one. However, the difference between the two values is small and is likely due to
sampling variability.
2. The MSEt _SE values for the concurrent calibration with outliers included and the TCC
transformation with outliers included under the condition of three outliers with extreme
values and equivalent groups were 0.6026 and 3.7795, respectively (see right side, Panel A1,
Table 4). Based on the relative criteria, one may conclude that the concurrent calibration with
outliers included performed better than the TCC transformation with outliers included
because it had a smaller MSEt SE value. Although the conclusion may sound reasonable for
this case, it likely makes no sense in the following case.
3. The MSEt _SE values for the TCC transformation with outliers included and the M/S
transformation with outliers included under the condition of outliers with nine score points
from one content area and equivalent groups were 16.2440 and 19.5356, respectively (see
right side, Panel A1, Table 5). Based on the relative criteria, one may conclude that the
TCC transformation with outliers included performed better than the M/S transformation
with outliers included. However, both values are large, perhaps too large. Consequently,
neither the TCC nor the M/S transformations with outliers included would be recom-
mended in this situation.
These examples led to the following question: What should the minimum size of MSEb SE and
MSEt SE be to claim the systematic error is small, moderate, or large? To answer this question and
to make the discussion consistent, absolute rules for interpreting the sizes of the systematic errors for
the b-parameters and the number-correct true scores were developed. The development of rules was
based on the magnitude of the square roots of the systematic errors (referred to as bias), which repre-
sents the difference between the observed b-parameters or the number-correct true scores and their
(text continues on p. 327)
From One From Any With Extreme From One From Any With Extreme
Method Content Content Values Content Content Values
(continued)
323
324
Table 4 (continued)
From One From Any With Extreme From One From Any With Extreme
Method Content Content Values Content Content Values
Note. S, M, and L represent the size of systematic errors. S refers to small, M refers to moderate, and L refers to large. IRT = item response theory; NID = normal
(continued)
325
326
Table 5 (continued)
Note. S, M, and L represent the size of systematic errors. S refers to small, M refers to moderate, and L refers to large. IRT = item response theory; NID = normal
corresponding true b-parameters or number-correct true scores. For the b-parameter, the bias values
of 0.2500 (one fourth of the standard deviation of the distribution of the b-parameters) and 0.5000
(one half of the standard deviation of the distribution of the b-parameters) were adopted as the cutoff
scores. These values correspond to 0.0625 and 0.2500 in the metric of mean square errors. Conse-
quently, the rules for the MSEb _SE were as follows: (a) MSEb _SE ≤ 0:06: small, (b) 0:06 <
MSEb _SE ≤ 0:25: moderate, and (c) MSEb _SE > 0:25: large. The MSEb _SE values were rounded
to two decimal points to avoid the situations when an MSEb _SE value is placed in a higher category
due to a small difference from a cutoff value. The rules for the MSEt _SE were as follows: (a)
MSEt _SE ≤ 2:25: small, (b) 2:25 < MSEt _SE ≤ 6:25: moderate, and (c) MSEt _SE > 6:25: large.
As for the case of MSEb _SE, two decimal points were used in judging the size of MSEt _SE.
Results
transformation with outliers included had a large MSEt _SE value. Examination of the systematic
errors in Panel B2, Table 4 revealed that the MSEb _SE and MSEt _SE values for the TCC and M/
S transformations with outliers excluded were small, the MSEb _SE and MSEt _SE values for the
FCIP calibration with outliers not fixed and excluded were moderate, and the MSEb _SE and
MSEt _SE values for the concurrent calibration with outliers excluded and M/S transformation
with outliers weighted were large or moderate.
Outliers with nine score points. The systematic errors for the conditions of outliers with nine
score points are summarized in Table 5. The purpose of having a condition of outliers with nine
score points is to examine how the 10 variations of IRT-based equating methods perform when the
number/score points of outliers increase. The results are discussed by comparing the systematic
errors in Tables 4 and 5 in this section.
Comparison of the sizes of systematic errors in Panel A1, Table 5 with the sizes of the system-
atic errors in Panel A1, Table 4 revealed that, for equivalent groups, the MSEb _SE and MSEt _SE
systematic errors tended to increase when the number/score points of outliers increased. In con-
trast, comparison of the systematic errors in Panel A2, Table 5 with the corresponding errors in
Panel A2, Table 4 showed that, with the exception of the M/S transformation with outliers
weighted, the methods that considered the influence of outliers had small MSEb _SE and
MSEt _SE values, as did the corresponding methods under the conditions of outliers with three
score points. The M/S transformation with outliers weighted had a moderate MSEb _SE value and
a large MSEt _SE value under the condition of outliers with nine score points but small MSEb _SE
and MSEt _SE values under the conditions of outliers with three score points.
When the systematic errors in Panel B1, Table 5 were compared with those in Panel B1, Table
4, it was found that the MSEb _SE and MSEt _SE values for the concurrent calibration with out-
liers included remained the same size as the number/score points of outliers increased, the
MSEb _SE and MSEt _SE values for the TCC and M/S transformations with outliers included
tended to increase as the number/score points of outliers increased, the MSEb _SE systematic
errors for the FCIP calibration with outliers fixed were moderate under the conditions of outliers
with three score points but small under the condition of outliers with nine score points, and the
MSEt _SE values for the FCIP calibration with outliers fixed were small regardless of the number/
score points of outliers. In contrast, the MSEb _SE and MSEt _SE values for the methods that con-
sidered the influence of outliers (Panel B2, Tables 4 and 5) remained the same size as the number
of outlier items and the number of score points of outliers increased.
When outliers were not present in the data set and the equating groups were equivalent, the
methods of concurrent calibration, TCC transformation, M/S transformation, and FCIP calibration
performed equally well. However, the same cannot be said when the two equating groups were not
equivalent. The four methods were sensitive, but not equally, to the presence of nonequivalent
groups. These findings about the concurrent calibration and the TCC and M/S transformations are
consistent with the previous research (e.g., Hanson & Beguin, 2002).
When outliers were present, under the equivalent groups condition, the methods that considered
the influence of outliers tended to have smaller systematic errors than the methods that did not con-
sider the influence of outliers, which indicated that the former methods performed better. Among
the methods that considered the influence of outliers, with the exception of the M/S transformation
with outliers weighted, the remaining methods produced small systematic errors regardless of the
number/score points and types of outliers, which indicated that these methods performed equally
well under the condition of equivalent groups.
When the equating groups were not equivalent, not all of the systematic errors for the methods
that considered the influence of outliers were smaller than the corresponding values for the meth-
ods that did not consider the influence of outliers. Thus, caution needs to be taken when drawing
conclusions about whether methods that consider the influence of outliers will perform better than
methods that do not consider the influence of outliers when the equating groups are not equivalent.
For the concurrent calibration, excluding the outliers did not reduce the systematic errors as one
would expect. Because the MSEb _SE value for the concurrent calibration was large, even when no
outliers were present, and the MSEt _SE was small for the concurrent calibration with outliers
included but moderate for concurrent calibration with outliers excluded, one may conclude that
whether the concurrent calibration performs well depends on multiple factors such as group equiv-
alence, outliers, and evaluation criteria. Among these, group equivalence is the most important fac-
tor. Although the FCIP calibration had smaller systematic errors than the concurrent calibration,
the same conclusion applies. For the TCC and M/S transformations, excluding outliers produced
small systematic errors. In contrast, including outliers resulted in moderate or large systematic
errors. The M/S transformation with outliers weighted produced large systematic errors under the
conditions of outliers with three and nine score points and nonequivalent equating groups. This last
observation is likely attributable to the weighting used in this method. As described previously, this
method uses the weighted item difficulties to calculate the equating coefficients, where the weights
are inversely proportional to the standard errors of the item difficulty estimates. Under the non-
equivalent groups conditions, one group had an ability distribution with mean 1 and standard devi-
ation 1, which means the standard errors of the item difficulty estimate are large when the item
responses from this group are used. Unfortunately, this method uses these large standard errors to
weight the item difficulties, which in turn leads to large systematic errors.
The results of the current study revealed that there was an interaction among the IRT-based
equating methods, group equivalence, and number/score points of outliers. For the methods that
did not consider the influence of outliers, the MSEb _SE and MSEt _SE values tended to increase
as the number/score points of outliers increased for the conditions with equivalent groups. How-
ever, this cannot be said for the corresponding methods with the conditions of nonequivalent
groups. For the methods that considered the influence of outliers, with the exception of M/S trans-
formation with outliers weighted, the sizes of the MSEb _SE and MSEt _SE values for the other
methods remained the same as the number/score points of outliers increased under equivalent and
nonequivalent groups conditions. Among the methods that considered the influence of outliers,
the TCC and M/S transformations with outliers excluded performed the best (i.e., with small
MSEb _SE and MSEt _SE values), regardless of the group equivalence and number/score points of
outliers.
Selecting common items is often based on both content and statistical grounds (Kolen & Bren-
nan, 2004). Many researchers have investigated the importance of the assumption of content and
statistical representativeness of common items (e.g., Cook & Petersen, 1987; Harris, 1991; Klein
& Jarjoura, 1985; Kromrey, Parshall, & Yi, 1998; Petersen, Marco, & Stewart, 1982; Yang, 1997).
However, the conclusions are not consistent. For example, Cook and Petersen (1987) reviewed
several studies that considered anchor test properties. They pointed out that content and statistical
representativeness was especially important when groups varied in ability. Yang (1997) found out
that the accuracy of equating depended on the content representativeness of the anchor items, no
matter which equating method (Tucker linear and two IRT-based methods) was used to equate
two test forms. However, Harris (1991) examined content and statistical nonrepresentativeness
and found that content itself did not greatly influence equating results. The current study
investigated content and statistical representativeness from the perspective of dealing with com-
mon items that had outlier b-values. Because excluding the outliers that appear in only one content
area may violate the content representativeness of the common item set, it might be expected that
the systematic errors for the methods that excluded the outliers in one content area should be
greater than the systematic errors for the methods that excluded outliers randomly distributed
across the content areas. However, the similarity among the systematic errors under the conditions
of the different types of outliers suggests that the violation of the assumption of content representa-
tiveness of the common items did not influence of performance of the IRT-based equating meth-
ods. The observation that the mean and standard deviation of the common items remained
essentially unchanged while the types of outliers changed may indicate that the statistical repre-
sentativeness of common items affects the equating results more directly than the content repre-
sentativeness of common items. This finding is consistent with the conclusions drawn by Harris
(1991) and Kromrey et al. (1998).
In the current study, it was found that content representativeness of common items had little
impact on the equating results. This conclusion is associated with the fact that the mean and stan-
dard deviation of the common items remained essentially unchanged, whereas the assumption of
content representativeness was violated. Further study is needed to investigate whether content
representativeness of common items is a direct causal effect of the equating accuracy or is moder-
ated by some other factors such as statistical representativeness.
Absolute rules were proposed in the current study to distinguish small, moderate, and large sys-
tematic errors of b-parameters and number-correct true scores. The development of these rules
was somewhat subjective. More research is needed to determine whether these rules will hold over
the other studies.
Notes
1. The sample control files for each of these methods are available from the first author.
2. The full set of results is available from the first author.
References
Angoff, W. H. (1984). Scales, norms, and equivalent theory equating methods in less than optimal
scores. Princeton, NJ: Educational Testing circumstances. Applied Psychological Measure-
Service. ment, 11, 225-244.
Baker, F. B. (1996). An investigation of the sampling De Champlain, A. F. (1996). The effect of multidi-
distributions of equating coefficients. Applied mensionality on IRT true-score equating for sub-
Psychological Measurement, 20(1), 45-57. groups of examinees. Journal of Educational
Bejar, I., & Wingersky, M. S. (1981). An application Measurement, 33(2), 181-201.
of item response theory to equating the Test of Dorans, N. J., & Kingston, N. M. (1985). The effects
Standard Written English (College Board Report of violations of unidimensionality on the estima-
No. 81-8). Princeton, NJ: Educational Testing tion of item and ability parameters and on item
Service. (ETS No. 81-35) response theory equating of the GRE verbal
Bolt, D. M. (1999). Evaluating the effects of multidi- scale. Journal of Educational Measurement,
mensionality on IRT true-score equating. Applied 22(4), 249-262.
Measurement in Education, 12(3), 383-407. Gifford, J. A., & Swaminathan, H. (1990). Bias and
Childs, R. A., & Chen, W.-H. (1999). Obtaining the effect of priors in Bayesian estimation of
comparable item parameter estimates in MULTI- parameters of item response models. Applied
LOG and PARSCALE for two polytomous IRT Psychological Measurement, 14, 33-43.
models. Applied Psychological Measurement, Haebara, T. (1980). Equating logistic ability scales
23(4), 371-379. by a weighted least squares method. Japanese
Cohen, A. S., & Kim, S.-H. (1998). An investigation Psychological Research, 22, 144-149.
of linking methods under the graded response Hambleton, R. K., & Murray, L. (1983). Some good-
model. Applied Psychological Measurement, ness of fit investigations for item response mod-
22(2), 116-130. els. In R. K. Hambleton (Ed.), Applications of
Cook, L. L., Eignor, D. R., & Hutton, L. R. (1979, item response theory (pp. 71-94). British Colum-
April). Considerations in the application of latent bia: Educational Research Institute of British
trait theory to objective-based criterion-refer- Columbia.
enced tests. Paper presented at the meeting of the Hanson, B. A., & Beguin, A. A. (2002). Obtaining
American Educational Research Association, a common scale for item response theory item
San Francisco. parameters using separate versus concurrent esti-
Cook, L. L., & Petersen, N. S. (1987). Problems mation in the common-item equating design.
related to the use of conventional and item response Applied Psychological Measurement, 26(1), 3-24.
Hanson, B. A., & Feinstein, Z. S. (1997). Applica- Linn, R. L., Levine, M. V., Hastings, C. N., & War-
tion of a polynomial log linear model to asses- drop, J. L. (1980). An investigation of item bias in
sing differential item functioning for common a test of reading comprehension (Tech. Rep. No.
items in the common-item equating design (ACT 163). Urbana: Center for the Study of Reading,
Research Report Series 97-1). Iowa City, IA: University of Illinois.
American College Testing. Lord, F. M. (1980). Applications of item response
Harris, D. J. (1991, April). Equating with nonrepre- theory to practical testing problems. Hillsdale,
sentative common item sets and non-equivalent NJ: Lawrence Erlbaum.
groups. Paper presented at the annual meeting of Loyd, B. H., & Hoover, H. D. (1980). Vertical equat-
the American Educational Research Association, ing using the Rasch model. Journal of Educa-
Chicago. tional Measurement, 17, 179-193.
Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. Marco, G. L. (1977). Item characteristic curve solu-
(1996). Monte Carlo studies in item response the- tions to three intractable testing problems. Jour-
ory. Applied Psychological Measurement, 20(2), nal of Educational Measurement, 14, 139-160.
101-125. Minnesota Comprehensive Assessments Grade 3 & 5
Hills, J. R., Subhiyah, R. G., & Hirsch, T. M. (1988). Technical Manual. (2002). Retrieved November
Equating minimum-competency tests: Compari- 26, 2005, from http://education.state.mn.us/mde/
son of methods. Journal of Educational Measure- static/001879.pdf
ment, 25(3), 221-231. Muraki, E., & Bock, R. D. (1999). PARSCALE:
Ironson, G. H. (1983). Using item response theory IRT Item Analysis and Test Scoring for Rating-
to measure bias. In R. K. Hambleton (Ed.), scale Data (Version 3.5) [Computer software].
Applications of item response theory (pp. 155- Chicago: Scentific Software.
174). British Columbia, Canada: Educational Petersen, N. C., Marco, G. L., & Stewart, E. E.
Research Institute of British Columbia. (1982). A test of the adequacy of linear score
Klein, L. W., & Jarjoura, D. (1985). The importance equating models. In P. W. Holland & D. B. Rubin
of content representation for common-item (Eds.), Test equating (pp. 71-135). New York:
equating with non-random groups. Journal of Academic Press.
Educational Measurement, 22, 197-206. Psychometric Society. (1979). Publication policy
Kolen, M. J., & Brennan, R. L. (2004). Test equating, regarding Monte Carlo studies. Psychometrika,
scaling, and linking: Methods and practices. 44, 133-134.
New York: Springer. Stocking, M. L., & Lord, F. M. (1983). Developing
Kromrey, J. D., Parshall, C. G., & Yi, Q. (1998, a common metric in item response theory.
April). The effects of content representativeness Applied Psychological Measurement, 7, 201-210.
and differential weighting on test equating: A Tate, R. (2000). Performance of a proposed method
Monte Carlo study. Paper presented at the annual for the linking of mixed format tests with con-
meeting of the American Educational Research structed response and multiple choice items.
Association, San Diego, CA. Journal of Educational Measurement, 37(4),
Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, 329-346.
R. D. (2001). Comparison of dichotomous and Thissen, D. (1991). MULTILOG user’s guide: Multi-
polytomous item response models in equating ple, categorical item analysis and test scoring
scores from tests composed of testlets. Applied using item response theory (Version 6.0). New
Psychological Measurement, 25(4), 357-372. York: Springer.
Lehman, R. S., & Bailey, D. E. (1968). Digital com- 2001 MCAS technical report. (2001). Retrieved
puting: Fortran IV and its applications in beha- November 26, 2005, from http://www.doe.mass.
vioural science. New York: John Wiley. edu/mcas/2002/news/01techrpt.pdf
Li, Y. H., Lissitz, R. W., & Yang, Y.-N. (1999, 2002 MCAS technical report. (2002). Retrieved
April). Estimating IRT equating coefficients for November 26, 2005, from http://www.doe.mass
tests with polytomously and dichotomously .edu/mcas/2003/news/02techrpt.pdf
scored items. Paper presented at the annual meet- Vukmirovic, Z., Hu, H., & Turner, J. C. (2003, April).
ing of the National Council on Measurement in The effects of outliers on IRT equating with fixed
Education, Montreal, Canada. common item parameters. Paper presented at the
meeting of the National Council on Measurement Zenisky, A. L. (2001, October). Investigating the
in Education, Chicago. accumulation of equating error in fixed common
Wang, T.-Y., Hanson, B. A., & Harris, D. J. (2000). item parameter linking: A simulation study. Paper
The effectiveness of circular equating as a crite- presented at the annual meeting of the Northeast-
rion for evaluating equating. Applied Psychologi- ern Educational Research Association, Kerhonk-
cal Measurement, 24(3), 195-210. son, NY.
Yang, W. (1997, April). The effects of content mix Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock,
and equating method on the accuracy of test R. D. (1996). BILOG-MG: Multiple group IRT
equating using anchor-item design. Paper pre- analysis and test maintenance for binary items
sented at the annual meeting of the American [Computer program]. Chicago: Scientific Soft-
Educational Research Association, Chicago. ware International.
Yen, W. M. (1984). Effects of local item dependence
on the fit and equating performance of the three-
parameter logistic model. Applied Psychological
Measurement, 8(2), 125-145. Author’s Address
Zeng, L. (1991). Standard errors of linear equating
for the single-group design (ACT Research Address all correspondence to Huiqin Hu, DRC,
Report 91-4). Iowa City, IA: American College 13490 Bass Lake Road, Maple Grove, MN 55311;
Testing. e-mail: hhu@datarecognitioncorp.com.