You are on page 1of 4

Through use of the Rasch model, all the items are assumed to have equal

discriminating power as that of the ideal ICC. Therefore, all items should
have infit mean square (INFT MNSQ) values equal to unity or within a predetermined range, regardless
of the groups of students used. However, some
items may record INFT MNSQ values outside the predetermined range,
depending on the subgroup of the general population being tested. Such
items are considered to be biased as they do not discriminate equally for all
subgroups of the general population being tested.


The usually
reported fit statistics focus on two aspects of fit (infit and outfit), each
of which is routinely reported in
both an unstandardized form (mean squares) and a standardized form
(t or Z)

One of the key item fit statistics is the infit mean square (INFIT MNSQ).
The infit mean square measures the consistency of fit of the students to the
item characteristic curve for each item with weighted consideration given to
those persons close to the 0.5 probability level. The acceptable range of the
infit mean squares statistic for each item in this study was taken to be from 0.77 to 1.30 (Adams & Khoo,
1993). Values outside this acceptable range
that is above 1.30 indicate that these items do not discriminate well, and
below 0.77 the items provide redundant information. Hence, consideration
must be given to excluding those items that are outside the range. In
calibration, items that do not fit the Rasch model and which are outside the
acceptable range must be deleted from the analysis (Rentz & Bashaw, 1975;
Wright & Stone, 1979; Kolen & Whitney, 1981; Smith & Kramer, 1992).
Hence, in the FIMS data two items (Items 13 and 29), in SIMS data two
items (Items 21 and 29) and in TIMS data one item [(Item T1b No. 148)
with one item (No. 94) having been excluded from the international TIMSS
analysis] were removed from the calibration analyses due to the misfitting of
these items to the Rasch model. Consequently, 68 items for FIMS, 70 for
SIMS and 156 for TIMS fitted the Rasch model.

The other way of investigating the fit of the Rasch scale to data is to
examine the estimates for each case. The case estimates express the
performance level of each student on the total scale. In order to identify
whether the cases fit the scale or not, it is important to examine the case
OUTFIT mean square statistic (OUTFIT MNSQ) which measures the

consistency of the fit of the items to the student characteristic curve for each
student, with special consideration given to extreme items. In this study, the
general guideline used for interpreting t as a sign of misfit is if t>5 (Wright
& Stone, 1979, p. 169). That is, if the OUTFIT MNSQ value of a person has
a t value >5, that person does not fit the scale and is deleted from the
analysis. However, in this analysis, no person was deleted, because the t
values for all cases were less than 5.

We used Rasch analysis to explore item difficulty, item discrimination, item

redundancy, and personitem patterns within the two-dimensional PCM. Two indices
commonly used to determine item fit levels: (1) infit/outfit, and (2) standardized
z-values (ZSTD). Item infit and outfit values that are less than the expected value of
1.0 are indicative of over-fitting items (see also Bond & Fox, 2001). In these cases,
too little variance occurs relative to the estimated model. Item infit and outfit values
larger than the expected value indicate under-fitting items. Here excessive variance
occurs relative to the estimated model. Depending on the sample size per item,
different acceptability intervals for infit and outfit are typically employed (e.g.,
Adams, Wu, & Macaskill, 1997; Bond & Fox, 2001). Moderate cutoff levels (i.e.,
infit/outfit acceptability values) between 0.8 and 1.2, and ZSTD item fit values
between 2 and 2, were applied to our dataset. Employing these cutoff values, we
could identify six (10%) misfitting items based on infit, and nine (14%) misfitting
items based on ZSTD
infit. Based on outfit values, 19 items (30%) were misfitting,
and based on ZSTD
outfit 20 (32%) items were misfitting. However, many items
exhibited unacceptable values on more than one of those indices (see Appendix A).
In addition to these four indices produced in the Rasch analysis, we also evaluated
the items based on a traditional discrimination index. In this analysis 20 items
were identified with insufficient discrimination (i.e., < 0.30). In summary, 29 items
(46%) were identified to display unacceptable fit values (based on at least one of
above indices). Given the high percentage of misfitting items, it is reasonable to
investigate item properties in greater detail to obtain information about which items
should be removed from the instrument.
Item fit statistics (i.e., infit/outfit and ZSTD) show which items fit the estimated
model. Therefore, the number of misfitting items also indicates the quality of a
model. To investigate the question of whether a two-dimensional model fits better
than a uni-dimensional model, one could also investigate which model displays
fewer misfitting items. Based on infit, outfit, and ZSTD values, we could identify 19
misfitting items regarding the uni-dimensional PCM while the two-dimensional

revealed 21 misfitting items. This result contradicts the findings regarding model
dimensionality as discussed above. However, the number of misfitting items is only
one indicator regarding model fit. Deviance statistics, information criteria, and
theoretical considerations (see above) indicate that a two-dimensional model should
preferred over a uni-dimensional model.
We used infit/outfit and
standardized z-values (ZSTD) to determine item fit levels
in the reduced item dataset. As shown in Table 3 and Appendix B, for the reduced
item analysis of fit patterns, far fewer items displayed poor fit than the original item
set. Specifically, 0% of the reduced dataset items had unacceptable fit values using
four different item quality measures (infit, ZSTD infit, outfit, and ZSTD outfit).
Approximately 48% of items in the reduced dataset displayed poor discrimination
values (i.e., <0.3) in contrast to 32% in the original dataset. Overall, 46% of items
showed at least one unacceptable fit value in contrast to 48% of items in the
dataset showing unacceptable discrimination values. Thus, significantly more items
were characterized by acceptable fit values in the reduced dataset relative to the
original dataset (Table 4). In both cases the percentage of items characterized by
discrimination values is high, and within the reduced item set it is higher due to a
smaller number of items. Finally, as in our analysis of the original Lombrozo et al.
item set, a Wright Map was generated to examine personitem distribution patterns
(Neumann, Neumann, & Nehm, 2010)

Misfit statistic too low Predicted fit

statistic Misfit statistic too high
e.g., Z < 2.0 e.g., 2.0 < Z <
e.g., Z > +2.0
e.g., MnSq < 0.7 e.g., 0.7 < MnSq
< 1.3
e.g., MnSq > +1.3
1111100000 1110101000 0100100010
Less than modeled Modeled Larger than modeled
Overfit Good fit Underfit
Deterministic Stochastic Erratic
Productive for
Muted measurement Noisy
o good to be truelikely
due to item dependence Expected U ite nm ex /p sp ee cc te

ia dl kln ik oe w ly led du ge e/tg oup eo sso in rg

Guttman Rasch Unpredictable