Professional Documents
Culture Documents
C. Ovalle ()
Centro de Justicia Educacional, PUC, Santiago de Chile, Chile
e-mail: claudia.ovalle@uc.cl
D. Alvares
Departamento de Estadística, PUC, Santiago de Chile, Chile
1 Introduction
The use of the guessing parameter is due to Birnbaum (1968) and corresponds to the
asymptote with a value greater than 0 in information curves. The guessing parameter
represents the probability of response of an individual with a very low ability. On the
other hand, the difficulty refers to the probability that a student responds correctly to
a given item with a certain level of ability (San Martín and De Boeck 2015). Finally,
discrimination refers to the ability of the item to differentiate which students know
the content as opposed to those who do not (Tuerlinckx and De Boeck 2005). This
parameter is represented by the slope of the information curve of the item. There are
different models to determine the parameters of the items. The first is the logistic
model of a parameter with guessing(1PL-G) which is a case of the 3PL model in
which the discrimination parameters are set at 1 (San Martín et al. 2006). The second
is the Rasch model (Rasch 1960) that focuses on the difficulty parameter. The third is
the 2PL model, in which discrimination and difficulty parameters are included. The
fourth is the 3-PLG Model (difficulty, discrimination, and guessing) of Birnbaum
(1968). All models are presented in Table 1.
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 71
3 Item Bias
Bias or differential item functioning (DIF) arises when the probability of a correct
response between people with the same value of the latent trait (ability) differs
between groups, for example, whenever the difficulty of an item depends on the
membership to a subgroup based on race, ethnicity, or gender (Berger and Tutz
2016). The present study is focused on uniform bias, that is, when individuals
from different subgroups with the same skill level have different probabilities of
solving an item and these differences do not depend on their ability. Zwick (2012)
reviews the criteria for the detection of biased items and identifies the flagging rules
used by ETS (Educational Testing Service). The author concludes that rule “C” is
insufficient to establish critical bias of the items even when the samples are large.
The rule indicates that an item that has bias must have a χ 2 Mantel-Haenszel Delta,
MH D statistic, with an absolute value greater than 1.5 and it must be significant
at the 5 percent level. A similar rule indicates that the critical value of the MH
72 C. Ovalle and D. Alvares
Delta is 1,645 or 95th percentile. If the classification of the ETS bias is used with
the MH Delta statistic, which focuses on the difficulty parameter, then an item of
category “A” or without bias has a nonsignificant delta value of 1. A category “B”
item has a delta between 1 and 1.5 and is significant, and one of category “C”
has a delta of at least 1.5 and is significant
too. In terms
of the parameters of the
items, a MH Delta of 1.0 is equivalent to −2.35 ∗ β 2 and therefore indicates the
cutoff point for an item to have type B or type C bias in terms of the difficulty
parameter. The present proposal is based on a Bayesian approach, and it is centered
on the values of the difference of the difficulty parameter for different groups. For
example, we compared the difficulty parameter between groups of technical students
versus academic track students who took a standardized test. While the no-Bayesian
approach seeks to find the values of the (estimated) parameters that maximize the
probability of the data that has been observed, the Bayesian approach used in the
present study makes use of the prior distributions of the parameters of interest, and
the inferences are based on samples of the posterior distributions, which can be
used to summarize all the information about the parameters of interest (Gonzalez
2010a,b, p. 2). That is, the probability distribution of the parameter of interest is
used.
4 Method
Since the purpose of the present study is to analyze if a standardized test is fair
for all students, in particular for those who come from a vocational/technical high
school using the measurement of the parameters in the selected IRT models (1PL,
1PL-G, 2PL, 3PL), we will proceed to use a graphical Bayesian interpretation
of the differences in the parameters in the tests of mathematics from a national
standardized test (80 items from the PSU). The question that guides the present
study is: “Does a differential functioning or DIF persists in the items (in the
mathematics subtest) that is not due to the ability of students (latent trait) but
can be conditional to aspects such as the type of curriculum (academic versus
technical/vocational)?”.
In the present proposal, the DIF analysis will be developed with the χ 2 Mantel-
Haenszel. This is a descriptive analysis that will be developed with the DIFAS
software for dichotomous items.
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 73
All models predict the probability of correct response Yij = 1. The parameter
βij∗ = βj + gi j represents the difficulty βj and an interaction term βj ∗ which
represents the negative or positive increment of the difficulty parameter for the TP
(vocational/technical) group compared to the SH (academic) group of students.
The Bayesian approach was used to calculate the item parameters in all IRT
models. In the estimation, the following priors were used: The guessing parameter γi
was represented as a beta distribution (0.5, 0.5), discrimination αi was represented
as a uniform distribution (0,100), and difficulty βi was represented as a normal
distribution (0,100). The person latent ability θi was a parameter estimated in all
models, and it is distributed as normal(0, σg2i ),where σg2i is the standard deviation of
the ability parameter for each of the groups gi of vocational or academic students.
74 C. Ovalle and D. Alvares
For each parameter, calculated by means of the IRT models (1PL-G, 1PL 2PL,
3PL), a graphical representation was done. The horizontal axis represents all of
the test items, and the vertical axis represents the difference in the value of the
difficulty parameter for each one of these items. This representation helps to
establish if the difference affects the minority group (vocational/technical students)
in comparison to a majority group (academic students). The difference between
groups in the difficulty parameter is represented for each item by a point in the
graph. The graphical representation includes the credibility intervals that indicate
the probability that the difference between groups in the parameter (conditional on
the data) is greater or less than zero.
4.4 Data
In the year 2017, for the university admission of the year 2018, approximately
295.531 students registered for the PSU standardized test, and 262.139 (89%) took
the subtests of language and mathematics (DEMRE, 2017). Among these students,
almost 90.000 belonged to the technical/vocational track.
4.5 Sample
We sampled 136.918 students from the academic track and 56.798 students from
the technical/vocational track.
5 Results
The analysis was done separately for the four equivalent forms of the test of the
mathematics PSU test (each form with 80 items, which can be combined and
repeated in different ways). With the estimations of DIF, it was established which
items have potential bias. In order to detect DIF, the χ 2 MH (chi square of Mantel-
Haenszel (MH)) was calculated. We used the R language to establish the model
parameters, and we used the software DIFAS 5.0 for the χ 2 MH analysis. In order
to establish if an item has DIF, two criteria were used: χ 2 MH has to be significant
and the CDR (combined decision rule) should be true. The last rule implies that MH
LOR – log odds ratio – is equivalent to a value ranging between 2.0 -2.0 (indicating
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 75
that the item has DIF) and LOR Z (negative values indicating DIF is in favor of the
minority group and positive values favoring the majority group).
The first graphic representation obtained with the 1PL-G model is presented in
Fig. 1. The horizontal axis corresponds to the PSU test items (n = 180 items),
while the vertical axis corresponds to the difference between groups (technical vs.
academic track) for the value of the difficulty parameter.
The vertical axis value is the difference between the medians of the posterior
distributions obtained with multiple samples. These samples were obtained using
a Bayesian approximation for the calculation of the difficulty parameter. The
difference between groups in the difficulty parameter has a range between 3 and
−3 (on the vertical axis).
In this sense, the differences that approach 0 indicate that the item is appropriate
and it is not presenting an important difference between the two groups of students
(technical vs. academic). On the contrary, those items in which the differences
between groups are closer to 3 or closer to -3 are the items which are “problematic”
since they do not measure the groups in the same way. The items in the upper band
(above 0.5) indicate that the difficulty is greater for the technical student group
compared to the academic group. According to the representation, the vast majority
of the items in the different areas are over 0.5 and should be reconsidered before
including them in the PSU test.
The graphical Bayesian representations of the difference between groups in the
difficulty parameters based on the 1PL-G, 1Pl, 2PL, and 3 PL models are displayed
in Figs. 1, 2, 3, and 4. In summary, the representations show that bias, defined
as a probability of a difference between the groups above 0, is present in a large
percentage of the items:
6 Conclusion
A visual representation of item parameter differences can help determine item bias,
and it can help in decision-making regarding item selection to benefit minority
groups. In the present study, the differences in the difficulty parameter between
academic vs. technical track students were represented (Figs. 1, 2, 3, and 4).
Although ETS (Educational Testing Service) flagging rules may underestimate item
bias affecting minorities, our Bayesian visual representation determined a more
76 C. Ovalle and D. Alvares
-2
-2
precise approach: it showed bias in 97% of items according to the 1PL and 1PL-
G models, 89% in the 2PL model, and 93% in the 3PL model. When we used
the ETS flagging rules, they showed that all items had a type “A” or minimal
bias (Table 2), underestimating the differences between student groups. Also, our
visual Bayesian analysis is more effective to establish bias against minorities
(such as vocational/technical students) compared to traditional measures such as
χ 2 Mantel-Haenszel.
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 77
-2
-2
In the present study, the Mantel-Haenszel statistic detected item bias between 28%
and 45% in each of the test forms (see Table 2) missing the large bias found in
the present study. Also, the Bayesian analysis permitted the calculation of posterior
distributions of the difficulty parameter making bias analysis more accurate. Finally,
the difficulty parameter is suitable to compare groups in a bias analysis.
Acknowledgments This study was funded by project Conicyt PIA CIE 160007.
78 C. Ovalle and D. Alvares
References
Berger, M., & Tutz, G. (2016). Detection of uniform and non-uniform differential item functioning
by item focused trees. Journal of Educational and Behavioral Statistics, 41(6), 559–592.
Birnbaum, A. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley.
Fariña, P., González, J., & San Martín, E. (2019). The use of an Identifiability-based strategy for
the interpretation of parameters in the 1PL-G and Rasch models. Psychometrika, 84, 511–528.
Gonzalez, J. (2010a). Bayesian estimation in IRT. International Journal of Psychological Research,
31(1), 164–176.
Gonzalez, J. (2010b). Bayesian Methods in Psychological Research: The case of IRT. International
Journal of Psychological Research, 3(1), 164–176.
Maris, G., & Bechger, T. (2009). On interpreting the model parameters for the three parameter
logistic model. Measurement, 7(2), 75–88.
Rasch, G. (1960). Probabilistic model for some intelligence and achievement tests. Copenhagen:
Danish Institute for Educational Research.
San Martin, E. (2016). Identification of item response theory models, in van der Linden, W. (Ed.).
(2016). Handbook of item response theory. New York: Chapman and Hall/CRC, https://doi.org/
10.1201/b19166.
San Martín, E., & De Boeck, P. (2015). What Do You Mean by a Difficult Item? On the
Interpretation of the Difficulty Parameter in a Rasch Model. In: Millsap R., Bolt D., van der Ark
L., Wang WC. (eds) Quantitative Psychology Research. Springer Proceedings in Mathematics
& Statistics, vol 89. Springer, Cham.
San Martín, E., Del Pino, G., & De Boeck, P. (2006). IRT models for ability-based guessing.
Applied Psychological Measurement, 30(3), 183–203.
San Martín, E., Rolin, J., & Castro, L. M. (2013). Identification of the 1PL model with guessing
parameter: Parametric and semi-parametric results. Psychometrika, 78(2), 341–379.
Tuerlinckx, F., & De Boeck, P. (2005). Two interpretations of the discrimination parameter.
Psychometrika, 70(4), 629–650.
Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging
rules, minimum sample, size requirements, and criterion refinement (p. 130). ETS Research
Reports.