You are on page 1of 10

A Bayesian Graphical and Probabilistic

Proposal for Bias Analysis

Claudia Ovalle and Danilo Alvares

Abstract One of the main concerns in educational policies is to analyze whether a


national test is fair for all students, especially for the economically and socially
disadvantaged groups. In the current literature, there are some methodological
proposals that analyze this problem through comparative approaches of performance
by groups. However, these methodologies do not provide an intuitive graphical and
probabilistic interpretation, which would be useful to aid the educational decision-
making process. Therefore, the objective of this work is to bridge these gaps
through a methodological proposal based on the one-, two-, and three-parameter
logistic models, where we evaluate the performance of each group using the
difficulty parameter estimated from a Bayesian perspective. The difference between
parameters of each group and their respective 95 credible interval are displayed in
graphical form. In addition, we also calculate the mean of the posterior probability
of all the differences of each parameter for the groups compared. This probabilistic
measurement provides a more accurate perception of intergroup performance by
analyzing all items together. Our methodology is illustrated with the Chilean
University Selection Test (PSU) data of 2018, where the analyzed groups are
students from (regular) high schools versus technical high schools. A sensitivity
analysis between the two logistic models is presented. All analyzes were performed
using the R language with the JAGS program.

Keywords Differential item functioning · Mantel Haenszel · Item analysis ·


Graphical representation

C. Ovalle ()
Centro de Justicia Educacional, PUC, Santiago de Chile, Chile
e-mail: claudia.ovalle@uc.cl
D. Alvares
Departamento de Estadística, PUC, Santiago de Chile, Chile

© Springer Nature Switzerland AG 2020 69


M. Wiberg et al. (eds.), Quantitative Psychology, Springer Proceedings in
Mathematics & Statistics 322, https://doi.org/10.1007/978-3-030-43469-4_6
70 C. Ovalle and D. Alvares

1 Introduction

In the field of education, it is necessary to measure and evaluate the learning of


students. However, being able to obtain items that allow measuring different pop-
ulation groups is hindered by the potential bias and by aspects such as differences
in the parameters of difficulty, pseudo-chance, and discrimination that may favor a
majority group compared to a minority. In this sense, it is necessary to explore the
performance of the test items with high-stake consequences for students. One way of
doing this is the measurement and graphic representation of the difference between
groups of the parameters of difficulty, guessing, and discrimination proposed in the
Item Response Theory or IRT. In this theory, student’s ability is a latent trait, and
the characteristics of the items can be measured and represented graphically (e.g.,
by means of the item information curve). The present proposal is novel since it is not
limited to an interpretation based on the 3PL model, but it incorporates measures of
sensitivity integrating the interpretation of the parameters in the 1PL-G, 1PL, 2PL,
and 3PL models, which are calculated with a Bayesian approach in the language
R with the JAGS program. Likewise, it proposes a new graphic representation that
incorporates the difference in the parameters between the groups, to facilitate the
detection of biases in the test in a visual way. This graphic representation allows the
researcher to observe the same parameter for several items, which is not possible
whenever information curves are drawn for each item. The graphic representation
provides a more informative analysis of the performance between the groups since
all the items are analyzed together and not separately as it is done with the item
information curve.

2 IRT Models for Parameter Estimation

The use of the guessing parameter is due to Birnbaum (1968) and corresponds to the
asymptote with a value greater than 0 in information curves. The guessing parameter
represents the probability of response of an individual with a very low ability. On the
other hand, the difficulty refers to the probability that a student responds correctly to
a given item with a certain level of ability (San Martín and De Boeck 2015). Finally,
discrimination refers to the ability of the item to differentiate which students know
the content as opposed to those who do not (Tuerlinckx and De Boeck 2005). This
parameter is represented by the slope of the information curve of the item. There are
different models to determine the parameters of the items. The first is the logistic
model of a parameter with guessing(1PL-G) which is a case of the 3PL model in
which the discrimination parameters are set at 1 (San Martín et al. 2006). The second
is the Rasch model (Rasch 1960) that focuses on the difficulty parameter. The third is
the 2PL model, in which discrimination and difficulty parameters are included. The
fourth is the 3-PLG Model (difficulty, discrimination, and guessing) of Birnbaum
(1968). All models are presented in Table 1.
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 71

Table 1 IRT models


Model G(θi , ωi ) Item parameter Parameter space
1 PL F (θi , −βi ) ωj = βj (θ1i , ω1j ) ∈ R i XR j
2 PL F (αi θi , −βi ) ωj = αj , βj (θ1i , ω1j ) ∈ R i XR j XR j
1PL-G F (γi + (1 − γi )F (αi , −βj ) ωj = βj , γj (θ1:i , ω1j ) ∈ R i XR j X(0, 1)j
3 PL F (γi + (1 − γi )F (αi θi , −βj ) ωj = αi , βj , γj (θ1i , ω1:j ) ∈ R i XR j XR j X(0, 1)j

1PL-G models, in which the discrimination parameter is constant and it is


equivalent to 1, are widely used in the literature (Fariña, Gonzales, San Martín 2019;
San Martín et al. 2013). The 1PL-G is also preferred since issues are raised when
interpreting the parameters in the 3PL model (Maris and Bechger 2009) and the
convenience of using binary models (two parameters) under different specifications
has been reported (San Martin 2016). In the present study, we opted for the difficulty
parameter, since the main objective is to compare two groups (focal and reference
groups) so that we can find the differences in the items that affect the minority
group (technical high school students vs. the academic track students). For this we
focus on the 1Pl, 1PL-G, 2PL, and 3PL models. From these four models, a proposal
was developed to establish a selection criteria for the items that must integrate a
standardized test applicable to different groups of students based on the differences
between groups in the difficulty parameter. Specifically, this was done to compare
students of the technical high school vs. students of the academic track and thus to
reduce the bias in favor of one or the other group. This research provides a novel
approach to item bias by means of a graphical representation of the difference in the
difficulty parameter.

3 Item Bias

Bias or differential item functioning (DIF) arises when the probability of a correct
response between people with the same value of the latent trait (ability) differs
between groups, for example, whenever the difficulty of an item depends on the
membership to a subgroup based on race, ethnicity, or gender (Berger and Tutz
2016). The present study is focused on uniform bias, that is, when individuals
from different subgroups with the same skill level have different probabilities of
solving an item and these differences do not depend on their ability. Zwick (2012)
reviews the criteria for the detection of biased items and identifies the flagging rules
used by ETS (Educational Testing Service). The author concludes that rule “C” is
insufficient to establish critical bias of the items even when the samples are large.
The rule indicates that an item that has bias must have a χ 2 Mantel-Haenszel Delta,
MH D statistic, with an absolute value greater than 1.5 and it must be significant
at the 5 percent level. A similar rule indicates that the critical value of the MH
72 C. Ovalle and D. Alvares

Delta is 1,645 or 95th percentile. If the classification of the ETS bias is used with
the MH Delta statistic, which focuses on the difficulty parameter, then an item of
category “A” or without bias has a nonsignificant delta value of 1. A category “B”
item has a delta between 1 and 1.5 and is significant, and one of category “C”
has a delta of at least 1.5 and is significant
 too. In terms
 of the parameters of the
items, a MH Delta of 1.0 is equivalent to −2.35 ∗ β 2 and therefore indicates the
cutoff point for an item to have type B or type C bias in terms of the difficulty
parameter. The present proposal is based on a Bayesian approach, and it is centered
on the values of the difference of the difficulty parameter for different groups. For
example, we compared the difficulty parameter between groups of technical students
versus academic track students who took a standardized test. While the no-Bayesian
approach seeks to find the values of the (estimated) parameters that maximize the
probability of the data that has been observed, the Bayesian approach used in the
present study makes use of the prior distributions of the parameters of interest, and
the inferences are based on samples of the posterior distributions, which can be
used to summarize all the information about the parameters of interest (Gonzalez
2010a,b, p. 2). That is, the probability distribution of the parameter of interest is
used.

4 Method

Since the purpose of the present study is to analyze if a standardized test is fair
for all students, in particular for those who come from a vocational/technical high
school using the measurement of the parameters in the selected IRT models (1PL,
1PL-G, 2PL, 3PL), we will proceed to use a graphical Bayesian interpretation
of the differences in the parameters in the tests of mathematics from a national
standardized test (80 items from the PSU). The question that guides the present
study is: “Does a differential functioning or DIF persists in the items (in the
mathematics subtest) that is not due to the ability of students (latent trait) but
can be conditional to aspects such as the type of curriculum (academic versus
technical/vocational)?”.

4.1 Descriptive Analysis

In the present proposal, the DIF analysis will be developed with the χ 2 Mantel-
Haenszel. This is a descriptive analysis that will be developed with the DIFAS
software for dichotomous items.
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 73

4.2 Bias Analysis Comparing Difficulty Parameter with 1PL,


1PL-G, 2PL, and 3PL Models

The following models were used:


One parameter logistic (1PL) model
, -
exp θi − βj∗
P(Yij = 1) = , -
1 + exp θi − βj∗

One parameter logistic with guessing (1PL-G) model


, -
exp θi − βj∗
P(Yij = 1) = γj + (1 − γj ) , -
1 + exp θi − βj∗

Two parameter logistic (2PL) model


, -
exp αj (θi − βj∗ )
P(Yij = 1) = , -
1 + exp αj (θi − βj∗ )

Three parameter logistic (3PL) model


, -
exp αj (θi − βj∗ )
P(Yij = 1) = γj + (1 − γj ) 
1 + exp αj (θi − βj ∗)

All models predict the probability of correct response Yij = 1. The parameter
βij∗ = βj + gi j represents the difficulty βj and an interaction term βj ∗ which
represents the negative or positive increment of the difficulty parameter for the TP
(vocational/technical) group compared to the SH (academic) group of students.
The Bayesian approach was used to calculate the item parameters in all IRT
models. In the estimation, the following priors were used: The guessing parameter γi
was represented as a beta distribution (0.5, 0.5), discrimination αi was represented
as a uniform distribution (0,100), and difficulty βi was represented as a normal
distribution (0,100). The person latent ability θi was a parameter estimated in all
models, and it is distributed as normal(0, σg2i ),where σg2i is the standard deviation of
the ability parameter for each of the groups gi of vocational or academic students.
74 C. Ovalle and D. Alvares

4.3 Graphical Representation

For each parameter, calculated by means of the IRT models (1PL-G, 1PL 2PL,
3PL), a graphical representation was done. The horizontal axis represents all of
the test items, and the vertical axis represents the difference in the value of the
difficulty parameter for each one of these items. This representation helps to
establish if the difference affects the minority group (vocational/technical students)
in comparison to a majority group (academic students). The difference between
groups in the difficulty parameter is represented for each item by a point in the
graph. The graphical representation includes the credibility intervals that indicate
the probability that the difference between groups in the parameter (conditional on
the data) is greater or less than zero.

4.4 Data

In the year 2017, for the university admission of the year 2018, approximately
295.531 students registered for the PSU standardized test, and 262.139 (89%) took
the subtests of language and mathematics (DEMRE, 2017). Among these students,
almost 90.000 belonged to the technical/vocational track.

4.5 Sample

We sampled 136.918 students from the academic track and 56.798 students from
the technical/vocational track.

5 Results

5.1 Descriptive Statistics

The analysis was done separately for the four equivalent forms of the test of the
mathematics PSU test (each form with 80 items, which can be combined and
repeated in different ways). With the estimations of DIF, it was established which
items have potential bias. In order to detect DIF, the χ 2 MH (chi square of Mantel-
Haenszel (MH)) was calculated. We used the R language to establish the model
parameters, and we used the software DIFAS 5.0 for the χ 2 MH analysis. In order
to establish if an item has DIF, two criteria were used: χ 2 MH has to be significant
and the CDR (combined decision rule) should be true. The last rule implies that MH
LOR – log odds ratio – is equivalent to a value ranging between 2.0 -2.0 (indicating
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 75

that the item has DIF) and LOR Z (negative values indicating DIF is in favor of the
minority group and positive values favoring the majority group).

5.2 Graphical Representation

The first graphic representation obtained with the 1PL-G model is presented in
Fig. 1. The horizontal axis corresponds to the PSU test items (n = 180 items),
while the vertical axis corresponds to the difference between groups (technical vs.
academic track) for the value of the difficulty parameter.
The vertical axis value is the difference between the medians of the posterior
distributions obtained with multiple samples. These samples were obtained using
a Bayesian approximation for the calculation of the difficulty parameter. The
difference between groups in the difficulty parameter has a range between 3 and
−3 (on the vertical axis).
In this sense, the differences that approach 0 indicate that the item is appropriate
and it is not presenting an important difference between the two groups of students
(technical vs. academic). On the contrary, those items in which the differences
between groups are closer to 3 or closer to -3 are the items which are “problematic”
since they do not measure the groups in the same way. The items in the upper band
(above 0.5) indicate that the difficulty is greater for the technical student group
compared to the academic group. According to the representation, the vast majority
of the items in the different areas are over 0.5 and should be reconsidered before
including them in the PSU test.
The graphical Bayesian representations of the difference between groups in the
difficulty parameters based on the 1PL-G, 1Pl, 2PL, and 3 PL models are displayed
in Figs. 1, 2, 3, and 4. In summary, the representations show that bias, defined
as a probability of a difference between the groups above 0, is present in a large
percentage of the items:

P(βTP − βSH > 0 | data) : 97%(1PL) 100%(1PL − G) 89%(2PL) 93%(3PL)

6 Conclusion

A visual representation of item parameter differences can help determine item bias,
and it can help in decision-making regarding item selection to benefit minority
groups. In the present study, the differences in the difficulty parameter between
academic vs. technical track students were represented (Figs. 1, 2, 3, and 4).
Although ETS (Educational Testing Service) flagging rules may underestimate item
bias affecting minorities, our Bayesian visual representation determined a more
76 C. Ovalle and D. Alvares

Content Algebra Geometry Numbers Probability


Median and 95% Credible Interval

-2

Fig. 1 Difference (TP-SH) between difficulty parameters. 1PL model

Content Algebra Geometry Numbers Probability


Median and 95% Credible Interval

-2

Fig. 2 Difference (TP-SH) between difficulty parameters. 1PL-G model

precise approach: it showed bias in 97% of items according to the 1PL and 1PL-
G models, 89% in the 2PL model, and 93% in the 3PL model. When we used
the ETS flagging rules, they showed that all items had a type “A” or minimal
bias (Table 2), underestimating the differences between student groups. Also, our
visual Bayesian analysis is more effective to establish bias against minorities
(such as vocational/technical students) compared to traditional measures such as
χ 2 Mantel-Haenszel.
A Bayesian Graphical and Probabilistic Proposal for Bias Analysis 77

Content Algebra Geometry Numbers Probability


Median and 95% Credible Interval

-2

Fig. 3 Difference (TP-SH) between difficulty parameters. 2PL model

Content Algebra Geometry Numbers Probability


Median and 95% Credible Interval

-2

Fig. 4 Difference (TP-SH) between difficulty parameters. 3PL model

In the present study, the Mantel-Haenszel statistic detected item bias between 28%
and 45% in each of the test forms (see Table 2) missing the large bias found in
the present study. Also, the Bayesian analysis permitted the calculation of posterior
distributions of the difficulty parameter making bias analysis more accurate. Finally,
the difficulty parameter is suitable to compare groups in a bias analysis.

Acknowledgments This study was funded by project Conicyt PIA CIE 160007.
78 C. Ovalle and D. Alvares

Table 2 MH statistic for all forms of the PSU test (2018)


Form Items MH LORZ majority MH LORZ minority CDR ETS
111 80 19(23,7%) 21(26,2%) 36(45%) A
112 80 14(17,5%) 13(16,2%) 23(28,7%) A
113 80 17(21,2%) 18(22,5%) 33(41,2%) A
114 80 12(15%) 12(15%) 27(33,7%) A

References

Berger, M., & Tutz, G. (2016). Detection of uniform and non-uniform differential item functioning
by item focused trees. Journal of Educational and Behavioral Statistics, 41(6), 559–592.
Birnbaum, A. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley.
Fariña, P., González, J., & San Martín, E. (2019). The use of an Identifiability-based strategy for
the interpretation of parameters in the 1PL-G and Rasch models. Psychometrika, 84, 511–528.
Gonzalez, J. (2010a). Bayesian estimation in IRT. International Journal of Psychological Research,
31(1), 164–176.
Gonzalez, J. (2010b). Bayesian Methods in Psychological Research: The case of IRT. International
Journal of Psychological Research, 3(1), 164–176.
Maris, G., & Bechger, T. (2009). On interpreting the model parameters for the three parameter
logistic model. Measurement, 7(2), 75–88.
Rasch, G. (1960). Probabilistic model for some intelligence and achievement tests. Copenhagen:
Danish Institute for Educational Research.
San Martin, E. (2016). Identification of item response theory models, in van der Linden, W. (Ed.).
(2016). Handbook of item response theory. New York: Chapman and Hall/CRC, https://doi.org/
10.1201/b19166.
San Martín, E., & De Boeck, P. (2015). What Do You Mean by a Difficult Item? On the
Interpretation of the Difficulty Parameter in a Rasch Model. In: Millsap R., Bolt D., van der Ark
L., Wang WC. (eds) Quantitative Psychology Research. Springer Proceedings in Mathematics
& Statistics, vol 89. Springer, Cham.
San Martín, E., Del Pino, G., & De Boeck, P. (2006). IRT models for ability-based guessing.
Applied Psychological Measurement, 30(3), 183–203.
San Martín, E., Rolin, J., & Castro, L. M. (2013). Identification of the 1PL model with guessing
parameter: Parametric and semi-parametric results. Psychometrika, 78(2), 341–379.
Tuerlinckx, F., & De Boeck, P. (2005). Two interpretations of the discrimination parameter.
Psychometrika, 70(4), 629–650.
Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging
rules, minimum sample, size requirements, and criterion refinement (p. 130). ETS Research
Reports.

You might also like