Professional Documents
Culture Documents
An Introduction To Differential Item Functioning
An Introduction To Differential Item Functioning
60
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
listed. Neither the list of the software nor the studies cited are meant to be
exhaustive. Rather, these are intended to orient the reader.
Differential Item Functioning
Differential Item Functioning (DIF) occurs when examinees with the same
ability level but from two different groups have different probabilities of endorsing
an item (Clauser & Mazor, 1998). It is synonymous with statistical bias where one
or more parameters of the statistical model are under- or overestimated (Camilli,
2006; Wiberg, 2007. Whenever DIF is present in an item, the source(s) of this
variance should be investigated to ensure that it is not a case of bias. Any item
flagged as showing DIF is biased if, and only if, the source of variance is irrelevant
to the construct being measured by the test. In other words, it is a source of
construct-irrelevant variance and the groups perform differentially on an item
because of a grouping factor (Messick, 1989, 1994).
There are at least two groups, i.e. focal and reference groups, in any DIF
study. The focal group, a group of minorities for example, is the potentially
disadvantaged group. The group which is considered to be potentially advantaged
by the test is called the reference group. Note, however, that naming the groups is
not always clear-cut. Naming the groups in such cases is totally random.
There are two types of DIF, namely uniform and non-uniform DIF.
Uniform DIF occurs when a group performs better than another group on all
ability levels. That is, almost all members of a group outperform almost all
members of the other group who are at the same ability levels. In the case of nonuniform DIF, members of one group are favored up to a level on the ability scale
and from that point on the relationship is reversed. That is, there is an interaction
between grouping and ability level.
As stated earlier, DIF occurs when two groups of the same ability levels
have different chances of endorsing an item. Thus, a criterion is needed for
matching the examinees for ability. The process is called conditioning and the
criterion dubbed as matching criterion. Matching is of two types: internal and
external. In the case of internal matching, the criterion is the observed or latent
score of the test itself. For external matching, the observed or latent score of
another test is considered as the criterion. External matching can become
problematic because in such cases the assumption is that the supplementary test
itself is free of bias and that it is testing the same construct as the test of focus
(McNamara & Roever, 2006).
DIF is not evidence for bias in the test. It is evidence of bias if, and only if,
the factor causing DIF is irrelevant to the construct underlying the test. If that factor
is part of the construct, it is called impact rather than bias. The decision as to
whether the real source of DIF in an item is part of the construct being gauged is
totally subjective. Usually, a panel of experts is consulted to give more validity to the
interpretations.
61
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
62
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
are valid. Moreover, a test needs to be fair for different test takers. In other words,
the test should not be biased against test takers characteristics, e.g. males vs.
females, blacks vs. whites, etc. To examine such an issue requires at least a statistical
approach to test analysis which is able to find initially whether the test items are
functioning differentially among test taking groups and finally detect the sources of
this variance (Geranpayeh & Kunnan 2006). One of the approaches suggested for
such purposes is DIF.
Studying the differential performance of different test taking groups is
essential in the test development and test use procedures. If the sources of DIF are
irrelevant to the construct being measured by the test, it is a source of bias and the
validity of the test is under question. The higher the stakes of the test, the more
serious the consequences of the test use are. With high stakes tests, it is incumbent
upon the test users to ensure that their test is free of bias and the interpretation
made on the test scores are valid.
Test fairness analysis and the search for test bias are closely interwoven. In
fact, they are two sides of the same coin: whenever a test is biased, it is not fair and
vice versa. The search for fairness has gained new impetus during the last two
decades mainly due to advances within Critical Language Testing (CLT). The
proponents of the CLT believe that all uses of language tests are politically
motivated. Tests are, as they suggest, means of manipulating society and imposing
the will of the system on individuals (see Shohamy 2001).
DIF analysis provides only a partial answer to fairness issues. It is focused
on only differential performance of two groups on an item. Therefore, whenever
no groupings are involved in a test, then DIF is not applicable. However, when
groupings are involved, the possibility that the items are favoring one group exists.
If this happens, then the test may not be fair for the disfavored group. Thus, DIF
analysis should be applied in such contexts to obviate the problem.
DIF Methodology
McNamara and Roever (2006, p. 93) have classified methods of DIF
detection into four categories:
1. Analyses based on item difficulty (e.g. transformed item difficulty index
(TID) or delta plot).
2. Nonparametric methods. These methods make use of contingency tables
and chi-square methods.
3. Item-response-theory-based approaches which include 1, 2, and 3
parameter logistic models
4. Other approaches. These methods have not been developed primarily for
DIF detection but they can be utilized for this purpose. They include
multifaceted Rasch measurement and generalizability theory.
Despite the diversity of techniques, only a limited number of them appear
to be in current use. DIF detention techniques based on difficulty indices are not
common. Although they are conceptually simple and their application do not
require understanding complicated mathematical formulas, they face certain
2012 Time Taylor Academic Journals ISSN 2094-0734
63
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
problems including the fact that they assume equal discrimination across all items
and that there is no matching for ability (McNamara & Roever (2006). If the first
assumption is not met, the results can be misleading (Angoff, 1993; Scheuneman &
Bleistein, 1989). When an item has high discrimination level, it shows large
differences between the groups. One the other hand, differences between the
groups will not be significant in an item with low discrimination.
As indicated above, the DIF indices based upon item difficulty are not
common. Thus, they will not be discussed here. (For a detailed account of DIF
detection methods, both traditional and modern, see the following: Kamata &
Vaughn 2004; Scheuneman & Bleistein 1989; Wiberg 2007). In the next sections, a
general discussion of the most frequently used DIF detection methods will be
presented.
Logistic Regression
Logistic regression, first proposed by Swaminathan and Rogers (1990), is
basically used when we have one or more independent variables which are most of
the time continuous, and a binary or dichotomous dependent variable (Pampel,
2000; Swaminathan & Rogers, 1990; Zumbo, 1999). In applying logistic regression
to DIF detection, one attempts to see whether item performance, a wrong or right
answer, can be predicted from total scores alone, from total scores plus group
membership, and from total scores, group membership, and interaction between
them. The procedure can be formulaically presented as follows:
Ln
= 0 + 1 + 2 + 3 ( )
1
P mi
1P mi
64
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
presence of the new variable. As such, logistic regression involves three successive
steps:
1. The conditioning variable or the total score is entered into the model
2. The grouping variable is added
3. The interaction term is also entered.
As a test of the significance of DIF, the Chi-square value of step 1 is
calculated and subtracted from the Chi-square value of step three. This is an overall
index of the significance of the DIF. The Chi-square value of step 2 can be
subtracted from that of step 3 to provide a significance test of non-uniform DIF. In
addition, comparing the Chi-square value of steps 1 and 2 is a good indicator of
uniform DIF.
Zumbo (1999) argued that logistic regression has three main advantages
over other DIF detection techniques in that one:
- need not categorize a continuous criterion variable,
- can model both uniform and non-uniform DIF
- can generalize the binary logistic regression model for use with ordinal item
scores. (p. 23)
Also, Wiberg (2007) noted that the logistic regression and the MantelHaenszel statistics (to be explained in the next section) have gained particular
attention due to the fact that they can be utilized for detecting DIF in small sample
sizes. For example, Zumbo (1999) pointed out that 200 people per group are
needed. This is not a remarkable sample size compared to that required by other
models such as the three-parameter IRT which require over 1000 test takers per
group.
McNamara and Roever (2006) also stated that, Logistic regression is useful
because it allows modeling of uniform and non-uniform DIF, is nonparametric, can
be applied to dichotomous and rated items, and requires less complicated
computing than IRT-based analysis, (p. 116).
There are a number of software for doing DIF analysis using logistic
regression. LORDIF (Choi, Gibbons, & Crane, 2011) conducts DIF analysis for
dichotomous and polytomous items using both ordinal logistic regression and IRT. In
addition, SPSS can also be used for doing DIF analysis through both the MH and logistic
regression. Zumbo (1999) and Kamata and Vaughn (2004) provide examples of such
analyses. Magis, Bland, Tuerlinckx, and De Boeck, (2010) also have introduced
an R package for DIF detection, called difR, that can apply nine DIF detection
techniques including the logistic regression and the MH.
There are a number of studies that have applied logistic regression for DIF
detection. Shabani (2008) utilized logistic regression to analyze a version of the
University of Tehran English Proficiency test (UTEPT) for the presence of DIF
due to gender differences. Kim (2001) conducted a DIF analysis of the
polytomously scored speaking items in the SPEAK test (the Speaking Proficiency
English Assessment Kit), a test developed by the Educational Testing Service. The
participants were divided into two different groups: the East Asian and the
European groups. He utilized the IRT likelihood ratio test and logistic regression to
detect the differentially functioning items. Davidson (2004) has investigated the
2012 Time Taylor Academic Journals ISSN 2094-0734
65
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
( )
where is the relative frequencey of the group members at score levels, is the
proportion of the focal group at score level correctly responding to the item, and
is the proportion of reference group members scoring who endorse the item.
There are two versions of this technique based on whether the sign of the
difference is taken into account or not: unsigned proportion difference and the
signed proportion difference (Wiberg, 2007). The former is also referred to as the
standardized p-difference. The standardized p-difference index is more common.
The item will be flagged as DIF if the absolute value of this index is above 0.1.
Despite conceptual and statistical simplicity, the standardization procedure
is not so prevalent due to the large sample sizes that it requires (McNamara &
Roever, 2006). Another shortcoming of the procedure is that it has no significance
tests (Clauser & Mazor, 1998).
One of the most recent software introduced for DIF detection through the
Standardization procedure is the EASY-DIF (Gonzlez et al. 2011). EASY-DIF also
applies the MantelHaenszel as explained earlier. Also, STDIF (Robin, 2001) is a
free DOS-based program to compute DIF through the Standardization approach.
STDIF also has a manual (Zenisky, Robin, & Hambleton, 2009) which is freely
available. The software and the manual are both available at:
http://www.umass.edu/remp/software/STDIF.html.
Zenisky, Hambleton, & Robin, (2003) utilized the STDIF to apply a twostage methodology for evaluating DIF in large-scale state assessment data. These
researchers were concerned with the merits of iterative approaches to DIF detection.
In a later study, Zenisky et al, (2004) also applied the STDIF to identify gender
DIF in a large-scale science assessment. As the authors explain, their methodology
was a variant of the Standardization technique. Lawrence, Curley, and McHale,
(1988) also applied the Standardization technique to detect differentially
functioning items in the reading comprehension and sentence completion items in
the verbal section of the Scholastic Aptitude Test (SAT). Freedle and Kostin
(1997) conducted an ethnic comparison study using the DIF methodology. The
2012 Time Taylor Academic Journals ISSN 2094-0734
66
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
scrutinized a large number of items from SAT and GRE exams comparing the
performance of the Black and White examinees. Gallagher (2004) has applied the
Standardization procedure, logistic regression, and the MH to investigate the
reading performance differences between African-American and White students
taking a nationally normed reading test.
Mantel-Haenszel
The Mantel-Haenszel (MH) procedure was first proposed for DIF analysis
by Holland and Thayer (as cited in Kamata & Vaughn, 2004). The basic idea is to
calculate the odds of correctly endorsing an item for the focal group relative to the
reference group. If there are large differences, DIF is present.
According to Scheuneman and Bleistein (1989), The MH estimate is a
weighted average of the odds ratios at each of j ability levels, (p. 262). That is, the
odds ratios of success at each ability level are estimated and then summed over all
ability levels.
Table 1 shows the hypothetical performance of two groups of test takers,
focal and reference groups, on an item.
Table 1
Total
20
20
40
Empirical Probabilities
Reference group
Focal group
Total
Correct
Incorrect
Total
.7
.4
25
.3
.6
25
15
20
50
67
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
Finally, we want to know how much more likely are the members of the
reference group to correctly respond than the members of the focal group. To this
aim, we get the odds ratio:
= 2.33/0.66 = 3.5
Simply put, the odds ratio in the above example shows that members of the
reference group are three and a half times more likely than members of the focal
group to endorse the item.
However, note that we have calculated the odds ratio for only ability level.
Thus, the overall DIF is calculated by summing the odds ratios at all ability levels
and dividing them by the number of ability levels. The resulting index is the
Mantel-Haenszel odds ratio denoted by . This index is usually transformed by
the following:
= ln
A negative indicates DIF in favor of the focal group whereas a positive
MH shows DIF favoring the reference group (Wiberg 2007). Sometimes, is
further transformed into:
= 2.35 ln
A positive value of indicates that the item is more difficult for
the reference group while a negative value shows focal group faces more difficulty
with the item.
The Educational Testing Service uses the MH statistics in DIF analysis.
Items flagged as DIF are further classified into three types (Zieky, 1993) to avoid
identifying items that display practically trivial but statistically significant DIF
(Clauser & Mazor, 1998, p. 39). Items are identified as showing type A DIF if
absolute value of is smaller than 1.0 or not significantly different
from zero. Type C DIF occurs when the absolute value of is greater
than 1.5 or it is significantly different from 1.0. All other DIF items are flagged as
type B.
The main software for DIF analysis using the MH are DIFAS (Penfield,
2005), EZDIF (Waller, 1998), and more recently EASY-DIF (Gonzlez, Padilla,
Hidalgo, Gmez-Benito & Bentez, 2011). Another relevant software is the
DICHODIF (Rogers, Swaminathan, & Hambleton, 1993) that can apply both the
MH and Logistic Regression. Also, LERTAP (Nelson, 2000) is an Excel-based
classical itemanalysis software that is able to do DIF analysis using the MH. Its
student version is freely available and the full version is available from
http://assess.com/xcart/product.php?productid=235&cat=21&page=1. For more
helpful information about the software see also http://lertap.curtin.edu.au/ . Winsteps
(Linacre, 2010) also provides MH based DIF estimates as part of its output.
Elder (1996) conducted a study to determine whether language background
may lead to DIF. She examined reading and listening subsections of the Australian
Language Certificates (ALC), a test given to Australian school-age learners from
diverse language backgrounds. Her participants included those who were enrolled
2012 Time Taylor Academic Journals ISSN 2094-0734
68
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
in language classes in years 8 and 9. The languages of her focus were Greek, Italian,
and Chinese. Elder (1996) compared the performance of background speakers
(those who used to speak the target language plus English at home) with nonbackground speakers (those who were only exposed to English at home). She
applied the Mantel-Haenszel procedure to detect DIF. Ryan and Bachman (1992),
also utilized the Mantel-Haenszel procedure to compare the performance of a
group of males and female test takers on FCE and TOELF tests. Allalouf and
Abramzon (2008) investigated the differences between groups from different first
language backgrounds, namely Arabic and Russian, using he Mantel-Haenszel.
Ockey (2007) applied both IRT and the MH to compare the performance of the
English language learners (ELL) and non-ELL 8th-grade students scores on
National Assessment of Educational Progress (NAEP) math word problems. For an
overview of the applications of Mantel-Haenszel procedure to detect DIF, see
Guilera, Gmez-Benito, and Hidalgo(2009).
Item Response Theory
The main difference between IRT DIF detection techniques and other
methods including logistic regression and MH is the fact that in non-IRT
approaches, examinees are typically matched on an observed variable (such as
total test score), and then counts of examinees in the focal and reference groups
getting the studied item correct or incorrect are compared (Clauser & Mazor 1998,
p. 35). That is, the conditioning or the matching criterion is the observed score.
However, in IRT-based methods, matching is based on the examinees estimated
ability level or the latent trait, .
69
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
1
1+
()
where:
b is the difficulty parameter
is the discrimination parameter
c is the guessing or pseudo-chance parameter and
is the ability level
The basic idea in detecting DIF through IRT models is that if DIF is
present in an item, the ICCs of the item for the reference and the focal groups
should be different (Thissen, Steinberg, & Wainer, 1993). However, where there is
no DIF, the item parameters and hence ICCs should be almost the same. It is
evident that the ICCs would be different if the item parameters vary from a group
to another. Thus, one possible way of detecting DIF through IRT is to compare
item parameters in two groups. If the item parameters are significantly different,
then DIF is ensured.
IRT DIF can be computed using BILOG-MG (Scientific Software
International, 2003) for dichotomously scored items and PARSCALE ((Muraki &
Bock, 2002) and MULTILOG (Thissen, 1991) for the polytomously scored items.
In addition, for small sample sizes, nonparametric IRT can be employed suing the
TestGraf software (Ramsay, 2001). For an exemplary study of the application of the
TestGraf for DIF detection, see Laroche, Kim, and Tomiuk, (1998). Finally, the
IRTDIF software (Kim, & Cohen, 1992) can do DIF analysis under the IRT
framework.
Pae (2004) undertook a DIF study of examinees with different academic
backgrounds sitting the English subtest of the Korean National Entrance Exam
2012 Time Taylor Academic Journals ISSN 2094-0734
70
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
for Colleges and Universities. He applied the Three Parameter IRT though the
MULTILOG for DIF analysis. Before applying the IRT, however, Pae (2004) also
did an initial DIF analysis using the MH procedure to detect suspect
items. Geranpayeh and Kunnan (2006) also examined the existence of differentially
functioning items on the listening section of the Certificate in Advanced
English examination for test takers from three different age groups. Uiterwijk and
Vallen (2005) investigated the performance of the second generation
immigrant (SGI) students and native Dutch (ND) students in the Final Test
ofPrimary Education in the Netherlands. Both IRT and MantelHaenszel were
applied in their study.
The Rasch Model
Although the one-parameter logistic model and the Rasch model are
mathematically similar, they were developed independently of each other. In fact, a
number of scholars (e.g. Pollitt 1997) believe that the IRT models are totally
different from the Rasch model.
The Rasch model focuses on the probability of endorsing item by person
. In aiming to model this probability, it essentially takes into account person
ability and item difficulty. Probability is a function of the difference between person
ability and item difficulty. The following formula shows just this:
P (x = 1 , ) = ( )
where is person ability and is item difficulty. The formula simply states that
the probability of endorsing the item is a function of the difference between person
ability, , and item difficulty, . This is possible becasue item difficulty and
person ability are on the same scale in the Rasch model. It is also intuitively
appealing to conceive of probability in such terms. The Rasch model assumes that
any person taking the test has an amount of the construct gauged by the test and
that any item also shows an amount of the construct. These values work in the
opposite direction. Thus, it is the difference between item difficulty and person
ability that counts
Three cases can be considered for any encounter of persons and items
(Wilson, 2005):
1. item difficulty and person ability are the same, = 0, and the person
has an equal probability of endorsing the item or failing. Thus, probability is
.5.
2. person ability is greater than item difficulty, > 0, and the person
has more than .5 probability of endorsing the item.
3. person ability is lower than item difficulty, < 0, and the probability
of giving a correct response to the item is less than .5.
The exact formula for the Rasch model is the following:
Ln
=
1
71
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
di2 di1
( 2 2 2 1 )
where di1 is the difficulty of item i in calibration 1, di2 is the difficulty of item i in
calibration based on group 2, 2 1 is the standard error of estimate for di1 , and 2 2
is the standard error of estimate for di2 . Baghaei (2009), Bond and Fox (2007), and
Wilson (2005) present excellent introductory level expositions of the Rasch model.
Among the software for DIF using the Rasch model are ConQuest(Wu, Adams,
Wilson, & Haldane, 2007), Winsteps (Linacre, 2010), and Facets (Linacre, 2009).
Karami (2011) has applied the Rasch model to investigate the existence of
DIF items in the UTEPT for male and female examinees. He applied Linacres
Winsteps for DIF analysis. Also, Karami (2010) exploited the Winsteps to examine
the UTEPT items for possible DIF for test takers from different academic
backgrounds. Elder, McNamara, and Congdon (2003) also applied the Rasch
model to examine the performance of native and non-native speakers on a test of
academic English.
Furthermore, Takala and Kaftandjieva (2000) undertook a study to
investigate the presence of DIF in the vocabulary subtest of the Finnish Foreign
Language Certificate Examination, an official, national high-stakes foreign-language
examination based on a bill passed by Parliament. To detect DIF, they utilized the
One Parameter Logistic Model (OPLM), a modification of the Rasch model where
item discrimination is not considered to be one but is input as a known constant.
Pallant and Tennant (2007) also applied the Rasch model to scrutinize the utility of
the Hospital Anxiety and Depression Scale (HADS) totals core (HADS-14) as a
measure of psychological distress.
Conclusion
DIF analysis aims to detect items that differentially favor examinees of the
same ability levels but from different groups. The technical requirements of this
methodology, however, has hampered the non-mathematically oriented
researchers. Even if a researcher does not apply these techniques in his own
studies, he has to be familiar with them in order to fully appreciate the published
papers that report such analyses.
This paper attempted to provide a non-technical introduction to the basic
principles of DIF analysis. Five DIF detection techniques were explained: Logistic
Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch
model. For each technique, a number of the most widely applied software along with some
studies applying those techniques were briefly cited. The interested reader may refer to
2012 Time Taylor Academic Journals ISSN 2094-0734
72
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
such studies for further information about their application. It is hoped that the
exposition offered here will enable researchers to appreciate and enjoy reading
studies that have conducted a DIF analysis.
References
Alderson, J. C., & Urquhart, A. (1985). The effect of students academic discipline
ontheir performance on ESP reading tests. Language Testing, 2, 192-204.
Allalouf, A., & Abramzon, A. (2008). Constructing better second language
assessments based on differential item functioning analysis. Language
Assessment Quarterly, 5, 120141.
Angoff, W. H. (1993). Perspectives on differential item functioning methodology.
In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3
4). Hillsdale, NJ: Lawrence Erlbaum Associates.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Baghaei, P. (2009) Understanding the Rasch model. Mashad: Mashad Islamic Azad
University Press.
Baker, F. (2001). The basics of Item Response Theory. ERIC Clearinghouse on
Assessment and Evaluation, University of Maryland, College Park, MD.
Bond, T. G., & Fox, C.M. (2007) Applying the Rasch model: Fundamental
measurement in the human sciences. London: Lawrence Erlbaum.
Brown, J. D. (1999). The relative importance of persons, items, subtests and
languages to TOEFL test variance. Language Testing, 16, 217238.
Camilli, G. (2006) Test fairness. In R. Brennan (Ed.), Educational measurement
(pp. 221-256). New York: American Council on Education & Praeger series
on higher education.
Chen, Z., & Henning, G. (1985) Linguistic and cultural bias in language proficiency
tests. Language Testing, 2(2), 155163.
Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). Lordif: An R Package for
Detecting DifferentialItem Functioning Using Iterative Hybrid Ordinal
Logistic Regression/Item Response Theory andMonte Carlo Simulations.
Journal of Statistical Software, 39(8), 1-30.
Clauser, E. B., & Mazor, M. K. (1998). Using statistical procedures to identify
differentially functioning test items. Educational Measurement: Issues and
Practice. 17, 31-44.
Davidson, B. (2004). Comparability of test scores for non-aboriginal and aboriginal
students. (DoctoralDissertation, University of British Columbia, 2004).
73
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
74
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
75
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)
76
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)