You are on page 1of 18

59

The International Journal of Educational and Psychological Assessment


September 2012, Vol. 11(2)

An Introduction to Differential Item Functioning


Hossein Karami

University of Tehran, Iran


Abstract
Differential Item Functioning (DIF) has been increasingly applied in fairness
studies in psychometric circles. Judicious application of this methodology by the
researchers, however, requires an understanding of the technical complexities
involved. This has become an impediment in the way of specially nonmathematically oriented researches. This paper is an attempt to bridge the gap. It
provides a non-technical introduction to the fundamental concepts involved in DIF
analysis. In addition, an introductory level explanation of a number of the most
frequently applied DIF detection techniques will be offered. These include Logistic
Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch
model. For each method, a number of the relevant software are also introduced.

Key words: Differential Item Functioning, Validity, Fairness, Bias


Introduction
Differential Item Functioning (DIF) occurs when two groups of equal ability
levels are not equally able to correctly answer an item. In other words, one group
does not have an equal chance of getting an item right though its members have
comparable ability levels to the other group. If the factor leading to DIF is not part
of the construct being tested, then the test is biased.
DIF analysis has been increasing applied in psychometric circles for
detecting bias at the item level (Zumbo, 1999). Language testing researchers have
also followed suit and have exploited DIF analysis in their fairness studies. They
have conducted a plethora of research studies to investigate the existence of bias in
their tests. These studies have focused on such factors as gender (e.g. Ryan &
Bachman, 1992; Karami, 2011; Takala & Kaftandjieva, 2000), language background
(Chen & Henning, 1985; Brown, 1999; Elder, 1996; Kim 2001; Ryan & Bachman,
1992), and academic background or content knowledge (Alderson & Urquhart,
1985; Hale, 1988; Karami, 2010; Pae, 2004).
Despite the widespread application of DIF analysis in psychometric circles,
however, it seems that the inherent complexity of the concepts in DIF analysis has
hampered its wider application among less mathematically oriented researchers.
This paper is an attempt to bridge this gap by providing a non-technical
introduction to the fundamental concepts in DIF analysis. The paper begins with an
overview of the basic concepts involved. Then, a brief overview of the development
of fairness studies and DIF analyses during the last century follows. The paper ends
with a detailed, though non-technical, explanation of a number of the most widely
used DIF detection techniques. For each technique, a number of most widely used
software are also introduced. A few studies applying the relevant techniques are also
2012 Time Taylor Academic Journals ISSN 2094-0734

60
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

listed. Neither the list of the software nor the studies cited are meant to be
exhaustive. Rather, these are intended to orient the reader.
Differential Item Functioning
Differential Item Functioning (DIF) occurs when examinees with the same
ability level but from two different groups have different probabilities of endorsing
an item (Clauser & Mazor, 1998). It is synonymous with statistical bias where one
or more parameters of the statistical model are under- or overestimated (Camilli,
2006; Wiberg, 2007. Whenever DIF is present in an item, the source(s) of this
variance should be investigated to ensure that it is not a case of bias. Any item
flagged as showing DIF is biased if, and only if, the source of variance is irrelevant
to the construct being measured by the test. In other words, it is a source of
construct-irrelevant variance and the groups perform differentially on an item
because of a grouping factor (Messick, 1989, 1994).
There are at least two groups, i.e. focal and reference groups, in any DIF
study. The focal group, a group of minorities for example, is the potentially
disadvantaged group. The group which is considered to be potentially advantaged
by the test is called the reference group. Note, however, that naming the groups is
not always clear-cut. Naming the groups in such cases is totally random.
There are two types of DIF, namely uniform and non-uniform DIF.
Uniform DIF occurs when a group performs better than another group on all
ability levels. That is, almost all members of a group outperform almost all
members of the other group who are at the same ability levels. In the case of nonuniform DIF, members of one group are favored up to a level on the ability scale
and from that point on the relationship is reversed. That is, there is an interaction
between grouping and ability level.
As stated earlier, DIF occurs when two groups of the same ability levels
have different chances of endorsing an item. Thus, a criterion is needed for
matching the examinees for ability. The process is called conditioning and the
criterion dubbed as matching criterion. Matching is of two types: internal and
external. In the case of internal matching, the criterion is the observed or latent
score of the test itself. For external matching, the observed or latent score of
another test is considered as the criterion. External matching can become
problematic because in such cases the assumption is that the supplementary test
itself is free of bias and that it is testing the same construct as the test of focus
(McNamara & Roever, 2006).
DIF is not evidence for bias in the test. It is evidence of bias if, and only if,
the factor causing DIF is irrelevant to the construct underlying the test. If that factor
is part of the construct, it is called impact rather than bias. The decision as to
whether the real source of DIF in an item is part of the construct being gauged is
totally subjective. Usually, a panel of experts is consulted to give more validity to the
interpretations.

2012 Time Taylor Academic Journals ISSN 2094-0734

61
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

The Development of DIF


The origins of bias analysis can be traced back to the early twentieth century
(McNamara & Roever, 2006). At the time, researchers were concerned with
developing tests that measured raw intelligence. A number of studies conducted at
the time, however, showed that the socio-economic status of the test takers was a
confounding variable. Thus, they aimed to partial out this variance through purging
items that functioned differently for examinees with high and low socio-economic
status.
In the 1960s, the focus of bias studies shifted from intelligence tests to areas
where social equity was a major concern (Angoff, 1993). The role of fairness in tests
became highlighted. A variety of techniques were developed for detecting bias in
the tests. There was a problem with all these bias-detection techniques: all the
techniques required performance on a criterion test. Criterion measures could not
be obtained until tests were in use, however, making test-bias detection procedures
inapplicable (Scheuneman & Bleistein, 1989 p. 256). Consequently, researchers
went for devising a plethora of item-detection procedures.
The Golden Rule Settlement in 1976 was a landmark in bias studies
because legal issues entered the scene (McNamara & Roever, 2006). The Golden
Rule Insurance Company filed a suit against the Educational Testing Service and
the Illinois Department of Insurance due to an alleged bias against blacks in the
tests they developed. The court issued a verdict in favor of the Golden Rule
Insurance Company. The ETS was considered liable for the tests it developed and
was legally ordered to make every effort to rule out bias in its tests.
The main point about the settlement was the fact that bias analysis turned
out to be a legal issue as the test developing agencies were legally held responsible
for the consequences of their tests. The case also highlighted the significance of
Samuel Messicks (1980, 1989) works that emphasized the consequential aspects of
the tests in his validation framework.
A number of researchers (e.g. Linn & Drasgow, 1987) opposed the verdict
emphasizing that simply discarding items showing DIF may render the test invalid
by making it less representative of the construct measured by the test. Another
reason they put forward was the fact that the items may show true differences
between the test takers and the test may be a mirror of the real world ability
differences.
The proponents of the settlement, however, argued that there is no reason
to believe that there are ability differences between the test takers simply because
they are from different test taking groups. Thus, any observed differences in the
performance of, say, blacks and whites, is cause for concern and a source of
construct-irrelevant variance in Messicks terminology. Since then, a number of
techniques have been developed to detect differentially functioning items.
DIF, Validity, and Fairness
The primary concern in test development and test use, as Bachman (1990)
suggests, is demonstrating that the interpretations and uses we make of test scores
2012 Time Taylor Academic Journals ISSN 2094-0734

62
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

are valid. Moreover, a test needs to be fair for different test takers. In other words,
the test should not be biased against test takers characteristics, e.g. males vs.
females, blacks vs. whites, etc. To examine such an issue requires at least a statistical
approach to test analysis which is able to find initially whether the test items are
functioning differentially among test taking groups and finally detect the sources of
this variance (Geranpayeh & Kunnan 2006). One of the approaches suggested for
such purposes is DIF.
Studying the differential performance of different test taking groups is
essential in the test development and test use procedures. If the sources of DIF are
irrelevant to the construct being measured by the test, it is a source of bias and the
validity of the test is under question. The higher the stakes of the test, the more
serious the consequences of the test use are. With high stakes tests, it is incumbent
upon the test users to ensure that their test is free of bias and the interpretation
made on the test scores are valid.
Test fairness analysis and the search for test bias are closely interwoven. In
fact, they are two sides of the same coin: whenever a test is biased, it is not fair and
vice versa. The search for fairness has gained new impetus during the last two
decades mainly due to advances within Critical Language Testing (CLT). The
proponents of the CLT believe that all uses of language tests are politically
motivated. Tests are, as they suggest, means of manipulating society and imposing
the will of the system on individuals (see Shohamy 2001).
DIF analysis provides only a partial answer to fairness issues. It is focused
on only differential performance of two groups on an item. Therefore, whenever
no groupings are involved in a test, then DIF is not applicable. However, when
groupings are involved, the possibility that the items are favoring one group exists.
If this happens, then the test may not be fair for the disfavored group. Thus, DIF
analysis should be applied in such contexts to obviate the problem.
DIF Methodology
McNamara and Roever (2006, p. 93) have classified methods of DIF
detection into four categories:
1. Analyses based on item difficulty (e.g. transformed item difficulty index
(TID) or delta plot).
2. Nonparametric methods. These methods make use of contingency tables
and chi-square methods.
3. Item-response-theory-based approaches which include 1, 2, and 3
parameter logistic models
4. Other approaches. These methods have not been developed primarily for
DIF detection but they can be utilized for this purpose. They include
multifaceted Rasch measurement and generalizability theory.
Despite the diversity of techniques, only a limited number of them appear
to be in current use. DIF detention techniques based on difficulty indices are not
common. Although they are conceptually simple and their application do not
require understanding complicated mathematical formulas, they face certain
2012 Time Taylor Academic Journals ISSN 2094-0734

63
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

problems including the fact that they assume equal discrimination across all items
and that there is no matching for ability (McNamara & Roever (2006). If the first
assumption is not met, the results can be misleading (Angoff, 1993; Scheuneman &
Bleistein, 1989). When an item has high discrimination level, it shows large
differences between the groups. One the other hand, differences between the
groups will not be significant in an item with low discrimination.
As indicated above, the DIF indices based upon item difficulty are not
common. Thus, they will not be discussed here. (For a detailed account of DIF
detection methods, both traditional and modern, see the following: Kamata &
Vaughn 2004; Scheuneman & Bleistein 1989; Wiberg 2007). In the next sections, a
general discussion of the most frequently used DIF detection methods will be
presented.
Logistic Regression
Logistic regression, first proposed by Swaminathan and Rogers (1990), is
basically used when we have one or more independent variables which are most of
the time continuous, and a binary or dichotomous dependent variable (Pampel,
2000; Swaminathan & Rogers, 1990; Zumbo, 1999). In applying logistic regression
to DIF detection, one attempts to see whether item performance, a wrong or right
answer, can be predicted from total scores alone, from total scores plus group
membership, and from total scores, group membership, and interaction between
them. The procedure can be formulaically presented as follows:
Ln

= 0 + 1 + 2 + 3 ( )
1

In the formula, 0 is the intercept, 1 is the effect of conditioning


variable which is usually the total score on the test, 2 is the grouping
variable, and finally 3 ( ) is the ability by grouping interaction effect. If
the conditioning variable alone is enough to predict the item performance, with
relatively little residuals, then no DIF is present. If group membership, 2 ,
adds to the precision of the prediction, uniform DIF is detected. That is, one group
performs better than another group and this is a case of uniform DIF. Finally, in
addition to total scores and grouping, if an interaction effect, signified by 3
in the formula, is also needed for a more precise prediction of the total
scores, it is a case of non-uniform DIF (Zumbo, 1999).
Also, note that the formula is based on logistic function denoted by
Ln

P mi
1P mi

where Pmi is the probability of giving a correct answer to item i by

person m and 1 Pmi is the probability of a wrong response. In simple words, it is


the natural logarithm of the odds of success to the odds of failure.
Identifying DIF through logistic regression is similar to step-wise regression
in that successive models are built up in each step entering a new variable to see
whether the new model is an improvement over the previous one due to the

2012 Time Taylor Academic Journals ISSN 2094-0734

64
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

presence of the new variable. As such, logistic regression involves three successive
steps:
1. The conditioning variable or the total score is entered into the model
2. The grouping variable is added
3. The interaction term is also entered.
As a test of the significance of DIF, the Chi-square value of step 1 is
calculated and subtracted from the Chi-square value of step three. This is an overall
index of the significance of the DIF. The Chi-square value of step 2 can be
subtracted from that of step 3 to provide a significance test of non-uniform DIF. In
addition, comparing the Chi-square value of steps 1 and 2 is a good indicator of
uniform DIF.
Zumbo (1999) argued that logistic regression has three main advantages
over other DIF detection techniques in that one:
- need not categorize a continuous criterion variable,
- can model both uniform and non-uniform DIF
- can generalize the binary logistic regression model for use with ordinal item
scores. (p. 23)
Also, Wiberg (2007) noted that the logistic regression and the MantelHaenszel statistics (to be explained in the next section) have gained particular
attention due to the fact that they can be utilized for detecting DIF in small sample
sizes. For example, Zumbo (1999) pointed out that 200 people per group are
needed. This is not a remarkable sample size compared to that required by other
models such as the three-parameter IRT which require over 1000 test takers per
group.
McNamara and Roever (2006) also stated that, Logistic regression is useful
because it allows modeling of uniform and non-uniform DIF, is nonparametric, can
be applied to dichotomous and rated items, and requires less complicated
computing than IRT-based analysis, (p. 116).
There are a number of software for doing DIF analysis using logistic
regression. LORDIF (Choi, Gibbons, & Crane, 2011) conducts DIF analysis for
dichotomous and polytomous items using both ordinal logistic regression and IRT. In
addition, SPSS can also be used for doing DIF analysis through both the MH and logistic
regression. Zumbo (1999) and Kamata and Vaughn (2004) provide examples of such

analyses. Magis, Bland, Tuerlinckx, and De Boeck, (2010) also have introduced
an R package for DIF detection, called difR, that can apply nine DIF detection
techniques including the logistic regression and the MH.
There are a number of studies that have applied logistic regression for DIF
detection. Shabani (2008) utilized logistic regression to analyze a version of the
University of Tehran English Proficiency test (UTEPT) for the presence of DIF
due to gender differences. Kim (2001) conducted a DIF analysis of the
polytomously scored speaking items in the SPEAK test (the Speaking Proficiency
English Assessment Kit), a test developed by the Educational Testing Service. The
participants were divided into two different groups: the East Asian and the
European groups. He utilized the IRT likelihood ratio test and logistic regression to
detect the differentially functioning items. Davidson (2004) has investigated the
2012 Time Taylor Academic Journals ISSN 2094-0734

65
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

comparability of the performances of non-aboriginal and aboriginal students. Lee,


Breland, and Muraki (2004) comparability of computer-based testing (CBT) writing
prompts in the Test of English as a Foreign Language (TOEFL) for examinees
of different native language backgrounds,with a focus on European (German,
French, and Spanish) and East Asian (Chinese, Japanese, and Korean) native
language groups as reference and focal groups, respectively.
Standardization
The idea here is to compute the difference between the proportion of test
takers, from both focal and reference groups, who answer the item correctly at each
score level. More weight is attached to score levels with more test takers
(McNamara & Roever, 2006). The procedure can be formulaically presented as
(Clauser & Mazor 1998):
=

( )

where is the relative frequencey of the group members at score levels, is the
proportion of the focal group at score level correctly responding to the item, and
is the proportion of reference group members scoring who endorse the item.
There are two versions of this technique based on whether the sign of the
difference is taken into account or not: unsigned proportion difference and the
signed proportion difference (Wiberg, 2007). The former is also referred to as the
standardized p-difference. The standardized p-difference index is more common.
The item will be flagged as DIF if the absolute value of this index is above 0.1.
Despite conceptual and statistical simplicity, the standardization procedure
is not so prevalent due to the large sample sizes that it requires (McNamara &
Roever, 2006). Another shortcoming of the procedure is that it has no significance
tests (Clauser & Mazor, 1998).
One of the most recent software introduced for DIF detection through the
Standardization procedure is the EASY-DIF (Gonzlez et al. 2011). EASY-DIF also
applies the MantelHaenszel as explained earlier. Also, STDIF (Robin, 2001) is a
free DOS-based program to compute DIF through the Standardization approach.
STDIF also has a manual (Zenisky, Robin, & Hambleton, 2009) which is freely
available. The software and the manual are both available at:
http://www.umass.edu/remp/software/STDIF.html.
Zenisky, Hambleton, & Robin, (2003) utilized the STDIF to apply a twostage methodology for evaluating DIF in large-scale state assessment data. These
researchers were concerned with the merits of iterative approaches to DIF detection.
In a later study, Zenisky et al, (2004) also applied the STDIF to identify gender
DIF in a large-scale science assessment. As the authors explain, their methodology
was a variant of the Standardization technique. Lawrence, Curley, and McHale,
(1988) also applied the Standardization technique to detect differentially
functioning items in the reading comprehension and sentence completion items in
the verbal section of the Scholastic Aptitude Test (SAT). Freedle and Kostin
(1997) conducted an ethnic comparison study using the DIF methodology. The
2012 Time Taylor Academic Journals ISSN 2094-0734

66
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

scrutinized a large number of items from SAT and GRE exams comparing the
performance of the Black and White examinees. Gallagher (2004) has applied the
Standardization procedure, logistic regression, and the MH to investigate the
reading performance differences between African-American and White students
taking a nationally normed reading test.
Mantel-Haenszel
The Mantel-Haenszel (MH) procedure was first proposed for DIF analysis
by Holland and Thayer (as cited in Kamata & Vaughn, 2004). The basic idea is to
calculate the odds of correctly endorsing an item for the focal group relative to the
reference group. If there are large differences, DIF is present.
According to Scheuneman and Bleistein (1989), The MH estimate is a
weighted average of the odds ratios at each of j ability levels, (p. 262). That is, the
odds ratios of success at each ability level are estimated and then summed over all
ability levels.
Table 1 shows the hypothetical performance of two groups of test takers,
focal and reference groups, on an item.
Table 1

Hypothetical Performance of Two Groups on an Item


Correct
Incorrect
Reference group
14
6
Focal group
8
12
Total
20
20

Total
20
20
40

The first step in calculating the MH statistics is to compute the probabilities


of correct and incorrect responses for both groups. The empirical probabilities are
shown in table 2.2. The second step is to find out how much more likely are the
members of either group to answer correctly rather than incorrectly to the item. For
the reference group, the odds are:
= 0.7/0.3 = 2.33
Similarly, the odds of giving a correct answer to the item for the focal group
are as follows:
= 0.4/0.6 = 0.66
Table 2

Empirical Probabilities
Reference group
Focal group
Total

Correct

Incorrect

Total

.7
.4
25

.3
.6
25

15
20
50

2012 Time Taylor Academic Journals ISSN 2094-0734

67
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

Finally, we want to know how much more likely are the members of the
reference group to correctly respond than the members of the focal group. To this
aim, we get the odds ratio:
= 2.33/0.66 = 3.5
Simply put, the odds ratio in the above example shows that members of the
reference group are three and a half times more likely than members of the focal
group to endorse the item.
However, note that we have calculated the odds ratio for only ability level.
Thus, the overall DIF is calculated by summing the odds ratios at all ability levels
and dividing them by the number of ability levels. The resulting index is the
Mantel-Haenszel odds ratio denoted by . This index is usually transformed by
the following:
= ln
A negative indicates DIF in favor of the focal group whereas a positive
MH shows DIF favoring the reference group (Wiberg 2007). Sometimes, is
further transformed into:
= 2.35 ln
A positive value of indicates that the item is more difficult for
the reference group while a negative value shows focal group faces more difficulty
with the item.
The Educational Testing Service uses the MH statistics in DIF analysis.
Items flagged as DIF are further classified into three types (Zieky, 1993) to avoid
identifying items that display practically trivial but statistically significant DIF
(Clauser & Mazor, 1998, p. 39). Items are identified as showing type A DIF if
absolute value of is smaller than 1.0 or not significantly different
from zero. Type C DIF occurs when the absolute value of is greater
than 1.5 or it is significantly different from 1.0. All other DIF items are flagged as
type B.
The main software for DIF analysis using the MH are DIFAS (Penfield,
2005), EZDIF (Waller, 1998), and more recently EASY-DIF (Gonzlez, Padilla,
Hidalgo, Gmez-Benito & Bentez, 2011). Another relevant software is the
DICHODIF (Rogers, Swaminathan, & Hambleton, 1993) that can apply both the
MH and Logistic Regression. Also, LERTAP (Nelson, 2000) is an Excel-based
classical itemanalysis software that is able to do DIF analysis using the MH. Its
student version is freely available and the full version is available from
http://assess.com/xcart/product.php?productid=235&cat=21&page=1. For more
helpful information about the software see also http://lertap.curtin.edu.au/ . Winsteps
(Linacre, 2010) also provides MH based DIF estimates as part of its output.
Elder (1996) conducted a study to determine whether language background
may lead to DIF. She examined reading and listening subsections of the Australian
Language Certificates (ALC), a test given to Australian school-age learners from
diverse language backgrounds. Her participants included those who were enrolled
2012 Time Taylor Academic Journals ISSN 2094-0734

68
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

in language classes in years 8 and 9. The languages of her focus were Greek, Italian,
and Chinese. Elder (1996) compared the performance of background speakers
(those who used to speak the target language plus English at home) with nonbackground speakers (those who were only exposed to English at home). She
applied the Mantel-Haenszel procedure to detect DIF. Ryan and Bachman (1992),
also utilized the Mantel-Haenszel procedure to compare the performance of a
group of males and female test takers on FCE and TOELF tests. Allalouf and
Abramzon (2008) investigated the differences between groups from different first
language backgrounds, namely Arabic and Russian, using he Mantel-Haenszel.
Ockey (2007) applied both IRT and the MH to compare the performance of the
English language learners (ELL) and non-ELL 8th-grade students scores on
National Assessment of Educational Progress (NAEP) math word problems. For an
overview of the applications of Mantel-Haenszel procedure to detect DIF, see
Guilera, Gmez-Benito, and Hidalgo(2009).
Item Response Theory
The main difference between IRT DIF detection techniques and other
methods including logistic regression and MH is the fact that in non-IRT
approaches, examinees are typically matched on an observed variable (such as
total test score), and then counts of examinees in the focal and reference groups
getting the studied item correct or incorrect are compared (Clauser & Mazor 1998,
p. 35). That is, the conditioning or the matching criterion is the observed score.
However, in IRT-based methods, matching is based on the examinees estimated
ability level or the latent trait, .

Figure 1. A Typical ICC


Methods based on item response theory are conceptually elegant though
mathematically very complicated. The building block of IRT is item characteristic
curve (ICC) (see Baker, 2001; DeMarse, 2009; Embretson & Reise, 2000;
2012 Time Taylor Academic Journals ISSN 2094-0734

69
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

Hambleton, Swaminathan, & Rogers, 1991). It is a smooth S-shaped curve which


depicts the relationship between the ability level and the probability of correct
response to the item. As it is evident from Figure 1, the probability of correct
response approaches one at the higher end of the ability scale, never actually
reaching one. Similarly, at the lower end of the ability scale, the probability
approaches, but never reaches, zero.
IRT uses three features to describe the shape of the ICC: item difficulty,
item discrimination, and guessing factor. Based on how many of these parameters
are involved in the estimation of the relationship between the ability and item
response patterns, there are three IRT models, namely one, two, and three
parameter logistic models.
In the one parameter logistic model and the Rasch model, it is assumed that
all items have the same discrimination level. The two parameter IRT model takes
account of item difficulty and item discrimination. However, guessing is assumed to
be uniform across ability levels. Finally, the three parameter model includes a
guessing parameter in addition to item difficulty and discrimination.
The models provide a mathematical equation for the relation of the responses to
ability levels (Baker, 2001). The equation for the three parameter model is:
P = + (1 )

1
1+

()

where:
b is the difficulty parameter
is the discrimination parameter
c is the guessing or pseudo-chance parameter and
is the ability level
The basic idea in detecting DIF through IRT models is that if DIF is
present in an item, the ICCs of the item for the reference and the focal groups
should be different (Thissen, Steinberg, & Wainer, 1993). However, where there is
no DIF, the item parameters and hence ICCs should be almost the same. It is
evident that the ICCs would be different if the item parameters vary from a group
to another. Thus, one possible way of detecting DIF through IRT is to compare
item parameters in two groups. If the item parameters are significantly different,
then DIF is ensured.
IRT DIF can be computed using BILOG-MG (Scientific Software
International, 2003) for dichotomously scored items and PARSCALE ((Muraki &
Bock, 2002) and MULTILOG (Thissen, 1991) for the polytomously scored items.
In addition, for small sample sizes, nonparametric IRT can be employed suing the
TestGraf software (Ramsay, 2001). For an exemplary study of the application of the
TestGraf for DIF detection, see Laroche, Kim, and Tomiuk, (1998). Finally, the
IRTDIF software (Kim, & Cohen, 1992) can do DIF analysis under the IRT
framework.
Pae (2004) undertook a DIF study of examinees with different academic
backgrounds sitting the English subtest of the Korean National Entrance Exam
2012 Time Taylor Academic Journals ISSN 2094-0734

70
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

for Colleges and Universities. He applied the Three Parameter IRT though the
MULTILOG for DIF analysis. Before applying the IRT, however, Pae (2004) also
did an initial DIF analysis using the MH procedure to detect suspect
items. Geranpayeh and Kunnan (2006) also examined the existence of differentially
functioning items on the listening section of the Certificate in Advanced
English examination for test takers from three different age groups. Uiterwijk and
Vallen (2005) investigated the performance of the second generation
immigrant (SGI) students and native Dutch (ND) students in the Final Test
ofPrimary Education in the Netherlands. Both IRT and MantelHaenszel were
applied in their study.
The Rasch Model
Although the one-parameter logistic model and the Rasch model are
mathematically similar, they were developed independently of each other. In fact, a
number of scholars (e.g. Pollitt 1997) believe that the IRT models are totally
different from the Rasch model.
The Rasch model focuses on the probability of endorsing item by person
. In aiming to model this probability, it essentially takes into account person
ability and item difficulty. Probability is a function of the difference between person
ability and item difficulty. The following formula shows just this:
P (x = 1 , ) = ( )
where is person ability and is item difficulty. The formula simply states that
the probability of endorsing the item is a function of the difference between person
ability, , and item difficulty, . This is possible becasue item difficulty and
person ability are on the same scale in the Rasch model. It is also intuitively
appealing to conceive of probability in such terms. The Rasch model assumes that
any person taking the test has an amount of the construct gauged by the test and
that any item also shows an amount of the construct. These values work in the
opposite direction. Thus, it is the difference between item difficulty and person
ability that counts
Three cases can be considered for any encounter of persons and items
(Wilson, 2005):
1. item difficulty and person ability are the same, = 0, and the person
has an equal probability of endorsing the item or failing. Thus, probability is
.5.
2. person ability is greater than item difficulty, > 0, and the person
has more than .5 probability of endorsing the item.
3. person ability is lower than item difficulty, < 0, and the probability
of giving a correct response to the item is less than .5.
The exact formula for the Rasch model is the following:
Ln

=
1

2012 Time Taylor Academic Journals ISSN 2094-0734

71
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

The Rasch model provides us with sample independent item difficulty


indices. Therefore, DIF occurs when invariance is not accrued in a particular
application of the model (Engelhard, 2009). That is, the indices are dependent on
the sample who takes the test. The amount of DIF is calculated by a separate
calibration t-test approach first proposed by Wright and Stone (1979, see Smith,
2004). The formula is the following:
=

di2 di1
( 2 2 2 1 )

where di1 is the difficulty of item i in calibration 1, di2 is the difficulty of item i in
calibration based on group 2, 2 1 is the standard error of estimate for di1 , and 2 2
is the standard error of estimate for di2 . Baghaei (2009), Bond and Fox (2007), and
Wilson (2005) present excellent introductory level expositions of the Rasch model.
Among the software for DIF using the Rasch model are ConQuest(Wu, Adams,
Wilson, & Haldane, 2007), Winsteps (Linacre, 2010), and Facets (Linacre, 2009).
Karami (2011) has applied the Rasch model to investigate the existence of
DIF items in the UTEPT for male and female examinees. He applied Linacres
Winsteps for DIF analysis. Also, Karami (2010) exploited the Winsteps to examine
the UTEPT items for possible DIF for test takers from different academic
backgrounds. Elder, McNamara, and Congdon (2003) also applied the Rasch
model to examine the performance of native and non-native speakers on a test of
academic English.
Furthermore, Takala and Kaftandjieva (2000) undertook a study to
investigate the presence of DIF in the vocabulary subtest of the Finnish Foreign
Language Certificate Examination, an official, national high-stakes foreign-language
examination based on a bill passed by Parliament. To detect DIF, they utilized the
One Parameter Logistic Model (OPLM), a modification of the Rasch model where
item discrimination is not considered to be one but is input as a known constant.
Pallant and Tennant (2007) also applied the Rasch model to scrutinize the utility of
the Hospital Anxiety and Depression Scale (HADS) totals core (HADS-14) as a
measure of psychological distress.
Conclusion
DIF analysis aims to detect items that differentially favor examinees of the
same ability levels but from different groups. The technical requirements of this
methodology, however, has hampered the non-mathematically oriented
researchers. Even if a researcher does not apply these techniques in his own
studies, he has to be familiar with them in order to fully appreciate the published
papers that report such analyses.
This paper attempted to provide a non-technical introduction to the basic
principles of DIF analysis. Five DIF detection techniques were explained: Logistic
Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch
model. For each technique, a number of the most widely applied software along with some
studies applying those techniques were briefly cited. The interested reader may refer to
2012 Time Taylor Academic Journals ISSN 2094-0734

72
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

such studies for further information about their application. It is hoped that the

exposition offered here will enable researchers to appreciate and enjoy reading
studies that have conducted a DIF analysis.
References
Alderson, J. C., & Urquhart, A. (1985). The effect of students academic discipline
ontheir performance on ESP reading tests. Language Testing, 2, 192-204.
Allalouf, A., & Abramzon, A. (2008). Constructing better second language
assessments based on differential item functioning analysis. Language
Assessment Quarterly, 5, 120141.
Angoff, W. H. (1993). Perspectives on differential item functioning methodology.
In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3
4). Hillsdale, NJ: Lawrence Erlbaum Associates.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Baghaei, P. (2009) Understanding the Rasch model. Mashad: Mashad Islamic Azad
University Press.
Baker, F. (2001). The basics of Item Response Theory. ERIC Clearinghouse on
Assessment and Evaluation, University of Maryland, College Park, MD.
Bond, T. G., & Fox, C.M. (2007) Applying the Rasch model: Fundamental
measurement in the human sciences. London: Lawrence Erlbaum.
Brown, J. D. (1999). The relative importance of persons, items, subtests and
languages to TOEFL test variance. Language Testing, 16, 217238.
Camilli, G. (2006) Test fairness. In R. Brennan (Ed.), Educational measurement
(pp. 221-256). New York: American Council on Education & Praeger series
on higher education.
Chen, Z., & Henning, G. (1985) Linguistic and cultural bias in language proficiency
tests. Language Testing, 2(2), 155163.
Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). Lordif: An R Package for
Detecting DifferentialItem Functioning Using Iterative Hybrid Ordinal
Logistic Regression/Item Response Theory andMonte Carlo Simulations.
Journal of Statistical Software, 39(8), 1-30.
Clauser, E. B., & Mazor, M. K. (1998). Using statistical procedures to identify
differentially functioning test items. Educational Measurement: Issues and
Practice. 17, 31-44.
Davidson, B. (2004). Comparability of test scores for non-aboriginal and aboriginal
students. (DoctoralDissertation, University of British Columbia, 2004).

UMI Proquest Digital Dissertation.


DeMars, C. E. (2010). Item response theory. New York: Oxford University Press.
Elder,C. (1996). The effect of language background on foreign language test
performance:The case of Chinese, Italian, and Modern Greek. Language
Learning, 46, 233282.
Elder, C., McNamara, T. F., & Congdon, P. (2003). Understanding
Raschmeasurement: Rasch techniques for detecting bias in performance
assessments:An example comparing the performance of native and non 2012 Time Taylor Academic Journals ISSN 2094-0734

73
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

nativespeakers on a test of academic English. Journal of Applied


Measurement, 4, 181197.
Embretson, S. E., & Reise, S. (2000). Item Response Theory for psychologists.
Mahwah, NJ: Erlbaum Publishers.
Engelhard, G. (2009). Using item response theory and model--data fit to
conceptualize differential item and person functioning for students with
disabilities. Educational and Psychological Measurement,69, 585-602.
Freedle, R., & Kostin, I. (1997). Predicting black and white differential item
functioning in verbalanalogy performance. Intelligence, 24, 417444.
Gallagher, M. (2004). A study of differential item functioning: Its use as a tool for
urban educatorsto analyze reading performance (Unpublished doctoral
dissertation, Kent State University). UMIProquest Digital Dissertation.
Geranpayeh, A., & Kunnan, A. J. (2007) Differential Item Functioning in Terms of
Age in the Certificate in Advanced English Examination. Language
Assessment Quarterly, 4, 190-222.
Gonzlez, A., Padilla, J. L., Hidalgo, M. D., Gmez-Benito, J., & Bentez, I. (2011).
EASY-DIF: Software for analyzing differential item functioning using the
Mantel-Haenszel and standardization procedures. Applied Psychological
Measurement, 35, 483-484.
Guilera, G., Gmez-Benito, J., & Hidalgo, M.D. (2009). Scientific production on
the Mantel-Hanszel procedure as a way of detecting DIF. Psicothema, 21
(3), 492-498.
Hale, G. A. (1988) Student major field and text content: interactive effects on
reading comprehension in the Test of English as a Foreign Language.
Language Testing5, 4961.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. Newbury Park, CA: Sage.
Kamata, A., & Vaughn, B. K. (2004). An introduction to differential item
functioning analysis. Learning Disabilities: A Contemporary Journal, 2, 4969.
Karami, H. (2010). A Differential Item Functioning analysis of a language
proficiency test: an investigation of background knowledge bias.
Unpublished Masters Thesis. University of Tehran, Iran.
Karami, H. (2011). Detecting gender bias in a language proficiency test.
International Journal of Language Studies, 5, 167-178.
Kim, M. (2001). Detecting DIF across the different language groups in a speaking
test. Language Testing, 18, 89114.
Kim, S.-H., & Cohen, A. S. (1992). IRTDIF: A computer program for IRT
differential itemfunctioning analysis. Applied Psychological Measurement,
16, 158.
Laroche, M., Kim, C., & Tomiuk, M. A. (1998). Translation fidelity: an IRT
analysis of Likert-type scale items from a culture change measure for ItalianCanadians. Advances in Consumer Research, 25, 240-245.
Lawrence, I. M., Curley, W. E., & McHale, F. J. (1988). Differential item
functioning for males and females on SAT verbal reading subscoreitems.
Report No. 884. New York: College Entrance Examination Board.
2012 Time Taylor Academic Journals ISSN 2094-0734

74
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

Lee, Y. W., Breland, H., & Muraki, E. (2004). Comparability of TOEFL


CBTwriting prompts for different native language groups (TOEFL
ResearchReport No. RR-77). Princeton, NJ: Educational Testing Service.
Retrieved
September
29,
2011,
from
http://www.ets.org/Media/Research/pdf/RR-04-24.pdf.
Linacre, J. M. (2009). FACETS Rasch-model computer program (Version 3.66.0)
[Computer software]. Chicago, IL: Winsteps.com.
Linacre, J. M. (2010) Winsteps (Version 3.70.0) [Computer Software].
Beaverton, Oregon:Winsteps.com.
Linn, R. L., & Drasgow, F. (1987). Implications of the Golden Rule settlement for
test construction. Educational Measurement: Issues and Practice, 6, 1317.
Magis, D., Bland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework
and an R package for the detection of dichotomous differential item
functioning. Behavior Research Methods, 42, 847-862.
McNamara, T., & C. Roever (2006) Language testing: The social dimension.
Malden, MA & Oxford: Blackwell.
Messick, S. (1980). Test validation and the ethics of assessment. American
Psychologist, 35, 10121027.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp.
13103). New York: American Council on Education & Macmillan.
Messick, S. (1994) The interplay of evidence and consequences in the validation of
performance assessments. Educational Researcher, 23(2), 13-23.
Muraki, E., & Bock, D. (2002) PARSCLE 4.1 Computer program. Chicago:
Scientific Software International, Inc.
Nelson, L. R. (2000). Item analysis for tests and surveys using Lertap 5. Perth,
Western
Australia:
Curtin
University
of
Technology
(www.lertap.curtin.edu.au).
Ockey, G. J. (2007). Investigating the validity of math word problems for English
language learners with DIF. Language Assessment Quarterly, 4(2), 149-164.
Pae, T. (2004). DIF for learners with different academic backgrounds. Language
Testing, 21, 5373.
Pallant, J. F., & Tennant, A. (2007).An introduction to the Rasch measurement
model: An example using the Hospital Anxiety and Depression Scale
(HADS). British Journal of Clinical Psychology, 4, 118.
Pampel, F. (2000). Logistic regression: A primer. Thousand Oaks, CA: Sage.
Penfield, R. D. (2005). DIFAS: Differential Item Functioning Analysis System.
AppliedPsychological Measurement, 29, 150-151.
Pollitt, A. (1997). Rasch measurement in latent trait models. In Clapham, C. and
Corson, D., (Eds.), Encyclopedia of language and education. Volume 7:
Language testing and assessment (pp. 243254). Dordrecht: Kluwer
Academic.
Ramsay, J. O. (2001). TestGraf: A program for the graphical analysis of multiplechoice test andquestionnaire data [Computer software and manual].
Montreal, Canada: McGill University.
Robin, F. (2001). STDIF: Standardization-DIF analysis program [Computer
program].Amherst, MA: University of Massachusetts, School of Education.
2012 Time Taylor Academic Journals ISSN 2094-0734

75
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1993). DICHODIF: A


FORTRAN program forDIF analysis of dichotomously scored item
response data [Computer software]. Amherst: University of Massachusetts.
Roznowski, M., & Reith, J. (1999). Examining the measurement quality of tests
containing differentially functioning items: Do biased items result in poor
measurement? Educational and Psychological Measurement, 59, 248269.
Ryan, K., & Bachman, L. (1992). Differential item functioning on two tests of EFL
proficiency. Language Testing, 9, 1229.
Sasaki, M. (1991). A comparison of two methods for detecting differentialitem
functioning in an ESL placement test. Language Testing, 8(2), 95111.
Scheuneman, J. D., & Bleistein, C. A. (1989) A consumers guide to statistics for
identifying differential item functioning. Applied Measurement in
Education. 2, 255-275.
Shohamy, E. (2001) The Power of Tests. A Critical Perspective on the Uses of
Language Tests. London: Longman/Pearson Education.
Smith, R. (2004) Detecting item bias with the Rasch model. Journal of Applied
Measurement, 5(4), 430-449.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning
using logistic regression procedures. Journal of Educational Measurement,
27, 361370.
Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2
vocabulary test.Language Testing, 17, 323340.
Thissen, D. (1991). MULTILOG users guide: Multiple categorical item analysis
and test scoring using item response theory (Version 6.0). Chicago:
Scientific Software.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item
functioning using the parameters of item response models. In P. W.
Holland & H. Wainer (Eds.), Differential item functioning (pp. 67113).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Uiterwijk, H., & Vallen, T. (2005). Linguistic sources of item bias for second
generation immigrantsin Dutch tests. Language Testing, 22, 211234.
Waller, N. G. (1998). EZDIF: Detection of Uniform and Nonuniform Differential
Item Functioning With the Mantel-Haenszel and Logistic Regression
Procedures. Applied Psychological Measurement, 22, 391.
Wiberg, M. (2007) Measuring and detecting differential item functioning in
criterion-referencedlicensing test. A Theoretic Comparison of Methods.
Educational Measurement, technical report N. 2.
Wilson, M. (2005) Constructing measures: an item response modeling approach.
London: LawrenceErlbaum Associates.
Wright B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER
ConQuest Version 2: Generalized item response modeling software
[computer program]. Camberwell: Australian Council for Educational
Research.

2012 Time Taylor Academic Journals ISSN 2094-0734

76
The International Journal of Educational and Psychological Assessment
September 2012, Vol. 11(2)

Zenisky, A. L., Hambleton, R. K., & Robin, F. (2003). Detection of differential


item functioning in large-scale state assessments: A study evaluating a twostage approach. Educational and Psychological Measurement, 63 (1), 49-62.
Zenisky, A. L., Hambleton, R. K., & Robin, F. (2004). DIF detection and
interpretation in large-scale science assessments: Informing item-writing
practices. Educational Assessment, 9(1&2), 61-78.
Zenisky, A. L., Robin, F., & R. K. Hambleton. (2009). Differential item functioning
analyses with STDIF: Users guide. Amherst, MA: University
of Massachusetts, Center for Educational Assessment.
Zieky, M. (1993). Practical questions in the use of DIF statistics in test
development. In P.W. Holland & H.Wainer (Eds.), Differential item
functioning (pp. 337348). Hillsdale, NJ: Lawrence Erlbaum Associates.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item

functioning (DIF): Logistic regression modeling as a unitary framework for


binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of
Human Resources Research and Evaluation, Department of National
Defense.
About the Author
Hossein Karami (hkarami@ut.ac.ir) is currently a Ph.D. candidate in TEFL and an
instructor at the Faculty of Foreign Languages and Literature, University of Tehran,
Iran. His research interests include various aspects of language testing in general,
and Differential Item Functioning, validity, and fairness in particular.

2012 Time Taylor Academic Journals ISSN 2094-0734

You might also like