Professional Documents
Culture Documents
The Subjective and Objective Interface of Bias Detection On Language Tests PDF
The Subjective and Objective Interface of Bias Detection On Language Tests PDF
Test validity is predicated on there being a lack of bias in tasks, items, or test content.
It is well-known that factors such as test candidates’ mother tongue, life experiences,
and socialization practices of the wider community may serve to inject subtle interac-
tions between individuals’ background and the test content. When the gender of the
test candidate interacts further with these factors, the potential for item bias to influ-
ence test performances grows. A dilemma faced by test designers concerns how they
can proactively screen test content for possible sources of bias. Conventional prac-
tices in many contexts rely on the subjective opinion of review panels in detecting
sensitive topical content and potentially biased material and items. In the last 2 de-
cades this practice has been rivaled by the increased availability of item bias diagnos-
tic software. Few studies have compared the relative accuracy and cost utility of the
two approaches in the domain of language assessment. This study makes just that
comparison. A 4-passage, 20-item reading comprehension test was given to a strati-
fied sample of 825 high school students and college undergraduates at 5 Japanese in-
stitutions. The sampling included a focus group of 468 female students compared to a
reference group of 357 male English as a foreign language (EFL) learners. The test
passages and items were also given to a panel of 97 in-service and preservice EFL
teachers for subjective ratings of potential gender bias. The results of the actual item
responses were then empirically checked for evidence of differential item function-
ing using Simultaneous Item Bias analysis, the Mantel-Haenszel Delta method, and
logistic regression. Concordance analyses of the subjective and objective methods
suggest that subjective screening of bias overestimates the extent of actual item bias.
Implications for cost-effective approaches to item bias detection are discussed.
The issue of test bias has always been central in the consideration of test validity. Bias
has been of concern because inferences about the results of test outcomes often lead
Correspondence should be addressed to Steven J. Ross, School of Policy Studies, Kwansei Gakuin
University, Gakuen 2–1, Sanda, Hyogo, Japan 6691337. E-mail: sross@ksc.kwansei.ac.jp
230 ROSS AND OKABE
For Romance language speakers, the deductive task is to parse “conflagration” for
its affixation and locate the core free morpheme. Once located, the Romance lan-
BIAS DETECTION ON LANGUAGE TESTS 231
guage speaker can compare the root to similar known free morphemes in the
reader’s native language, for instance, incendio or conflagracion.
The Chinese speaker, in contrast, starts at the same deductive step, but must
compare the free root morpheme to all other previously learned morphemes (i.e.,
most probably, “flag”). The resulting difference leads Spanish speakers to follow a
semantically based second step, while Chinese speakers are likely to split between
a semantic and phonetic comparison strategy. The item response accuracy in such
cases favors the Romance language speakers, even when matched with Chinese
counterparts for overall proficiency.
The transferability factor applies to orthographic phenomena as well. Brown
and Iwashita (1996) detected bias favoring Chinese learners of Japanese over na-
tive English speakers, whose native language orthography is typologically most
distant from Japanese. Given the fact that modern written Japanese relies on Chi-
nese character compounds for the formation of nominal phrases, as well as the root
forms of many verbs, Chinese students of Japanese can transfer their knowledge of
semantic roots for many Japanese words and compounds, even without knowledge
of their corresponding phonemic representations, or exact semantic reference.
Here a similar strategic difference emerges for speakers of Chinese versus
speakers of an Indo-European language. While the exact compound might not ex-
ist in modern written Chinese, the component Chinese characters provide a deduc-
tive strategy to Chinese learners of Japanese that is not available to English speak-
ers. For instance, the following sentence contains the compound (bullet
train) which does not have a direct counterpart in Chinese.
The component characters “new,” “trunk,” and “line” provide the basis
for a lexical inference that the compound refers to a kind of rail transportation sys-
tem. For an English-speaking learner of Japanese, the cognitive load falls on de-
ducing the meaning of the whole compound from its components. Here, a mixed
grapheme to phoneme strategy is most likely if “new” and “line” are recog-
nized as “shin” and “sen.” The lexical inference here might entail filling in the
missing component “trunk” with a syllable that matches the surrounding
“shin___sen” for successful compound word recognition.
Examining transferability on a macrolevel, Ross (2000), while controlling for
biographical and experiential factors such as age, educational background, and
hours of ESL learning, found weaker evidence of a language distance factor. The
distance factor was comprised of canonical syntactic structure, orthography, and
typological grouping which served to influence the relative rates of learning
English by 72 different groups of migrants to Australia.
The overall picture of transfer bias suggests that on the microlevel, particularly
in studies that triangulate two different native languages against a target language,
232 ROSS AND OKABE
evidence of transfer bias tends to be identifiable. When many languages are com-
pared and individual differences in experiential and cognitive variables are fac-
tored in, transfer bias at the macro or language typological level appears to be less
readily identifiable.
A second type of bias in language assessment arises from differential exposure to
a target language that candidates might experience. Ryan and Bachman (1992), for
instance, considered Test of English as a foreign language (TOEFL) type items to be
more culturally oriented toward the North American context than a British compari-
son, the First Certificate in English. Language learners with exposure to instruction
in American English and test TOEFL preparation courses were thought to have a
greater chance on such items than learners whose exposure did not prepare them for
the cultural framework TOEFL samples in its reading and listening items. Their
findings suggest that high stakes language tests for admissions such as TOEFL may
indirectly include knowledge of cultural reference in addition to the core linguistic
constructs considered to be the object of measurement. Presumably this phenome-
non would be observable on language tests such as the International English Lan-
guage Testing System (IELTS), which is designed to qualify candidates for admis-
sions to universities in the United Kingdom, New Zealand, or Australia.
Cultural background comparisons in second language performance assessments
have demonstrated how speech community norms may transfer into assessment pro-
cesses like oral proficiency interviews. While not overtly recognized as a source of
assessment bias, interlanguage pragmatic transfer has been seen to influence the per-
formances of Asian speakers when compared to European speakers (Young, 1995;
Young & Halleck, 1998; Young & Milanovic, 1992). The implication is that if as-
sessments are norm-referenced, speakers from discourse communities favoring ver-
bosity may be advantaged in assessment such as interactive interviews. This obser-
vation apparently extends to semi-direct speech tasks such as the SPEAK test. Kim
(2001), for instance, found differential rating functions for pronunciation and gram-
mar ratings for Asians when compared to equal abilityEuropean test candidates. The
implication here is that raters apply the rating scale differently.
In considering possible sources of bias in university admissions, Zwick and
Sklar (2003) opined that the foreign language component on the SAT II created a
“bilingual advantage” for particular candidates for admission to the University of
California. If candidates had been raised in bilingual households, for instance, they
would be expected to score higher on the foreign language listening comprehen-
sion component, which is an optional third subscore on the SAT II. This test is re-
quired for undergraduate admissions to all campuses at the University of Califor-
nia. The issue of bias in this case stems from the assumption that the foreign
language component was presumably conceptualized as an achievement indicator,
when in fact the highest scoring candidates are from bilingual households. The
perceived advantage is that such candidates develop their proficiency not through
coursework and scholarship, but through naturalistic exposure.
BIAS DETECTION ON LANGUAGE TESTS 233
Elder (1997) reported on a similar fairness issue arising from the use of second
language tests for access to higher education in Australia. Elder noted that the
score weighting policy on the Victoria Certificate of Education, functioning as it
does as a qualification for university admission in that state, explicitly profiles the
language learning history of the test candidate. This form of candidate profiling
aimed to reweight the influence of the foreign language scores on the admissions
qualification so as to minimize the preferential bias bilingual candidates enjoyed
over conventional foreign language learners. Elder found interactions between
English and the profile categorizations were not symmetric across different for-
eign language test candidatures and concluded that efforts to adjust for differential
exposure profiles are fraught with difficulty.
A third category of bias in language assessment deals with differences in social-
ization patterns. Socialization patterns might involve academic tracking early in a
school student’s educational career, usually into either science or humanities aca-
demic tracks in high school (Pae, 2004). In some cultural contexts, academic track-
ing might correspond to gender socialization practices as well.
In contrast to cultural assumptions made about the verbal advantage females
have over males, Hyde and Linn (1988) concluded in a meta-analysis of 165 stud-
ies of gender differences on all facets of verbal tests that there was an effect size of
D = .11 for gender differences. To them, this constituted little firm evidence to sup-
port the assumed female verbal advantage. Willingham and Cole (1997), and
Zwick (2002) concur with this interpretation, noting that gender differences have
steadily diminished over the last four decades and now account for no more than
1% of the total variation on ability tests in general. Willingham and Cole (1997, p.
348) however, noted that females tend to frequent the top 10% in standardized tests
of reading and writing.
Surveys of gender differences on The Advanced Placement Test, used for uni-
versity admissions to the more selective American universities, suggest reasons
why verbal differences in literacy still tend to persist. Dwyer and Johnson (1997, p.
136) describe considerable effect size differences between college-bound males
and females in preference for language studies. This finding would suggest that in
the North American context socialization patterns could serve to channel high
school students into academic tracks that tend to correlate with gender.
To date, language socialization issues have not been central in foreign or second
language test bias analyses in multicultural contexts because of the more immedi-
ate and salient influences of exposure and transfer on high stakes tests. In contexts
that are not characterized by multiculturalism, a more subtle threat of bias may be
related to how socialization practices steer males and females into different aca-
demic domains, and in doing so cumulatively serve to make gender in particular
knowledge domains differentially salient. When language tests inadvertently sam-
ple particular domains more than others, the issue of schematic knowledge inter-
acting with the gender of the test candidate takes on a new level of importance.
234 ROSS AND OKABE
Four sample subtests written for a high stakes university admissions test were used in
the study. The subtests were all from the fourth section of a six section English as a
foreign language (EFL) test given annually to approximately 630,000 Japanese high
school seniors. The results of the exam are norm-referenced and serve to qualify can-
didate for secondary examinations to specific academic departments at national and
public universities (Ingulsrud, 1994). Increasingly, private Japanese universities use
the results of the Center examination for admissions decisions, making the test the
most influential gate-keeping device in the Japanese educational system.
The format of the EFL test is a “discrete point” type of test of language structure
and vocabulary, sampling the high school syllabus mandated by the Japanese Min-
istry of Education. It is construed as an achievement test because only vocabulary
and grammatical structures occurring in about 40 high school textbooks sanc-
tioned by the Ministry of Education are sampled on the test. The six sections of the
examination cover knowledge of segmental pronunciation, tonic word stress, dis-
crete point grammar, word order, paragraph coherence and cohesion, interpreta-
tion of short texts describing graphics and data in tabular format, interactive
dialogic discourse in the form of a transcribed conversation, and comprehension of
BIAS DETECTION ON LANGUAGE TESTS 235
METHOD
ence group of 357 male EFL learners. The aim of the sampling was to approximate
the range of scores normally observed in the population of Japanese high school
seniors. The 20-item test was given in multiple-choice format with enough time (1
hr) for completion, and was followed with a survey about the age, gender, and
language learning experiences of the sample test candidates.
Materials
The test section specifications call for a three to four paragraph text describing
graphs, figures, or tables written as specimens of social science types of academic
writing. In the case of the experimental test, four of these passages were used. Each
of the passages had five items that tested readers’ comprehension of the passage
content. The themes sampled on the test can be seen in Table 1.
The experimental test comprised of four short reading passages, which closely
approximate the format and content of Section Four of the Center Examination.
The sampling of students in this study yielded a mean and variance similar to the
operational test. Table 2 lists descriptive statistics for the test.
TABLE 1
Experimental Passage Order and
Thematic Content
TABLE 2
Mean, Standard Deviation, Internal Consistency,and Sample Size
vey was thus devised to sample early, mid, and late career EFL teachers who were
assumed to represent the larger population of language teaching professionals
from whom future test moderation panel members are drafted. In-service teachers
(n = 37) were surveyed individually.
In addition to the sampling of in-service teachers, a larger group of preservice
EFL teachers in training were also surveyed so as to compare the ratings provided
by seasoned professional teachers with neophyte teachers (n = 60). All respon-
dents were asked to examine the four test passages and each of the 20 items on the
test before rating the likelihood that each item would favor male or female test can-
didates. The preservice teachers in training completed the survey during Teaching
English as a Foreign Language (TEFL) Methodology course meetings.
The rating scale used and instructions are shown in the Appendix.
A variety of options now exist for detecting DIF. Comparative research suggests
that DIF methods tend to differ in the extent of Type I error and power. Whitmore
and Schumacker (1999), for instance, found logistic regression more accurate than
an analysis of variance approach. A direct comparison of logistic regression and
Mantel-Haenszel procedure (Rogers & Swaminathan, 1993) indicated moderate
differences in power. Swanson, Clauser, Case, Nungester, and Featherman (2002)
more recently approached DIF with hierarchical logistic regression and found it to
be more accurate than standard logistic regression or Mantel-Haenszel estimates.
In this approach, different possible sources of item bias can be dummy-coded and
nested in the multilevel design. Recent uses of logistic regression for DIF extend to
polytomous rating categories (Lee, Breland, & Muraki, 2005) but still enable an
examination of nonuniform DIF through interaction terms between matching
scores and group membership.
Although multilevel modeling approaches offer extended opportunities for test-
ing nested sources of potential DIF, the single level methods, such as logistic re-
gression and Mantel-Haenszel approaches, have tended to prevail in DIF studies.
Penfield (2001) compared three variants of Mantel-Haenszel according to differ-
ences in the criterion significance level, and concluded that the generalized ap-
proach provided the lowest error and most power. Zwick and Thayer (2002) found
that modifications of the Mantel-Haenszel procedure involving an empirical Bayes
approach showed promise of greater potential for bias detection. A direct compari-
son of the Mantel-Haenszel procedure with Simultaneous Item Bias (SIB;
Narayanan & Swaminathan, 1994) concluded that the Mantel-Haenszel procedure
yielded smaller Type I error rates relative to SIB.
238 ROSS AND OKABE
In this study, three empirical methods detecting of DIF were used. The choice
of bias detection methods used was based on their overall frequency of use in em-
pirical DIF studies. The three methods were thought to represent conventional ap-
proaches DIF research, and thus best operationalize “objective” approaches to be
compared with subjective methods.
Mantel-Haenszel Delta was computed from six sets of equipercentile-matched
ability subgroups cross tabulated by gender. Differences in the observed Deltas for
the matched males and females were evaluated against a chi-square distribution.
This method matches males and females along the latent ability continuum and de-
tects improbable discontinuities between the expected percentage of success and
the observed data.
The second method of detecting bias was a logistic regression performed on the di-
chotomously scored outcomes for each of the 20 items. The baseline model tested the
effects of gender controlling for each student’s total score (Camilli & Shepard, 1994).
In this binary regression, the probability of success should be solely influenced by the
individual’s overall ability. In the event of no bias, only the test score will account for
systematic covariance between the item responses on a particular item. If bias does af-
fect a particular item, the variable encoding gender will covary with the item response
independently of the covariance between the score and the outcome. Further, if bias is
associated with particular levels of ability on the latent score continuum, a nonuniform
DIF can be diagnosed with a Gender × Total Score interaction term.
Item response = constant + gender + score + (gender × score)
In the event a nonuniform DIF is confirmed not to exist, the interaction term can
be deleted to yield a main effect for gender, controlling for test score. Gender ef-
fects are then tested for nonrandomness against a t-distribution.
The third empirical method used a simultaneous item bias utilizing item response
theory (Shealy & Stout, 1993). The SIB approach was performed on each of the 20
items in turn. The sums of the all other items were used in rotation as ability estimates
in matching male and female examinees via a regression approach. This approach
employs the matching strategy of the Mantel-Haenszel method, and uses the total
score based on k-1 items as a concurrent covariate for each of the item bias tests. Dif-
ferences in estimates of DIF were evaluated against a z distribution.
The composite results of the three different approaches to estimating DIF for each
of the 20 items are given in Table 3. Each of the three objective measures employs a
different test statistic to assess the likelihood of the observed bias statistic. Analo-
gous to meta-analytic methods, the different effects can be assessed standardized as
metrics. To this end, each DIF estimate, controlled for overall candidate ability, is
presented as a conventional probability (p < .05) of rejecting the null hypothesis.
Table 3 indicates that the Mantel-Haenszel and SIB approaches are equally par-
simonious in detecting gender bias on the 20-item test. Both of these methods em-
ploy ability matches of men and women along the latent ability continuum. In con-
BIAS DETECTION ON LANGUAGE TESTS 239
TABLE 3
Objective Bias Probabilities Per Item
Note. Items in bold represent significant probabilities, where p < .05. MH = Mantel–Haenszel;
SIB = simultaneous item bias.
trast, the logistic regression approach, which uses the total score as a covariate,
appears slightly more likely to detect bias. All three methods concur in detecting
gender bias on the Soccer item 13 shown in Table 4.
The panel of 97 preservice and in-service teachers was categorized into male and
female subgroups of novices and experienced teachers based on survey responses.
TABLE 4
Biased Item No. 13 From the Soccer Passage
Item Description
The aim of this subdivision of the subjective raters was to explore possible sources
of differential sensitivity to bias in the test questions. In contrast to the objective
methods of diagnosing item bias, the subjective ratings do not employ any infor-
mation about individual ability inferred from the total score. Subjective estimates
rely completely on the apparent schematic content and presumed world knowledge
needed to answer each item. Further, because ratings were on a Likert type scale,
differences between the observed mean rating and the null hypothesis needed to be
tested to provide bias probabilities1 comparable to those in Table 3. To this end, the
mean rating of gender bias on each of the 20 items was tested against the hypothe-
sis that male versus female advantage on each item equaled zero. Table 5 contains
the subjectively estimated probabilities that each item is biased.
In contrast with the objective measures of bias, the subjective analysis diagno-
ses considerably more bias. Complete subjective agreement in diagnosing bias oc-
curs in 6 of the 20 items. As the third column in Table 5 suggests, it appears that ex-
perienced male (EM) teachers are the most inclined to assume there is gender bias.
This subgroup in fact sees bias in the majority of items. Experienced female teach-
ers, in contrast, are the most conservative in assuming that schematic content indi-
cate possible test item bias. The novice male teachers-in-training correspond to
their more experienced male counterparts in assuming there is a schematic bias in
two of the four test passages.
Of particular interest is the tendency of the subjective raters to apply the bias diagno-
sis not to individual items, but to entire test passages. It appears likely that these male re-
spondents equate content sensitivity with test bias. Sources of this confusion will be ex-
amined in the narrative accounts provided by some of the male teachers. The tendency to
see content schema as bias suggests that subjective raters see topical domains as the key
source of possible bias. Both male and female in-service teachers would be expected to
share equivalently accurate knowledge about the cumulative consequences of socializa-
tion on Japanese teenagers’ world-knowledge. As Table 5 would suggest, however, the
experienced male teachers appear to overgeneralize the extent of possible schematic
knowledge differences between male and female students. The domains that appear to
be high bias risk to male teachers involve spatial processing (Visuals 6–9), all of the
items concerned with a sports tournament (Soccer 11–15), and all of the items about the
passage describing changes in transportation (Transport 16–20).
1Subjective ratings were tested against a null hypothesis by assuming the population mean bias
(mu) equals zero and testing the observed subjective mean against a single sample t distribution. Exact
probabilities of the observed t tests were then used in Table 3.
BIAS DETECTION ON LANGUAGE TESTS 241
TABLE 5
Subjective Bias Probabilities Per Item
Novice Expert
hoc reflective account of why bias would be expected in test items. The three ac-
counts provide subjective evidence as to the sources of the putative sensitivity or
bias in test items. Three facets of belief about gender differences were included in
the interview. The first was a global impression of the test materials. The second
was concerned with the four passages used in the actual reading test. The third
question in the interview phase addressed how each male teacher assumes his col-
leagues are aware of gender differences among students.
formance between boys and girls; boys may do better in mathematical and
logical types of tasks. Anyway, girls do much better than boys as far as Eng-
lish exams are concerned. So I don’t think there is any particular gender dif-
ference among different English tests.”
Here the male teacher contends that there is systematic difference in the female
students’ understanding of science and logical content. Yet assuming that the lan-
guage test content does not narrowly sample such domains, he claims there is no
bias on the test in question. In fact, he suggests that the advantage is on the side of
the female students, because the domain is language study. This view perhaps re-
flects a prevailing social construct in Japan. Science and math are perceived as
male domains, so test content focused on language give females an advantage.
This younger male teacher, while adhering to the pattern of projected bias about
the passages, seems aware that Japanese teachers in general entertain that male –
female differences in schematic background knowledge actually exist. Interest-
ingly, no account of the source of such putative differences, whether natural or
constructed through socialization processes, is given.
The account provided by this mid-career male teacher does not specifically
nominate any particular biased passages; rather, he contends that gender differ-
ences are pervasive. This interpretation matches the overall pattern of these Japa-
nese male teachers seeing ubiquitous gender differences. Interestingly, this ac-
count contends that the modality of the test serves to remove gender
differences—that the multiple choice format neutralizes the potential for bias. This
account conflicts with meta-analyses of test format (Willingham & Cole, 1997),
which have found that multiple-choice tends to favor male test candidates. The as-
sertion here is that multiple choice format removes the “natural” advantage in lan-
guage ability that female students are assumed to possess.
This male teacher’s account seems to contradict the statistical data. While
schema sampled in the passages may differ according to life experiences of male
and female students, the extent of that difference is not enough to trigger system-
atic bias in answering the test questions. It is odd therefore that male teach-
ers—both in-service and preservice teachers—have tended to assume there is a
systematic handicap for female students. The implied source of the bias is numeri-
cal reasoning, and since the four passages do not require much more than simple
arithmetic, no clear source of bias is identified. The one biased item (no. 13) nested
in the Soccer Playoff Schedule passage, is not referred to at all in the oral account.
The possible source of bias is said to be in male versus female domains of “inter-
est.” Students not interested in sports in general would be disadvantaged on a
sports-related topic.
This teacher does not expect that male and female teachers differ in their diagnos-
tic accuracy when it comes to gender bias on tests. Rather, the source of differences
may emerge as a consequence of teachers’ age and experience. Presumably, gender
differences observed over a career in the classroom serve to reinforce expectations
about what female students are likely to be familiar with. However, if the content is
BIAS DETECTION ON LANGUAGE TESTS 245
part of the taught curriculum, the potential for bias is considered minimized, espe-
cially when the test format required selection from alternative answers.
Teacher C asserts that gender differences favor female students, and observed
differences are not necessarily the consequence of biased items. The account itself
seems to assume that language, content, and discourse are inseparable elements,
and somehow female students prevail. This account does not correspond well with
the objective data, were male teachers tend to assume that content sampling serves
to handicap female test candidates, who otherwise would enjoy an advantage in the
foreign language domain.
Once again the assumption is that the multiple choice method of testing serves
to reduce the potential natural advantage that female students would enjoy. Perfor-
mance assessments, in contrast, are thought to favor female students. What is strik-
ing about this account is the assumption that language advantages for female for-
eign language learners are natural, and not the consequences of differential
streaming, reinforcement, or social engineering.
246 ROSS AND OKABE
When these three teachers are asked for their overall impression and their own
opinion about gender differences, their first answer is always similar: they don’t
think there is significant gender difference between girls and boys in the four pas-
sages. Girls are just better at overall performance on English exams. However, as
the interview went on and the subjects were asked about passages, these male
teachers revealed their ideas about the sources of gender differences.
Because these high school teachers find gender differences according to the
method of testing quite obvious, they do not seem to pay much attention to the dif-
ferences shown according to the topical domain of each passage. The overall pat-
tern suggests that hermeneutic assessment of bias issues is susceptible to hyper-
sensitivity. These male teachers don’t construe possible schematic differences as
products of differential socialization practices, but apparently tend to over-gener-
alize them as natural categories of gender differences.
CONCORDANCE ANALYSES
After the subjective and objective analyses of bias on the 20 test items were com-
piled, a direct comparison was undertaken. The objective and subjective probability
of bias estimates were converted into effect sizes. This conversion to a standard met-
ric allows for a direct comparison between the objective and subjective estimates of
bias. In the subjective and objective estimates of item bias, different indicators of sta-
tistical significance were used. To make all indicators directly comparable, an effect
size conversion (Shadish, Robinson, & Lu, 1999) yielded a single indictor to facili-
tate direct comparisons. Zero effect size indicates an estimate of no bias. Negative
valence on the effect size estimates indicates bias thought to favor males.
As Table 6 suggests, the subjective estimates of bias produce larger effect indi-
cators of bias than do the objective methods. For Soccer 13, the item detected by
the objective methods to produce a bias favoring males, the effect size of the bias is
in the small effect range (Cohen, 1988). The subjective diagnostics of bias, in con-
trast, are mainly in the large effect range (± .80). While the one authentically bi-
ased item gets concordant agreement between the subjective and objective ap-
BIAS DETECTION ON LANGUAGE TESTS 247
TABLE 6
Effect Sizes for Objective and Subjective Bias Estimates
Novice Expert
Note. Items in bold represent biased test items. MH = Mantel–Haenszel; SIB = simultaneous item
bias.
FIGURE 1 Interaction of experience and rater gender on bias effect sizes. Experience = 0 re-
fers to preservice EFL teachers; Experience = 1 refers to in-service teachers. Sex = 1 refers to
male teachers; Sex = 2 refers to female teachers.
parts. Table 7 lists the main effects and test of the interaction between experi-
ence and the sex of the teachers.2
2We have used “sex” to refer to teacher gender so as to distinguish it from the gender of the students
TABLE 7
Analysis of Variance of Experience and Teacher Sex
Source df F p
TABLE 8
Kendall’s Coefficient of Concordance W
detection. The subjective ratings of gender bias are strongly in agreement about the
existence of bias in the test items, though even the three samples of narrative ac-
counts do not provide much consistent insight into why there is such hypersensitiv-
ity to the issue of gender differences.
CONCLUSION
The three conventional empirical methods of estimating item bias via differential
item functioning show strong concordance. The three methods used in the analy-
sis, the simultaneous item bias approach, the Mantel-Haenszel Delta, and logistic
regression, were largely concordant in identifying that the majority of the 20 items
on the four test passages were not biased. A single biased item (Soccer 13) was
correctly detected by the three different objective methods.
The subjective ratings of bias suggested that novice and experienced teachers
tend to overestimate the extent of gender bias. The tendency is to see the sche-
matic domain of whole passages as likely to induce bias. The male teachers sam-
pled may simply confuse potential sensitivity issues with actual test bias. The
finding that both novice and experienced male raters identified phantom bias
also suggests a possible stereotypical assumption about knowledge domains
thought to favor males versus female test takers. The lack of empirical evidence
of item bias on 19 out of 20 experiment test items further suggests that female
Japanese test candidates are more familiar with particular schematic domains
than their male teachers might give them credit for.
250 ROSS AND OKABE
The reality is that high stakes language tests in Japan are rarely screened empir-
ically for item bias. Current practice calls for moderation panels to conduct sensi-
tivity reviews of draft test materials. As many of these panels are predominantly
made up of males, the potential for the needless omission of unbiased test material
is implied by the findings of this study. A large false-positive error rate is likely to
render such hermeneutic assessments of bias inefficient in terms of cost-utility cri-
teria. This conclusion is derived from the observation that for many high stakes ad-
missions exams, the test passages and items are crafted at high cost by sequestered
test construction committees. When possible sensitivity is confused with authentic
bias in candidate test items, unbiased items and often whole passages are omitted
despite their potential validity. It is possible that the eventual homogenization of
test passages can even work against content validity in that authentic texts repre-
senting the wider domain of real world usage are more likely to be excluded in the
subjective sensitivity reviews. Because empirical counterevidence is rarely used to
correct the tendency to assume sensitivity is equivalent to bias, considerable re-
sources are wasted in drafting test passages because of the high false positive error
rate resulting from the exclusive use of subjective moderation panels on these high
stakes foreign language examinations.
The implication of this study is that in addition to moderation reviews, empiri-
cal item bias should be conducted on high stakes foreign language tests used for
university admissions. Faulty items that slip through the moderation process can
be identified and omitted from the test before scoring is finalized. This approach,
given the widespread availability of item bias detection software, is likely to yield
in the long run the most cost-effective and valid approach to removing biased items
from high stakes language tests.
ACKNOWLEDGMENTS
This study was supported by kaken grant from the Japanese Ministry of Education
and Science.
We thank Sugiyama Naoto, Tom Robb, Ishikawa Tomohito, Isono Morihiko,
Ozawa Masato, and Ishihara Satoru for their assistance in data collection.
REFERENCES
American Educational Research Association, American Psychological Association, & National Coun-
cil on Measurement in Education Joint Committee on Standards for Educational and Psychological
Testing. (1999). Standards for educational and psychological testing. Washington, DC: AERA.
Brown, A., & Iwashita, N. (1996). Language background and item difficulty: The development of a
computer-adaptive test of Japanese. System, 24, 199–206.
BIAS DETECTION ON LANGUAGE TESTS 251
Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Thousand Oaks, CA:
Sage.
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language
Testing, 2, 155–163.
Cohen, J. (1988). Power analysis. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Dwyer, C., & Johnson, L. (1997). Grades, accomplishments, and correlates. In W. Willingham & N.
Cole (Eds), Gender and fair assessment (pp. 127–156). Mahwah, NJ: Lawrence Erlbaum Associates,
Inc.
Elder, C. (1997). What does test bias have to do with fairness? Language Testing, 14(3), 261–277.
Hyde, J., & Linn, M. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bul-
letin, 104(1), 53–69.
Ingulsrud, J. (1994). An entrance test to Japanese universities: Social and historical context. In C. Hill
& K. Parry (Eds.), From testing to assessment: English as an international language (pp. 61–81).
London: Longman.
Kim, M. (2001). Detecting DIF across the different groups in a speaking test. Language Testing, 18(1),
89–114.
Lee, Y. W., Breland, H., & Muraki, E. (2005). Comparability of TOEFL CBT writing prompts for dif-
ference native language groups. International Journal of Testing, 5(2), 131–158.
Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous item
bias procedures for detecting differential item functioning. Applied Psychological Measurement,
18(4), 315–328.
Pae, T. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1),
53–73.
Penfield, R. (2001). Assessing differential item functioning among multiple groups: A comparison of
three Mantel-Haenszel procedures. Applied Measurement in Education, 14(3), 235–259.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of the logistic regression and Man-
tel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measure-
ment, 17(2), 105–116.
Ross, S. J. (2000). Individual difference factors on the Certificate in Spoken and Written English. In G.
Brindley (Ed.), Studies in Adult Migrant English (pp. 191–214). Sydney: National Center for English
Language Teaching and Research.
Ryan, K., & Bachman, L. (1992). Differential item functioning of two tests of EFL proficiency. Lan-
guage Testing, 9(1), 12–29.
Sasaki, M. (1991). A comparison of two methods of detecting differential item functioning in an ESL
placement test. Language Testing, 8(2), 95–111.
Shadish, W., Robinson, L., & Lu, C. (1999). ES: A computer program for effect size calculation. Min-
neapolis: Assessment Systems Corporation.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF
from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika,
58(2), 159–194.
Siegel, S., & Castellan, N. J., Jr. (1988). Non-parametric statistics for the behavioral sciences. New
York: McGraw-Hill.
Swanson, D., Clauser, B., Case, S., Nungester, R., & Featherman, C. (2002). Analysis of differential
item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Be-
havioral Statistics, 27(1), 53–75.
Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language
Testing, 17(3), 323–340.
Whitmore, M., & Schumacker, R. (1999). A comparison of logistic regression and analysis of variance
differential item functioning decision methods. Educational and Psychological Measurement, 59(6),
910–927.
252 ROSS AND OKABE
Willingham, W., & Cole, N. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum As-
sociates, Inc.
Young, R. (1995). Conversational styles in language proficiency interviews. Language Learning, 45,
3–41.
Young, R., & Halleck, G. (1998). ‘Let them eat cake!’ or How to avoid losing your head in cross-cul-
tural conversations. In R. Young & A. W. He (Eds), Talking and testing: Discourse approaches to the
assessment of oral proficiency (pp. 355–382). Amsterdam: Benjamins.
Young, R., & Milanovic, M. (1992). Discourse variation in oral proficiency interviews. Studies in Sec-
ond Language Acquisition, 14, 403–424.
Zwick. R. (2002). Fair game? The use of standardized admissions tests in higher education. New York:
Routledge Falmer.
Zwick, R., & Sklar, J. (2003, April). California and the SAT: A reanalysis of University of California
admissions data. Paper presented at AERA, Chicago, IL.
Zwick, R., & Thayer, D. (2002). An application of an empirical Bayes’ enhancement of Man-
tel-Haenszel differential item functioning analysis to a computerized adaptive test. Applied Psycho-
logical Measurement, 25(1), 57–76.
APPENDIX
Gender Bias Survey
There is no need to provide answers to the test questions. Please inspect the test
passages, graphs, and charts, then rate the differential difficulty with which male or
female students would find each test question.
Rating Scale:
Rate –3 if you think male students would be highly advantaged
Rate –2 if you think male students would be moderately advantaged
Rate –1 if you think male students would be slightly advantaged
Rate 0 if you think there is no differential advantage
Rate 1 if you think female students would be slightly advantaged
Rate 2 if you think female students would moderately advantaged
Rate 3 if you think that female students would be highly advantaged
About You
Please complete the survey
Pre-Service
Pre-Service Graduate
College Faculty
In-Service High School Teacher
Other
23) In what ways do you consider language test content to give differential ad-
vantages to either male or female high school students? Please answer
freely.