You are on page 1of 26

INTERNATIONAL JOURNAL OF TESTING, 6(3), 229–253

Copyright © 2006, Lawrence Erlbaum Associates, Inc.

The Subjective and Objective Interface of


Bias Detection on Language Tests
Steven J. Ross and Junko Okabe
Kwansei Gakuin University
Kobe-Sanda, Japan

Test validity is predicated on there being a lack of bias in tasks, items, or test content.
It is well-known that factors such as test candidates’ mother tongue, life experiences,
and socialization practices of the wider community may serve to inject subtle interac-
tions between individuals’ background and the test content. When the gender of the
test candidate interacts further with these factors, the potential for item bias to influ-
ence test performances grows. A dilemma faced by test designers concerns how they
can proactively screen test content for possible sources of bias. Conventional prac-
tices in many contexts rely on the subjective opinion of review panels in detecting
sensitive topical content and potentially biased material and items. In the last 2 de-
cades this practice has been rivaled by the increased availability of item bias diagnos-
tic software. Few studies have compared the relative accuracy and cost utility of the
two approaches in the domain of language assessment. This study makes just that
comparison. A 4-passage, 20-item reading comprehension test was given to a strati-
fied sample of 825 high school students and college undergraduates at 5 Japanese in-
stitutions. The sampling included a focus group of 468 female students compared to a
reference group of 357 male English as a foreign language (EFL) learners. The test
passages and items were also given to a panel of 97 in-service and preservice EFL
teachers for subjective ratings of potential gender bias. The results of the actual item
responses were then empirically checked for evidence of differential item function-
ing using Simultaneous Item Bias analysis, the Mantel-Haenszel Delta method, and
logistic regression. Concordance analyses of the subjective and objective methods
suggest that subjective screening of bias overestimates the extent of actual item bias.
Implications for cost-effective approaches to item bias detection are discussed.

The issue of test bias has always been central in the consideration of test validity. Bias
has been of concern because inferences about the results of test outcomes often lead

Correspondence should be addressed to Steven J. Ross, School of Policy Studies, Kwansei Gakuin
University, Gakuen 2–1, Sanda, Hyogo, Japan 6691337. E-mail: sross@ksc.kwansei.ac.jp
230 ROSS AND OKABE

to consequences affecting the life-course trajectories of test candidates, such as in


the use of tests for employment, admissions, or professional certification. Test re-
sults may be considered unambiguously fair to the extent candidates are compared,
as in the case of norm-referenced tests, on only the domain-relevant constructs in-
cluded in the measurement instrument devised for the purpose. In the real world of
testing practice, uncontaminated construct relevant domain coverage is often more
an ideal than a reality. This is especially true when the testing construct involves do-
mains of knowledge or ability related to language learning.

ISSUES IN SECOND LANGUAGE ASSESSMENT BIAS

Language learning, particularly second or foreign language learning, is influenced


to no small degree by factors that interact with, and that are sometimes even inde-
pendent of, the direct consequences of formal classroom-based achievement. Yet
in many high stakes contexts, foreign or second language ability is used as a
gate-keeping criterion for employment and admissions decisions. Further, inclu-
sion of foreign language ability on selection tests is often predicated on the as-
sumption that candidates’ relative standing reflects the cumulative effects of
achievement propelled by long-term commitment to diligent scholarship. These
assumptions do not often factor in the possibly biasing influences of cross-linguis-
tic transfer and naturalistic acquisition on individual differences in test outcomes.
Constructing high stakes measures to be free of these kinds of bias presents a chal-
lenging task to language test designers, particularly when the implicit meritocratic
intention is to reward scholastic achievement.
Studies of bias on language tests have tended to fall into the three broad categories
of transfer, experience, and socialization practices. The first, which accounts for the
influence of transfer from a first-learned language to a second or foreign language,
addresses the extent of bias occurring when speakers of different first languages are
tested on a common second language. Chen and Henning (1985), for instance, noted
the transferability of Latin cognates from Spanish to English lexical recognition,
which served to bias native speakers of Spanish over native speakers of Chinese.
Working in the same vein, Sasaki (1991) corroborated Chen and Henning using a dif-
ferent DIF detection method. Both of these studies suggested that when novel words
are encountered by Spanish and Chinese speakers, the cognitive task of lexical infer-
ence differs. For instance, consider the following sample sentence:

Residents evacuated their homes during the conflagration.

For Romance language speakers, the deductive task is to parse “conflagration” for
its affixation and locate the core free morpheme. Once located, the Romance lan-
BIAS DETECTION ON LANGUAGE TESTS 231

guage speaker can compare the root to similar known free morphemes in the
reader’s native language, for instance, incendio or conflagracion.
The Chinese speaker, in contrast, starts at the same deductive step, but must
compare the free root morpheme to all other previously learned morphemes (i.e.,
most probably, “flag”). The resulting difference leads Spanish speakers to follow a
semantically based second step, while Chinese speakers are likely to split between
a semantic and phonetic comparison strategy. The item response accuracy in such
cases favors the Romance language speakers, even when matched with Chinese
counterparts for overall proficiency.
The transferability factor applies to orthographic phenomena as well. Brown
and Iwashita (1996) detected bias favoring Chinese learners of Japanese over na-
tive English speakers, whose native language orthography is typologically most
distant from Japanese. Given the fact that modern written Japanese relies on Chi-
nese character compounds for the formation of nominal phrases, as well as the root
forms of many verbs, Chinese students of Japanese can transfer their knowledge of
semantic roots for many Japanese words and compounds, even without knowledge
of their corresponding phonemic representations, or exact semantic reference.
Here a similar strategic difference emerges for speakers of Chinese versus
speakers of an Indo-European language. While the exact compound might not ex-
ist in modern written Chinese, the component Chinese characters provide a deduc-
tive strategy to Chinese learners of Japanese that is not available to English speak-
ers. For instance, the following sentence contains the compound (bullet
train) which does not have a direct counterpart in Chinese.

The component characters “new,” “trunk,” and “line” provide the basis
for a lexical inference that the compound refers to a kind of rail transportation sys-
tem. For an English-speaking learner of Japanese, the cognitive load falls on de-
ducing the meaning of the whole compound from its components. Here, a mixed
grapheme to phoneme strategy is most likely if “new” and “line” are recog-
nized as “shin” and “sen.” The lexical inference here might entail filling in the
missing component “trunk” with a syllable that matches the surrounding
“shin___sen” for successful compound word recognition.
Examining transferability on a macrolevel, Ross (2000), while controlling for
biographical and experiential factors such as age, educational background, and
hours of ESL learning, found weaker evidence of a language distance factor. The
distance factor was comprised of canonical syntactic structure, orthography, and
typological grouping which served to influence the relative rates of learning
English by 72 different groups of migrants to Australia.
The overall picture of transfer bias suggests that on the microlevel, particularly
in studies that triangulate two different native languages against a target language,
232 ROSS AND OKABE

evidence of transfer bias tends to be identifiable. When many languages are com-
pared and individual differences in experiential and cognitive variables are fac-
tored in, transfer bias at the macro or language typological level appears to be less
readily identifiable.
A second type of bias in language assessment arises from differential exposure to
a target language that candidates might experience. Ryan and Bachman (1992), for
instance, considered Test of English as a foreign language (TOEFL) type items to be
more culturally oriented toward the North American context than a British compari-
son, the First Certificate in English. Language learners with exposure to instruction
in American English and test TOEFL preparation courses were thought to have a
greater chance on such items than learners whose exposure did not prepare them for
the cultural framework TOEFL samples in its reading and listening items. Their
findings suggest that high stakes language tests for admissions such as TOEFL may
indirectly include knowledge of cultural reference in addition to the core linguistic
constructs considered to be the object of measurement. Presumably this phenome-
non would be observable on language tests such as the International English Lan-
guage Testing System (IELTS), which is designed to qualify candidates for admis-
sions to universities in the United Kingdom, New Zealand, or Australia.
Cultural background comparisons in second language performance assessments
have demonstrated how speech community norms may transfer into assessment pro-
cesses like oral proficiency interviews. While not overtly recognized as a source of
assessment bias, interlanguage pragmatic transfer has been seen to influence the per-
formances of Asian speakers when compared to European speakers (Young, 1995;
Young & Halleck, 1998; Young & Milanovic, 1992). The implication is that if as-
sessments are norm-referenced, speakers from discourse communities favoring ver-
bosity may be advantaged in assessment such as interactive interviews. This obser-
vation apparently extends to semi-direct speech tasks such as the SPEAK test. Kim
(2001), for instance, found differential rating functions for pronunciation and gram-
mar ratings for Asians when compared to equal abilityEuropean test candidates. The
implication here is that raters apply the rating scale differently.
In considering possible sources of bias in university admissions, Zwick and
Sklar (2003) opined that the foreign language component on the SAT II created a
“bilingual advantage” for particular candidates for admission to the University of
California. If candidates had been raised in bilingual households, for instance, they
would be expected to score higher on the foreign language listening comprehen-
sion component, which is an optional third subscore on the SAT II. This test is re-
quired for undergraduate admissions to all campuses at the University of Califor-
nia. The issue of bias in this case stems from the assumption that the foreign
language component was presumably conceptualized as an achievement indicator,
when in fact the highest scoring candidates are from bilingual households. The
perceived advantage is that such candidates develop their proficiency not through
coursework and scholarship, but through naturalistic exposure.
BIAS DETECTION ON LANGUAGE TESTS 233

Elder (1997) reported on a similar fairness issue arising from the use of second
language tests for access to higher education in Australia. Elder noted that the
score weighting policy on the Victoria Certificate of Education, functioning as it
does as a qualification for university admission in that state, explicitly profiles the
language learning history of the test candidate. This form of candidate profiling
aimed to reweight the influence of the foreign language scores on the admissions
qualification so as to minimize the preferential bias bilingual candidates enjoyed
over conventional foreign language learners. Elder found interactions between
English and the profile categorizations were not symmetric across different for-
eign language test candidatures and concluded that efforts to adjust for differential
exposure profiles are fraught with difficulty.
A third category of bias in language assessment deals with differences in social-
ization patterns. Socialization patterns might involve academic tracking early in a
school student’s educational career, usually into either science or humanities aca-
demic tracks in high school (Pae, 2004). In some cultural contexts, academic track-
ing might correspond to gender socialization practices as well.
In contrast to cultural assumptions made about the verbal advantage females
have over males, Hyde and Linn (1988) concluded in a meta-analysis of 165 stud-
ies of gender differences on all facets of verbal tests that there was an effect size of
D = .11 for gender differences. To them, this constituted little firm evidence to sup-
port the assumed female verbal advantage. Willingham and Cole (1997), and
Zwick (2002) concur with this interpretation, noting that gender differences have
steadily diminished over the last four decades and now account for no more than
1% of the total variation on ability tests in general. Willingham and Cole (1997, p.
348) however, noted that females tend to frequent the top 10% in standardized tests
of reading and writing.
Surveys of gender differences on The Advanced Placement Test, used for uni-
versity admissions to the more selective American universities, suggest reasons
why verbal differences in literacy still tend to persist. Dwyer and Johnson (1997, p.
136) describe considerable effect size differences between college-bound males
and females in preference for language studies. This finding would suggest that in
the North American context socialization patterns could serve to channel high
school students into academic tracks that tend to correlate with gender.
To date, language socialization issues have not been central in foreign or second
language test bias analyses in multicultural contexts because of the more immedi-
ate and salient influences of exposure and transfer on high stakes tests. In contexts
that are not characterized by multiculturalism, a more subtle threat of bias may be
related to how socialization practices steer males and females into different aca-
demic domains, and in doing so cumulatively serve to make gender in particular
knowledge domains differentially salient. When language tests inadvertently sam-
ple particular domains more than others, the issue of schematic knowledge inter-
acting with the gender of the test candidate takes on a new level of importance.
234 ROSS AND OKABE

In a study of differential item function (DIF) on foreign language vocabulary test


for Finnish secondary students, Takala and Kaftandjieva (2000) found that individ-
ual vocabulary items showed domain-sampling effects, whereas the total score on
test did not reflect systematic gender bias. Their study identified how words sampled
from male activity domains such as mechanics and sports might yield higher scores
for male test candidates than for females at the same ability level. Their approach
used conventional statistical analyses of DIF, which, according to some current stan-
dards of test practices, would serve to identify and eliminate biased items before test
scores are interpreted (American Educational Research Association, American Psy-
chological Association, & National Council on Measurement in Education, 1999).
With such practices for bias-free testing, faulty items would be screened through
sensitivity review and content moderation prior to test administration, and then sub-
jected to DIF analyses before the final score tally.
The issue of interest we address in this article is how gender bias on foreign lan-
guage tests devised for high stakes purposes can be diagnosed when accepted cul-
tural practices disfavor the use of empirical analysis of item functioning prior to
score interpretation. In this study we address the issue of the accuracy of sensitivity
review and bias screening through content moderation prior to test administration
by comparing the judgments of both expert and novice moderation groups with the
results of three different empirical approaches to DIF.

BACKGROUND TO THE STUDY

Four sample subtests written for a high stakes university admissions test were used in
the study. The subtests were all from the fourth section of a six section English as a
foreign language (EFL) test given annually to approximately 630,000 Japanese high
school seniors. The results of the exam are norm-referenced and serve to qualify can-
didate for secondary examinations to specific academic departments at national and
public universities (Ingulsrud, 1994). Increasingly, private Japanese universities use
the results of the Center examination for admissions decisions, making the test the
most influential gate-keeping device in the Japanese educational system.
The format of the EFL test is a “discrete point” type of test of language structure
and vocabulary, sampling the high school syllabus mandated by the Japanese Min-
istry of Education. It is construed as an achievement test because only vocabulary
and grammatical structures occurring in about 40 high school textbooks sanc-
tioned by the Ministry of Education are sampled on the test. The six sections of the
examination cover knowledge of segmental pronunciation, tonic word stress, dis-
crete point grammar, word order, paragraph coherence and cohesion, interpreta-
tion of short texts describing graphics and data in tabular format, interactive
dialogic discourse in the form of a transcribed conversation, and comprehension of
BIAS DETECTION ON LANGUAGE TESTS 235

a 400-word reading comprehension passage. All items, usually 50 in all, are in


multiple-choice format to facilitate machine scoring.
The test is constructed by a committee of 20 examiners who convene 40 days
each year to draft, moderate, and revise the examination before its administration
in January each year. On several occasions during the test construction period the
draft passages and items are sent out to an external moderation panel for sensitivity
and bias review. The external moderation panel, whose membership is not known
to the test committee members, is composed of former committee members and
examination committee chairpersons. Their task is to critique the draft passages
and items and to recommend changes, large and small. On occasion the modera-
tion panel recommends substitution of entire draft test sections. This usually oc-
curs when issues of test sensitivity or bias are raised. The criteria for sensitivity are
themselves highly subjective and variable across moderation panels. For some, test
content should involve “heart-warming” topics that avoid dark or pessimistic
themes. For others, avoiding references to specific social or ethnic groups may be
the most important criterion.
The four passages included in the study were originally drafted for the fourth sec-
tion of the EFL language examination. The specifications for the fourth section call
for three or four paragraphs describing charts, figures, or tabular data concerning hy-
pothetical experimental or survey data in a social science domain. This section of the
test is known to be the most domain-sensitive, because the content sampling usually
sits at the borderline of where male–female differences in experiential schemata be-
gin to emerge in the population.
The four passages were never used in the operational test, but were held in re-
serve as alternates. All four had at various stages of development undergone exter-
nal review by the moderation panel and were found to be possibly too gender sensi-
tive, thus ending further investment in committee time for their revision.
The operational test is not screened with DIF statistics prior to score interpreta-
tion. The current test policy endorsed by the Japanese testing community is predi-
cated on the assumption that the moderation panel reviews are sufficiently accurate
in detecting faulty, insensitive, or biased items before any are used on the opera-
tional test. The research issue addressed here thus considers empirical evidence of
the accuracy of the subjective approach currently used, and directly examines evi-
dence that subjective interpretation of gender bias in fact concurs with objective
analyses using empirical methods common to DIF analysis.

METHOD

The four-passage, 20-item reading comprehension test was given to a stratified


sample of 825 high school students and college undergraduates at five institutions.
The sampling included a focus group of 468 female students compared to a refer-
236 ROSS AND OKABE

ence group of 357 male EFL learners. The aim of the sampling was to approximate
the range of scores normally observed in the population of Japanese high school
seniors. The 20-item test was given in multiple-choice format with enough time (1
hr) for completion, and was followed with a survey about the age, gender, and
language learning experiences of the sample test candidates.

Materials
The test section specifications call for a three to four paragraph text describing
graphs, figures, or tables written as specimens of social science types of academic
writing. In the case of the experimental test, four of these passages were used. Each
of the passages had five items that tested readers’ comprehension of the passage
content. The themes sampled on the test can be seen in Table 1.
The experimental test comprised of four short reading passages, which closely
approximate the format and content of Section Four of the Center Examination.
The sampling of students in this study yielded a mean and variance similar to the
operational test. Table 2 lists descriptive statistics for the test.

Bias Survey Procedure


A test bias survey was constructed for use by in-service and preservice EFL teach-
ers. The sampling of high school level teachers parallels the normal career path of
Japanese members of a typical moderation panel. The actual external moderation
panel is comprised of university faculty members, most of whom had followed a
career path starting with junior and senior high school EFL teaching. The bias sur-

TABLE 1
Experimental Passage Order and
Thematic Content

Passage Thematic Content

I Letter rotation experiment


II Visual illusions experiment
III Soccer league tournament
IV Survey of transportation use changes

TABLE 2
Mean, Standard Deviation, Internal Consistency,and Sample Size

M SD Reliability Sample Size Items

12.36 4.14 .780 825 20


BIAS DETECTION ON LANGUAGE TESTS 237

vey was thus devised to sample early, mid, and late career EFL teachers who were
assumed to represent the larger population of language teaching professionals
from whom future test moderation panel members are drafted. In-service teachers
(n = 37) were surveyed individually.
In addition to the sampling of in-service teachers, a larger group of preservice
EFL teachers in training were also surveyed so as to compare the ratings provided
by seasoned professional teachers with neophyte teachers (n = 60). All respon-
dents were asked to examine the four test passages and each of the 20 items on the
test before rating the likelihood that each item would favor male or female test can-
didates. The preservice teachers in training completed the survey during Teaching
English as a Foreign Language (TEFL) Methodology course meetings.
The rating scale used and instructions are shown in the Appendix.

ANALYSES: OBJECTIVE DIFFERENTIAL ITEM


FUNCTIONING ANALYSIS

A variety of options now exist for detecting DIF. Comparative research suggests
that DIF methods tend to differ in the extent of Type I error and power. Whitmore
and Schumacker (1999), for instance, found logistic regression more accurate than
an analysis of variance approach. A direct comparison of logistic regression and
Mantel-Haenszel procedure (Rogers & Swaminathan, 1993) indicated moderate
differences in power. Swanson, Clauser, Case, Nungester, and Featherman (2002)
more recently approached DIF with hierarchical logistic regression and found it to
be more accurate than standard logistic regression or Mantel-Haenszel estimates.
In this approach, different possible sources of item bias can be dummy-coded and
nested in the multilevel design. Recent uses of logistic regression for DIF extend to
polytomous rating categories (Lee, Breland, & Muraki, 2005) but still enable an
examination of nonuniform DIF through interaction terms between matching
scores and group membership.
Although multilevel modeling approaches offer extended opportunities for test-
ing nested sources of potential DIF, the single level methods, such as logistic re-
gression and Mantel-Haenszel approaches, have tended to prevail in DIF studies.
Penfield (2001) compared three variants of Mantel-Haenszel according to differ-
ences in the criterion significance level, and concluded that the generalized ap-
proach provided the lowest error and most power. Zwick and Thayer (2002) found
that modifications of the Mantel-Haenszel procedure involving an empirical Bayes
approach showed promise of greater potential for bias detection. A direct compari-
son of the Mantel-Haenszel procedure with Simultaneous Item Bias (SIB;
Narayanan & Swaminathan, 1994) concluded that the Mantel-Haenszel procedure
yielded smaller Type I error rates relative to SIB.
238 ROSS AND OKABE

In this study, three empirical methods detecting of DIF were used. The choice
of bias detection methods used was based on their overall frequency of use in em-
pirical DIF studies. The three methods were thought to represent conventional ap-
proaches DIF research, and thus best operationalize “objective” approaches to be
compared with subjective methods.
Mantel-Haenszel Delta was computed from six sets of equipercentile-matched
ability subgroups cross tabulated by gender. Differences in the observed Deltas for
the matched males and females were evaluated against a chi-square distribution.
This method matches males and females along the latent ability continuum and de-
tects improbable discontinuities between the expected percentage of success and
the observed data.
The second method of detecting bias was a logistic regression performed on the di-
chotomously scored outcomes for each of the 20 items. The baseline model tested the
effects of gender controlling for each student’s total score (Camilli & Shepard, 1994).
In this binary regression, the probability of success should be solely influenced by the
individual’s overall ability. In the event of no bias, only the test score will account for
systematic covariance between the item responses on a particular item. If bias does af-
fect a particular item, the variable encoding gender will covary with the item response
independently of the covariance between the score and the outcome. Further, if bias is
associated with particular levels of ability on the latent score continuum, a nonuniform
DIF can be diagnosed with a Gender × Total Score interaction term.
Item response = constant + gender + score + (gender × score)
In the event a nonuniform DIF is confirmed not to exist, the interaction term can
be deleted to yield a main effect for gender, controlling for test score. Gender ef-
fects are then tested for nonrandomness against a t-distribution.
The third empirical method used a simultaneous item bias utilizing item response
theory (Shealy & Stout, 1993). The SIB approach was performed on each of the 20
items in turn. The sums of the all other items were used in rotation as ability estimates
in matching male and female examinees via a regression approach. This approach
employs the matching strategy of the Mantel-Haenszel method, and uses the total
score based on k-1 items as a concurrent covariate for each of the item bias tests. Dif-
ferences in estimates of DIF were evaluated against a z distribution.
The composite results of the three different approaches to estimating DIF for each
of the 20 items are given in Table 3. Each of the three objective measures employs a
different test statistic to assess the likelihood of the observed bias statistic. Analo-
gous to meta-analytic methods, the different effects can be assessed standardized as
metrics. To this end, each DIF estimate, controlled for overall candidate ability, is
presented as a conventional probability (p < .05) of rejecting the null hypothesis.
Table 3 indicates that the Mantel-Haenszel and SIB approaches are equally par-
simonious in detecting gender bias on the 20-item test. Both of these methods em-
ploy ability matches of men and women along the latent ability continuum. In con-
BIAS DETECTION ON LANGUAGE TESTS 239

TABLE 3
Objective Bias Probabilities Per Item

Item MH Delta Logistic SIB

Letters1 0.98 0.142 0.791


Letters2 0.88 0.918 0.761
Letters3 0.2 0.901 0.133
Letters4 0.69 0.617 0.557
Letters5 0.96 0.981 0.768
Visuals6 0.39 0.029 0.686
Visuals7 0.17 0.292 0.199
Visuals8 0.36 0.178 0.281
Visuals9 0.71 0.357 0.974
Visuals10 0.24 0.106 0.361
Soccer11 0.97 0.659 0.96
Soccer12 0.87 0.776 0.806
Soccer13 0.001 0.036 0.001
Soccer14 0.37 0.7 0.414
Soccer15 0.47 0.456 0.583
Transprt16 0.61 0.099 0.827
Transprt17 0.39 0.48 0.31
Transprt18 0.29 0.048 0.539
Transprt19 0.48 0.071 0.48
Transprt20 0.84 0.529 0.605

Note. Items in bold represent significant probabilities, where p < .05. MH = Mantel–Haenszel;
SIB = simultaneous item bias.

trast, the logistic regression approach, which uses the total score as a covariate,
appears slightly more likely to detect bias. All three methods concur in detecting
gender bias on the Soccer item 13 shown in Table 4.

ANALYSIS: SUBJECTIVE ESTIMATES OF BIAS

The panel of 97 preservice and in-service teachers was categorized into male and
female subgroups of novices and experienced teachers based on survey responses.

TABLE 4
Biased Item No. 13 From the Soccer Passage

Item Description

13 If the Fighters defeat the Sharks by a score of 1–0 then:


1 The Lions will play the Sharks.
2 The Fighters will play the Bears.
3 The Sharks will play the Eagles.
4 The Fighters will play the Lions.
240 ROSS AND OKABE

The aim of this subdivision of the subjective raters was to explore possible sources
of differential sensitivity to bias in the test questions. In contrast to the objective
methods of diagnosing item bias, the subjective ratings do not employ any infor-
mation about individual ability inferred from the total score. Subjective estimates
rely completely on the apparent schematic content and presumed world knowledge
needed to answer each item. Further, because ratings were on a Likert type scale,
differences between the observed mean rating and the null hypothesis needed to be
tested to provide bias probabilities1 comparable to those in Table 3. To this end, the
mean rating of gender bias on each of the 20 items was tested against the hypothe-
sis that male versus female advantage on each item equaled zero. Table 5 contains
the subjectively estimated probabilities that each item is biased.
In contrast with the objective measures of bias, the subjective analysis diagno-
ses considerably more bias. Complete subjective agreement in diagnosing bias oc-
curs in 6 of the 20 items. As the third column in Table 5 suggests, it appears that ex-
perienced male (EM) teachers are the most inclined to assume there is gender bias.
This subgroup in fact sees bias in the majority of items. Experienced female teach-
ers, in contrast, are the most conservative in assuming that schematic content indi-
cate possible test item bias. The novice male teachers-in-training correspond to
their more experienced male counterparts in assuming there is a schematic bias in
two of the four test passages.
Of particular interest is the tendency of the subjective raters to apply the bias diagno-
sis not to individual items, but to entire test passages. It appears likely that these male re-
spondents equate content sensitivity with test bias. Sources of this confusion will be ex-
amined in the narrative accounts provided by some of the male teachers. The tendency to
see content schema as bias suggests that subjective raters see topical domains as the key
source of possible bias. Both male and female in-service teachers would be expected to
share equivalently accurate knowledge about the cumulative consequences of socializa-
tion on Japanese teenagers’ world-knowledge. As Table 5 would suggest, however, the
experienced male teachers appear to overgeneralize the extent of possible schematic
knowledge differences between male and female students. The domains that appear to
be high bias risk to male teachers involve spatial processing (Visuals 6–9), all of the
items concerned with a sports tournament (Soccer 11–15), and all of the items about the
passage describing changes in transportation (Transport 16–20).

Subjective Accounts of Bias


As a way of accounting for the presumed bias in the test items, interviews with
three veteran male instructors were undertaken. These interviews provide a post

1Subjective ratings were tested against a null hypothesis by assuming the population mean bias

(mu) equals zero and testing the observed subjective mean against a single sample t distribution. Exact
probabilities of the observed t tests were then used in Table 3.
BIAS DETECTION ON LANGUAGE TESTS 241

TABLE 5
Subjective Bias Probabilities Per Item

Novice Expert

Item Label Males Females Males Females

Letters1 0.337 0.71 0.494 0.136


Letters2 0.165 0.42 0.666 0.104
Letters3 0.082 0.452 0.494 0.426
Letters4 0.999 0.194 0.33 0.263
Letters5 0.999 0.497 0.33 0.435
Visuals6 0.082 0.628 0.005 0.426
Visuals7 0.165 0.008 0.056 0.104
Visuals8 0.04 0.001 0.002 0.003
Visuals9 0.999 0.728 0.012 0.538
Visuals10 0.584 0.341 0.541 0.671
Soccer11 0.035 0.007 0.001 0.001
Soccer12 0.009 0.001 0.001 0.001
Soccer13 0.001 0.001 0.001 0.001
Soccer14 0.004 0.001 0.001 0.001
Soccer15 0.005 0.009 0.001 0.001
Transprt16 0.19 0.552 0.001 0.165
Transprt17 0.673 0.473 0.01 0.272
Transprt18 0.19 0.124 0.001 0.336
Transprt19 0.19 0.151 0.004 0.5
Transprt20 0.19 0.044 0.004 0.385

Note. Items in bold represent significant probabilities, where p < .05.

hoc reflective account of why bias would be expected in test items. The three ac-
counts provide subjective evidence as to the sources of the putative sensitivity or
bias in test items. Three facets of belief about gender differences were included in
the interview. The first was a global impression of the test materials. The second
was concerned with the four passages used in the actual reading test. The third
question in the interview phase addressed how each male teacher assumes his col-
leagues are aware of gender differences among students.

Teacher A (mid-30s, male)


Overall impression and belief
“In general, I believe that boys are better than girls at understanding sci-
entific and logical essays. Having said so, I normally don’t pay attention to
such gender differences. The actual English test materials, when compared
with Japanese materials, the topics are easier and the contents are less com-
plicated. Thus I tend to focus students on how to solve the tasks and get
higher scores rather than on the comprehension of the contents. In this
sense, rather than the topic, the form of the tasks may cause a different per-
242 ROSS AND OKABE

formance between boys and girls; boys may do better in mathematical and
logical types of tasks. Anyway, girls do much better than boys as far as Eng-
lish exams are concerned. So I don’t think there is any particular gender dif-
ference among different English tests.”

Here the male teacher contends that there is systematic difference in the female
students’ understanding of science and logical content. Yet assuming that the lan-
guage test content does not narrowly sample such domains, he claims there is no
bias on the test in question. In fact, he suggests that the advantage is on the side of
the female students, because the domain is language study. This view perhaps re-
flects a prevailing social construct in Japan. Science and math are perceived as
male domains, so test content focused on language give females an advantage.

Teacher A’s Analysis of Four Test Passages


Passage 1 (Letter Rotation): “I found this an interesting passage, but I
don’t think there would be any gender difference in the performance.”
Passage 2 (Human Factors): “As for the topic, girls may hold more inter-
est in it. However, I think boys would do better in such task type using
graphs. As a result there may not be any difference between them. Anyway, I
believe boys are good at such mathematical tasks.”
Passage 3 (Soccer Playoffs): “The topic is soccer and I don’t think there is
any gender difference in their interest and knowledge about it as far as high
school students are concerned. Rather than the topic, the knowledge about
this table may make the difference. However, again, I think this table is quite
familiar to both genders since we use it often in school ball games or other
club activities.”
Passage 4 (Transportation): “The topic is about transportation and I think
boys can do better here. Boys like vehicles, machines etc., don’t they? I think
boys definitely like and do well in passages about sports, transportation, an-
imation, and pop idols, while girls are good at fashion, songs, singers, mov-
ies, cooking, girls’comics and trendy dramas. If they are asked to read a pas-
sage about fashion, boys wouldn’t understand much.”

This male teacher’s perception of domains of interest and experience corre-


sponds to the overall tendency of male teachers to ascribe knowledge and abilities
to female and male students differentially. Interestingly, two of the passages Hu-
man Factors and Transportation are considered in the male domain of interest,
while the Soccer Playoff Schedule is deemed less likely to trigger schematic bias.
This testimony concurs with the statistical tendency of Japanese male teachers to
assume that differences in schematic content trigger bias even when no such bias is
detected objectively. Oddly, this teacher does not predict that the one passage (Soc-
cer Playoffs) with objective evidence of gender bias would in fact yield any.
BIAS DETECTION ON LANGUAGE TESTS 243

Teacher A: About Teachers’ Awareness


“I think teachers’awareness about gender difference varies according to
the generation. In general, older generations (50s, 60s) have a clearer and
very often wrong image of gender difference among students.”

This younger male teacher, while adhering to the pattern of projected bias about
the passages, seems aware that Japanese teachers in general entertain that male –
female differences in schematic background knowledge actually exist. Interest-
ingly, no account of the source of such putative differences, whether natural or
constructed through socialization processes, is given.

Teacher B (mid-40s, male)


Overall impression and belief
“In general, I didn’t notice gender performance differences in different
passages, apart from the fact that girls do better than boys on English tests
as a whole. If I dare raise an example of gender difference, girls might do
particularly better when it’s about fashion or designing. These topics are
very unfamiliar to boys. As for the task types, I don’t think multiple choice
type would cause any gender difference in the performance. Certainly girls
tend to perform much better than boys in essay type or self-expressive types
of tasks. In this sense, this kind of paper test may be easier for boys to show
their knowledge or ability. Performance-based types of exams such as essays
or interviews would reveal the gender difference far more dramatically.
Girls certainly are said to be weak at map reading or space/direction recog-
nition. However, paper tests wouldn’t go that far to reveal the difference be-
tween genders. Similarly, boys are said to be good at logical structure, but an
English test in multiple choice format wouldn’t be appropriate to prove it.”

The account provided by this mid-career male teacher does not specifically
nominate any particular biased passages; rather, he contends that gender differ-
ences are pervasive. This interpretation matches the overall pattern of these Japa-
nese male teachers seeing ubiquitous gender differences. Interestingly, this ac-
count contends that the modality of the test serves to remove gender
differences—that the multiple choice format neutralizes the potential for bias. This
account conflicts with meta-analyses of test format (Willingham & Cole, 1997),
which have found that multiple-choice tends to favor male test candidates. The as-
sertion here is that multiple choice format removes the “natural” advantage in lan-
guage ability that female students are assumed to possess.

Teacher B’s Analysis of Four Test Passages


“I felt that passage[s] 2 and 4 might show some gender differences. Pas-
sage 2 is about shapes and this is about math knowledge. Passage 4 is about
244 ROSS AND OKABE

transportation and it is a sociological topic. The former requires rather par-


ticular vocabulary such as ‘triangle’, which is not very frequent in the nor-
mal English textbooks. So, boys who had read something related to math in
English could answer this much more easily. Similarly passage 4 (Transpor-
tation) has some specific vocabulary and I think it is more familiar to boys.
Passage 1 (Letter Rotation) is about angles and directions. If one can under-
stand what these pictures mean, then it’s easy to answer. Passage 3’s topic is
soccer, but the point of this passage is this table. So I don’t think this is diffi-
cult for those not interested in sports. Certainly this table may require some
attention, but this is a very common type of table and we see it on TV daily. In
the end, these 4 passages don’t require any special mathematical knowledge.
The only difference is in Passage 2 (Human Factors); this may require math-
ematical vocabulary to understand the context and it may cause gender dif-
ferences in the end.”

This male teacher’s account seems to contradict the statistical data. While
schema sampled in the passages may differ according to life experiences of male
and female students, the extent of that difference is not enough to trigger system-
atic bias in answering the test questions. It is odd therefore that male teach-
ers—both in-service and preservice teachers—have tended to assume there is a
systematic handicap for female students. The implied source of the bias is numeri-
cal reasoning, and since the four passages do not require much more than simple
arithmetic, no clear source of bias is identified. The one biased item (no. 13) nested
in the Soccer Playoff Schedule passage, is not referred to at all in the oral account.
The possible source of bias is said to be in male versus female domains of “inter-
est.” Students not interested in sports in general would be disadvantaged on a
sports-related topic.

Teacher B: About Teachers’ Awareness


“I think a teacher’s own common sense should be good enough to judge
any gender-bias in the exam. Certainly some teachers, especially older ones,
tend to have different gender images about students. But more importance
lies in whether the exam questions are about what they’ve learned in the cur-
riculum or not. This is because the gender difference would show very little
influence on the multiple-choice tests.”

This teacher does not expect that male and female teachers differ in their diagnos-
tic accuracy when it comes to gender bias on tests. Rather, the source of differences
may emerge as a consequence of teachers’ age and experience. Presumably, gender
differences observed over a career in the classroom serve to reinforce expectations
about what female students are likely to be familiar with. However, if the content is
BIAS DETECTION ON LANGUAGE TESTS 245

part of the taught curriculum, the potential for bias is considered minimized, espe-
cially when the test format required selection from alternative answers.

Teacher C (mid-50s, male)


Overall impression and belief
“Overall, female students perform always better than males on language
tests. Rather than the gender, whether s/he knows the topic or not would af-
fect the performance. Language ability cannot simply be the pure knowledge
of the language. It cannot be separated from the content knowledge or dis-
course knowledge.”

Teacher C asserts that gender differences favor female students, and observed
differences are not necessarily the consequence of biased items. The account itself
seems to assume that language, content, and discourse are inseparable elements,
and somehow female students prevail. This account does not correspond well with
the objective data, were male teachers tend to assume that content sampling serves
to handicap female test candidates, who otherwise would enjoy an advantage in the
foreign language domain.

Teacher C’s Analysis of Four Test Passages


“Since our school has more female than male students, we EFL teach-
ers naturally choose female-oriented topics or tasks such as fashion or
movies. We sometimes feel uncomfortable to select topics such as baseball
with them. Here the topic (of Passage 3) is soccer, and I think in this case
the topic would hinder the female students’ performance, although this
passage can be solved without knowing about soccer. Similarly, passage 4,
which is about transportation, may be preferred by male students. As for
the task types, this multiple-choice type is easy for both male and female
students. Female students are particularly good at essay type questions.
Boys are somehow always bad at expressing themselves using words. So
the multiple-choice tasks wouldn’t reveal gender difference clearly. As for
Passages 1 and 2, I don’t think there is any difference between boys and
girls. These are quite gender-neutral and you need only your common
sense to understand the questions.”

Once again the assumption is that the multiple choice method of testing serves
to reduce the potential natural advantage that female students would enjoy. Perfor-
mance assessments, in contrast, are thought to favor female students. What is strik-
ing about this account is the assumption that language advantages for female for-
eign language learners are natural, and not the consequences of differential
streaming, reinforcement, or social engineering.
246 ROSS AND OKABE

Teacher C: About Teachers’ Awareness


“Sometimes students show interest in the topic that seems very tough or
unfamiliar for them. So it is all up to the teachers to make the material inter-
esting and thought-provoking. Even though some topics, for example, fash-
ion, are not very familiar to males, teachers can still make it interesting and
enjoyable.”

SUMMARY OF TEACHER ACCOUNTS

When these three teachers are asked for their overall impression and their own
opinion about gender differences, their first answer is always similar: they don’t
think there is significant gender difference between girls and boys in the four pas-
sages. Girls are just better at overall performance on English exams. However, as
the interview went on and the subjects were asked about passages, these male
teachers revealed their ideas about the sources of gender differences.
Because these high school teachers find gender differences according to the
method of testing quite obvious, they do not seem to pay much attention to the dif-
ferences shown according to the topical domain of each passage. The overall pat-
tern suggests that hermeneutic assessment of bias issues is susceptible to hyper-
sensitivity. These male teachers don’t construe possible schematic differences as
products of differential socialization practices, but apparently tend to over-gener-
alize them as natural categories of gender differences.

CONCORDANCE ANALYSES

After the subjective and objective analyses of bias on the 20 test items were com-
piled, a direct comparison was undertaken. The objective and subjective probability
of bias estimates were converted into effect sizes. This conversion to a standard met-
ric allows for a direct comparison between the objective and subjective estimates of
bias. In the subjective and objective estimates of item bias, different indicators of sta-
tistical significance were used. To make all indicators directly comparable, an effect
size conversion (Shadish, Robinson, & Lu, 1999) yielded a single indictor to facili-
tate direct comparisons. Zero effect size indicates an estimate of no bias. Negative
valence on the effect size estimates indicates bias thought to favor males.
As Table 6 suggests, the subjective estimates of bias produce larger effect indi-
cators of bias than do the objective methods. For Soccer 13, the item detected by
the objective methods to produce a bias favoring males, the effect size of the bias is
in the small effect range (Cohen, 1988). The subjective diagnostics of bias, in con-
trast, are mainly in the large effect range (± .80). While the one authentically bi-
ased item gets concordant agreement between the subjective and objective ap-
BIAS DETECTION ON LANGUAGE TESTS 247

TABLE 6
Effect Sizes for Objective and Subjective Bias Estimates

Novice Expert

Item MHDelta SIB Logistic Regression Males Females Males Females

Letters1 0.002 0.103 0.018 –0.384 0.076 –0.216 0.56


Letters2 –0.009 –0.007 –0.021 –0.561 0.167 0.137 0.613
Letters3 0.09 0.0087 0.105 –0.711 0.155 –0.218 0.294
Letters4 –0.028 –0.035 –0.041 0.0 –0.275 –0.312 0.417
Letters5 0.003 0.001 0.02 0.0 0.14 0.312 0.289
Visuals6 0.06 0.153 0.028 –0.711 –0.1 –0.94 –0.294
Visuals7 0.096 0.074 0.09 –0.561 0.558 –0.623 0.613
Visuals8 0.064 0.094 0.075 0.851 0.696 –1.04 –1.18
Visuals9 0.026 0.068 0.002 0.0 –0.071 –0.833 –0.202
Visuals10 0.082 0.113 0.064 –0.271 –0.196 –0.193 0.156
Soccer11 0.002 0.031 0.003 –0.876 –0.568 –1.11 –1.33
Soccer12 0.011 0.02 0.017 –1.11 –0.696 –1.11 –1.33
Soccer13 –0.23 –0.147 –0.264 –1.45 –0.696 –1.11 –1.33
Soccer14 0.063 0.027 0.057 –1.24 –0.696 –1.11 –1.33
Soccer15 0.05 0.052 0.038 –1.2 –0.549 –1.11 –1.33
Transprt16 0.035 0.116 0.015 –0.529 –0.123 –1.11 –0.52
Transprt17 0.06 0.049 0.071 –0.167 0.148 –0.85 –0.469
Transprt18 0.074 –0.139 0.043 –0.529 –0.32 –1.11 –0.357
Transprt19 0.049 0.12 0.049 –0.529 –0.298 –0.966 –0.249
Transprt20 0.014 0.044 0.036 –0.529 0.0 –0.966 –0.322

Note. Items in bold represent biased test items. MH = Mantel–Haenszel; SIB = simultaneous item
bias.

proaches, the magnitude of the bias estimation is disproportionately large in the


subjective estimation.
The subjective judgments of item bias apparently differ according to the experi-
ence and gender of the preservice and in-service teachers in this study. To examine
whether there is a conditional proclivity to identify bias subjectively, a factorial
analysis of variance was performed on the individual test items’ mean effect size as
the dependent variable and dummy codes for rater experience and gender as inde-
pendent variables. As Figure 1 indicates, there is no significant interaction be-
tween experience and the raters’ own gender.
Both experienced and nonexperienced male EFL teachers tend to rate the test
items as being biased in favor of male test candidates. The negative effect sizes
reflect bias towards male test candidates, while an effect size of zero would indi-
cate no anticipated bias. The factorial analysis of variance indicates that there is
a main effect for the gender of the teachers. The near-significance of experience
also suggests that preservice teachers in general tend to be less prone to assum-
ing that test items favor male test candidates more than their female counter-
248 ROSS AND OKABE

FIGURE 1 Interaction of experience and rater gender on bias effect sizes. Experience = 0 re-
fers to preservice EFL teachers; Experience = 1 refers to in-service teachers. Sex = 1 refers to
male teachers; Sex = 2 refers to female teachers.

parts. Table 7 lists the main effects and test of the interaction between experi-
ence and the sex of the teachers.2

COMPARATIVE CONCORDANCE ANALYSES

As a final comparison of the differences between subjective and objective ap-


proaches to bias detection, separate concordance analyses were performed on the
bias estimate effects. The three objective methods of detecting bias on the test
items show a strong concordance in agreement about the lack of systematic gender
bias on the 20 test items. Table 8 indicates a Concordance W of .690 among objec-
tive DIF detection methods. As Seigel and Castellan (1988) noted, a concordance
indicator does not necessarily provide evidence that the agreement is unidirec-
tional. This point is illustrated in the agreement among the subjective raters of gen-
der bias. Here, the Concordance W is comparably large (.709), but indicates the op-
posite conclusion reached by employing objective analysis methods of bias

2We have used “sex” to refer to teacher gender so as to distinguish it from the gender of the students

in the subjective and objective analyses.


BIAS DETECTION ON LANGUAGE TESTS 249

TABLE 7
Analysis of Variance of Experience and Teacher Sex

Source df F p

Experience 1 3.170 .079


Teacher sex 1 9.635 .003
Experience × Teacher Sex 1 0.020 .888
Error 76

Note. Dependent variable is the effect size.

TABLE 8
Kendall’s Coefficient of Concordance W

Objective DIF Methods Subjective Bias Ratings

W = .690 , p = .004, df = 19 W = .709 , p = .000, df = 19

Note. DIF = differential item functioning.

detection. The subjective ratings of gender bias are strongly in agreement about the
existence of bias in the test items, though even the three samples of narrative ac-
counts do not provide much consistent insight into why there is such hypersensitiv-
ity to the issue of gender differences.

CONCLUSION

The three conventional empirical methods of estimating item bias via differential
item functioning show strong concordance. The three methods used in the analy-
sis, the simultaneous item bias approach, the Mantel-Haenszel Delta, and logistic
regression, were largely concordant in identifying that the majority of the 20 items
on the four test passages were not biased. A single biased item (Soccer 13) was
correctly detected by the three different objective methods.
The subjective ratings of bias suggested that novice and experienced teachers
tend to overestimate the extent of gender bias. The tendency is to see the sche-
matic domain of whole passages as likely to induce bias. The male teachers sam-
pled may simply confuse potential sensitivity issues with actual test bias. The
finding that both novice and experienced male raters identified phantom bias
also suggests a possible stereotypical assumption about knowledge domains
thought to favor males versus female test takers. The lack of empirical evidence
of item bias on 19 out of 20 experiment test items further suggests that female
Japanese test candidates are more familiar with particular schematic domains
than their male teachers might give them credit for.
250 ROSS AND OKABE

The reality is that high stakes language tests in Japan are rarely screened empir-
ically for item bias. Current practice calls for moderation panels to conduct sensi-
tivity reviews of draft test materials. As many of these panels are predominantly
made up of males, the potential for the needless omission of unbiased test material
is implied by the findings of this study. A large false-positive error rate is likely to
render such hermeneutic assessments of bias inefficient in terms of cost-utility cri-
teria. This conclusion is derived from the observation that for many high stakes ad-
missions exams, the test passages and items are crafted at high cost by sequestered
test construction committees. When possible sensitivity is confused with authentic
bias in candidate test items, unbiased items and often whole passages are omitted
despite their potential validity. It is possible that the eventual homogenization of
test passages can even work against content validity in that authentic texts repre-
senting the wider domain of real world usage are more likely to be excluded in the
subjective sensitivity reviews. Because empirical counterevidence is rarely used to
correct the tendency to assume sensitivity is equivalent to bias, considerable re-
sources are wasted in drafting test passages because of the high false positive error
rate resulting from the exclusive use of subjective moderation panels on these high
stakes foreign language examinations.
The implication of this study is that in addition to moderation reviews, empiri-
cal item bias should be conducted on high stakes foreign language tests used for
university admissions. Faulty items that slip through the moderation process can
be identified and omitted from the test before scoring is finalized. This approach,
given the widespread availability of item bias detection software, is likely to yield
in the long run the most cost-effective and valid approach to removing biased items
from high stakes language tests.

ACKNOWLEDGMENTS

This study was supported by kaken grant from the Japanese Ministry of Education
and Science.
We thank Sugiyama Naoto, Tom Robb, Ishikawa Tomohito, Isono Morihiko,
Ozawa Masato, and Ishihara Satoru for their assistance in data collection.

REFERENCES

American Educational Research Association, American Psychological Association, & National Coun-
cil on Measurement in Education Joint Committee on Standards for Educational and Psychological
Testing. (1999). Standards for educational and psychological testing. Washington, DC: AERA.
Brown, A., & Iwashita, N. (1996). Language background and item difficulty: The development of a
computer-adaptive test of Japanese. System, 24, 199–206.
BIAS DETECTION ON LANGUAGE TESTS 251

Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Thousand Oaks, CA:
Sage.
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language
Testing, 2, 155–163.
Cohen, J. (1988). Power analysis. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Dwyer, C., & Johnson, L. (1997). Grades, accomplishments, and correlates. In W. Willingham & N.
Cole (Eds), Gender and fair assessment (pp. 127–156). Mahwah, NJ: Lawrence Erlbaum Associates,
Inc.
Elder, C. (1997). What does test bias have to do with fairness? Language Testing, 14(3), 261–277.
Hyde, J., & Linn, M. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bul-
letin, 104(1), 53–69.
Ingulsrud, J. (1994). An entrance test to Japanese universities: Social and historical context. In C. Hill
& K. Parry (Eds.), From testing to assessment: English as an international language (pp. 61–81).
London: Longman.
Kim, M. (2001). Detecting DIF across the different groups in a speaking test. Language Testing, 18(1),
89–114.
Lee, Y. W., Breland, H., & Muraki, E. (2005). Comparability of TOEFL CBT writing prompts for dif-
ference native language groups. International Journal of Testing, 5(2), 131–158.
Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous item
bias procedures for detecting differential item functioning. Applied Psychological Measurement,
18(4), 315–328.
Pae, T. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1),
53–73.
Penfield, R. (2001). Assessing differential item functioning among multiple groups: A comparison of
three Mantel-Haenszel procedures. Applied Measurement in Education, 14(3), 235–259.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of the logistic regression and Man-
tel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measure-
ment, 17(2), 105–116.
Ross, S. J. (2000). Individual difference factors on the Certificate in Spoken and Written English. In G.
Brindley (Ed.), Studies in Adult Migrant English (pp. 191–214). Sydney: National Center for English
Language Teaching and Research.
Ryan, K., & Bachman, L. (1992). Differential item functioning of two tests of EFL proficiency. Lan-
guage Testing, 9(1), 12–29.
Sasaki, M. (1991). A comparison of two methods of detecting differential item functioning in an ESL
placement test. Language Testing, 8(2), 95–111.
Shadish, W., Robinson, L., & Lu, C. (1999). ES: A computer program for effect size calculation. Min-
neapolis: Assessment Systems Corporation.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF
from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika,
58(2), 159–194.
Siegel, S., & Castellan, N. J., Jr. (1988). Non-parametric statistics for the behavioral sciences. New
York: McGraw-Hill.
Swanson, D., Clauser, B., Case, S., Nungester, R., & Featherman, C. (2002). Analysis of differential
item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Be-
havioral Statistics, 27(1), 53–75.
Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language
Testing, 17(3), 323–340.
Whitmore, M., & Schumacker, R. (1999). A comparison of logistic regression and analysis of variance
differential item functioning decision methods. Educational and Psychological Measurement, 59(6),
910–927.
252 ROSS AND OKABE

Willingham, W., & Cole, N. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum As-
sociates, Inc.
Young, R. (1995). Conversational styles in language proficiency interviews. Language Learning, 45,
3–41.
Young, R., & Halleck, G. (1998). ‘Let them eat cake!’ or How to avoid losing your head in cross-cul-
tural conversations. In R. Young & A. W. He (Eds), Talking and testing: Discourse approaches to the
assessment of oral proficiency (pp. 355–382). Amsterdam: Benjamins.
Young, R., & Milanovic, M. (1992). Discourse variation in oral proficiency interviews. Studies in Sec-
ond Language Acquisition, 14, 403–424.
Zwick. R. (2002). Fair game? The use of standardized admissions tests in higher education. New York:
Routledge Falmer.
Zwick, R., & Sklar, J. (2003, April). California and the SAT: A reanalysis of University of California
admissions data. Paper presented at AERA, Chicago, IL.
Zwick, R., & Thayer, D. (2002). An application of an empirical Bayes’ enhancement of Man-
tel-Haenszel differential item functioning analysis to a computerized adaptive test. Applied Psycho-
logical Measurement, 25(1), 57–76.

APPENDIX
Gender Bias Survey

There is no need to provide answers to the test questions. Please inspect the test
passages, graphs, and charts, then rate the differential difficulty with which male or
female students would find each test question.

Rating Scale:
Rate –3 if you think male students would be highly advantaged
Rate –2 if you think male students would be moderately advantaged
Rate –1 if you think male students would be slightly advantaged
Rate 0 if you think there is no differential advantage
Rate 1 if you think female students would be slightly advantaged
Rate 2 if you think female students would moderately advantaged
Rate 3 if you think that female students would be highly advantaged

Please be sure to rate each of the 20 test questions.

About You
Please complete the survey

21) Your Gender:


1 Male 2 Female

22) Current Occupation:


BIAS DETECTION ON LANGUAGE TESTS 253

Pre-Service
Pre-Service Graduate
College Faculty
In-Service High School Teacher
Other

23) In what ways do you consider language test content to give differential ad-
vantages to either male or female high school students? Please answer
freely.

You might also like