You are on page 1of 18

This article was downloaded by: [University of Cambridge]

On: 18 December 2014, At: 02:15


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

The Journal of Economic Education


Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/vece20

Differential Item Functioning


and Male-Female Differences on
Multiple-Choice Tests in Economics
William B. Walstad & Denise Robson
Published online: 25 Mar 2010.

To cite this article: William B. Walstad & Denise Robson (1997) Differential Item Functioning
and Male-Female Differences on Multiple-Choice Tests in Economics, The Journal of Economic
Education, 28:2, 155-171

To link to this article: http://dx.doi.org/10.1080/00220489709595917

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information
(the “Content”) contained in the publications on our platform. However, Taylor
& Francis, our agents, and our licensors make no representations or warranties
whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and
views of the authors, and are not the views of or endorsed by Taylor & Francis. The
accuracy of the Content should not be relied upon and should be independently
verified with primary sources of information. Taylor and Francis shall not be liable
for any losses, actions, claims, proceedings, demands, costs, expenses, damages,
and other liabilities whatsoever or howsoever caused arising directly or indirectly in
connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.
Terms & Conditions of access and use can be found at http://www.tandfonline.com/
page/terms-and-conditions
Differential Item Functioning
and Male-Female Differences on
Multiple-Choice Tests in Economics
William B. Walstad and Denise Robson

Differences in the understanding of economics by males and females have


long been studied in economic education. Research generally shows that test
Downloaded by [University of Cambridge] at 02:15 18 December 2014

scores in economics are higher for males than females at the high school arid col-
lege levels (Siegfried 1979). These gender differences in economic understand-
ing are worth investigating because differences in test scores affect course grades
and student attitudes. They also signal to students possible areas of comparative
advantage that may ultimately influence the choice of college major or career. In
addition, test scores shape teachers’ perceptions of students’ abilities and the
instructional strategies they use in the classroom. Researchers also use test scores
to assess the effectiveness of educational programs and their distribution of ben-
efits to students.
Several reasons have been offered to explain the gender differences on tests of
economic achievement. Lower test scores in economics for females have been
attributed to social and cultural influences that create sex-role stereotypes that
reduce female interest and achievement in a traditionally male-dominated subject
such as economics (Ladd 1977; Jackstadt and Grootaert 1980). A second expla-
nation considers the possibility that cognitive differences between males and fe-
males, such as mathematical, spatial, or verbal skills, may result in performance
differences on economics tests (Williams, Waldauer, and Duggal 1992; Ander-
son, Benjamin, and Fuss 1994; Hirschfeld, Moore, and Brown 1995). A third rea-
son focuses on instructional differences that may limit the economic understand-
ing of females. Among these are a “chilly” classroom climate for women, biased
educational materials, and poor teacher role models (Ferber 1990; Horvath,
Beaudin, and Wright 1992). Finally, the fixed or constructed-response format of
an economics test may influence test results. Several studies have reported that
females do relatively worse on multiple-choice tests and relatively better (or at
least the same) on essay tests (Ferber, Birnbaum, and Green 1983; Lumsden and
Scott 1987).
In this study, we offer another explanation for the gender difference in test
scores when multiple-choice tests are used as the measurement instrument-the

~~

William B. Walstad i s (I prc?fe.ssor of economic~sand dirc.ctor. Nutiotial Center for Reseurch in Ec.0-
nomic Educution. Unitvrsity (flNehruskrr-Lincoln. Denise Robson is an ussisrunt professoi- of’eco-
1ion7ic.s. U n i v e r ~ iotf~Wi.si.on.sin-Oshkosh. Thr nuthorr trpprecitrred the helpful comments from Peter
Kennedy wid UII unonymous referee.

Spring 1997 155


differential functioning of test items. Conclusions from previous research in eco-
nomic education are based largely on studies of gender differences in the overall
test scores, usually standardized multiple-choice tests. Most of these tests con-
tain about 30 to 50 test items and cover a range of economic concepts. It is quite
likely that only certain test items are producing the significant differences in the
economic understanding of males and females reflected in the total score. Elim-
ination of the biased items from a multiple-choice test may eliminate the gender
difference in economics achievement in the total scores. Even if item bias is not
the only factor explaining the gender difference in multiple-choice test scores,
the presence of item bias will lead to overestimates of the differences in test
scores and distort the interpretation of research findings.'
The problem can be demonstrated by examining gender differences in scores
on the Test of Economic Literacy (TEL) (Soper and Walstad 1987). This test was
Downloaded by [University of Cambridge] at 02:15 18 December 2014

selected because it has been used in many research studies at the high school
level, but it has never been statistically analyzed for possible gender bias in item
performance.2 Of those who participated in the national norming of the TEL,
males had a mean score of 22.60 questions correct out of 46 TEL questions, and
females had a mean score of 21.51 questions correct. Although the gender dif-
ference was slight (1.09 questions), it was statistically significant, and the differ-
ence is important because conclusions about the effectiveness of programs are
often based on TEL differences of about this size.3 If only certain items account
for this difference, however, then conclusions reported in previous research may
be overstated. The discussion that follows explains how potentially biased items
on the TEL, or other standardized multiple-choice tests used in economics, can
be identified and how future tests or revisions can address this problem.
We conducted the study in two stages. First, we analyzed TEL item data to
identify how well the items worked for males and females after controlling for
estimates of economic ability. This analysis used item response theory (IRT) to
obtain estimates of item characteristics (e.g., difficulty, discrimination, pseudo-
guessing) and economic ability. IRT methods were used to identify items with
large differences in male and female performance after controlling for economic
ability. In the second stage, we removed biased items from the original TEL. The
gender differences in overall scores were then re-assessed based on the modified
TEL scores. Also, we used the modified TEL to reestimate a regression equation
used in a previous study that showed differences in the achievement of males and
females. The empirical results showed a significant decrease in gender difference
in either the group mean comparisons or the effect of a sex variable in a regres-
sion equation when the modified TEL score was used for the analysis, but the dif-
ference was still present.

DIF AND ITEM RESPONSE THEORY


Test developers define an unbiased item as one for which the probability of a
correct response is the same for all persons of a given ability level, regardless of
sex, race, socioeconomic status, or other group variable that may be of interest
(Shepard 1982). Not all items are biased, however, just because there are group
156 JOURNAL OF ECONOMIC EDUCATION
differences. Other factors (e.g., socialization, cognitive differences, instruction,
or test format), may cause different groups of the same ability to perform differ-
ently on a test item. Nevertheless, some group difference on an item may be
caused by item bias, or what is more correctly described as differential itern func-
tioning (DIF) (Holland and Wainer 1993). The DIF term removes the more pejo-
rative bias from the discussion and suggests that items may work for different
groups in positive and negative ways across the ability spectrum. DIF measure-
ment can be used to reduce this source of test invalidity and allows researchers
to concentrate on the other explanations for group differences in test scores.
IRT is used to identify DIF items. This theory specifies the relationship be-
tween the observable examinee test performance and the unobservable traits or
abilities assumed to underlie performance on the test. This relationship is de-
scribed by a monotonically increasing function called an item characteristic
Downloaded by [University of Cambridge] at 02:15 18 December 2014

curve (ICC). The function can be graphed in an S-shaped curve with ability on
the horizontal axis and the probability of a correct response at a given ability
level on the vertical axis. The graph shows that as ability level increases, the
probability of correctly responding to an item will increase. Item bias, or DIF,
can be illustrated by plotting the ICCs for each group (e.g., males and females)
on the same graph. Unbiased, or non-DIF, items will have ICCs for the two
groups that substantially overlap and have the same basic shape across the abili-
ty spectrum. DIF items will have group ICCs that cross and substantially differ
in shape. The calculation of the area between the group ICCs provides a precise
measure of the degree of DIE
The most common IRT model for the analysis of multiple-choice data is a
three-parameter model that describes three characteristics of test items--difti-
culty, discrimination, and chance or pseudo-guessing (Birnbaum 1968; Lord
1980). The ICC for the three-parameter model is given by the equation

where
4(8) is the probability that a student with ability 8 answers an item correctly
b, is the difficulty parameter of item i
n is the number of items in the test
e is the transcendental number 2.718
D is a scaling parameter equaling 1.702
a, is the discrimination parameter for item i and
c, is the pseudo-chance parameter for item i.

The three-item parameters mean that the ICCs will differ in location (difficul-
ty), slope, and lower asymptote. The location or difficulty parameter b is the point
on the ability scale where the slope of the ICC is maximized. The greater the b
parameter, the harder the item.4 The a parameter is proportional to the slope of the
ICC at the b location on the ability scale. Items with steeper slopes are better dis-
criminators than those with less slope. The c parameter provides a nonzero lower

Spring 1997 157


asymptote for the ICC, giving low-ability students a nonzero probability of get-
ting an item correct. The c parameter is sometimes mistakenly called a guessing
parameter, but it is typically lower than one would expect from random guessing,
possibly because of the attractiveness of one or more distractors for the item. D is
a scaling factor that makes the logistic function approximate the normal ogive
function. When D equals 1.702, values of f (0) for the normal ogive and the logis-
tic model differ in absolute value by less than 0.0 1 for all values of 0. The scaling
factor is used to take advantage of the relationship between the logistic ogive and
the normal ogive. Any property derived by a logistic ogive should be approxi-
mately true for the normal ogive, but the logistic ogive does not require integra-
tion, so it is easier to evaluate (Hambleton and Swaminathan 1985).
Several desirable measurement features are obtained from IRT models. First,
estimates of the item characteristics or parameters (difficulty, discrimination,
Downloaded by [University of Cambridge] at 02:15 18 December 2014

chance or pseudo-guessing) are not dependent on the sample to which the items
were administered. Item parameter estimates obtained from different samples of
students will be the same, except for sampling error. Second, estimates of student
ability are not dependent on the test given. Ability estimates for students obtained
from different sets of test items will be the same, except for sampling error. IRT
estimation produces item and ability parameter estimates that are said to be
invariant. This invariance is achieved by incorporating information about the
items into the estimation of ability and information about the students' ability
into the estimation of item parameters.
Although other methods are available for identifying DIF, IRT is preferred
because each ICC is unique, and, except for random variations, the same curve is
found irrespective of the nature of the group for which the function is plotted. The
three-parameter ICC is a comprehensive item-response model that accounts for dif-
ferences between groups not only in terms of ability but also in terms of discrimi-
nating power and differences in the pseudo-guessing parameter. These latter two
differences are usually ignored by other methods (Holland and Wainer 1993).5

IRT ESTIMATION

The BILOG computer program was used to estimate ability (0) and the item
parameters (Mislevy and Bock 1990). This program maximizes the probability of
obtaining the observed data when using a marginal maximum likelihood (MML)
approach. The marginal probability, or the probability of an item for a student
who has been randomly selected from a population with a distribution of ability
g ( 0 ) , is If(0)g(0)d(O). If a sample of N students is selected, the corresponding
marginal likelihood function for the observed data is

158 JOURNAL OF ECONOMIC EDUCATION


P,(0) is the probability of a correct response to item i
Q,(0) is 1-P,(0)
uV
is the observed response to item i by studentj (coded 0 if wrlong, 1
if right)
g(0) is the distribution of ability levels, estimated in conjunction with
the parameter estimates
U is the matrix of observed item responses of all students to all items
and
a, b, c are vectors of item parameters, one triple (u, b, c) for each of the n
items (Mislevy and Stocking 1989; Mislevy and Bock 1990).

The marginal maximum likelihood (MML) method begins with use of the test
score to approximate the distribution of ability, g(0), for a randomly selected
Downloaded by [University of Cambridge] at 02:15 18 December 2014

group of examinees. The initial ability distribution is used to integrate the abili-
ty parameters out of the maximum likelihood function, so the values of the item
parameters can be found. These parameter estimates are treated as if they are
equal to their true values. BILOG produces maximum likelihood estimates of 0
using equation

L ( B l a , b , c , U ), ==I ~, = I~ P , ( 0 , ) " " Q , ( 0 , ) ' ~ " ' ' g ( 0 ) , (3)

where, i, j , a, b, c, and U are as defined in equation (2);


P,(0,) is the probability of a correct response to item i by student j , obtained
from equation (1)
e,ce,,is 1 - P p J
0 is the vector of known examinee abilities, one for each student, but
because the true 0s are unknown, the ability distribution is used (Mis-
levy and Stocking 1989, 58-60).

These new ability estimates were used to refine g(0) in equation (2), and thle item
parameters were reestimated. We continued this iterative process until the values
of the estimates did not change between two successive stages (Hambleton,
Swaminathan, and Rogers 1991).
IRT requires a reasonable fit between the model and the test data before esti-
mation. The key assumption of IRT is unidimensionality, which means that only
one construct (or ability) is being measured by the items on the test. One com-
mon method for assessing unidimensionality is factor analysis of the item data
(Hambleton and Swaminathan 1985). A factor is a latent trait (e.g., ability) that
is being measured by the test. Eigenvalues from the factor analysis indicate how
much variance is accounted for by each factor. The ratio of the eigenvalue of the
first factor and the eigenvalue of the second factor is calculated to assess the
dominance of the first factor in the data set. If the ratio is high, then this result
provides evidence that the test is unidimensional and essentially one factor (abil-
ity) is being estimated in an IRT model.
Principal component factor analysis was performed on the TEL item data for
males and females separately and for the total sample. Factor loadings were sim-
Spring 1997 159
ilar for all groups, so only the overall analysis is reported. The first factor
accounted for about 15 percent of the variance in the test data and had an eigen-
value of 7.06. The eigenvalue of the first factor was 4.58 times greater than the
second eigenvalue, and the ratio of the first factor to all other factors was even
greater. Clearly, the first factor was dominant and at least 4.58 times greater than
any other factor. The TEL data appeared to meet the unidimensionality assump-
tion for IRT estimation.

DIF IDENTIFICATION
The assessment of DIF first involves the estimation of the item parameters (a,
b, c) for each group (males and females) so that they can be placed on the same
8 (ability) scale. The estimates of item difficulty ( b parameter) and ability are
Downloaded by [University of Cambridge] at 02:15 18 December 2014

expressed as logits (log odds units). The scale for ability is normally distributed
with a mean of zero and a variance of one.
As defined previously, DIF exists if students with equal ability but from dif-
ferent groups have an unequal probability of answering an item correctly. DIF
can be illustrated by plotting the ICCs for males and females. Figure 1 shows the
ICCs for TEL item 40. This item does not indicate DIF because the two curves
are almost identical throughout the ability range. At an average level of ability,
both males and females have about a .44 probability of getting this item correct.
By contrast, Figure 2 presents the ICCs for TEL item 43 and clearly shows poten-
tial item bias. The male ICC lies above the female ICC throughout the ability
range. At an average ability level, males have about a .30 probability of a correct
response, compared with only a .24 probability for females.
The calculation of the area between ICCs for each group provides a measure
of the degree of DIF in an item. The area measure can be signed, or unsigned, by
taking the absolute value.6The signed measure indicates the direction of the DIE
A positive value would mean that the male ICC is above the female ICC. For
example, the signed DIF for item 43 is 0.626 (Figure 2). A negative value would
mean that the female ICC is above the male ICC (Figure 3). Item 22 has a signed
area of -0.319. The advantage of the unsigned over the signed area is that when
the DIF is not uniform (i.e., low-ability females do better than low-ability males,
but high-ability males do better than high-ability females), a signed measure may
not indicate DIF because the positive and negative areas would cancel each other
out. This effect is found in TEL item 44, which has a signed DIF of 0.128 and an
unsigned DIF of 0.549. These DIF values indicate that the ICCs for item 44 cross
and differ in shape (Figure 4).7 In general, the signed area determines the direc-
tion of the DIF, and the unsigned area indicates the strength of the DIE

DIF AND THE TEL


The cutoffs for classifying items as DIF and omitting them from a test are ulti-
mately decided by the test developer. There are tradeoffs in this decision. If a
conservative cutoff is used, too few items may be omitted for the test to be con-
sidered unbiased. Conversely, a liberal cutoff may identify too many items, and
160 JOURNAL OF ECONOMIC EDUCATION
FIGURE 1
Question 40
1

0.8
Downloaded by [University of Cambridge] at 02:15 18 December 2014

0.6

0.4

0.2
-4 -2 0 2 4
Ability

* Male €-I Female

their removal would destroy a test’s validity and reliability and make test devel-
opment more costly. Cutoffs generally range from a conservative $, = .70 to a lib-
eral $, = .45.8 For the TEL, the conservative cutoff of @,= .70 flagged one DIF
item (32). The liberal cutoff of $, = .45 identified 6 DIF items ( 1 , 5, 22, 32, 43,
44), or 13 percent of the TEL. The text and correct answers for these 6 TEL items
are presented in the appendix.
A less arbitary procedure for choosing a cutoff has been suggested by Oshima,
McGinty, and Flowers (1994). Using this procedure, we computed the area of
difference in ICCs for a pair of random samples of men for each of the 46 items.
These 46 values were used to estimate a normal distribution, and a two-tailed cut-
off value from this distribution for a .01 level of significance was recorded. This
procedure was repeated 28 times with different pairs of random samples of men.
The largest of the 28 resulting cutoff values was chosen to reflect statistical sig-

Spring 1997 161


FIGURE 2
Question 43
1

0.8
Downloaded by [University of Cambridge] at 02:15 18 December 2014

0.6

0.4

0.2
4 -2 0 2 4
Ability

-X- Male +Female

nificance. Remarkably, this cutoff of @; = 0.45 was identical to the liberal cutoff.
The items flagged by the IRT method, using the $i= .45 cutoff and the unsigned
area as the measure of DIF, are shown in Table I .
The benefit of using the unsigned area to identify DIF items can be illustrated
with the data in Table 1. When DIF is not uniform across the ability scale, the
signed area will understate the degree of DIF because positive areas showing a
male advantage in one part of the ability scale is offset by negative areas show-
ing a female advantage in another part of the ability scale. In fact, the signed area
results would flag two fewer items (22, 44) for DIF when compared with the
unsigned area results.
Using the unsigned area to assess DIF is also more accurate than using the dif-
ficulty difference (Table 1, column 8) to judge items because the unsigned area
takes into account the ability factor. The data in Table 1 show that only 6 items
162 JOURNAL OF ECONOMIC EDUCATION
FIGURE 3
Question 22
1

0.8
Downloaded by [University of Cambridge] at 02:15 18 December 2014

0.6

0.4

0.2
-4 -2 0 2 4
Ability

3c Male -8- Female

(32, 43, 5 , I , 44, and 22) are flagged as DIF using the unsigned measure, or 13
percent of the test. These items, however, would not be the same ones that would
be flagged if a statistically significant difference (.05 level) in male-female per-
centage correct was the criterion for judging item bias. If that criterion was used,
23 items ( I , 2, 5 , 6, 9, 11, 13, 15, 17, 18, 20, 22, 23, 25, 26, 31, 32, 34, 39, 42,
43,44, and 46), or half the test, would be flagged as potentially biased in column
8. When these 23 items were rank-ordered based on the size of the difficulty dif-
ference (largest to smallest), the relative position of the 6 DIF identified items
(32,43, 5, I , 44, and 22) varied markedly: items 5 and 43 held the top two posi-
tions; items 32 and 1 were ranked 7th and 8th; and items 44 and 22 were at the
bottom of this distribution (20th and 23rd, respectively). Even a more restrictive
.01 level of statistical significance flags 19 items, or 41 percent of the test. Sim-
ply looking at difficulty differences in the percentage correct for males and
Spring 1997 163
FIGURE 4
Question 44
1

0.8
Downloaded by [University of Cambridge] at 02:15 18 December 2014

0.6

0.4

0.2
-4 -2 0 2 4
Ability

4 6 - Male -8- Female

TABLE 1
TEL Items Flagged for DIF (IRT - Qj = 0.453)

Item Performs Content Cognition Question Signed Unsigned Difficulty


number better category category tY Pe area area difference"

32 Male Macro Analysis Textual 0.763 0.763 5.00


43 Male Intl. Analysis Textual 0.626 0.626 9.00
5 Male Fund. Analysis Numerical 0.618 0.618 12.00
I Male Fund. Application Textual 0.454 0.458 4.00
44 Male Intl. Analysis Numerical 0. I28 0.549 3.00
22 Female Micro Analysis Textual -0.3 I9 0.553 3.00

'Difficulty difference is the percentage difference between males and females in getting the item correct.

164 JOURNAL OF ECONOMIC EDUCATION


females overstates possible item bias on the TEL and is an inaccurate indicator
of biased items.
Why females of similar ability to males performed worse on the five items and
better on the one item (Table I ) is difficult to explain or attribute solely to item
bias. A review of the text wording of these items revealed no language or stereo-
typical examples that would cause problems for females. Four of the five items
on which males did better, however. are categorized at the higher cognitive lev-
els of analysis in Bloom’s taxonomy. Females did better than males on only one
of the five analytical items. Also, two of the three numerical items on the TEL
were flagged as DIF in favor of males. Thus, on the one hand, these results sup-
port the hypothesis that high school females may have more difficulty with ques-
tions involving numerical, spatial, or higher reasoning skills. On the other hand,
the results support the hypothesis that there may be item bias on the TEL that
Downloaded by [University of Cambridge] at 02:15 18 December 2014

needs to be corrected.

RESULTS FROM AN IRT-MODIFIED TEL


The means and standard deviations for the modified TEL of 40 items and the
original TEL of 46 items are reported in Table 2. Males scored 0.73 of a point
higher than females on the modified TEL, compared with a 1.09-point difference
on the original TEL. For students with and without economic instruction, the dif-
ference was 1.18 and 0.60 questions, respectively; and on the modified TEL, the
difference was 0.823 and 0.256 questions, respectively. The overall anti with-
economic-instruction differences were still statistically significant, but they were
a step closer to ensuring that the test is free of DIF items.
The modified TEL also shows reasonable measurement properties. It is still a
reliable measure; the Cronbach alpha is 3 6 versus .87 for the original TEL.
Although the elimination of 13 percent of the TEL items may change the con-
tent validity of the test, the removal of items is distributed across all content cat-
egories: 2 of 12 fundamental items, 1 of 13 micro items, I of 13 macro items,
and 2 of 8 international items. The modified test still shows construct validity

TABLE 2
Original and Modified TEL Means, by Sex

Males Females
Variable Mean SD N Mean SD N f

Original TEL
Overall 22.601 8.779 2,118 21.5lI 7.757 2.005 4.21*
With economics 23.960 8.942 1.m 22.776 7.895 1.390 3.76*
Without economics 19.255 7.369 612 18.652 6.602 615 1.51
Modified TEL
Overall 19.985 8.153 2,l I8 19.258 7.265 2,005 3.02*
With economics 21.249 8.266 1,506 20.426 7.378 1,390 2.n2*
Without economics 16.874 6.954 612 16.618 6.250 615 0.68

*Significant at the 01 level.

Spring 1997 165


because there is a statistically significant difference in the mean scores for stu-
dents with and without economics (4.38 for males and 3.8 1 for females). A case
can be made that the modified TEL is still a valid and reliable measure of gen-
eral economics achievement at the high school level, albeit with a reduced set
of questions.
If the modified TEL is considered to be a reasonable proxy for the original
TEL, it would be worthwhile to compare the regression results of the two mea-
sures. A basic regression equation from a study by Walstad and Soper (1989) was
selected for comparison. The regression model and rationale for the variables
included in the equation are extensively described in that study, so only a short
description of the variables is provided here.
In the regression equations, the dependent variable (original TEL score or IRT-
modified TEL score) was explained by personal characteristics, instructor fac-
Downloaded by [University of Cambridge] at 02:15 18 December 2014

tors, school characteristics, and location variables. Personal characteristics were


captured by dummy variables for gender (MALE = l), year in high school
(SENIOR = 1), and race (BLACK = 1). A set of three dummy variables was
included to control for type of course-an economics course (ECON = I), a con-
sumer economics course (CONECON = l), and a social studies course that
included economics (SSECON = 1)-with a social studies course that did not
include economics as the omitted variable. The other educational factor in the
regression equation was the number of course credits in economics earned by the
teacher of the student (TCOUR).
Several variables were used to control for school characteristics. The first was
a dummy variable for whether the school district participated in a national eco-
nomic education program (DEEP = 1). The second was the number of students in
the school (SIZE in common logs). The third factor was a set of two income dum-
my variables, one for whether the school was considered to be in a middle-income
area (MINCOME = 1) and one for whether the school was considered to be in a
high-income area (HINCOME), with low income as the omitted category.
The type of community and region in which the school was located were cap-
tured by two other sets of dummy variables. The equation included variables for
whether the school was in a suburban area (SUBURB = 1) or an urban area
(URBAN = I), with rural serving as the omitted variable. Three regional vari-
ables were included to cover the northeast region (NOREAST = 1), southern
region (SOUTH = l), and west region (WEST = l), with the midwestern region
being the omitted variable.
The regression results from estimating the equation using the original and
modified TEL as a dependent variable are presented in Table 3. The sample used
for the regression was the 2,019 students who participated in the TEL norming
for whom there were complete data on all the variables. The biggest difference
in the results was the general reduction in the size of coefficients from the origi-
nal TEL to the modified TEL equation. The obvious reason for this change was
the reduced range of the modified TEL (40 items) compared with the original
TEL (46 items). In all other respects, the results were strikingly similar. Both
forms of the TEL (original and modified) regressions explained 48 percent of the
variance in the dependent variable. Also, the direction of the coefficient signs and
166 JOURNAL OF ECONOMIC EDUCATION
TABLE 3
TEL Regression Results

Onginal TEL Modified TEL"


( n = 2,019) 1 (n = 2,019) t

MALE 1.015 3.616* 0.683 2.630*


IQ 0.282 26.372* 0.265 26.750*
SENIOR 1.501 4.262* 1.367 4. I90*
BLACK -2.025 4.1 12* -I ,920 4.2 1 O*
ECON 2.302 5.347* 1.986 4.98 I *
CONECON -1.51 I 2.532** - I ,532 2.772*
SSECON 0.039 0.062 -0.21 I 0.362
TCOUR 0.786 10.022* 0.7 15 9.857*
DEEP 1.060 2.840* 1.036 2.999*
SIZE 2.538 2.772* 2.401 2.832
Downloaded by [University of Cambridge] at 02:15 18 December 2014

MINCOME I .374 3.078 1.004 2.430**


HINCOME 2.887 4.656' 2.709 4.719*
SUBURB 0.204 0.425 0.089 0.199
URBAN 0.96.5 I .738 0.887 1.725
NOREAST 0.349 0.646 0.317 0.633
SOUTH 4.700 1.664 4.616 0.582
WEST 0.098 0.208 0.060 0.136
Constant -10.36 I 3.858 -10.576 4. I32

Adj. R' 0.477 0.380


SEE 6.266 5.802

"Modified TEL wing 40 non-DIF i t e m .


*Significant at the .01 level. **Significant at the .05 level

the statistical significance of the variables did not change when the modified TEL
was substituted for the original TEL.
Given the differences in the dependent variables of the two regression equa-
tions, it is difficult to compare the coefficients directly; a rough comparison,
however, can be made. We calculated the percentage change from the original
coefficient to see if a significant decrease existed in the male coefficient, The
dependent variable on the modified TEL was 13 percent smaller than that on the
original TEL. The coefficient on the male variable was 33 percent smaller in the
modified analysis, a statistically significant decrease at the .01 level. In compar-
ison, the coefficient on ECON, the next largest statistically significant change,
decreased by 14 percent. The coefficient on MINC fell by 27 percent, but the
level of significance on that coefficient fell from .01 to .05. The rest of thz coef-
ficients changed by less than the decrease in the dependent variable at a statisti-
cally significant level of .O 1.
Overall, the largest decrease in coefficient size was for the gender variable.
This result supports the initial hypothesis that item bias leads to an overestimate
of gender differences in test scores. Even with the six DIF items removed, how-
ever, a statistically significant difference remained in the TEL scores of males
and females. This result suggests that item bias is not the only factor contribut-
ing to gender differences. The other factors mentioned in the introduction-

Spring 1997 167


socialization, cognitive differences, instructional effects, or test format-likely
explain the remaining gender differences.

CONCLUSION
Differences in the scores of males and females on economics tests in favor of
males have been found in many studies using multiple-choice tests in high school
and college. Although many reasons have been offered to explain the gender dif-
ference in economic understanding, we considered an explanation that has not
been examined in previous research on economic education-differential item
functioning (DIF). The results suggest that males and females may perform dif-
ferently on particular items on a multiple-choice test in economics, even after
controlling for group differences in economic ability.
Downloaded by [University of Cambridge] at 02:15 18 December 2014

We outlined statistical procedures that use item response theory (IRT) for iden-
tifying items affected by DIF, and data from the national norming of the TEL, a
standardized test for high school students, were analyzed for DIE The results
showed a statistically significant difference in the scores of males and females
before DIF items were removed but a statistically significant decrease in that dif-
ference between male and female scores when the modified TEL scores were
used. These results suggest that DIF is not the only source of gender differences
in economic understanding. Other factors, such as differential reasoning, social-
ization, instructional practices, or the format used for testing, may contribute.
Test development work in the future will need to account for gender differ-
ences in test items. If certain items show DIF, then they should be eliminated
from the test because they are masking the true performance of students. Identi-
fying items with DIF and removing them from the measurement instruments
improves test validity. Future research also needs to be conducted on the reasons
why males and females perform differently on DIF items, especially when the
explanation is not clearly apparent from inspecting the content of an item. DIF
analysis and follow-up research on items will be invaluable for improving the
major tests used for research in economic education.

APPENDIX
Text and Correct Answers for Six TEL Items
I . When the United States trades wheat to Saudi Arabia in exchange for oil:
*a. both countries gain.
b. both countries lose.
c. the United States gains, Saudi Arabia loses.
d. Saudi Arabia gains, the United States loses.
5. Sandy Smith can take a job paying $10,000 a year when she graduates from high school, or she
can go to college and pay $5,000 a year for tuition. Measured in dollars, what is her opponu-
nity cost of going to college next year?
a. $0.
b. $5,000.
c. $10,000.
*d. $15,000.
22. “ANOTHER SHIP WRECKED-For the fourth time in six years, Rocky Point claims more
victims. Millions of dollars in ships and cargo have been lost. Ships heading into the nearby

168 JOURNAL OF ECONOMIC EDUCATION


port muht come dangerously close to this well-known hazard. Citizen5 are concerned that n o
lighthouse protects shipping into our port." Private business are NOT likely to bu~lda light-
house because:
a. ship owners won't pay for lighthouses because they buy insurance policies to protect them-
selves from losses.
*b. the light from the lighthouse can be used even by ships that do not pay a fee for the service.
c. it would cost a private business more than it would cost the government to build a light-
house.
d. the cost of building the lighthouse is too high.
32. Which one of the following groups typically is hurt the most by unexpected inflation'?
a. Manufacturers
*b. Bondholders
c. Borrowers
d. Farmers
43. To correct a balance of trade deficit, many members of Congress want to increase import tar-
Downloaded by [University of Cambridge] at 02:15 18 December 2014

iffs. If this occurs. then we should also expect:


a. increased U S . imports and exports.
*b. decreased US. imports and exports.
c. increased US. imports and decreased US. exports.
d. decreased US. imports and increased U.S. exports.

Question 44 is based on the following table:

Prices of Foreign Currencies in US. Dollars


Currency Year I Year 2 Year 3 Year 4

German mark .40 .30 .so .45


Canadian dollar I .oo .so .70 .90
British pound 2.20 2.40 I .30 I .60

44. The change in the value of the British pound from Year 1 to Year 2 could be explained by a mar-
ket for pounds that had experienced:
a. increased supply and decreased demand
*b. decreased supply and stable demand.
c. stable supply and decreased demand.
d. stahle supply and stable demand.

*Indicates correct answer.

NOTES
1 , The bias problem I S not just an issue with multiple-choice tests. Bias may exist in essay or con-
structed-response testing and scoring that favors females (Bennett and Ward 1993).
2. The statistical procedures discussed in this article were costly and not widely used for lest devel-
opment when the TEL revision began in 1985.
3. To put this in perspective. the mean difference in TEL scores for those with and without econom-
ics was 4.96. Thus, the mean gender difference is equivalent to about 22 percent of the total gain
in economic understanding.
4. In the one- and two-parameter models, the h parameter will lie at P ( 0 ) = S O , indicating that the
higher the h parameter the greater the ability required to have a 50 percent chance of getting the
item correct. In the three-parameter model, h will lie halfway between the lower asymptote and 1
on the P ( 0 ) scale.
5. A simpler procedure (the delta-plot) standardiLes the differences in difficulties (percentage cor-
rect) between males and females (see Angoff 1982). Any item falling outside some specified inter-
val is considered relatively easier lor one group than for the other. Only three of the six items

Spring 1997 169


flagged by the IRT method were also flagged by this simpler procedure ( I , 5.43). Four items that
were not flagged by IRT were found, using the delta-plot method, to be easier for females than for
males. All four items were found to have differences in discrimination, which explains why they
were not flagged by the IRT method. The simpler method assumes equal discriminating power for
all items, which is not required for IRT estimation.
6. For a discussion of the area equation, see Hambleton, Swaminathan, and Rogers (1991). For a dis-
cussion of other methods for DIF identification, see Holland and Wainer (1993, 67-1 13).
7. Item 22, in Figure 3, also indicates nonuniform DIF: the signed area is -0.319 and the unsigned
area is 0.553. In this case, there is little or no difference at the low- and middle-ability levels, but
at the high-ability level, females have a higher probability of getting the item correct than do
males.
8. Rudner (1977) suggested a conservative cutoff of 0.70 and a liberal cutoff of 0.40. Hambleton,
Saminathan, and Rogers (1991) used a cutoff of 0.498. For the liberal cutoff, we decided to split
the difference between the two.

REFERENCES
Downloaded by [University of Cambridge] at 02:15 18 December 2014

Anderson, G., D. Benjamin, and M. A. Fuss. 1994. The determinants of success in university intro-
ductory economics courses. Journal ofEcconomic Education 25 2 (Spring): 99-1 19.
Angoff, William H. 1982. Use of difficulty and discrimination indices for detecting item bias. In
Handbook of methods for detecting test bias. ed. R. A. Berk. Baltimore, Md.: Johns Hopkins Uni-
versity Press.
Bennett, R. E., and W. C. Ward, eds. 1993. Construction ver.su.s choice in cognitive measurement.
Hillsdale, N.J.: Erlbaum.
Birnbaum, A. 1968. Some latent trait models and their use in inferring an examinee's ability. In F. M.
Lord and M. R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wes-
ley.
Ferber, M. A. 1990. Gender and the study of economics. In The principles of economics course, ed.
Phillip Saunders and William B. Walstad, 44-60. New York: McGraw-Hill.
Ferber, M.A., B. G. Birnbaum, and C. A. Green. 1983. Gender differences in economic knowledge:
A reevaluation of the evidence. Journal of Economic Education 14 (Spring): 24-37.
Hambleton, R. K., and H. Swaminathan. 1985. Item response theory: Principles and applications.
Boston, Mass.: Kluwer-Nijhoff.
Hambleton, R. K., H. Swaminathan, and H. J . Rogers. 1991. Fundamentals of item response theory.
Newbury Park, Calif.: Sage.
Hirschfeld, M., R. L. Moore, and E. Brown. 1995. Exploring the gender gap on the GRE subject test
in economics. Journal of Economic Education 26 (Winter): 3-16.
Holland, P. W., and H. Wainer, eds. 1993. Diyerenrial iremfunctioning. Hillsdale, N.J.: Erlbaum.
Horvath, J., B. Q. Beudin, and S. P. Wright. 1992. Persisting in the introductory economics course:
An exploration of gender differences. Journal of Economic Education 23 (Spring): 101-108.
Jackstadt, S. L., and C. Grootaert. 1980. Gender, gender stereotyping, and socioeconomic back-
ground as determinants of economic knowledge and learning. Journal of Economic Education 12
(Winter): 34-40.
Ladd, H. F. 1977. Male-female differences in precollege economic education. In Perspectives on eco-
nomic education, ed. D. R. Wentworth, W. Lee Hansen, and Sharryl H. Hawke, 145-155. New
York: Joint Council on Economic Education (now National Council on Economic Education).
Lord, F. 1980. Applications of IRT to practical testing problems. Hillsdale, N.J.: Erlbaum.
Lumsden, K. G., and A. Scott. 1987. The economics student re-examined: Male-female differences
in comprehension. Journal of Economic Education I8 (Fall): 365-75.
Mislevy, R. J., and R. D. Bock. 1990. BILOG: Item analysis and test scoring with binary logistic
models. 2d ed. Computer program. Mooresville, Ind.: Scientific Software, Inc.
Mislevy, R. J., and M. L. Stocking. 1989. A consumer's guide to LOGIST and BILOG. Applied Psy-
chological Measurement 13 (March): 57-75.
Oshima, T. C., D. McGinty, and C. P. Flowers. 1994. Differential item functioning for a test with a
cutoff score: Use of limited closed-interval measures. Applied Measurement in Education 7 (3):
195-209.
Rudner, L. M. 1977. An appmach to biased item identification using latent trait measurement theory.
Paper presented at the 61" annual meeting of the American Educational Research Association. New
York, April 4-8.
Shepard, L. A. 1982. Definition of bias. In Handbook of methods for detecting test bias, ed. R. A.

170 JOURNAL OF ECONOMIC EDUCATION


Berk, 9-29. Baltimore, Md.: Johns Hopkins University Press.
Siegfried, J. J. 1979. Male-female difference in economic education. Journul rf Economic .Educution
10 (Spring): 1-1 1.
Soper, J. C., and W. B. Walstad. 1987. Te.rt of economic literacy: Exuminer’s munual. 2’ ed. New
York: Joint Council on Economic Education (now the National Council on Economic Education).
Walstad, W. B., and J. C. Soper. 1989. What is high school economics’?Factors contributing to stu-
dent achievement and attitudes. Journal of Economic Eductrtion 20 (Winter): 23-38.
Williams, M. L.. C. Waldauer. and V. G. Duggal. 1992. Gender differences in economic knowledge:
An extension of the analysis. Journul of Economic. Education 23 (Summer): 219-31.

TEST OF UNDERSTANDING
Downloaded by [University of Cambridge] at 02:15 18 December 2014

IN COLLEGE ECONOMICS
THIRD EDITION

The Test of Understanding in College Economics is the only test available for col-
lege-level instructors to measure understanding of introductory economics. The
Third Edition is based on the largest and broadest sample of schools ever used for
norming TUCE. For the first time, two-year colleges are included in the sample.
Two-thirds of the test questions are application questions, and roughly half of the
application questions are realistic, incorporating quotations from published sources.
Use the Test of Understunding in College Economics in controlled experiments.
Compare the performances of your Ftudents with those of students in other colleges
and universities.
The complete program includes:
Examiner’s Manual featuring
easy-to-read specification
matrices describing the
2 Test Booklets
= Macroeconomics
Microeconomics
content and categories (25 to a package)
related to each question
a scoring key
Use this coupon and get a 15 percent discount off the catalog price.
REGULAR PRICE $46.85 SPECIAL PRICE $39.95
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - 1 1 1 1 1 1 1 -

Please send the Test of Understanding in College Economics (TUCE) to:


Name
Title
School
School Address
City State Zip -__
Purchase Order No.
0 Send check or official Purchase Order for $39.95 plus 10% shipping and
handling. You may fax your Purchase Order. Fax no. (212) 730-1793.
0 Send more information on the National Council of Economic Education
Mail Coupon to: National Council on Economic Education
Marketing Department
I 140 Avenue of the Americas, New York, NY 10036
JEE 97

Spring 1997 171

You might also like