Adverse Impact Tests 1

Adverse Impact Tests 1
Testing for Adverse Impact When Sample Size is Small
Michael W. Collins
Illinois Institute of Technology
Scott B. Morris
Illinois Institute of Technology
Journal of Applied Psychology, 93(2), 463-471
DOI: 10.1037/0021-9010.93.2.463
© 2008 American Psychological Association
This article may not exactly replicate the final version published in the APA journal. It is not the
copy of record. The final published version can be obtained at
http://psycnet.apa.org/journals/apl/93/2/463/
Testing for Adverse Impact When Sample Size is Small
Abstract
Adverse impact evaluations often call for evidence that the disparity between groups in
selection rates is statistically significant, and practitioners must choose which test statistic to
apply in this situation. To identify the most effective testing procedure, several alternate test
statistics were compared in terms of Type I error rates and power, focusing on situations with
small samples. Significance testing was found to be of limited value due to low power for all
tests. Among the alternate test statistics, the widely-used Z-test on the difference between two
proportions performed reasonably well, except when sample size was extremely small. A test
suggested by Upton (1982) provided slightly better control of Type I error under some
conditions, but generally produced results similar to the Z-test. Use of the Fisher Exact Test and
Yates continuity-corrected chi-square test are not recommended, due to overly conservative Type
I error rates and substantially lower power than the Z-test.

Adverse impact refers to group differences in the outcome of an employment decision.
Adverse impact analyses play a central role in many employment discrimination lawsuits, and
have become a regular part of the evaluation of employee selection procedures. Consequently, it
is important that adverse impact analyses are based on the best statistical procedures available.
The most common approach for evaluating adverse impact is based on the 4/5ths rule
outlined in the Uniform Guidelines on Employee Selection Procedures (Uniform Guidelines;
U.S. Equal Employment Opportunity Commission, 1978). A limitation of the 4/5ths rule is that it
does not take into account the potential impact of sampling error (Morris & Lobsenz, 2000).
When sample size is small, the 4/5ths rule will often identify cases of adverse impact even when
selection rates are equal in the population (Roth, Bobko & Switzer, 2006).
In order to account for sampling error, the 4/5ths rule can be supplemented with a test of
statistical significance (Roth et al., 2006). For large samples, significance can be tested using a
Z-test on the difference between two independent proportions, often referred to as the “2 SD
rule” [Office of Federal Contract Compliance Programs (OFCCP) Compliance Manual, 1993].
When the sample size is small, the Fisher Exact Test is often recommended as a more
appropriate test (OFCCP, 1993; Siskin & Trippi, 2005).
The choice between the Fisher Exact Test and the Z-test (or the equivalent chi-square
test) has long been debated (Camilli, 1990; Haber, 1990; Upton, 1982). Although there are clear
advantages to using an exact test, the Fisher Exact Test is based on a different theoretical model
than the Z-test (Kroll, 1989; Siskin & Trippi, 2005), and each will be appropriate in different
situations. In this paper, we review the statistical models underlying these tests and discuss their
applicability to adverse impact analysis. In addition, we report simulations of the Type I error
rates and power of alternate adverse impact tests.

It is important to note that in practice, decisions about adverse impact are not solely based
on statistical evidence. Courts may consider a variety of factors to determine whether a prima
facie case of discrimination has been made. The Uniform Guidelines recommend that adverse
impact statistics be interpreted in light of the hiring organization’s recruiting practices that
encourage or discourage minority applicants. In addition, when sample size is small, the Uniform
Guidelines suggest that adverse impact statistics might be supplemented with data from other
similar jobs or for the same job across time. The current research only addresses methods for
evaluating statistical evidence, and therefore does not fully model the process of determining
whether adverse impact exists. However, because statistical evidence usually plays a central role
in such decisions, identifying the most effective statistical tools is essential to ensure accurate
decisions.
Alternate Test Statistics
Adverse impact can be defined using passing rates on a particular component of a
selection process, or based on the outcome of a decision related to selection, promotion or other
employment action. For simplicity, we will use the term ‘selection rate’ to refer to the rate of
successful outcomes, regardless of the type of decision involved.
Adverse impact statistics are typically based on a comparison of selection rates for two
predetermined groups (referred to here as the minority and majority groups). Although there may
be more than two subgroups in the applicant pool (e.g., multiple ethnic groups), the context in
which the analysis is conducted (e.g., a claim of discrimination) typically identifies the relevant
minority and majority groups. Therefore, in developing a model for adverse impact data, we will
focus on the selection rates for the two groups of interest, and ignore data that might be present
for other groups. In a later section we will discuss the implications of assessing adverse impact in
situations where the relevant groups are not identified a priori.

Adverse impact analysis can be expressed as a test for association in a 2 x 2 contingency
table, as illustrated in Table 1. In this table, NPmin, NPmaj, NFmin and NFmaj reflect the number of
applicants in each cell, Nmin, Nmaj, NPT and NFT are the marginal totals, and N is the total number
of applicants. Pmin represents the marginal proportion of minority applicants, and SRT reflects the
marginal proportion of applicants who pass the selection test.
The most widely recognized procedure for detecting adverse impact is the 4/5ths rule
outlined in the Uniform Guidelines. The basic statistic for the 4/5ths rule is the impact ratio (IR),
which is defined as the selection rate for the minority group divided by the selection rate for the
majority group. Adverse impact is indicated when this ratio is less than four-fifths. The impact ratio
is an index of effect size; it was intended to direct the attention of regulatory agencies toward
settings with substantial disparities in selection outcomes. It is not based on a formal sampling
model and does not assess the impact of sampling error on the result.
The Uniform Guidelines also suggest a heuristic method of assessing the impact of chance
on adverse impact results in small samples, which we will call the Reverse One rule (Roth et al.,
2006, referred to this as the “N of 1” or “flip-flop” rule). The Guidelines state that adverse impact
would not be found when, “…the selection of one different person for one job would shift the result
from adverse impact against one group to a situation in which that group has a higher selection rate
than the other group.” The adjusted impact ratio (IRadj) suggested by this strategy is,
IRadj =
 NPmin  1 N min
..
 NPmaj  1 N maj
( 0)
Applying this strategy, a finding of adverse impact would require that: (1) the impact ratio (IR) is
less than 4/5, and (2) that the adjusted impact ratio (IRadj) is less than 1.0. Although this approach
does assess the impact of chance on the result, it is not based on a formal statistical model, and it is
not known how well this heuristic will work in practice.

Formal statistical significance tests have also been used to analyze adverse impact data. A
common test is the Z-test for the difference between proportions (OFCCP, 1993), or equivalently,
the Pearson chi-square test for independence in a 2 x 2 table. The Z-test evaluates the null
hypothesis of equal population selection rates for the two groups being compared,
NP min NP maj
-
N min N maj
Z= . ( 0)
SRT (1 - SRT )
(N)( P min )(1 - P min )
Also known as the 2-SD test, Z is considered significant if the difference is more than roughly
two standard deviations above or below zero (or more precisely, |Z| > 1.96, corresponding to a
two-tailed =.05).
The Z-test is based on large-sample theory, and may not be appropriate in small samples.
Specifically, when any of the cell frequencies are less than five, the ability of the test to achieve
the nominal Type I error rate is questionable. When Pmin and SRT are small, this value may be
less than five, even for moderately large N. For example, when the minority group makes up
10% of the applicant pool, and 10% of the applicants pass the test, the total sample size must
exceed 500 in order to meet this requirement.
A number of alternate test statistics have been suggested for situations where the sample
size is too small for the Z-test. The most common is the Fisher Exact Test (Kroll, 1989; OFCCP,
1993; Siskin & Trippi, 2005). The Fisher Exact Test provides the exact probability of obtaining
the observed frequency table (or one more extreme) under the null hypothesis, with the
additional assumption that the marginal frequencies are fixed (Fleiss, 1981).
Another statistic often recommended for small samples is the chi-square test with Yates’
continuity correction (Camilli & Hopkins, 1978). Although not an exact test, Yates’ correction
has been shown to provide a close approximation to the Fisher Exact Test (Fleiss, 1981), while
being considerably easier to compute. Yates’ continuity-corrected test is calculated as:

2
 1 
N  ( NFmin )( NPmaj )  ( NFmaj )( NPmin )  N 
2  .
 Y2   (0)
( N min )( N maj )( NPT )( NFT )
Yates’ test is evaluated against a chi-square distribution with 1 df.
Upton (1982) suggested another alternative. Based on the known bias in the Pearson chi-
square statistic, Upton (1982) suggested a corrected chi-square statistic (U2),
( N  1) ( NPmin )( NPmaj )  ( NFmin )( NFmaj ) 

2
U2  . (0)
( N min )( N maj )( NPT )( NFT )
U2 is evaluated against a chi-square distribution with 1 df.
One- and Two-tailed tests. Tests of statistical significance can be either directional (i.e.,
one-tailed) or non-directional (two-tailed). Adverse impact statistics are often evaluated using
two-tailed significance levels (e.g., OFCCP, 1993), even though the hypothesis is usually
directional. Typically, the focal minority group will be determined by a claim of discrimination
or a history of under-representation in the workforce, and the purpose of the test is to identify
potential discrimination against this group. In such situations, finding a higher selection rate for
the minority than for the majority group would typically not be interpreted as an indication of
discrimination. Therefore, one-tailed significance levels are appropriate (Paetzold & Willborn,
1994).
There may be settings in which adverse impact statistics are evaluated in a more
exploratory fashion, without predetermined minority and majority groups, and non-directional
significance test should be used in these situations. The use of one-tailed significance tests will
generally increase statistical power for all significance tests, and will have little effect on the
relative performance of alternate tests. The current research will focus only on directional tests.
It is important to note that tests based on the chi-square distribution are inherently non-
directional. Chi-square tests are computed using the sum of squared differences, and therefore,
differences in either direction will yield a positive chi-square. In order to conduct a directional
test, the probability obtained from the standard chi-square test should be halved. Alternatively,
the critical chi-square value can be set using double the desired significance level (i.e., =.10 for
a directional test at the .05 level).
Alternate Sampling Models for 2 x 2 Contingency Tables
Statisticians have long debated the appropriateness of alternate test statistics for assessing
significance in 2 x 2 contingency tables. The most appropriate statistic depends on the theoretical
sampling model, that is, how a particular sample contingency table is generated from the
population. Historically, the test for association in a 2 x 2 contingency table has been expressed
in terms of one of three sampling models (Bernard, 1947; Fleiss, 1981; Kroll, 1989).
In order to define statistical significance, a set of applicants is viewed as a random sample
from a population of potential applicants. The relevant population will be defined by a variety of
factors, including the geographical region of the employment opportunity, the employer’s
recruiting practices and the minimum qualifications established for the job.1 A group of job
applicants are conceptualized as a sample from this population, because some qualified
individuals, either due to lack of awareness or convenience, will not attend a particular
administration of the selection procedure. Alternately, the sample of applicants could be viewed
as the individuals who take an exam at a particular point in time, and the population as applicants
who might potentially take part in future administrations of the test.
Independence Trial. In the first model, the marginal proportions are assumed to be fixed.
That is, Pmin and SRT are not sample estimates, but rather define the population to which
inferences will be generalized. In this model, the data are not viewed as a random sample from a
larger population. Instead, the probability of the result is considered among alternate random
assignment of participants to treatment conditions (Camilli, 1990). Consider an experiment
where participants are randomly assigned to one of two treatments and then observed on a
dichotomous outcome. Under the null hypothesis that the two variables (treatment group and
outcome) are independent, the number of successes for each treatment depends only on which
individuals were assigned to each group. If a different random assignment were applied to the
same group of participants, the results of the 2 x 2 table would differ. Tests based on this model
evaluate the probability that the observed result could have occurred due to random assignment,
assuming that the two variables are independent. Under this sampling model, the cell frequencies
will have a hypergeometric distribution. This model led to the development of the Fisher Exact
Test, which is closely approximated by the Yates (1934) correction to the chi-square test for
independence.
Comparative Trial. In the Comparative Trial model, participants can be viewed as
random samples from two distinct populations (e.g., minority and majority). The proportion from
each population is fixed. In other words, the marginal proportion on one variable (e.g., Pmin) is
assumed to be constant across replications. The second marginal proportion (SRT) is estimated
from the sample data. Assuming independence, each row of Table 1 would have a binomial
distribution, with parameters Ni and SRT. This model is the basis for the Z-test on independent
proportions.
Double Dichotomy. In the third model, neither marginal is assumed to be fixed.
Participants are viewed as a random sample from a population that is characterized by two
dichotomous characteristics. No purposive sampling or assignment to groups is used, and the
proportion in each group can vary across samples. Both population marginal proportions are
therefore unknown and must be estimated from the sample data. Assuming independence, the
expected value of each cell frequency will be the product of the marginal frequencies. The
sample frequencies will be distributed as N observations from a multinomial distribution with
parameters Pmin*SRT, Pmin*(1-SRT), (1-Pmin)*SRT, and (1-Pmin)*(1-SRT).
Model for adverse impact. Because the choice of test statistic depends on the sampling
model (Siskin & Trippi, 2005), it is important to identify the model that best describes typical
adverse impact data. This choice is complicated by the variety of approaches used to make
selection decisions. No single model will be appropriate for all situations in which adverse
impact is evaluated.
One approach to making selection decisions uses a fixed cutoff score. The passing score
might be established by the test developers in reference to normative data, or based on the
minimum qualifications required for a particular job. This situation is best represented by the
Double Dichotomy model. The population consists of all potential applicants, who can be
classified on minority/majority status and whether they would pass or fail the cut score. The
available data represent a sample from this population. Neither the proportion of minority
applicants nor the selection rate involve purposive sampling or random assignment, and sample
estimates of both variables are expected to differ from their population values. If multiple
samples were taken, they would not be expected to have exactly the same selection rate or
proportion of minority applicants. Therefore, the situation is best modeled as a random sample
with two unknown marginal proportions, as reflected in the Double Dichotomy model.
For other approaches to selection, the Double Dichotomy model may not be appropriate.
In top-down selection, candidates are ranked on their test scores (or based on a composite score
from a battery of tests), and then selected from the top down until a fixed number of positions are
filled. In this situation, the selection rate is fixed based on the employer’s staffing needs. If a
different sample had been used, the number passing would have been the same. However, Pmin is
likely to vary across samples and is best treated as an estimate of an unknown population
parameter. Further, because the selection decision depends on the rank position of applicants in a
particular sample, the selected and non-selected groups are sample-specific, and do not reflect
two distinct populations as in the Comparative Trial model. In fact, none of the three sampling
models adequately describe this type of decision.
Another common approach to making selection decisions uses score bands. A band refers
to a range of test scores that are treated as equivalent for the purpose of a selection decision. All
individuals within the top band or bands are considered to have passed the test, and move on to
be evaluated using other selection criteria2. Because bands are often defined using a fixed range
of scores below the highest score in the sample, the position of the band and the number of
people in the top band(s) will vary across samples. Neither marginal would be fixed. Further,
whether an individual passes depends on his or her position relative to the highest score in the
sample, and the selection rate cannot be considered a characteristic of the population. None of
the sampling models adequately represent this situation.
In some situations, it is difficult to conceptualize a population from which the data are a
random sample. For example, when evaluating a promotion decision, the pool of candidates is
relatively fixed. If the decision were repeated at a different point in time, the set of candidates
under consideration would be mostly the same. In such cases, probabilities based on randomly
sampling from a population, as in the Comparative Trial and Double Dichotomy models, would
not apply. Similarly, probabilities based on random reassignment of participants, as in the
Independence Trial model, would not be appropriate. Without some theoretical process for
producing different patterns of data (e.g., random sampling or random reassignment), statistical
significance cannot be defined3.

Comparison of alternate test statistics. A number of previous studies have compared the
performance of the alternate test statistics for 2x2 contingency tables (Kroll, 1989; Upton, 1982).
A clear result from these studies is that the choice of a test depends on the sampling model.
Under the Independence Trial model, the Fisher Exact Test has Type I error rates closest to the
nominal alpha level, closely followed by Yates’ test, while the Z-test and Upton’s chi-square can
have excessive Type I error rates when expected cell frequencies are less than 5. In contrast,
when the data are produced by the Double Dichotomy or Comparative Trial models, both the
Fisher Exact Test and Yates’ test tend to have overly conservative Type I error rates. The Z-test
generally produces Type I error rates closer to the nominal alpha level, but has inflated Type I
error rates under some conditions. Upton’s test also tends to be slightly inflated under some
conditions, but generally less so than the Z-test.
Because the Independence Trial model does not represent typical personnel selection
data, there is reason to question the appropriateness of the Fisher Exact Test and Yates’ test for
adverse impact analysis. The tendency of these tests to be conservative under the other sampling
models indicates that the Fisher Exact Test and Yates’ test will be less likely than other tests to
identify true cases of adverse impact. For selection decisions based on a fixed cut score, which
can be represented by the Double Dichotomy model, past research suggests that the Z-test and
Upton’s test will perform well. However, for other decision types (e.g., banding or top-down
selection), none of the sampling models are a perfect fit, and it is unclear whether the results of
previous studies will generalize to these settings.
Simulation of Type I Error and Power
A series of Monte Carlo simulations were conducted to identify rejection rates for each of
the adverse impact detection methods: the Z-test on the difference between proportions (Z),
Upton’s adjusted chi-square test (U2), the Fisher Exact Test (F), Yates’ correction to the chi
square test (Y2), the 4/5ths rule (4/5), and the Reverse One rule (R1). Separate simulations were
used to represent different types of selection decisions described above (fixed cutoff, top-down
selection, test-score banding). Overall, the different simulations produced very similar results.
For the sake of brevity, only the fixed cutoff will be discussed in detail here. Full results for the
other methods are available from the authors.
Method
Data were generated based on the Double Dichotomy model. A Fortran 90 program
generated 10,000 random samples consisting of frequencies in each of the four cells (i.e.,
minority vs. majority and pass vs. fail). Sample frequency distributions were generated using a
multinomial pseudo random number generator (RMNTN) from the International Mathematical
and Statistical Library (IMSL, 1984).
The multinomial sampling distribution depends on the sample size (N) and the proportion
of the population in each cell. Rather than specifying these proportions directly, we specified
parameters that are more easily interpreted in terms of personnel selection decisions (IR, SRT and
Pmin), and then computed the population cell proportions from these parameters. The simulation
was repeated for a wide range of realistic levels for each of these parameters. We chose three
levels of IR, representing extreme (IR = .1), small (IR = .8), and no adverse impact (IR = 1). SRT
was set at .1, .3, .5, and .7 to represent a wide range of selection settings and labor market
conditions. Pmin was set at .1, .3 and .5. Conditions with SRT=.7 and Pmin=.5 were excluded from
the simulation, because some values of IR are not possible under these conditions. Sample size
(N) values were chosen to represent a range of relatively small sample sizes (N = 20, 50, or 100).
The choice was made to focus on small sample size conditions because these are the conditions
under which the choice of significance test is likely to make the most difference.
For each sample, each of the test statistics was computed as described above. The Fisher
Exact Test was obtained using the IMSL (1984) DCTTWO routine. The rejection rate was
defined as the proportion of samples4 in which the statistic was significant at the =.05 level
(one-tailed). The test was considered significant only if the minority SR was smaller than the
majority SR. Tests based on the chi-square distribution (Y2 and U2) were considered significant
if the test statistic was larger than the critical chi-square value with 1 df and =.10. Because
results in the non-hypothesized direction were ignored, this corresponds to a directional =.05.
Rejection rates correspond to Type I error rates for IR=1.0, and to power when IR = 0.8 or IR =
0.1.
Results
Type I Error Rates. Empirical Type I error rates for the alternate test statistics are
presented in Table 2. The accuracy of the each approach depended on the sample size, proportion
of minority applications and overall selection rate. The combined influence of these three factors
can be summarized to a large extent by the expected frequency of the smallest cell, which is
N*Pmin*SRT when SRT.5, and N*Pmin*(1-SRT) when SRT>.5.
As in past research (Roth et al., 2006), the 4/5ths rule had very high false positive rates.
When the smallest expected frequency was less than five, false positive rates were often above .4
and reached as high as .77. Even for larger N, false positive rates were often over .2.
Figure 1 summarizes the false positive (Type I error) rates for the different significance
tests, and for the Reverse One rule. Applying the Reverse One rule avoided the most excessive
false positive rates found with the 4/5ths rule. When the minimum expected frequency was five
or less, false positive rates for the Reverse One rule were consistently less than .25, considerably
lower than with the 4/5ths rule. For larger N, results for the Reverse One method were similar to
the 4/5ths rule. Although it performed better than the 4/5ths rule, the high false positive rates for
the Reverse One rule are problematic.
Across conditions, the Z-test and Upton’s test produced very similar results. When the
smallest expected frequency was greater than five, the empirical Type I error rates tended to
range from .047 to .056, which is a negligible deviation from the nominal alpha level of .05.
Furthermore, as long as the minimum expected frequency was greater than two, the empirical
Type I error rates for the Z and Upton’s tests ranged from .04 to .06. This range is slightly larger
than the level of error recommended in past research (Bradley, 1978); however, these values
would not lead to substantially inaccurate conclusions.
When the minimum expected frequency was less than two, the Z-test and Upton’s test
performed similarly, with the Z-test slightly more liberal than Upton’s test. Both tests
consistently showed inflated Type I error rates when the selection rate was high. When the
proportion of minorities and the selection rate were both low, Type I error rates were overly
conservative (.026 or lower). Under other conditions both tests approached nominal Type I error
rates. In the few cases where the Z-test produced excessive rejection rates, Upton’s test was also
liberal, but to a lesser degree. For example, in the worst case, where the rejection rate for the Z-
test was .076, the rejection rate for Upton’s was slightly lower at .071.
The Fisher Exact Test provided better control for Type I errors, with empirical rejection
never exceeding .05 for the Fisher Exact Test. However, this control of Type I error occurred
because the test is overly conservative. The empirical rejection rates were consistently lower than
.03. The empirical rejection rates became smaller as the minimum expected frequency decreased,
reaching levels below .01. In terms of accurately reflecting the nominal Type I error rate, the
Fisher Exact Test was less accurate than the Z-test or Upton’s test for all of the conditions
investigated.
Results for Yates’ test were generally quite close to the Fisher Exact Test, except when
N, Pmin and SRT were all small, in which case the Type I error rate for Yates’ test was excessive
(.083). It should be noted that this anomalous result was found only with the fixed cutoff
scenario, and was not replication in simulations based on top-down selection or banding-based
decisions.
Power. Generally, all tests became more powerful as sample size increased and as the
proportion of minorities and the selection rate increased. When the degree of adverse impact in
the population was small (IR=.8), all tests had extremely low power across conditions (see Table
3). Figure 2 illustrates the power curve for each test statistic as a function of the minimum
expected frequency when SRT=.5. Even under the best conditions (N=100, Pmin=.5, SRT=.5)
power was only .29 for the Z-test and .23 for the Fisher Exact Test. Power curves were nearly
identical for the Z-test and Upton's test, and were consistently lower for the Fisher Exact Test
and Yates' corrected chi-square. However, differences between the statistics were generally
small, with power for Z and Upton's test ranging from .06 to .09 higher than Yates' test and the
Fisher Exact Test. Power curves were similar to Figure 2 for lower selection rates, and slightly
steeper when SRT=.7.
Power differences were larger when a large degree of adverse impact was present in the
population (IR=.1). As shown in Table 4, the Z-test and Upton’s test were uniformly more
powerful than Yates’ chi-square and the Fisher Exact Test. Figure 3 displays power curves for
each statistic as a function of the minimum expected frequency when SRT=.5. For the Z- and
Upton tests, power greater than .8 was found as long as the minimum expected frequency was
greater than 3. For the Fisher Exact Test and Yates' chi-square, a minimum expected frequency
greater than 4 guaranteed power of .8. Differences between the alternatives were greatest when
power was moderate. For example, when the minimum expected frequency was 2.5, power was .
75 for the Z-test but only .50 for the Fisher Exact Test. Power curves were similar to Figure 3 for
smaller selection rates, and slightly steeper when SRT =.7.
Power was consistently higher for the 4/5ths rule than for any of the alternatives, but this
came at the cost of inflated false positive rates, as discussed above. Power for the Reverse One
rule was generally less than the 4/5ths rule, but greater than the significance tests. For example,
when the population IR = 0.1, SRT = .3, Pmin = .1 and N=50, power was .27 for the Z-test, .66 for
the Reverse One rule, and .96 for the 4/5ths rule.
Situations Without a Pre-determined Comparison Group.
Although it may often be appropriate to focus on predetermined minority and majority
groups, there may be situations in which the comparison group is determined based on the
sample data. A strict reading of the Uniform Guidelines defines adverse impact in terms of the
selection rate for the minority group compared to the group with the highest selection rate. Under
this approach, the comparison group would not be the same in all samples, a situation that is not
represented by any of the sampling models described above.
This approach introduces a second form of sampling error into the data. Because the
comparison group is determined based on the sample selection rates, sampling error influences
which groups are included in the analysis. Choosing the group with the highest selection rate
increases the chance that a difference will be found, even when the differences are solely due to
sampling error. Consequently, Type I error rates will be inflated.
A more appropriate approach for evaluating statistical significance in this situation would
be to conduct a single test comparing all groups. A multiple-group generalization of the Z-test
can be performed using the standard k x 2 (group x outcome) chi-square test (Fleiss, 1981). This
test evaluates the null hypothesis that the selection rates are equal in all subpopulations. This test
will be referred to as the multiple-group chi-square.

To explore this scenario, we replicated the simulation of a test with a fixed selection rate,
but with three groups. For this simulation, the population selection rates were the same across
groups, so significant results represent Type I errors. The simulation was repeated for sample
sizes of 20, 50 and 100, and selection rates of .1, .3, .5, and .7. The proportion of the population
from the focal minority group was set at .1, .2 or .3. The proportion from the second group was
equal to the first group, with the third group making up a larger proportion (.8, .6 or .4). The Z-
test and Fisher Exact Test were computed comparing the first minority group to the group with
the highest selection rate. The multiple-group chi-square test was computed using all three
groups.
Table 5 presents the results for the Z-test, the Fisher Exact Test and the multiple group
chi-square. As expected, the Z-test had inflated Type I error rates under many conditions,
reaching as high as .13. The multiple-group chi-square test performed better than the Z-test, but
Type I error rates were slightly elevated under many conditions. However, when the smallest
expected frequency was greater than 2, the Type I error rates for the multiple-group chi-square
test did not exceed .07. Type I error rates for the Fisher Exact Test did not substantially exceed
the nominal alpha of .05; however, the test was often overly conservative, with Type I error rates
substantially below .05. Overall, the results support use of the multiple-group chi-square test.
Discussion
Current guidelines (e.g., OFCCP, 1993) recommend using a significance test combined
with the 4/5ths rule for adverse impact assessment. Based on the results from the current study,
the current use of the Z-test in this context seems reasonably well justified. The Z-test provided a
good balance of maintaining the nominal Type I error rate and maximizing power. In contrast,
the Fisher Exact Test and Yates’ chi-square were overly conservative except for extremely large
N, and consequently had lower power under many conditions. Given its lower power to detect
true cases of adverse impact, recommendations to use the Fisher Exact Test for adverse impact
assessment (OFCCP, 1993; Siskin & Trippi, 2005) should be reconsidered.
No significance test performed well when the sample size was extremely small. When the
smallest expected frequency was less than two, the Z-test was overly conservative in some cases,
and overly liberal in others. Upton’s chi-square test provided slightly better control for Type I
errors under the conditions where the Z-test was inflated, and maintained power levels that were
equal to the Z-test. Overall, the results indicate Upton’s chi-square is a reasonable alternative to
the Z-test when N is small.
The results were generally consistent with past research on significance tests for 2x2
tables (Upton, 1982) and demonstrate the generalizability of these results to several types of
selection decisions (fixed cutoff, top-down, banding). At the same time, the simulations may not
have fully captured all aspects of real-world employment decisions. For example, the simulations
did not model situations involving deliberate attempts to minimize adverse impact. Targeted
recruiting efforts may change the composition of the applicant pool. Similarly, methods that give
preferential treatment to the minority-group (e.g., selecting minorities first within a test score
band, as in Cascio, Outtz, Zedeck & Goldstein, 1991) will tend to reduce group differences in
selection rates. We would expect such affirmative action activities to primarily influence group
differences at the population level. Beyond their effect on power by reducing the effect size,
these practices should have little effect on the performance of statistical significance tests.
All of the significance tests had extremely low power under many of the conditions
investigated, highlighting an inherent limitation of statistical significance testing under these
conditions. If statistical significance testing is applied, we recommend using one-tailed tests to
gain the benefit of added power, and because in most situations there is a clear expectation about
the group being disadvantaged by the selection procedure. Even using one-tailed tests, power
was found to be low under many conditions. As a result, alternative approaches to assessing
adverse impact should be considered, such as the use of confidence intervals on the adverse
impact ratio (Morris & Lobsenz, 2000), or pooling of results across samples to increase the
precision of the statistics that guide adverse impact decisions. Additional research is needed to
explore the usefulness and limitations of these strategies.

References
Aguinis, H. (2004). Test-score banding in human resource selection: Technical, legal,
and societal issues. Westport, CT: Praeger.
Bernard, G. A. (1947). Significance tests for 2 x 2 tables. Biometrika, 34, 123-138.
Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-
analytic matrix incorporating cognitive ability, alternative predictors, and job performance.
Personnel Psychology, 52, 561-589.
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical
Psychology, 31, 144-152.
Buonasera, A. K., Kuang, D., Dunleavy, E. M., & Mueller, L. (2006, April). The
implications of frequent appliers on adverse impact analyses. Poster presented at the Annual
Conference of the Society for Industrial and Organizational Psychology, Dallas, TX.
Camilli, G. (1990). The test of homogeneity for 2 x 2 contingency tables: A review of
and some personal opinions on the controversy. Psychological Bulletin, 108, 135-145.
Camilli, G., & Hopkins, K. D. (1978). Applicability of chi-square to 2 x 2
contingency tables with small expected frequencies. Psychological Bulletin, 85, 163-167.
Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implication of
six methods of test score use in personnel selection. Human Performance, 4, 233-264.
Conover, W. J. (1974). Some reasons for not using the Yates’ continuity correction
on 2 x 2 contingency tables. Journal of the American Statistical Association, 69, 374-376.
Fleiss J. L. (1981). Statistical methods for rates and proportions (2nd ed). Wiley Series in
Probability and mathematical Statistics. NY: John Wiley & Sons.

Fleiss, J. L. (1994). Measures of effect size for categorical data. In H. Cooper & L. V.
Hedges (Eds.), The Handbook of Research Synthesis (pp. 245-260). NY: Russell Sage
Foundation.
Grizzle, J. E. (1967). Continuity correction in the 2 test for 2 x 2 tables. American
Statistician, 21, 28-32.
Haber, M. (1986). An exact unconditional test for the 2 x 2 comparative trial.
Psychological Bulletin, 99, 129-132.
Haber, M. (1990). Comments on “The test of homogeneity for 2 x 2 contingency tables:
A review of and some personal opinions on the controversy” by G. Camilli. Psychological
Bulletin, 108, 146-149.
Hazelwood School District v. United States, 433 U.S. 299 (1977).
International Mathematical and Statistical Library (1984). User's Manual: IMSL Library,
Problem-solving software system for mathematical and statistical FORTRAN programming
(Vol. 3, ed. 9.2). Houston, TX: IMSL.
Kroll, N. E. A. (1989). Testing independence in 2 x 2 contingency tables. Journal of
Educational Statistics, 14, 47-79.
Morris, S. B., & Lobsenz, R. (2000). Significance tests and confidence intervals for the
adverse impact ratio. Personnel Psychology, 53, 89-111.
Morris, S. B., & Henry, M. S. (2000, April). Using Meta-Analysis to Estimate Adverse
Impact. Paper presented at the 15th Annual Conference of the Society for Industrial and
Organizational Psychology, New Orleans, LA.
Office of Federal Contract Compliance Programs (1993). Federal contract compliance
manual. Washington, D.C.: Department of Labor, Employment Standards Administration, Office
of Federal Contract Compliance Programs (SUDOC# L 36.8: C 76/993).

Paetzold, R. L., & Willborn, S. L. (1994). Statistics in discrimination: Using statistical
evidence in discrimination cases. Colorado Springs, CO: Shepard's/McGraw-Hill.
Roth, P. L., Bobko, P., & Switzer, F. S. (2006). Modeling the behavior of the 4/5ths rule
for determining adverse impact: Reasons for caution. Journal of Applied Psychology, 91, 507-
522.
Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score
adjustment in preemployment testing. American Psychologist, 49, 929-954.
Siskin, B. R., & Trippi, J. (2005). Statistical issues in litigation. In Landy, F. J. (ed.),
Employment Discrimination Litigation: Behavioral, Quantitative and Legal Perspectives. San
Francisco, CA: Jossey-Bass.
U.S. Equal Employment Opportunity Commission, Civil Service Commission,
Department of Labor, and Department of Justice (1978). Uniform guidelines on employee
selection procedures. Federal Register, 43, 38290-38315.
Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial.
Journal of the Royal Statistical Society, 145, 86-105.
Yates, F. (1934). Contingency tables involving small numbers of the 2 test. Journal of
the Royal Statistical Society Supplement, 1, 217-235.
Yates, F. (1984). Tests of significance for 2 x 2 contingency tables. Journal of the Royal
Statistical Society A, 147, pp. 426-463.

Notes
1
The choice of an appropriate reference population can have a substantial impact on
adverse impact results, and is often a point of debate in discrimination cases (e.g., Hazelwood v.
US, 1977). However, once a particular definition of the population has been adopted, this choice
will have no effect on the behavior of samples drawn from that population, and therefore will not
impact the conclusions drawn in this paper about the performance of alternate statistical tests.
2
A variety of methods have been proposed for using test score bands, which differ in how
the band width is determined and how applicants are chosen from within a band. The effect of
alternate banding techniques on adverse impact has been the issue of considerable research
(Aguinis, 2004), and will not be further explored here. We will consider only fixed bands where
everyone within the band moves on to be evaluated using other selection criteria.
3
When a large number of applicants repeatedly reapply for the same position, the samples
will not be completely independent, and the sampling models may not accurately represent the
variability of results across repeated test administrations (Buonasera, Kuang, Dunleavy, &
Mueller, 2006).
4
Under some conditions, samples were produced where the significance tests could not be
computed. For example, if either SRT or Pmin is one or zero, the denominator of Z, Y2, and U2
will be zero, and therefore, the statistic is not defined. Samples where the statistical tests could
not be computed were excluded from the analysis. Thus, empirical rejection rates were computed
as the proportion of the remaining samples that produced significant results.

Table 1
Cross-tabulated Frequency Table of Selection Outcomes
Reference Fail Test/Not Selected Pass Test/Selected Total Proportion

Minority NFmin NPmin Nmin Pmin
Majority NFmaj NPmaj Nmaj 1-Pmin
Total NFT NPT N
Proportion 1-SRT SRT

Table 2
Empirical Type I error rates of the Z-test (Z), Upton’s chi square (U), Fisher Exact Test (F), Yates’ chi square (Y), 4/5th Test (4/5), and
Reverse One rule (R1) for selection decisions with a fixed cut score.
10% Minority 30% Minority 50% Minority

N Z U F Y 4/5 R1 Z U F Y 4/5 R1 Z U F Y 4/5 R1
10% Selected
20 0 0 0 0.083 0.765 0.001 0.009 0.008 0 0.002 0.544 0.033 0.046 0.034 0.004 0.004 0.424 0.066
50 0.001 0.001 0 0.013 0.610 0.013 0.026 0.026 0.005 0.003 0.449 0.141 0.052 0.050 0.015 0.013 0.410 0.164
100 0.002 0.002 0 0.001 0.448 0.096 0.046 0.045 0.017 0.013 0.390 0.232 0.051 0.050 0.024 0.023 0.364 0.243
30% Selected
20 0.009 0.008 0 0.003 0.539 0.034 0.043 0.040 0.007 0.005 0.416 0.124 0.059 0.053 0.016 0.015 0.369 0.144
50 0.022 0.022 0.005 0.003 0.415 0.142 0.048 0.045 0.023 0.020 0.337 0.250 0.053 0.052 0.028 0.028 0.314 0.260
100 0.042 0.041 0.016 0.011 0.358 0.225 0.054 0.053 0.031 0.029 0.270 0.267 0.053 0.052 0.034 0.034 0.234 0.234
50% Selected
20 0.037 0.027 0.004 0.004 0.395 0.063 0.056 0.05 0.015 0.014 0.354 0.144 0.060 0.054 0.019 0.018 0.317 0.167
50 0.052 0.050 0.014 0.014 0.351 0.167 0.05 0.049 0.025 0.024 0.250 0.235 0.054 0.053 0.029 0.029 0.228 0.227
100 0.047 0.046 0.022 0.021 0.279 0.237 0.05 0.049 0.031 0.031 0.168 0.168 0.048 0.048 0.032 0.032 0.136 0.136
70% Selected
20 0.076 0.071 0.008 0.013 0.316 0.063 0.063 0.058 0.016 0.017 0.271 0.135
50 0.070 0.068 0.020 0.022 0.260 0.139 0.056 0.056 0.027 0.027 0.150 0.149
100 0.057 0.057 0.024 0.026 0.181 0.169 0.054 0.054 0.033 0.035 0.071 0.071
*Type I error was defined as the proportion of significant results out of 10,000 samples when the population had no adverse impact
(IR=1.0). The conditions with 50% minority and 70% selected were excluded from the simulation.
Table 3
Power of the Z-test (Z), Upton's chi square (U), the Fisher Exact test (F), Yates’ chi square (Y), 4/5th Test (4/5), and Reverse One rule (R1)
when the population adverse impact ratio is 0.8.

10% Selected
20 0 0 0 0.090 0.809 0.001 0.012 0.011 0.001 0.003 0.607 0.044 0.056 0.045 0.007 0.007 0.501 0.085
50 0 0 0 0.011 0.673 0.021 0.039 0.038 0.008 0.005 0.535 0.199 0.079 0.075 0.027 0.025 0.502 0.230
100 0.004 0.004 0.001 0 0.528 0.128 0.081 0.081 0.035 0.025 0.521 0.339 0.103 0.102 0.056 0.053 0.504 0.376
30% Selected
20 0.013 0.012 0.001 0.002 0.611 0.044 0.076 0.071 0.014 0.01 0.547 0.192 0.108 0.098 0.032 0.03 0.494 0.224
50 0.039 0.039 0.008 0.005 0.525 0.215 0.118 0.115 0.063 0.057 0.507 0.405 0.132 0.13 0.076 0.075 0.515 0.455
100 0.087 0.087 0.042 0.032 0.523 0.362 0.150 0.148 0.103 0.097 0.509 0.506 0.179 0.178 0.128 0.128 0.505 0.504
50% Selected
20 0.072 0.057 0.009 0.009 0.506 0.106 0.125 0.113 0.037 0.035 0.521 0.257 0.137 0.127 0.055 0.054 0.493 0.308
50 0.114 0.110 0.043 0.040 0.515 0.289 0.178 0.175 0.106 0.106 0.510 0.494 0.212 0.207 0.136 0.135 0.51 0.509
100 0.156 0.155 0.091 0.086 0.515 0.464 0.253 0.251 0.188 0.188 0.517 0.517 0.292 0.291 0.229 0.229 0.501 0.501
70% Selected
20 0.162 0.152 0.026 0.033 0.484 0.137 0.191 0.178 0.063 0.064 0.495 0.307
50 0.189 0.185 0.078 0.081 0.504 0.350 0.288 0.287 0.175 0.177 0.494 0.493
100 0.247 0.246 0.148 0.156 0.491 0.474 0.444 0.443 0.356 0.365 0.503 0.503
*Power was defined as the proportion of significant results out of 10,000 samples. The conditions with 50% minority and 70%
selected were excluded from the simulation.

Table 4
Power of the Z-test (Z), Upton's chi square (U) the Fisher Exact test (F), Yates’ chi square (Y), 4/5th Test (4/5), and Reverse One rule (R1)
when the population adverse impact ratio is 0.1.

10% Selected
20 0 0 0 0.090 0.973 0.002 0.047 0.043 0.004 0.004 0.938 0.129 0.296 0.254 0.065 0.066 0.922 0.356
50 0 0 0 0.010 0.947 0.043 0.248 0.245 0.077 0.054 0.958 0.634 0.670 0.66 0.417 0.398 0.977 0.843
100 0.015 0.013 0.002 0.001 0.935 0.358 0.696 0.694 0.498 0.429 0.988 0.944 0.919 0.918 0.839 0.828 0.997 0.986
30% Selected
20 0.046 0.043 0.005 0.005 0.939 0.131 0.512 0.503 0.213 0.165 0.972 0.729 0.843 0.827 0.633 0.620 0.99 0.927
50 0.267 0.264 0.086 0.06 0.956 0.664 0.926 0.924 0.851 0.833 0.998 0.993 0.996 0.996 0.989 0.989 1 1
100 0.741 0.739 0.531 0.456 0.991 0.956 0.999 0.999 0.996 0.996 1 1 1 1 1 1 1 1
50% Selected
20 0.352 0.304 0.076 0.077 0.945 0.409 0.893 0.879 0.705 0.694 0.994 0.948 0.998 0.997 0.987 0.987 1 0.999
50 0.756 0.748 0.499 0.480 0.983 0.882 0.999 0.999 0.996 0.996 1 1 1 1 1 1 1 1
100 0.963 0.963 0.912 0.905 0.999 0.993 1 1 1 1 1 1 1 1 1 1 1 1
70% Selected
20 0.746 0.729 0.309 0.362 0.968 0.609 0.997 0.996 0.977 0.978 0.999 0.98
50 0.948 0.944 0.822 0.835 0.996 0.954 1 1 1 1 1 1
100 0.997 0.997 0.990 0.992 0.999 0.998 1 1 1 1 1 1
*Power was defined as the proportion of significant results out of 10,000 samples. The conditions with 50% minority and 70%
selected were excluded from the simulation.

Table 5
Empirical Type I error rates of the Z-test (Z), Fisher Exact Test (F), and Multi-group chi-square test (MG) in a 3-group comparison.
10% in Focal Group 20% in Focal Group 30% in Focal Group
N Z F MG Z F MG Z F MG
10% Selected
20 0.027 0 0.066 0.030 0.001 0.048 0.045 0.002 0.040
50 0.024 0.001 0.050 0.040 0.004 0.047 0.078 0.013 0.056
100 0.039 0.004 0.043 0.069 0.018 0.051 0.086 0.034 0.065
30% Selected
20 0.074 0.002 0.051 0.081 0.008 0.058 0.092 0.015 0.061
50 0.073 0.010 0.050 0.086 0.026 0.059 0.090 0.038 0.068
100 0.078 0.025 0.054 0.084 0.041 0.065 0.085 0.052 0.066
50% Selected
20 0.119 0.004 0.048 0.104 0.019 0.065 0.110 0.023 0.071
50 0.096 0.022 0.064 0.087 0.033 0.068 0.087 0.039 0.066
100 0.088 0.033 0.061 0.086 0.046 0.064 0.087 0.055 0.070
70% Selected
20 0.183 0.013 0.111 0.129 0.020 0.086 0.117 0.019 0.077
50 0.112 0.022 0.070 0.091 0.029 0.069 0.094 0.039 0.069
100 0.094 0.031 0.066 0.090 0.045 0.068 0.096 0.054 0.069
*Type I error was defined as the proportion of significant results out of 10,000 samples when the population had no adverse impact
(IR=1.0).
Adverse Impact Tests
30
Figure Captions
Figure 1
Type I error rates for alternate test statistics with a fixed cut score.
Note: Z = Z-test; CHIU = Upton’s chi-square test; CHIY = Yates’ chi-square test; Fisher =
Fisher Exact Test; R1 = Reverse One rule.
Figure 2
Power of adverse impact tests with a fixed cut score when IR=.8 and SRT=.5.
Figure 3
Power of adverse impact tests with a fixed cut score when IR=.1 and SRT=.5.
Figure 4
Type I error rates in 3-group comparison
Note: Z = Z-test; Fisher = Fisher Exact Test; MG = Multi-group chi-square test.

31
0.3
0.25
Type I Error Rate
0.2 Z
CHIU
0.15 CHIY
Fisher
0.1 R1
0.05
0
0 10 20 30
Minimum Expected N
32
1
0.9
0.8
0.7
Z
0.6 CHIU
Power
0.5 CHIY
0.4 Fisher
R1
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Minimum Expected N
33
1
0.9
0.8
0.7
Z
0.6 CHIU
Power
0.5 CHIY
0.4 Fisher
R1
0.3
0.2
0.1
0
0 10 20 30
Minimum Expected N
34
0.2
0.18
0.16
0.14
Type I Error Rate
0.12 Z
0.1 Fisher
0.08 MG
0.06
0.04
0.02
0
0 5 10 15 20 25 30
Minimum Expected N

Adverse Impact Tests 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adverse Impact Tests 1

Uploaded by

Copyright:

Available Formats

Adverse Impact Tests 1

Testing for Adverse Impact When Sample Size is Small

Illinois Institute of Technology

Illinois Institute of Technology

Journal of Applied Psychology, 93(2), 463-471

© 2008 American Psychological Association

copy of record. The final published version can be obtained at

I error rates and substantially lower power than the Z-test.

outlined in the Uniform Guidelines on Employee Selection Procedures (Uniform Guidelines;

appropriate test (OFCCP, 1993; Siskin & Trippi, 2005).

rates and power of alternate adverse impact tests.

Alternate Test Statistics

Adverse impact can be defined using passing rates on a particular component of a

successful outcomes, regardless of the type of decision involved.

situations where the relevant groups are not identified a priori.

marginal proportion of applicants who pass the selection test.

not known how well this heuristic will work in practice.

exceed 500 in order to meet this requirement.

being considerably easier to compute. Yates’ continuity-corrected test is calculated as:

Yates’ test is evaluated against a chi-square distribution with 1 df.

square statistic, Upton (1982) suggested a corrected chi-square statistic (U2),

( N  1) ( NPmin )( NPmaj )  ( NFmin )( NFmaj ) 

U2 is evaluated against a chi-square distribution with 1 df.

a directional test at the .05 level).

Alternate Sampling Models for 2 x 2 Contingency Tables

In order to define statistical significance, a set of applicants is viewed as a random sample

who might potentially take part in future administrations of the test.

assignment of participants to treatment conditions (Camilli, 1990). Consider an experiment

Comparative Trial. In the Comparative Trial model, participants can be viewed as

Double Dichotomy. In the third model, neither marginal is assumed to be fixed.

dichotomous characteristics. No purposive sampling or assignment to groups is used, and the

sample frequencies will be distributed as N observations from a multinomial distribution with

parameters Pmin*SRT, Pmin*(1-SRT), (1-Pmin)*SRT, and (1-Pmin)*(1-SRT).

models adequately describe this type of decision.

the sampling models adequately represent this situation.

not apply. Similarly, probabilities based on random reassignment of participants, as in the

significance cannot be defined3.

conditions, but generally less so than the Z-test.

previous studies will generalize to these settings.

Simulation of Type I Error and Power

other methods are available from the authors.

and Statistical Library (IMSL, 1984).

N*Pmin*SRT when SRT.5, and N*Pmin*(1-SRT) when SRT>.5.

the Reverse One rule are problematic.

would not lead to substantially inaccurate conclusions.

steeper when SRT=.7.

smaller selection rates, and slightly steeper when SRT =.7.

Situations Without a Pre-determined Comparison Group.

Although it may often be appropriate to focus on predetermined minority and majority

represented by any of the sampling models described above.

sampling error. Consequently, Type I error rates will be inflated.

will be referred to as the multiple-group chi-square.

assessment (OFCCP, 1993; Siskin & Trippi, 2005) should be reconsidered.

the Z-test when N is small.

investigated, highlighting an inherent limitation of statistical significance testing under these

conditions. If statistical significance testing is applied, we recommend using one-tailed tests to

explore the usefulness and limitations of these strategies.

Aguinis, H. (2004). Test-score banding in human resource selection: Technical, legal,

and societal issues. Westport, CT: Praeger.

Bernard, G. A. (1947). Significance tests for 2 x 2 tables. Biometrika, 34, 123-138.

Personnel Psychology, 52, 561-589.

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical

Psychology, 31, 144-152.

parameters PminSRT, Pmin(1-SRT), (1-Pmin)SRT, and (1-Pmin)(1-SRT).

NPminSRT when SRT.5, and NPmin(1-SRT) when SRT>.5.