Professional Documents
Culture Documents
Michael W. Collins
Scott B. Morris
DOI: 10.1037/0021-9010.93.2.463
This article may not exactly replicate the final version published in the APA journal. It is not the
http://psycnet.apa.org/journals/apl/93/2/463/
Adverse Impact Tests 2
Testing for Adverse Impact When Sample Size is Small
Abstract
Adverse impact evaluations often call for evidence that the disparity between groups in
selection rates is statistically significant, and practitioners must choose which test statistic to
apply in this situation. To identify the most effective testing procedure, several alternate test
statistics were compared in terms of Type I error rates and power, focusing on situations with
small samples. Significance testing was found to be of limited value due to low power for all
tests. Among the alternate test statistics, the widely-used Z-test on the difference between two
proportions performed reasonably well, except when sample size was extremely small. A test
suggested by Upton (1982) provided slightly better control of Type I error under some
conditions, but generally produced results similar to the Z-test. Use of the Fisher Exact Test and
Yates continuity-corrected chi-square test are not recommended, due to overly conservative Type
Adverse impact analyses play a central role in many employment discrimination lawsuits, and
have become a regular part of the evaluation of employee selection procedures. Consequently, it
is important that adverse impact analyses are based on the best statistical procedures available.
The most common approach for evaluating adverse impact is based on the 4/5ths rule
U.S. Equal Employment Opportunity Commission, 1978). A limitation of the 4/5ths rule is that it
does not take into account the potential impact of sampling error (Morris & Lobsenz, 2000).
When sample size is small, the 4/5ths rule will often identify cases of adverse impact even when
selection rates are equal in the population (Roth, Bobko & Switzer, 2006).
In order to account for sampling error, the 4/5ths rule can be supplemented with a test of
statistical significance (Roth et al., 2006). For large samples, significance can be tested using a
Z-test on the difference between two independent proportions, often referred to as the “2 SD
rule” [Office of Federal Contract Compliance Programs (OFCCP) Compliance Manual, 1993].
When the sample size is small, the Fisher Exact Test is often recommended as a more
The choice between the Fisher Exact Test and the Z-test (or the equivalent chi-square
test) has long been debated (Camilli, 1990; Haber, 1990; Upton, 1982). Although there are clear
advantages to using an exact test, the Fisher Exact Test is based on a different theoretical model
than the Z-test (Kroll, 1989; Siskin & Trippi, 2005), and each will be appropriate in different
situations. In this paper, we review the statistical models underlying these tests and discuss their
applicability to adverse impact analysis. In addition, we report simulations of the Type I error
on statistical evidence. Courts may consider a variety of factors to determine whether a prima
facie case of discrimination has been made. The Uniform Guidelines recommend that adverse
impact statistics be interpreted in light of the hiring organization’s recruiting practices that
encourage or discourage minority applicants. In addition, when sample size is small, the Uniform
Guidelines suggest that adverse impact statistics might be supplemented with data from other
similar jobs or for the same job across time. The current research only addresses methods for
evaluating statistical evidence, and therefore does not fully model the process of determining
whether adverse impact exists. However, because statistical evidence usually plays a central role
in such decisions, identifying the most effective statistical tools is essential to ensure accurate
decisions.
selection process, or based on the outcome of a decision related to selection, promotion or other
employment action. For simplicity, we will use the term ‘selection rate’ to refer to the rate of
Adverse impact statistics are typically based on a comparison of selection rates for two
predetermined groups (referred to here as the minority and majority groups). Although there may
be more than two subgroups in the applicant pool (e.g., multiple ethnic groups), the context in
which the analysis is conducted (e.g., a claim of discrimination) typically identifies the relevant
minority and majority groups. Therefore, in developing a model for adverse impact data, we will
focus on the selection rates for the two groups of interest, and ignore data that might be present
for other groups. In a later section we will discuss the implications of assessing adverse impact in
table, as illustrated in Table 1. In this table, NPmin, NPmaj, NFmin and NFmaj reflect the number of
applicants in each cell, Nmin, Nmaj, NPT and NFT are the marginal totals, and N is the total number
of applicants. Pmin represents the marginal proportion of minority applicants, and SRT reflects the
The most widely recognized procedure for detecting adverse impact is the 4/5ths rule
outlined in the Uniform Guidelines. The basic statistic for the 4/5ths rule is the impact ratio (IR),
which is defined as the selection rate for the minority group divided by the selection rate for the
majority group. Adverse impact is indicated when this ratio is less than four-fifths. The impact ratio
is an index of effect size; it was intended to direct the attention of regulatory agencies toward
settings with substantial disparities in selection outcomes. It is not based on a formal sampling
model and does not assess the impact of sampling error on the result.
The Uniform Guidelines also suggest a heuristic method of assessing the impact of chance
on adverse impact results in small samples, which we will call the Reverse One rule (Roth et al.,
2006, referred to this as the “N of 1” or “flip-flop” rule). The Guidelines state that adverse impact
would not be found when, “…the selection of one different person for one job would shift the result
from adverse impact against one group to a situation in which that group has a higher selection rate
than the other group.” The adjusted impact ratio (IRadj) suggested by this strategy is,
IRadj =
NPmin 1 N min
..
NPmaj 1 N maj
( 0)
Applying this strategy, a finding of adverse impact would require that: (1) the impact ratio (IR) is
less than 4/5, and (2) that the adjusted impact ratio (IRadj) is less than 1.0. Although this approach
does assess the impact of chance on the result, it is not based on a formal statistical model, and it is
common test is the Z-test for the difference between proportions (OFCCP, 1993), or equivalently,
the Pearson chi-square test for independence in a 2 x 2 table. The Z-test evaluates the null
hypothesis of equal population selection rates for the two groups being compared,
NP min NP maj
-
N min N maj
Z= . ( 0)
SRT (1 - SRT )
(N)( P min )(1 - P min )
Also known as the 2-SD test, Z is considered significant if the difference is more than roughly
two standard deviations above or below zero (or more precisely, |Z| > 1.96, corresponding to a
two-tailed =.05).
The Z-test is based on large-sample theory, and may not be appropriate in small samples.
Specifically, when any of the cell frequencies are less than five, the ability of the test to achieve
the nominal Type I error rate is questionable. When Pmin and SRT are small, this value may be
less than five, even for moderately large N. For example, when the minority group makes up
10% of the applicant pool, and 10% of the applicants pass the test, the total sample size must
A number of alternate test statistics have been suggested for situations where the sample
size is too small for the Z-test. The most common is the Fisher Exact Test (Kroll, 1989; OFCCP,
1993; Siskin & Trippi, 2005). The Fisher Exact Test provides the exact probability of obtaining
the observed frequency table (or one more extreme) under the null hypothesis, with the
additional assumption that the marginal frequencies are fixed (Fleiss, 1981).
Another statistic often recommended for small samples is the chi-square test with Yates’
continuity correction (Camilli & Hopkins, 1978). Although not an exact test, Yates’ correction
Adverse Impact Tests 7
has been shown to provide a close approximation to the Fisher Exact Test (Fleiss, 1981), while
Upton (1982) suggested another alternative. Based on the known bias in the Pearson chi-
U2 . (0)
( N min )( N maj )( NPT )( NFT )
One- and Two-tailed tests. Tests of statistical significance can be either directional (i.e.,
one-tailed) or non-directional (two-tailed). Adverse impact statistics are often evaluated using
two-tailed significance levels (e.g., OFCCP, 1993), even though the hypothesis is usually
directional. Typically, the focal minority group will be determined by a claim of discrimination
or a history of under-representation in the workforce, and the purpose of the test is to identify
potential discrimination against this group. In such situations, finding a higher selection rate for
the minority than for the majority group would typically not be interpreted as an indication of
discrimination. Therefore, one-tailed significance levels are appropriate (Paetzold & Willborn,
1994).
There may be settings in which adverse impact statistics are evaluated in a more
exploratory fashion, without predetermined minority and majority groups, and non-directional
significance test should be used in these situations. The use of one-tailed significance tests will
generally increase statistical power for all significance tests, and will have little effect on the
relative performance of alternate tests. The current research will focus only on directional tests.
Adverse Impact Tests 8
It is important to note that tests based on the chi-square distribution are inherently non-
directional. Chi-square tests are computed using the sum of squared differences, and therefore,
differences in either direction will yield a positive chi-square. In order to conduct a directional
test, the probability obtained from the standard chi-square test should be halved. Alternatively,
the critical chi-square value can be set using double the desired significance level (i.e., =.10 for
Statisticians have long debated the appropriateness of alternate test statistics for assessing
significance in 2 x 2 contingency tables. The most appropriate statistic depends on the theoretical
sampling model, that is, how a particular sample contingency table is generated from the
population. Historically, the test for association in a 2 x 2 contingency table has been expressed
in terms of one of three sampling models (Bernard, 1947; Fleiss, 1981; Kroll, 1989).
from a population of potential applicants. The relevant population will be defined by a variety of
factors, including the geographical region of the employment opportunity, the employer’s
recruiting practices and the minimum qualifications established for the job.1 A group of job
applicants are conceptualized as a sample from this population, because some qualified
individuals, either due to lack of awareness or convenience, will not attend a particular
administration of the selection procedure. Alternately, the sample of applicants could be viewed
as the individuals who take an exam at a particular point in time, and the population as applicants
Independence Trial. In the first model, the marginal proportions are assumed to be fixed.
That is, Pmin and SRT are not sample estimates, but rather define the population to which
inferences will be generalized. In this model, the data are not viewed as a random sample from a
Adverse Impact Tests 9
larger population. Instead, the probability of the result is considered among alternate random
where participants are randomly assigned to one of two treatments and then observed on a
dichotomous outcome. Under the null hypothesis that the two variables (treatment group and
outcome) are independent, the number of successes for each treatment depends only on which
individuals were assigned to each group. If a different random assignment were applied to the
same group of participants, the results of the 2 x 2 table would differ. Tests based on this model
evaluate the probability that the observed result could have occurred due to random assignment,
assuming that the two variables are independent. Under this sampling model, the cell frequencies
will have a hypergeometric distribution. This model led to the development of the Fisher Exact
Test, which is closely approximated by the Yates (1934) correction to the chi-square test for
independence.
random samples from two distinct populations (e.g., minority and majority). The proportion from
each population is fixed. In other words, the marginal proportion on one variable (e.g., Pmin) is
assumed to be constant across replications. The second marginal proportion (SRT) is estimated
from the sample data. Assuming independence, each row of Table 1 would have a binomial
distribution, with parameters Ni and SRT. This model is the basis for the Z-test on independent
proportions.
Participants are viewed as a random sample from a population that is characterized by two
proportion in each group can vary across samples. Both population marginal proportions are
therefore unknown and must be estimated from the sample data. Assuming independence, the
Adverse Impact Tests 10
expected value of each cell frequency will be the product of the marginal frequencies. The
Model for adverse impact. Because the choice of test statistic depends on the sampling
model (Siskin & Trippi, 2005), it is important to identify the model that best describes typical
adverse impact data. This choice is complicated by the variety of approaches used to make
selection decisions. No single model will be appropriate for all situations in which adverse
impact is evaluated.
One approach to making selection decisions uses a fixed cutoff score. The passing score
might be established by the test developers in reference to normative data, or based on the
minimum qualifications required for a particular job. This situation is best represented by the
Double Dichotomy model. The population consists of all potential applicants, who can be
classified on minority/majority status and whether they would pass or fail the cut score. The
available data represent a sample from this population. Neither the proportion of minority
applicants nor the selection rate involve purposive sampling or random assignment, and sample
estimates of both variables are expected to differ from their population values. If multiple
samples were taken, they would not be expected to have exactly the same selection rate or
proportion of minority applicants. Therefore, the situation is best modeled as a random sample
with two unknown marginal proportions, as reflected in the Double Dichotomy model.
For other approaches to selection, the Double Dichotomy model may not be appropriate.
In top-down selection, candidates are ranked on their test scores (or based on a composite score
from a battery of tests), and then selected from the top down until a fixed number of positions are
filled. In this situation, the selection rate is fixed based on the employer’s staffing needs. If a
different sample had been used, the number passing would have been the same. However, Pmin is
Adverse Impact Tests 11
likely to vary across samples and is best treated as an estimate of an unknown population
parameter. Further, because the selection decision depends on the rank position of applicants in a
particular sample, the selected and non-selected groups are sample-specific, and do not reflect
two distinct populations as in the Comparative Trial model. In fact, none of the three sampling
Another common approach to making selection decisions uses score bands. A band refers
to a range of test scores that are treated as equivalent for the purpose of a selection decision. All
individuals within the top band or bands are considered to have passed the test, and move on to
be evaluated using other selection criteria2. Because bands are often defined using a fixed range
of scores below the highest score in the sample, the position of the band and the number of
people in the top band(s) will vary across samples. Neither marginal would be fixed. Further,
whether an individual passes depends on his or her position relative to the highest score in the
sample, and the selection rate cannot be considered a characteristic of the population. None of
In some situations, it is difficult to conceptualize a population from which the data are a
random sample. For example, when evaluating a promotion decision, the pool of candidates is
relatively fixed. If the decision were repeated at a different point in time, the set of candidates
under consideration would be mostly the same. In such cases, probabilities based on randomly
sampling from a population, as in the Comparative Trial and Double Dichotomy models, would
Independence Trial model, would not be appropriate. Without some theoretical process for
producing different patterns of data (e.g., random sampling or random reassignment), statistical
performance of the alternate test statistics for 2x2 contingency tables (Kroll, 1989; Upton, 1982).
A clear result from these studies is that the choice of a test depends on the sampling model.
Under the Independence Trial model, the Fisher Exact Test has Type I error rates closest to the
nominal alpha level, closely followed by Yates’ test, while the Z-test and Upton’s chi-square can
have excessive Type I error rates when expected cell frequencies are less than 5. In contrast,
when the data are produced by the Double Dichotomy or Comparative Trial models, both the
Fisher Exact Test and Yates’ test tend to have overly conservative Type I error rates. The Z-test
generally produces Type I error rates closer to the nominal alpha level, but has inflated Type I
error rates under some conditions. Upton’s test also tends to be slightly inflated under some
Because the Independence Trial model does not represent typical personnel selection
data, there is reason to question the appropriateness of the Fisher Exact Test and Yates’ test for
adverse impact analysis. The tendency of these tests to be conservative under the other sampling
models indicates that the Fisher Exact Test and Yates’ test will be less likely than other tests to
identify true cases of adverse impact. For selection decisions based on a fixed cut score, which
can be represented by the Double Dichotomy model, past research suggests that the Z-test and
Upton’s test will perform well. However, for other decision types (e.g., banding or top-down
selection), none of the sampling models are a perfect fit, and it is unclear whether the results of
A series of Monte Carlo simulations were conducted to identify rejection rates for each of
the adverse impact detection methods: the Z-test on the difference between proportions (Z),
Upton’s adjusted chi-square test (U2), the Fisher Exact Test (F), Yates’ correction to the chi
Adverse Impact Tests 13
square test (Y2), the 4/5ths rule (4/5), and the Reverse One rule (R1). Separate simulations were
used to represent different types of selection decisions described above (fixed cutoff, top-down
selection, test-score banding). Overall, the different simulations produced very similar results.
For the sake of brevity, only the fixed cutoff will be discussed in detail here. Full results for the
Method
Data were generated based on the Double Dichotomy model. A Fortran 90 program
generated 10,000 random samples consisting of frequencies in each of the four cells (i.e.,
minority vs. majority and pass vs. fail). Sample frequency distributions were generated using a
multinomial pseudo random number generator (RMNTN) from the International Mathematical
The multinomial sampling distribution depends on the sample size (N) and the proportion
of the population in each cell. Rather than specifying these proportions directly, we specified
parameters that are more easily interpreted in terms of personnel selection decisions (IR, SRT and
Pmin), and then computed the population cell proportions from these parameters. The simulation
was repeated for a wide range of realistic levels for each of these parameters. We chose three
levels of IR, representing extreme (IR = .1), small (IR = .8), and no adverse impact (IR = 1). SRT
was set at .1, .3, .5, and .7 to represent a wide range of selection settings and labor market
conditions. Pmin was set at .1, .3 and .5. Conditions with SRT=.7 and Pmin=.5 were excluded from
the simulation, because some values of IR are not possible under these conditions. Sample size
(N) values were chosen to represent a range of relatively small sample sizes (N = 20, 50, or 100).
The choice was made to focus on small sample size conditions because these are the conditions
under which the choice of significance test is likely to make the most difference.
Adverse Impact Tests 14
For each sample, each of the test statistics was computed as described above. The Fisher
Exact Test was obtained using the IMSL (1984) DCTTWO routine. The rejection rate was
defined as the proportion of samples4 in which the statistic was significant at the =.05 level
(one-tailed). The test was considered significant only if the minority SR was smaller than the
majority SR. Tests based on the chi-square distribution (Y2 and U2) were considered significant
if the test statistic was larger than the critical chi-square value with 1 df and =.10. Because
results in the non-hypothesized direction were ignored, this corresponds to a directional =.05.
Rejection rates correspond to Type I error rates for IR=1.0, and to power when IR = 0.8 or IR =
0.1.
Results
Type I Error Rates. Empirical Type I error rates for the alternate test statistics are
presented in Table 2. The accuracy of the each approach depended on the sample size, proportion
of minority applications and overall selection rate. The combined influence of these three factors
can be summarized to a large extent by the expected frequency of the smallest cell, which is
As in past research (Roth et al., 2006), the 4/5ths rule had very high false positive rates.
When the smallest expected frequency was less than five, false positive rates were often above .4
and reached as high as .77. Even for larger N, false positive rates were often over .2.
Figure 1 summarizes the false positive (Type I error) rates for the different significance
tests, and for the Reverse One rule. Applying the Reverse One rule avoided the most excessive
false positive rates found with the 4/5ths rule. When the minimum expected frequency was five
or less, false positive rates for the Reverse One rule were consistently less than .25, considerably
lower than with the 4/5ths rule. For larger N, results for the Reverse One method were similar to
Adverse Impact Tests 15
the 4/5ths rule. Although it performed better than the 4/5ths rule, the high false positive rates for
Across conditions, the Z-test and Upton’s test produced very similar results. When the
smallest expected frequency was greater than five, the empirical Type I error rates tended to
range from .047 to .056, which is a negligible deviation from the nominal alpha level of .05.
Furthermore, as long as the minimum expected frequency was greater than two, the empirical
Type I error rates for the Z and Upton’s tests ranged from .04 to .06. This range is slightly larger
than the level of error recommended in past research (Bradley, 1978); however, these values
When the minimum expected frequency was less than two, the Z-test and Upton’s test
performed similarly, with the Z-test slightly more liberal than Upton’s test. Both tests
consistently showed inflated Type I error rates when the selection rate was high. When the
proportion of minorities and the selection rate were both low, Type I error rates were overly
conservative (.026 or lower). Under other conditions both tests approached nominal Type I error
rates. In the few cases where the Z-test produced excessive rejection rates, Upton’s test was also
liberal, but to a lesser degree. For example, in the worst case, where the rejection rate for the Z-
test was .076, the rejection rate for Upton’s was slightly lower at .071.
The Fisher Exact Test provided better control for Type I errors, with empirical rejection
never exceeding .05 for the Fisher Exact Test. However, this control of Type I error occurred
because the test is overly conservative. The empirical rejection rates were consistently lower than
.03. The empirical rejection rates became smaller as the minimum expected frequency decreased,
reaching levels below .01. In terms of accurately reflecting the nominal Type I error rate, the
Fisher Exact Test was less accurate than the Z-test or Upton’s test for all of the conditions
investigated.
Adverse Impact Tests 16
Results for Yates’ test were generally quite close to the Fisher Exact Test, except when
N, Pmin and SRT were all small, in which case the Type I error rate for Yates’ test was excessive
(.083). It should be noted that this anomalous result was found only with the fixed cutoff
scenario, and was not replication in simulations based on top-down selection or banding-based
decisions.
Power. Generally, all tests became more powerful as sample size increased and as the
proportion of minorities and the selection rate increased. When the degree of adverse impact in
the population was small (IR=.8), all tests had extremely low power across conditions (see Table
3). Figure 2 illustrates the power curve for each test statistic as a function of the minimum
expected frequency when SRT=.5. Even under the best conditions (N=100, Pmin=.5, SRT=.5)
power was only .29 for the Z-test and .23 for the Fisher Exact Test. Power curves were nearly
identical for the Z-test and Upton's test, and were consistently lower for the Fisher Exact Test
and Yates' corrected chi-square. However, differences between the statistics were generally
small, with power for Z and Upton's test ranging from .06 to .09 higher than Yates' test and the
Fisher Exact Test. Power curves were similar to Figure 2 for lower selection rates, and slightly
Power differences were larger when a large degree of adverse impact was present in the
population (IR=.1). As shown in Table 4, the Z-test and Upton’s test were uniformly more
powerful than Yates’ chi-square and the Fisher Exact Test. Figure 3 displays power curves for
each statistic as a function of the minimum expected frequency when SRT=.5. For the Z- and
Upton tests, power greater than .8 was found as long as the minimum expected frequency was
greater than 3. For the Fisher Exact Test and Yates' chi-square, a minimum expected frequency
greater than 4 guaranteed power of .8. Differences between the alternatives were greatest when
power was moderate. For example, when the minimum expected frequency was 2.5, power was .
Adverse Impact Tests 17
75 for the Z-test but only .50 for the Fisher Exact Test. Power curves were similar to Figure 3 for
Power was consistently higher for the 4/5ths rule than for any of the alternatives, but this
came at the cost of inflated false positive rates, as discussed above. Power for the Reverse One
rule was generally less than the 4/5ths rule, but greater than the significance tests. For example,
when the population IR = 0.1, SRT = .3, Pmin = .1 and N=50, power was .27 for the Z-test, .66 for
the Reverse One rule, and .96 for the 4/5ths rule.
groups, there may be situations in which the comparison group is determined based on the
sample data. A strict reading of the Uniform Guidelines defines adverse impact in terms of the
selection rate for the minority group compared to the group with the highest selection rate. Under
this approach, the comparison group would not be the same in all samples, a situation that is not
This approach introduces a second form of sampling error into the data. Because the
comparison group is determined based on the sample selection rates, sampling error influences
which groups are included in the analysis. Choosing the group with the highest selection rate
increases the chance that a difference will be found, even when the differences are solely due to
A more appropriate approach for evaluating statistical significance in this situation would
be to conduct a single test comparing all groups. A multiple-group generalization of the Z-test
can be performed using the standard k x 2 (group x outcome) chi-square test (Fleiss, 1981). This
test evaluates the null hypothesis that the selection rates are equal in all subpopulations. This test
but with three groups. For this simulation, the population selection rates were the same across
groups, so significant results represent Type I errors. The simulation was repeated for sample
sizes of 20, 50 and 100, and selection rates of .1, .3, .5, and .7. The proportion of the population
from the focal minority group was set at .1, .2 or .3. The proportion from the second group was
equal to the first group, with the third group making up a larger proportion (.8, .6 or .4). The Z-
test and Fisher Exact Test were computed comparing the first minority group to the group with
the highest selection rate. The multiple-group chi-square test was computed using all three
groups.
Table 5 presents the results for the Z-test, the Fisher Exact Test and the multiple group
chi-square. As expected, the Z-test had inflated Type I error rates under many conditions,
reaching as high as .13. The multiple-group chi-square test performed better than the Z-test, but
Type I error rates were slightly elevated under many conditions. However, when the smallest
expected frequency was greater than 2, the Type I error rates for the multiple-group chi-square
test did not exceed .07. Type I error rates for the Fisher Exact Test did not substantially exceed
the nominal alpha of .05; however, the test was often overly conservative, with Type I error rates
substantially below .05. Overall, the results support use of the multiple-group chi-square test.
Discussion
Current guidelines (e.g., OFCCP, 1993) recommend using a significance test combined
with the 4/5ths rule for adverse impact assessment. Based on the results from the current study,
the current use of the Z-test in this context seems reasonably well justified. The Z-test provided a
good balance of maintaining the nominal Type I error rate and maximizing power. In contrast,
the Fisher Exact Test and Yates’ chi-square were overly conservative except for extremely large
N, and consequently had lower power under many conditions. Given its lower power to detect
Adverse Impact Tests 19
true cases of adverse impact, recommendations to use the Fisher Exact Test for adverse impact
No significance test performed well when the sample size was extremely small. When the
smallest expected frequency was less than two, the Z-test was overly conservative in some cases,
and overly liberal in others. Upton’s chi-square test provided slightly better control for Type I
errors under the conditions where the Z-test was inflated, and maintained power levels that were
equal to the Z-test. Overall, the results indicate Upton’s chi-square is a reasonable alternative to
The results were generally consistent with past research on significance tests for 2x2
tables (Upton, 1982) and demonstrate the generalizability of these results to several types of
selection decisions (fixed cutoff, top-down, banding). At the same time, the simulations may not
have fully captured all aspects of real-world employment decisions. For example, the simulations
did not model situations involving deliberate attempts to minimize adverse impact. Targeted
recruiting efforts may change the composition of the applicant pool. Similarly, methods that give
preferential treatment to the minority-group (e.g., selecting minorities first within a test score
band, as in Cascio, Outtz, Zedeck & Goldstein, 1991) will tend to reduce group differences in
selection rates. We would expect such affirmative action activities to primarily influence group
differences at the population level. Beyond their effect on power by reducing the effect size,
these practices should have little effect on the performance of statistical significance tests.
All of the significance tests had extremely low power under many of the conditions
gain the benefit of added power, and because in most situations there is a clear expectation about
the group being disadvantaged by the selection procedure. Even using one-tailed tests, power
Adverse Impact Tests 20
was found to be low under many conditions. As a result, alternative approaches to assessing
adverse impact should be considered, such as the use of confidence intervals on the adverse
impact ratio (Morris & Lobsenz, 2000), or pooling of results across samples to increase the
precision of the statistics that guide adverse impact decisions. Additional research is needed to
Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-
analytic matrix incorporating cognitive ability, alternative predictors, and job performance.
Buonasera, A. K., Kuang, D., Dunleavy, E. M., & Mueller, L. (2006, April). The
implications of frequent appliers on adverse impact analyses. Poster presented at the Annual
Conference of the Society for Industrial and Organizational Psychology, Dallas, TX.
and some personal opinions on the controversy. Psychological Bulletin, 108, 135-145.
contingency tables with small expected frequencies. Psychological Bulletin, 85, 163-167.
Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implication of
six methods of test score use in personnel selection. Human Performance, 4, 233-264.
Conover, W. J. (1974). Some reasons for not using the Yates’ continuity correction
Fleiss J. L. (1981). Statistical methods for rates and proportions (2nd ed). Wiley Series in
Hedges (Eds.), The Handbook of Research Synthesis (pp. 245-260). NY: Russell Sage
Foundation.
International Mathematical and Statistical Library (1984). User's Manual: IMSL Library,
Morris, S. B., & Lobsenz, R. (2000). Significance tests and confidence intervals for the
Morris, S. B., & Henry, M. S. (2000, April). Using Meta-Analysis to Estimate Adverse
Impact. Paper presented at the 15th Annual Conference of the Society for Industrial and
Roth, P. L., Bobko, P., & Switzer, F. S. (2006). Modeling the behavior of the 4/5ths rule
for determining adverse impact: Reasons for caution. Journal of Applied Psychology, 91, 507-
522.
Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score
Siskin, B. R., & Trippi, J. (2005). Statistical issues in litigation. In Landy, F. J. (ed.),
Yates, F. (1934). Contingency tables involving small numbers of the 2 test. Journal of
Yates, F. (1984). Tests of significance for 2 x 2 contingency tables. Journal of the Royal
adverse impact results, and is often a point of debate in discrimination cases (e.g., Hazelwood v.
US, 1977). However, once a particular definition of the population has been adopted, this choice
will have no effect on the behavior of samples drawn from that population, and therefore will not
impact the conclusions drawn in this paper about the performance of alternate statistical tests.
2
A variety of methods have been proposed for using test score bands, which differ in how
the band width is determined and how applicants are chosen from within a band. The effect of
alternate banding techniques on adverse impact has been the issue of considerable research
(Aguinis, 2004), and will not be further explored here. We will consider only fixed bands where
everyone within the band moves on to be evaluated using other selection criteria.
3
When a large number of applicants repeatedly reapply for the same position, the samples
will not be completely independent, and the sampling models may not accurately represent the
variability of results across repeated test administrations (Buonasera, Kuang, Dunleavy, &
Mueller, 2006).
4
Under some conditions, samples were produced where the significance tests could not be
computed. For example, if either SRT or Pmin is one or zero, the denominator of Z, Y2, and U2
will be zero, and therefore, the statistic is not defined. Samples where the statistical tests could
not be computed were excluded from the analysis. Thus, empirical rejection rates were computed
Table 2
Empirical Type I error rates of the Z-test (Z), Upton’s chi square (U), Fisher Exact Test (F), Yates’ chi square (Y), 4/5th Test (4/5), and
Reverse One rule (R1) for selection decisions with a fixed cut score.
*Type I error was defined as the proportion of significant results out of 10,000 samples when the population had no adverse impact
(IR=1.0). The conditions with 50% minority and 70% selected were excluded from the simulation.
Adverse Impact Tests 27
Table 3
Power of the Z-test (Z), Upton's chi square (U), the Fisher Exact test (F), Yates’ chi square (Y), 4/5th Test (4/5), and Reverse One rule (R1)
when the population adverse impact ratio is 0.8.
*Power was defined as the proportion of significant results out of 10,000 samples. The conditions with 50% minority and 70%
Table 4
Power of the Z-test (Z), Upton's chi square (U) the Fisher Exact test (F), Yates’ chi square (Y), 4/5th Test (4/5), and Reverse One rule (R1)
when the population adverse impact ratio is 0.1.
*Power was defined as the proportion of significant results out of 10,000 samples. The conditions with 50% minority and 70%
Table 5
Empirical Type I error rates of the Z-test (Z), Fisher Exact Test (F), and Multi-group chi-square test (MG) in a 3-group comparison.
N Z F MG Z F MG Z F MG
10% Selected
20 0.027 0 0.066 0.030 0.001 0.048 0.045 0.002 0.040
50 0.024 0.001 0.050 0.040 0.004 0.047 0.078 0.013 0.056
100 0.039 0.004 0.043 0.069 0.018 0.051 0.086 0.034 0.065
30% Selected
20 0.074 0.002 0.051 0.081 0.008 0.058 0.092 0.015 0.061
50 0.073 0.010 0.050 0.086 0.026 0.059 0.090 0.038 0.068
100 0.078 0.025 0.054 0.084 0.041 0.065 0.085 0.052 0.066
50% Selected
20 0.119 0.004 0.048 0.104 0.019 0.065 0.110 0.023 0.071
50 0.096 0.022 0.064 0.087 0.033 0.068 0.087 0.039 0.066
100 0.088 0.033 0.061 0.086 0.046 0.064 0.087 0.055 0.070
70% Selected
20 0.183 0.013 0.111 0.129 0.020 0.086 0.117 0.019 0.077
50 0.112 0.022 0.070 0.091 0.029 0.069 0.094 0.039 0.069
100 0.094 0.031 0.066 0.090 0.045 0.068 0.096 0.054 0.069
*Type I error was defined as the proportion of significant results out of 10,000 samples when the population had no adverse impact
(IR=1.0).
Adverse Impact Tests
30
Figure Captions
Figure 1
Type I error rates for alternate test statistics with a fixed cut score.
Note: Z = Z-test; CHIU = Upton’s chi-square test; CHIY = Yates’ chi-square test; Fisher =
Figure 2
Power of adverse impact tests with a fixed cut score when IR=.8 and SRT=.5.
Note: Z = Z-test; CHIU = Upton’s chi-square test; CHIY = Yates’ chi-square test; Fisher =
Figure 3
Power of adverse impact tests with a fixed cut score when IR=.1 and SRT=.5.
Note: Z = Z-test; CHIU = Upton’s chi-square test; CHIY = Yates’ chi-square test; Fisher =
Figure 4
31
0.3
0.25
Type I Error Rate
0.2 Z
CHIU
0.15 CHIY
Fisher
0.1 R1
0.05
0
0 10 20 30
Minimum Expected N
Adverse Impact Tests
32
1
0.9
0.8
0.7
Z
0.6 CHIU
Power
0.5 CHIY
0.4 Fisher
R1
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Minimum Expected N
Adverse Impact Tests
33
1
0.9
0.8
0.7
Z
0.6 CHIU
Power
0.5 CHIY
0.4 Fisher
R1
0.3
0.2
0.1
0
0 10 20 30
Minimum Expected N
Adverse Impact Tests
34
0.2
0.18
0.16
0.14
Type I Error Rate
0.12 Z
0.1 Fisher
0.08 MG
0.06
0.04
0.02
0
0 5 10 15 20 25 30
Minimum Expected N