You are on page 1of 34

Adverse Impact Tests 1

Testing for Adverse Impact When Sample Size is Small

Michael W. Collins

Illinois Institute of Technology

Scott B. Morris

Illinois Institute of Technology

Journal of Applied Psychology, 93(2), 463-471

DOI: 10.1037/0021-9010.93.2.463

© 2008 American Psychological Association

This article may not exactly replicate the final version published in the APA journal. It is not the

copy of record. The final published version can be obtained at

http://psycnet.apa.org/journals/apl/93/2/463/
Adverse Impact Tests 2
Testing for Adverse Impact When Sample Size is Small

Abstract

Adverse impact evaluations often call for evidence that the disparity between groups in

selection rates is statistically significant, and practitioners must choose which test statistic to

apply in this situation. To identify the most effective testing procedure, several alternate test

statistics were compared in terms of Type I error rates and power, focusing on situations with

small samples. Significance testing was found to be of limited value due to low power for all

tests. Among the alternate test statistics, the widely-used Z-test on the difference between two

proportions performed reasonably well, except when sample size was extremely small. A test

suggested by Upton (1982) provided slightly better control of Type I error under some

conditions, but generally produced results similar to the Z-test. Use of the Fisher Exact Test and

Yates continuity-corrected chi-square test are not recommended, due to overly conservative Type

I error rates and substantially lower power than the Z-test.


Adverse Impact Tests 3
Adverse impact refers to group differences in the outcome of an employment decision.

Adverse impact analyses play a central role in many employment discrimination lawsuits, and

have become a regular part of the evaluation of employee selection procedures. Consequently, it

is important that adverse impact analyses are based on the best statistical procedures available.

The most common approach for evaluating adverse impact is based on the 4/5ths rule

outlined in the Uniform Guidelines on Employee Selection Procedures (Uniform Guidelines;

U.S. Equal Employment Opportunity Commission, 1978). A limitation of the 4/5ths rule is that it

does not take into account the potential impact of sampling error (Morris & Lobsenz, 2000).

When sample size is small, the 4/5ths rule will often identify cases of adverse impact even when

selection rates are equal in the population (Roth, Bobko & Switzer, 2006).

In order to account for sampling error, the 4/5ths rule can be supplemented with a test of

statistical significance (Roth et al., 2006). For large samples, significance can be tested using a

Z-test on the difference between two independent proportions, often referred to as the “2 SD

rule” [Office of Federal Contract Compliance Programs (OFCCP) Compliance Manual, 1993].

When the sample size is small, the Fisher Exact Test is often recommended as a more

appropriate test (OFCCP, 1993; Siskin & Trippi, 2005).

The choice between the Fisher Exact Test and the Z-test (or the equivalent chi-square

test) has long been debated (Camilli, 1990; Haber, 1990; Upton, 1982). Although there are clear

advantages to using an exact test, the Fisher Exact Test is based on a different theoretical model

than the Z-test (Kroll, 1989; Siskin & Trippi, 2005), and each will be appropriate in different

situations. In this paper, we review the statistical models underlying these tests and discuss their

applicability to adverse impact analysis. In addition, we report simulations of the Type I error

rates and power of alternate adverse impact tests.


Adverse Impact Tests 4
It is important to note that in practice, decisions about adverse impact are not solely based

on statistical evidence. Courts may consider a variety of factors to determine whether a prima

facie case of discrimination has been made. The Uniform Guidelines recommend that adverse

impact statistics be interpreted in light of the hiring organization’s recruiting practices that

encourage or discourage minority applicants. In addition, when sample size is small, the Uniform

Guidelines suggest that adverse impact statistics might be supplemented with data from other

similar jobs or for the same job across time. The current research only addresses methods for

evaluating statistical evidence, and therefore does not fully model the process of determining

whether adverse impact exists. However, because statistical evidence usually plays a central role

in such decisions, identifying the most effective statistical tools is essential to ensure accurate

decisions.

Alternate Test Statistics

Adverse impact can be defined using passing rates on a particular component of a

selection process, or based on the outcome of a decision related to selection, promotion or other

employment action. For simplicity, we will use the term ‘selection rate’ to refer to the rate of

successful outcomes, regardless of the type of decision involved.

Adverse impact statistics are typically based on a comparison of selection rates for two

predetermined groups (referred to here as the minority and majority groups). Although there may

be more than two subgroups in the applicant pool (e.g., multiple ethnic groups), the context in

which the analysis is conducted (e.g., a claim of discrimination) typically identifies the relevant

minority and majority groups. Therefore, in developing a model for adverse impact data, we will

focus on the selection rates for the two groups of interest, and ignore data that might be present

for other groups. In a later section we will discuss the implications of assessing adverse impact in

situations where the relevant groups are not identified a priori.


Adverse Impact Tests 5
Adverse impact analysis can be expressed as a test for association in a 2 x 2 contingency

table, as illustrated in Table 1. In this table, NPmin, NPmaj, NFmin and NFmaj reflect the number of

applicants in each cell, Nmin, Nmaj, NPT and NFT are the marginal totals, and N is the total number

of applicants. Pmin represents the marginal proportion of minority applicants, and SRT reflects the

marginal proportion of applicants who pass the selection test.

The most widely recognized procedure for detecting adverse impact is the 4/5ths rule

outlined in the Uniform Guidelines. The basic statistic for the 4/5ths rule is the impact ratio (IR),

which is defined as the selection rate for the minority group divided by the selection rate for the

majority group. Adverse impact is indicated when this ratio is less than four-fifths. The impact ratio

is an index of effect size; it was intended to direct the attention of regulatory agencies toward

settings with substantial disparities in selection outcomes. It is not based on a formal sampling

model and does not assess the impact of sampling error on the result.

The Uniform Guidelines also suggest a heuristic method of assessing the impact of chance

on adverse impact results in small samples, which we will call the Reverse One rule (Roth et al.,

2006, referred to this as the “N of 1” or “flip-flop” rule). The Guidelines state that adverse impact

would not be found when, “…the selection of one different person for one job would shift the result

from adverse impact against one group to a situation in which that group has a higher selection rate

than the other group.” The adjusted impact ratio (IRadj) suggested by this strategy is,

IRadj =
 NPmin  1 N min
..
 NPmaj  1 N maj
( 0)

Applying this strategy, a finding of adverse impact would require that: (1) the impact ratio (IR) is

less than 4/5, and (2) that the adjusted impact ratio (IRadj) is less than 1.0. Although this approach

does assess the impact of chance on the result, it is not based on a formal statistical model, and it is

not known how well this heuristic will work in practice.


Adverse Impact Tests 6
Formal statistical significance tests have also been used to analyze adverse impact data. A

common test is the Z-test for the difference between proportions (OFCCP, 1993), or equivalently,

the Pearson chi-square test for independence in a 2 x 2 table. The Z-test evaluates the null

hypothesis of equal population selection rates for the two groups being compared,

NP min NP maj
-
N min N maj
Z= . ( 0)
SRT (1 - SRT )
(N)( P min )(1 - P min )
Also known as the 2-SD test, Z is considered significant if the difference is more than roughly

two standard deviations above or below zero (or more precisely, |Z| > 1.96, corresponding to a

two-tailed =.05).

The Z-test is based on large-sample theory, and may not be appropriate in small samples.

Specifically, when any of the cell frequencies are less than five, the ability of the test to achieve

the nominal Type I error rate is questionable. When Pmin and SRT are small, this value may be

less than five, even for moderately large N. For example, when the minority group makes up

10% of the applicant pool, and 10% of the applicants pass the test, the total sample size must

exceed 500 in order to meet this requirement.

A number of alternate test statistics have been suggested for situations where the sample

size is too small for the Z-test. The most common is the Fisher Exact Test (Kroll, 1989; OFCCP,

1993; Siskin & Trippi, 2005). The Fisher Exact Test provides the exact probability of obtaining

the observed frequency table (or one more extreme) under the null hypothesis, with the

additional assumption that the marginal frequencies are fixed (Fleiss, 1981).

Another statistic often recommended for small samples is the chi-square test with Yates’

continuity correction (Camilli & Hopkins, 1978). Although not an exact test, Yates’ correction
Adverse Impact Tests 7
has been shown to provide a close approximation to the Fisher Exact Test (Fleiss, 1981), while

being considerably easier to compute. Yates’ continuity-corrected test is calculated as:


2
 1 
N  ( NFmin )( NPmaj )  ( NFmaj )( NPmin )  N 
2  .
 Y2   (0)
( N min )( N maj )( NPT )( NFT )

Yates’ test is evaluated against a chi-square distribution with 1 df.

Upton (1982) suggested another alternative. Based on the known bias in the Pearson chi-

square statistic, Upton (1982) suggested a corrected chi-square statistic (U2),

( N  1) ( NPmin )( NPmaj )  ( NFmin )( NFmaj ) 


2

U2  . (0)
( N min )( N maj )( NPT )( NFT )

U2 is evaluated against a chi-square distribution with 1 df.

One- and Two-tailed tests. Tests of statistical significance can be either directional (i.e.,

one-tailed) or non-directional (two-tailed). Adverse impact statistics are often evaluated using

two-tailed significance levels (e.g., OFCCP, 1993), even though the hypothesis is usually

directional. Typically, the focal minority group will be determined by a claim of discrimination

or a history of under-representation in the workforce, and the purpose of the test is to identify

potential discrimination against this group. In such situations, finding a higher selection rate for

the minority than for the majority group would typically not be interpreted as an indication of

discrimination. Therefore, one-tailed significance levels are appropriate (Paetzold & Willborn,

1994).

There may be settings in which adverse impact statistics are evaluated in a more

exploratory fashion, without predetermined minority and majority groups, and non-directional

significance test should be used in these situations. The use of one-tailed significance tests will

generally increase statistical power for all significance tests, and will have little effect on the

relative performance of alternate tests. The current research will focus only on directional tests.
Adverse Impact Tests 8
It is important to note that tests based on the chi-square distribution are inherently non-

directional. Chi-square tests are computed using the sum of squared differences, and therefore,

differences in either direction will yield a positive chi-square. In order to conduct a directional

test, the probability obtained from the standard chi-square test should be halved. Alternatively,

the critical chi-square value can be set using double the desired significance level (i.e., =.10 for

a directional test at the .05 level).

Alternate Sampling Models for 2 x 2 Contingency Tables

Statisticians have long debated the appropriateness of alternate test statistics for assessing

significance in 2 x 2 contingency tables. The most appropriate statistic depends on the theoretical

sampling model, that is, how a particular sample contingency table is generated from the

population. Historically, the test for association in a 2 x 2 contingency table has been expressed

in terms of one of three sampling models (Bernard, 1947; Fleiss, 1981; Kroll, 1989).

In order to define statistical significance, a set of applicants is viewed as a random sample

from a population of potential applicants. The relevant population will be defined by a variety of

factors, including the geographical region of the employment opportunity, the employer’s

recruiting practices and the minimum qualifications established for the job.1 A group of job

applicants are conceptualized as a sample from this population, because some qualified

individuals, either due to lack of awareness or convenience, will not attend a particular

administration of the selection procedure. Alternately, the sample of applicants could be viewed

as the individuals who take an exam at a particular point in time, and the population as applicants

who might potentially take part in future administrations of the test.

Independence Trial. In the first model, the marginal proportions are assumed to be fixed.

That is, Pmin and SRT are not sample estimates, but rather define the population to which

inferences will be generalized. In this model, the data are not viewed as a random sample from a
Adverse Impact Tests 9
larger population. Instead, the probability of the result is considered among alternate random

assignment of participants to treatment conditions (Camilli, 1990). Consider an experiment

where participants are randomly assigned to one of two treatments and then observed on a

dichotomous outcome. Under the null hypothesis that the two variables (treatment group and

outcome) are independent, the number of successes for each treatment depends only on which

individuals were assigned to each group. If a different random assignment were applied to the

same group of participants, the results of the 2 x 2 table would differ. Tests based on this model

evaluate the probability that the observed result could have occurred due to random assignment,

assuming that the two variables are independent. Under this sampling model, the cell frequencies

will have a hypergeometric distribution. This model led to the development of the Fisher Exact

Test, which is closely approximated by the Yates (1934) correction to the chi-square test for

independence.

Comparative Trial. In the Comparative Trial model, participants can be viewed as

random samples from two distinct populations (e.g., minority and majority). The proportion from

each population is fixed. In other words, the marginal proportion on one variable (e.g., Pmin) is

assumed to be constant across replications. The second marginal proportion (SRT) is estimated

from the sample data. Assuming independence, each row of Table 1 would have a binomial

distribution, with parameters Ni and SRT. This model is the basis for the Z-test on independent

proportions.

Double Dichotomy. In the third model, neither marginal is assumed to be fixed.

Participants are viewed as a random sample from a population that is characterized by two

dichotomous characteristics. No purposive sampling or assignment to groups is used, and the

proportion in each group can vary across samples. Both population marginal proportions are

therefore unknown and must be estimated from the sample data. Assuming independence, the
Adverse Impact Tests 10
expected value of each cell frequency will be the product of the marginal frequencies. The

sample frequencies will be distributed as N observations from a multinomial distribution with

parameters Pmin*SRT, Pmin*(1-SRT), (1-Pmin)*SRT, and (1-Pmin)*(1-SRT).

Model for adverse impact. Because the choice of test statistic depends on the sampling

model (Siskin & Trippi, 2005), it is important to identify the model that best describes typical

adverse impact data. This choice is complicated by the variety of approaches used to make

selection decisions. No single model will be appropriate for all situations in which adverse

impact is evaluated.

One approach to making selection decisions uses a fixed cutoff score. The passing score

might be established by the test developers in reference to normative data, or based on the

minimum qualifications required for a particular job. This situation is best represented by the

Double Dichotomy model. The population consists of all potential applicants, who can be

classified on minority/majority status and whether they would pass or fail the cut score. The

available data represent a sample from this population. Neither the proportion of minority

applicants nor the selection rate involve purposive sampling or random assignment, and sample

estimates of both variables are expected to differ from their population values. If multiple

samples were taken, they would not be expected to have exactly the same selection rate or

proportion of minority applicants. Therefore, the situation is best modeled as a random sample

with two unknown marginal proportions, as reflected in the Double Dichotomy model.

For other approaches to selection, the Double Dichotomy model may not be appropriate.

In top-down selection, candidates are ranked on their test scores (or based on a composite score

from a battery of tests), and then selected from the top down until a fixed number of positions are

filled. In this situation, the selection rate is fixed based on the employer’s staffing needs. If a

different sample had been used, the number passing would have been the same. However, Pmin is
Adverse Impact Tests 11
likely to vary across samples and is best treated as an estimate of an unknown population

parameter. Further, because the selection decision depends on the rank position of applicants in a

particular sample, the selected and non-selected groups are sample-specific, and do not reflect

two distinct populations as in the Comparative Trial model. In fact, none of the three sampling

models adequately describe this type of decision.

Another common approach to making selection decisions uses score bands. A band refers

to a range of test scores that are treated as equivalent for the purpose of a selection decision. All

individuals within the top band or bands are considered to have passed the test, and move on to

be evaluated using other selection criteria2. Because bands are often defined using a fixed range

of scores below the highest score in the sample, the position of the band and the number of

people in the top band(s) will vary across samples. Neither marginal would be fixed. Further,

whether an individual passes depends on his or her position relative to the highest score in the

sample, and the selection rate cannot be considered a characteristic of the population. None of

the sampling models adequately represent this situation.

In some situations, it is difficult to conceptualize a population from which the data are a

random sample. For example, when evaluating a promotion decision, the pool of candidates is

relatively fixed. If the decision were repeated at a different point in time, the set of candidates

under consideration would be mostly the same. In such cases, probabilities based on randomly

sampling from a population, as in the Comparative Trial and Double Dichotomy models, would

not apply. Similarly, probabilities based on random reassignment of participants, as in the

Independence Trial model, would not be appropriate. Without some theoretical process for

producing different patterns of data (e.g., random sampling or random reassignment), statistical

significance cannot be defined3.


Adverse Impact Tests 12
Comparison of alternate test statistics. A number of previous studies have compared the

performance of the alternate test statistics for 2x2 contingency tables (Kroll, 1989; Upton, 1982).

A clear result from these studies is that the choice of a test depends on the sampling model.

Under the Independence Trial model, the Fisher Exact Test has Type I error rates closest to the

nominal alpha level, closely followed by Yates’ test, while the Z-test and Upton’s chi-square can

have excessive Type I error rates when expected cell frequencies are less than 5. In contrast,

when the data are produced by the Double Dichotomy or Comparative Trial models, both the

Fisher Exact Test and Yates’ test tend to have overly conservative Type I error rates. The Z-test

generally produces Type I error rates closer to the nominal alpha level, but has inflated Type I

error rates under some conditions. Upton’s test also tends to be slightly inflated under some

conditions, but generally less so than the Z-test.

Because the Independence Trial model does not represent typical personnel selection

data, there is reason to question the appropriateness of the Fisher Exact Test and Yates’ test for

adverse impact analysis. The tendency of these tests to be conservative under the other sampling

models indicates that the Fisher Exact Test and Yates’ test will be less likely than other tests to

identify true cases of adverse impact. For selection decisions based on a fixed cut score, which

can be represented by the Double Dichotomy model, past research suggests that the Z-test and

Upton’s test will perform well. However, for other decision types (e.g., banding or top-down

selection), none of the sampling models are a perfect fit, and it is unclear whether the results of

previous studies will generalize to these settings.

Simulation of Type I Error and Power

A series of Monte Carlo simulations were conducted to identify rejection rates for each of

the adverse impact detection methods: the Z-test on the difference between proportions (Z),

Upton’s adjusted chi-square test (U2), the Fisher Exact Test (F), Yates’ correction to the chi
Adverse Impact Tests 13

square test (Y2), the 4/5ths rule (4/5), and the Reverse One rule (R1). Separate simulations were

used to represent different types of selection decisions described above (fixed cutoff, top-down

selection, test-score banding). Overall, the different simulations produced very similar results.

For the sake of brevity, only the fixed cutoff will be discussed in detail here. Full results for the

other methods are available from the authors.

Method

Data were generated based on the Double Dichotomy model. A Fortran 90 program

generated 10,000 random samples consisting of frequencies in each of the four cells (i.e.,

minority vs. majority and pass vs. fail). Sample frequency distributions were generated using a

multinomial pseudo random number generator (RMNTN) from the International Mathematical

and Statistical Library (IMSL, 1984).

The multinomial sampling distribution depends on the sample size (N) and the proportion

of the population in each cell. Rather than specifying these proportions directly, we specified

parameters that are more easily interpreted in terms of personnel selection decisions (IR, SRT and

Pmin), and then computed the population cell proportions from these parameters. The simulation

was repeated for a wide range of realistic levels for each of these parameters. We chose three

levels of IR, representing extreme (IR = .1), small (IR = .8), and no adverse impact (IR = 1). SRT

was set at .1, .3, .5, and .7 to represent a wide range of selection settings and labor market

conditions. Pmin was set at .1, .3 and .5. Conditions with SRT=.7 and Pmin=.5 were excluded from

the simulation, because some values of IR are not possible under these conditions. Sample size

(N) values were chosen to represent a range of relatively small sample sizes (N = 20, 50, or 100).

The choice was made to focus on small sample size conditions because these are the conditions

under which the choice of significance test is likely to make the most difference.
Adverse Impact Tests 14
For each sample, each of the test statistics was computed as described above. The Fisher

Exact Test was obtained using the IMSL (1984) DCTTWO routine. The rejection rate was

defined as the proportion of samples4 in which the statistic was significant at the =.05 level

(one-tailed). The test was considered significant only if the minority SR was smaller than the

majority SR. Tests based on the chi-square distribution (Y2 and U2) were considered significant

if the test statistic was larger than the critical chi-square value with 1 df and =.10. Because

results in the non-hypothesized direction were ignored, this corresponds to a directional =.05.

Rejection rates correspond to Type I error rates for IR=1.0, and to power when IR = 0.8 or IR =

0.1.

Results

Type I Error Rates. Empirical Type I error rates for the alternate test statistics are

presented in Table 2. The accuracy of the each approach depended on the sample size, proportion

of minority applications and overall selection rate. The combined influence of these three factors

can be summarized to a large extent by the expected frequency of the smallest cell, which is

N*Pmin*SRT when SRT.5, and N*Pmin*(1-SRT) when SRT>.5.

As in past research (Roth et al., 2006), the 4/5ths rule had very high false positive rates.

When the smallest expected frequency was less than five, false positive rates were often above .4

and reached as high as .77. Even for larger N, false positive rates were often over .2.

Figure 1 summarizes the false positive (Type I error) rates for the different significance

tests, and for the Reverse One rule. Applying the Reverse One rule avoided the most excessive

false positive rates found with the 4/5ths rule. When the minimum expected frequency was five

or less, false positive rates for the Reverse One rule were consistently less than .25, considerably

lower than with the 4/5ths rule. For larger N, results for the Reverse One method were similar to
Adverse Impact Tests 15
the 4/5ths rule. Although it performed better than the 4/5ths rule, the high false positive rates for

the Reverse One rule are problematic.

Across conditions, the Z-test and Upton’s test produced very similar results. When the

smallest expected frequency was greater than five, the empirical Type I error rates tended to

range from .047 to .056, which is a negligible deviation from the nominal alpha level of .05.

Furthermore, as long as the minimum expected frequency was greater than two, the empirical

Type I error rates for the Z and Upton’s tests ranged from .04 to .06. This range is slightly larger

than the level of error recommended in past research (Bradley, 1978); however, these values

would not lead to substantially inaccurate conclusions.

When the minimum expected frequency was less than two, the Z-test and Upton’s test

performed similarly, with the Z-test slightly more liberal than Upton’s test. Both tests

consistently showed inflated Type I error rates when the selection rate was high. When the

proportion of minorities and the selection rate were both low, Type I error rates were overly

conservative (.026 or lower). Under other conditions both tests approached nominal Type I error

rates. In the few cases where the Z-test produced excessive rejection rates, Upton’s test was also

liberal, but to a lesser degree. For example, in the worst case, where the rejection rate for the Z-

test was .076, the rejection rate for Upton’s was slightly lower at .071.

The Fisher Exact Test provided better control for Type I errors, with empirical rejection

never exceeding .05 for the Fisher Exact Test. However, this control of Type I error occurred

because the test is overly conservative. The empirical rejection rates were consistently lower than

.03. The empirical rejection rates became smaller as the minimum expected frequency decreased,

reaching levels below .01. In terms of accurately reflecting the nominal Type I error rate, the

Fisher Exact Test was less accurate than the Z-test or Upton’s test for all of the conditions

investigated.
Adverse Impact Tests 16
Results for Yates’ test were generally quite close to the Fisher Exact Test, except when

N, Pmin and SRT were all small, in which case the Type I error rate for Yates’ test was excessive

(.083). It should be noted that this anomalous result was found only with the fixed cutoff

scenario, and was not replication in simulations based on top-down selection or banding-based

decisions.

Power. Generally, all tests became more powerful as sample size increased and as the

proportion of minorities and the selection rate increased. When the degree of adverse impact in

the population was small (IR=.8), all tests had extremely low power across conditions (see Table

3). Figure 2 illustrates the power curve for each test statistic as a function of the minimum

expected frequency when SRT=.5. Even under the best conditions (N=100, Pmin=.5, SRT=.5)

power was only .29 for the Z-test and .23 for the Fisher Exact Test. Power curves were nearly

identical for the Z-test and Upton's test, and were consistently lower for the Fisher Exact Test

and Yates' corrected chi-square. However, differences between the statistics were generally

small, with power for Z and Upton's test ranging from .06 to .09 higher than Yates' test and the

Fisher Exact Test. Power curves were similar to Figure 2 for lower selection rates, and slightly

steeper when SRT=.7.

Power differences were larger when a large degree of adverse impact was present in the

population (IR=.1). As shown in Table 4, the Z-test and Upton’s test were uniformly more

powerful than Yates’ chi-square and the Fisher Exact Test. Figure 3 displays power curves for

each statistic as a function of the minimum expected frequency when SRT=.5. For the Z- and

Upton tests, power greater than .8 was found as long as the minimum expected frequency was

greater than 3. For the Fisher Exact Test and Yates' chi-square, a minimum expected frequency

greater than 4 guaranteed power of .8. Differences between the alternatives were greatest when

power was moderate. For example, when the minimum expected frequency was 2.5, power was .
Adverse Impact Tests 17
75 for the Z-test but only .50 for the Fisher Exact Test. Power curves were similar to Figure 3 for

smaller selection rates, and slightly steeper when SRT =.7.

Power was consistently higher for the 4/5ths rule than for any of the alternatives, but this

came at the cost of inflated false positive rates, as discussed above. Power for the Reverse One

rule was generally less than the 4/5ths rule, but greater than the significance tests. For example,

when the population IR = 0.1, SRT = .3, Pmin = .1 and N=50, power was .27 for the Z-test, .66 for

the Reverse One rule, and .96 for the 4/5ths rule.

Situations Without a Pre-determined Comparison Group.

Although it may often be appropriate to focus on predetermined minority and majority

groups, there may be situations in which the comparison group is determined based on the

sample data. A strict reading of the Uniform Guidelines defines adverse impact in terms of the

selection rate for the minority group compared to the group with the highest selection rate. Under

this approach, the comparison group would not be the same in all samples, a situation that is not

represented by any of the sampling models described above.

This approach introduces a second form of sampling error into the data. Because the

comparison group is determined based on the sample selection rates, sampling error influences

which groups are included in the analysis. Choosing the group with the highest selection rate

increases the chance that a difference will be found, even when the differences are solely due to

sampling error. Consequently, Type I error rates will be inflated.

A more appropriate approach for evaluating statistical significance in this situation would

be to conduct a single test comparing all groups. A multiple-group generalization of the Z-test

can be performed using the standard k x 2 (group x outcome) chi-square test (Fleiss, 1981). This

test evaluates the null hypothesis that the selection rates are equal in all subpopulations. This test

will be referred to as the multiple-group chi-square.


Adverse Impact Tests 18
To explore this scenario, we replicated the simulation of a test with a fixed selection rate,

but with three groups. For this simulation, the population selection rates were the same across

groups, so significant results represent Type I errors. The simulation was repeated for sample

sizes of 20, 50 and 100, and selection rates of .1, .3, .5, and .7. The proportion of the population

from the focal minority group was set at .1, .2 or .3. The proportion from the second group was

equal to the first group, with the third group making up a larger proportion (.8, .6 or .4). The Z-

test and Fisher Exact Test were computed comparing the first minority group to the group with

the highest selection rate. The multiple-group chi-square test was computed using all three

groups.

Table 5 presents the results for the Z-test, the Fisher Exact Test and the multiple group

chi-square. As expected, the Z-test had inflated Type I error rates under many conditions,

reaching as high as .13. The multiple-group chi-square test performed better than the Z-test, but

Type I error rates were slightly elevated under many conditions. However, when the smallest

expected frequency was greater than 2, the Type I error rates for the multiple-group chi-square

test did not exceed .07. Type I error rates for the Fisher Exact Test did not substantially exceed

the nominal alpha of .05; however, the test was often overly conservative, with Type I error rates

substantially below .05. Overall, the results support use of the multiple-group chi-square test.

Discussion

Current guidelines (e.g., OFCCP, 1993) recommend using a significance test combined

with the 4/5ths rule for adverse impact assessment. Based on the results from the current study,

the current use of the Z-test in this context seems reasonably well justified. The Z-test provided a

good balance of maintaining the nominal Type I error rate and maximizing power. In contrast,

the Fisher Exact Test and Yates’ chi-square were overly conservative except for extremely large

N, and consequently had lower power under many conditions. Given its lower power to detect
Adverse Impact Tests 19
true cases of adverse impact, recommendations to use the Fisher Exact Test for adverse impact

assessment (OFCCP, 1993; Siskin & Trippi, 2005) should be reconsidered.

No significance test performed well when the sample size was extremely small. When the

smallest expected frequency was less than two, the Z-test was overly conservative in some cases,

and overly liberal in others. Upton’s chi-square test provided slightly better control for Type I

errors under the conditions where the Z-test was inflated, and maintained power levels that were

equal to the Z-test. Overall, the results indicate Upton’s chi-square is a reasonable alternative to

the Z-test when N is small.

The results were generally consistent with past research on significance tests for 2x2

tables (Upton, 1982) and demonstrate the generalizability of these results to several types of

selection decisions (fixed cutoff, top-down, banding). At the same time, the simulations may not

have fully captured all aspects of real-world employment decisions. For example, the simulations

did not model situations involving deliberate attempts to minimize adverse impact. Targeted

recruiting efforts may change the composition of the applicant pool. Similarly, methods that give

preferential treatment to the minority-group (e.g., selecting minorities first within a test score

band, as in Cascio, Outtz, Zedeck & Goldstein, 1991) will tend to reduce group differences in

selection rates. We would expect such affirmative action activities to primarily influence group

differences at the population level. Beyond their effect on power by reducing the effect size,

these practices should have little effect on the performance of statistical significance tests.

All of the significance tests had extremely low power under many of the conditions

investigated, highlighting an inherent limitation of statistical significance testing under these

conditions. If statistical significance testing is applied, we recommend using one-tailed tests to

gain the benefit of added power, and because in most situations there is a clear expectation about

the group being disadvantaged by the selection procedure. Even using one-tailed tests, power
Adverse Impact Tests 20
was found to be low under many conditions. As a result, alternative approaches to assessing

adverse impact should be considered, such as the use of confidence intervals on the adverse

impact ratio (Morris & Lobsenz, 2000), or pooling of results across samples to increase the

precision of the statistics that guide adverse impact decisions. Additional research is needed to

explore the usefulness and limitations of these strategies.


Adverse Impact Tests 21
References

Aguinis, H. (2004). Test-score banding in human resource selection: Technical, legal,

and societal issues. Westport, CT: Praeger.

Bernard, G. A. (1947). Significance tests for 2 x 2 tables. Biometrika, 34, 123-138.

Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-

analytic matrix incorporating cognitive ability, alternative predictors, and job performance.

Personnel Psychology, 52, 561-589.

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical

Psychology, 31, 144-152.

Buonasera, A. K., Kuang, D., Dunleavy, E. M., & Mueller, L. (2006, April). The

implications of frequent appliers on adverse impact analyses. Poster presented at the Annual

Conference of the Society for Industrial and Organizational Psychology, Dallas, TX.

Camilli, G. (1990). The test of homogeneity for 2 x 2 contingency tables: A review of

and some personal opinions on the controversy. Psychological Bulletin, 108, 135-145.

Camilli, G., & Hopkins, K. D. (1978). Applicability of chi-square to 2 x 2

contingency tables with small expected frequencies. Psychological Bulletin, 85, 163-167.

Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implication of

six methods of test score use in personnel selection. Human Performance, 4, 233-264.

Conover, W. J. (1974). Some reasons for not using the Yates’ continuity correction

on 2 x 2 contingency tables. Journal of the American Statistical Association, 69, 374-376.

Fleiss J. L. (1981). Statistical methods for rates and proportions (2nd ed). Wiley Series in

Probability and mathematical Statistics. NY: John Wiley & Sons.


Adverse Impact Tests 22
Fleiss, J. L. (1994). Measures of effect size for categorical data. In H. Cooper & L. V.

Hedges (Eds.), The Handbook of Research Synthesis (pp. 245-260). NY: Russell Sage

Foundation.

Grizzle, J. E. (1967). Continuity correction in the 2 test for 2 x 2 tables. American

Statistician, 21, 28-32.

Haber, M. (1986). An exact unconditional test for the 2 x 2 comparative trial.

Psychological Bulletin, 99, 129-132.

Haber, M. (1990). Comments on “The test of homogeneity for 2 x 2 contingency tables:

A review of and some personal opinions on the controversy” by G. Camilli. Psychological

Bulletin, 108, 146-149.

Hazelwood School District v. United States, 433 U.S. 299 (1977).

International Mathematical and Statistical Library (1984). User's Manual: IMSL Library,

Problem-solving software system for mathematical and statistical FORTRAN programming

(Vol. 3, ed. 9.2). Houston, TX: IMSL.

Kroll, N. E. A. (1989). Testing independence in 2 x 2 contingency tables. Journal of

Educational Statistics, 14, 47-79.

Morris, S. B., & Lobsenz, R. (2000). Significance tests and confidence intervals for the

adverse impact ratio. Personnel Psychology, 53, 89-111.

Morris, S. B., & Henry, M. S. (2000, April). Using Meta-Analysis to Estimate Adverse

Impact. Paper presented at the 15th Annual Conference of the Society for Industrial and

Organizational Psychology, New Orleans, LA.

Office of Federal Contract Compliance Programs (1993). Federal contract compliance

manual. Washington, D.C.: Department of Labor, Employment Standards Administration, Office

of Federal Contract Compliance Programs (SUDOC# L 36.8: C 76/993).


Adverse Impact Tests 23
Paetzold, R. L., & Willborn, S. L. (1994). Statistics in discrimination: Using statistical

evidence in discrimination cases. Colorado Springs, CO: Shepard's/McGraw-Hill.

Roth, P. L., Bobko, P., & Switzer, F. S. (2006). Modeling the behavior of the 4/5ths rule

for determining adverse impact: Reasons for caution. Journal of Applied Psychology, 91, 507-

522.

Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score

adjustment in preemployment testing. American Psychologist, 49, 929-954.

Siskin, B. R., & Trippi, J. (2005). Statistical issues in litigation. In Landy, F. J. (ed.),

Employment Discrimination Litigation: Behavioral, Quantitative and Legal Perspectives. San

Francisco, CA: Jossey-Bass.

U.S. Equal Employment Opportunity Commission, Civil Service Commission,

Department of Labor, and Department of Justice (1978). Uniform guidelines on employee

selection procedures. Federal Register, 43, 38290-38315.

Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial.

Journal of the Royal Statistical Society, 145, 86-105.

Yates, F. (1934). Contingency tables involving small numbers of the 2 test. Journal of

the Royal Statistical Society Supplement, 1, 217-235.

Yates, F. (1984). Tests of significance for 2 x 2 contingency tables. Journal of the Royal

Statistical Society A, 147, pp. 426-463.


Adverse Impact Tests 24
Notes
1
The choice of an appropriate reference population can have a substantial impact on

adverse impact results, and is often a point of debate in discrimination cases (e.g., Hazelwood v.

US, 1977). However, once a particular definition of the population has been adopted, this choice

will have no effect on the behavior of samples drawn from that population, and therefore will not

impact the conclusions drawn in this paper about the performance of alternate statistical tests.
2
A variety of methods have been proposed for using test score bands, which differ in how

the band width is determined and how applicants are chosen from within a band. The effect of

alternate banding techniques on adverse impact has been the issue of considerable research

(Aguinis, 2004), and will not be further explored here. We will consider only fixed bands where

everyone within the band moves on to be evaluated using other selection criteria.
3
When a large number of applicants repeatedly reapply for the same position, the samples

will not be completely independent, and the sampling models may not accurately represent the

variability of results across repeated test administrations (Buonasera, Kuang, Dunleavy, &

Mueller, 2006).
4
Under some conditions, samples were produced where the significance tests could not be

computed. For example, if either SRT or Pmin is one or zero, the denominator of Z, Y2, and U2

will be zero, and therefore, the statistic is not defined. Samples where the statistical tests could

not be computed were excluded from the analysis. Thus, empirical rejection rates were computed

as the proportion of the remaining samples that produced significant results.


Adverse Impact Tests 25
Table 1

Cross-tabulated Frequency Table of Selection Outcomes

Reference Fail Test/Not Selected Pass Test/Selected Total Proportion


Minority NFmin NPmin Nmin Pmin
Majority NFmaj NPmaj Nmaj 1-Pmin
Total NFT NPT N

Proportion 1-SRT SRT


Adverse Impact Tests 26

Table 2

Empirical Type I error rates of the Z-test (Z), Upton’s chi square (U), Fisher Exact Test (F), Yates’ chi square (Y), 4/5th Test (4/5), and
Reverse One rule (R1) for selection decisions with a fixed cut score.

10% Minority 30% Minority 50% Minority


N Z U F Y 4/5 R1 Z U F Y 4/5 R1 Z U F Y 4/5 R1
10% Selected
20 0 0 0 0.083 0.765 0.001 0.009 0.008 0 0.002 0.544 0.033 0.046 0.034 0.004 0.004 0.424 0.066
50 0.001 0.001 0 0.013 0.610 0.013 0.026 0.026 0.005 0.003 0.449 0.141 0.052 0.050 0.015 0.013 0.410 0.164
100 0.002 0.002 0 0.001 0.448 0.096 0.046 0.045 0.017 0.013 0.390 0.232 0.051 0.050 0.024 0.023 0.364 0.243
30% Selected
20 0.009 0.008 0 0.003 0.539 0.034 0.043 0.040 0.007 0.005 0.416 0.124 0.059 0.053 0.016 0.015 0.369 0.144
50 0.022 0.022 0.005 0.003 0.415 0.142 0.048 0.045 0.023 0.020 0.337 0.250 0.053 0.052 0.028 0.028 0.314 0.260
100 0.042 0.041 0.016 0.011 0.358 0.225 0.054 0.053 0.031 0.029 0.270 0.267 0.053 0.052 0.034 0.034 0.234 0.234
50% Selected
20 0.037 0.027 0.004 0.004 0.395 0.063 0.056 0.05 0.015 0.014 0.354 0.144 0.060 0.054 0.019 0.018 0.317 0.167
50 0.052 0.050 0.014 0.014 0.351 0.167 0.05 0.049 0.025 0.024 0.250 0.235 0.054 0.053 0.029 0.029 0.228 0.227
100 0.047 0.046 0.022 0.021 0.279 0.237 0.05 0.049 0.031 0.031 0.168 0.168 0.048 0.048 0.032 0.032 0.136 0.136
70% Selected
20 0.076 0.071 0.008 0.013 0.316 0.063 0.063 0.058 0.016 0.017 0.271 0.135
50 0.070 0.068 0.020 0.022 0.260 0.139 0.056 0.056 0.027 0.027 0.150 0.149
100 0.057 0.057 0.024 0.026 0.181 0.169 0.054 0.054 0.033 0.035 0.071 0.071

*Type I error was defined as the proportion of significant results out of 10,000 samples when the population had no adverse impact

(IR=1.0). The conditions with 50% minority and 70% selected were excluded from the simulation.
Adverse Impact Tests 27

Table 3

Power of the Z-test (Z), Upton's chi square (U), the Fisher Exact test (F), Yates’ chi square (Y), 4/5th Test (4/5), and Reverse One rule (R1)
when the population adverse impact ratio is 0.8.

10% Minority 30% Minority 50% Minority


N Z U F Y 4/5 R1 Z U F Y 4/5 R1 Z U F Y 4/5 R1
10% Selected
20 0 0 0 0.090 0.809 0.001 0.012 0.011 0.001 0.003 0.607 0.044 0.056 0.045 0.007 0.007 0.501 0.085
50 0 0 0 0.011 0.673 0.021 0.039 0.038 0.008 0.005 0.535 0.199 0.079 0.075 0.027 0.025 0.502 0.230
100 0.004 0.004 0.001 0 0.528 0.128 0.081 0.081 0.035 0.025 0.521 0.339 0.103 0.102 0.056 0.053 0.504 0.376
30% Selected
20 0.013 0.012 0.001 0.002 0.611 0.044 0.076 0.071 0.014 0.01 0.547 0.192 0.108 0.098 0.032 0.03 0.494 0.224
50 0.039 0.039 0.008 0.005 0.525 0.215 0.118 0.115 0.063 0.057 0.507 0.405 0.132 0.13 0.076 0.075 0.515 0.455
100 0.087 0.087 0.042 0.032 0.523 0.362 0.150 0.148 0.103 0.097 0.509 0.506 0.179 0.178 0.128 0.128 0.505 0.504
50% Selected
20 0.072 0.057 0.009 0.009 0.506 0.106 0.125 0.113 0.037 0.035 0.521 0.257 0.137 0.127 0.055 0.054 0.493 0.308
50 0.114 0.110 0.043 0.040 0.515 0.289 0.178 0.175 0.106 0.106 0.510 0.494 0.212 0.207 0.136 0.135 0.51 0.509
100 0.156 0.155 0.091 0.086 0.515 0.464 0.253 0.251 0.188 0.188 0.517 0.517 0.292 0.291 0.229 0.229 0.501 0.501
70% Selected
20 0.162 0.152 0.026 0.033 0.484 0.137 0.191 0.178 0.063 0.064 0.495 0.307
50 0.189 0.185 0.078 0.081 0.504 0.350 0.288 0.287 0.175 0.177 0.494 0.493
100 0.247 0.246 0.148 0.156 0.491 0.474 0.444 0.443 0.356 0.365 0.503 0.503

*Power was defined as the proportion of significant results out of 10,000 samples. The conditions with 50% minority and 70%

selected were excluded from the simulation.


Adverse Impact Tests 28

Table 4

Power of the Z-test (Z), Upton's chi square (U) the Fisher Exact test (F), Yates’ chi square (Y), 4/5th Test (4/5), and Reverse One rule (R1)
when the population adverse impact ratio is 0.1.

10% Minority 30% Minority 50% Minority


N Z U F Y 4/5 R1 Z U F Y 4/5 R1 Z U F Y 4/5 R1
10% Selected
20 0 0 0 0.090 0.973 0.002 0.047 0.043 0.004 0.004 0.938 0.129 0.296 0.254 0.065 0.066 0.922 0.356
50 0 0 0 0.010 0.947 0.043 0.248 0.245 0.077 0.054 0.958 0.634 0.670 0.66 0.417 0.398 0.977 0.843
100 0.015 0.013 0.002 0.001 0.935 0.358 0.696 0.694 0.498 0.429 0.988 0.944 0.919 0.918 0.839 0.828 0.997 0.986
30% Selected
20 0.046 0.043 0.005 0.005 0.939 0.131 0.512 0.503 0.213 0.165 0.972 0.729 0.843 0.827 0.633 0.620 0.99 0.927
50 0.267 0.264 0.086 0.06 0.956 0.664 0.926 0.924 0.851 0.833 0.998 0.993 0.996 0.996 0.989 0.989 1 1
100 0.741 0.739 0.531 0.456 0.991 0.956 0.999 0.999 0.996 0.996 1 1 1 1 1 1 1 1
50% Selected
20 0.352 0.304 0.076 0.077 0.945 0.409 0.893 0.879 0.705 0.694 0.994 0.948 0.998 0.997 0.987 0.987 1 0.999
50 0.756 0.748 0.499 0.480 0.983 0.882 0.999 0.999 0.996 0.996 1 1 1 1 1 1 1 1
100 0.963 0.963 0.912 0.905 0.999 0.993 1 1 1 1 1 1 1 1 1 1 1 1
70% Selected
20 0.746 0.729 0.309 0.362 0.968 0.609 0.997 0.996 0.977 0.978 0.999 0.98
50 0.948 0.944 0.822 0.835 0.996 0.954 1 1 1 1 1 1
100 0.997 0.997 0.990 0.992 0.999 0.998 1 1 1 1 1 1

*Power was defined as the proportion of significant results out of 10,000 samples. The conditions with 50% minority and 70%

selected were excluded from the simulation.


Adverse Impact Tests 29

Table 5

Empirical Type I error rates of the Z-test (Z), Fisher Exact Test (F), and Multi-group chi-square test (MG) in a 3-group comparison.

10% in Focal Group 20% in Focal Group 30% in Focal Group

N Z F MG Z F MG Z F MG
10% Selected
20 0.027 0 0.066 0.030 0.001 0.048 0.045 0.002 0.040
50 0.024 0.001 0.050 0.040 0.004 0.047 0.078 0.013 0.056
100 0.039 0.004 0.043 0.069 0.018 0.051 0.086 0.034 0.065
30% Selected
20 0.074 0.002 0.051 0.081 0.008 0.058 0.092 0.015 0.061
50 0.073 0.010 0.050 0.086 0.026 0.059 0.090 0.038 0.068
100 0.078 0.025 0.054 0.084 0.041 0.065 0.085 0.052 0.066
50% Selected
20 0.119 0.004 0.048 0.104 0.019 0.065 0.110 0.023 0.071
50 0.096 0.022 0.064 0.087 0.033 0.068 0.087 0.039 0.066
100 0.088 0.033 0.061 0.086 0.046 0.064 0.087 0.055 0.070
70% Selected
20 0.183 0.013 0.111 0.129 0.020 0.086 0.117 0.019 0.077
50 0.112 0.022 0.070 0.091 0.029 0.069 0.094 0.039 0.069
100 0.094 0.031 0.066 0.090 0.045 0.068 0.096 0.054 0.069

*Type I error was defined as the proportion of significant results out of 10,000 samples when the population had no adverse impact

(IR=1.0).
Adverse Impact Tests

30
Figure Captions

Figure 1

Type I error rates for alternate test statistics with a fixed cut score.

Note: Z = Z-test; CHIU = Upton’s chi-square test; CHIY = Yates’ chi-square test; Fisher =

Fisher Exact Test; R1 = Reverse One rule.

Figure 2

Power of adverse impact tests with a fixed cut score when IR=.8 and SRT=.5.

Note: Z = Z-test; CHIU = Upton’s chi-square test; CHIY = Yates’ chi-square test; Fisher =

Fisher Exact Test; R1 = Reverse One rule.

Figure 3

Power of adverse impact tests with a fixed cut score when IR=.1 and SRT=.5.

Note: Z = Z-test; CHIU = Upton’s chi-square test; CHIY = Yates’ chi-square test; Fisher =

Fisher Exact Test; R1 = Reverse One rule.

Figure 4

Type I error rates in 3-group comparison

Note: Z = Z-test; Fisher = Fisher Exact Test; MG = Multi-group chi-square test.


Adverse Impact Tests

31

0.3

0.25
Type I Error Rate

0.2 Z
CHIU
0.15 CHIY
Fisher
0.1 R1

0.05

0
0 10 20 30
Minimum Expected N
Adverse Impact Tests

32

1
0.9
0.8
0.7
Z
0.6 CHIU
Power

0.5 CHIY
0.4 Fisher
R1
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Minimum Expected N
Adverse Impact Tests

33

1
0.9
0.8
0.7
Z
0.6 CHIU
Power

0.5 CHIY
0.4 Fisher
R1
0.3
0.2
0.1
0
0 10 20 30
Minimum Expected N
Adverse Impact Tests

34

0.2
0.18
0.16
0.14
Type I Error Rate

0.12 Z
0.1 Fisher
0.08 MG

0.06
0.04
0.02
0
0 5 10 15 20 25 30
Minimum Expected N

You might also like