Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of
Variance
In: Best Practices in Quantitative Methods
By: Yanyan Sheng

Edited by: Jason Osborne
Pub. Date: 2011
Access Date: February 13, 2021
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9781412940658
Online ISBN: 9781412995627
DOI: https://dx.doi.org/10.4135/9781412995627
Print pages: 324-340
© 2008 SAGE Publications, Inc. All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the pagination of the
online version will vary from the pagination of the print book.
SAGE SAGE Research Methods
2008 SAGE Publications, Ltd. All Rights Reserved.
Testing the Assumptions of Analysis of Variance
YanyanSheng
ANOVA Assumptions
The analysis of variance (ANOVA) F test is commonly employed to test the omnibus null hypothesis regarding
the effect of categorical independent variables (or factors) on a continuous dependent variable. It is generally
well–known that certain assumptions have to be satisfied in order for the F test to produce valid statistical
results. Briefly, the assumptions are independence of observations, normality in population distributions, and
homogeneity of population variances (Penfield, 1994; Scheffé, 1959; Stevens, 1996). The first assumption
can be handled through research design and sampling frame, whereas the latter two assumptions are
concerned with the populations under investigation and therefore are oftentimes beyond the control of the
researcher. Statistical methods are called robust if their inferences are not seriously invalidated by the
violations of the assumptions (Miller, 1986; Scheffé, 1959). Robustness is often operationalized as the actual
Type I error—namely, the probability of erroneously rejecting a true null hypothesis, being near the nominal
α level. When the assumptions for the ANOVA F test are violated, the Type I error rate can be seriously
inflated, resulting in spurious rejection of the null hypothesis with a reduced statistical power, the probability of
correctly rejecting a false null hypothesis. Hence, inferences made about a given set of data may be invalid.
Many applied researchers in the social sciences are often tempted simply to ignore checking the assumptions
or ignore the violation of the assumptions (e.g., Breckler, 1990; Keselman et al., 1998; Micceri, 1989). A great
number of studies have been conducted to evaluate the ANOVA F test under various degrees of assumption
violations in the literature (e.g., Glass, Peckham, & Sanders, 1972; Harwell, Rubinstein, Hayes, & Olds, 1992;
Lix, Keselman, & Keselman, 1996; Scheffé, 1959; among others). They concluded that the F test is robust
to some but not all of the assumption violation situations. Consequently, checking the necessary ANOVA
assumptions and, more importantly, understanding the performance of the ANOVA F test under various
degrees of assumption violations are essential for an applied educational and psychological researcher to
understand specific data–analytic conditions and caution against false results from invalid F test procedures,
for “the relevant question is not whether ANOVA assumptions are met exactly, but rather whether the plausible
violations of the assumptions have serious consequences on the validity of probability statements based on
the standard assumptions” (Glass et al., 1972, p. 237).
When violation of a particular ANOVA assumption seriously jeopardizes the validity of statistical inferences for
a given data set, one may choose to either transform the data (such as a logarithm transformation or a square
root transformation) and perform the usual F test or use an alternative procedure that is robust to assumption
violations, which can be a parametric alternative such as the Welch (1951) test or a nonparamatric alternative
such as the Kruskal–Wallis (Kruskal & Wallis, 1952) test.

Page 2 of 29
However, none of these options are optimal in all situations. Transformation can create difficulties in
interpretation, for the results are based on the transformed scores instead of the original scores (e.g.,
Krutchkoff, 1988). Conclusions that are drawn on transformed data do not always transfer neatly to the
original measurements. Furthermore, transformation may not provide a simple solution, for a variety of
transformations can be adopted depending on the type and the degree of assumption violation presented in
the data (cf. Oshima & Algina, 1992).
Regarding the robust alternatives, studies have shown that they may be superior to the ANOVA F test in the
majority of assumption violation situations (e.g., Levy, 1978; Tomarken & Serlin, 1986), but each procedure
suffers from its own weakness.1 For instance, the Kruskal–Wallis procedure has possible substantial
statistical power when the data are nonnormal (Blair & Higgins, 1980), but it is sensitive to heterogeneity of
variance, especially with unequal group sizes (Tomarken & Serlin, 1986). The Welch test is robust to variance
heterogeneity but not to the presence of high skewness (Lix et al., 1996). The effects of assumption violations
on ANOVA alternative procedures have not yet been fully investigated. Furthermore, some of the alternative
procedures are only applicable in one–way designs. Studies are needed to extend them to a diverse range
of design types. Given these considerations and the popularity of the F test, it is becoming more vital and
necessary to understand the degrees of violations and the consequences of each violation on the validity of
the statistical inference made on a particular set of data.
In this chapter, I describe the assumptions for the ANOVA F test in the order of checking sequence, present
procedures for testing them, and discuss the consequences of each assumption violation. For simplicity of
illustration, oneway ANOVA is considered. However, examples of factorial designs are provided for assessing
the assumptions.
The linear model underlying a simple oneway ANOVA fixed effects design is
where Yij, is the dependent variable associated with the jth observation in thej th treatment group for i = 1,…,
nj andj = 1,…, k (where is the total sample size), μj is the population mean for the jth group,
and ∊ij is the random error associated with Yij. The null hypothesis for the omnibus F test, H0 : μ1; = μ2 =
… = μk, is tested by comparing the computed F statistic to a critical F value at the α level with k — 1 and
N — k degrees of freedom (df), F1-α(k – 1, N – k). It is assumed that the ∊ijs are independent and normally
distributed with a mean 0 and a common variance, σ2, within each of the k levels—that is, . The
linear model for factorial ANOVA designs can be generalized based on the number of factors in the model.
Independence of Observation
Independence of random errors (i.e., ∊ij, in Equation 1) means that the value of one observation in the
dependent variable, Yij, provides no information about or is not influenced by the value for any other
observation, Yi'j'. The independence of the observation assumption is a basic requirement of experimental

Page 3 of 29
design. Violation of this assumption leads to dependent or correlated observations. An illustrative example
for such a violation can be students taught by a particular instructor in an educational setting. They are not
independent as they share the same teaching. We say the students are nested within the instructor. One may
also argue that individuals in a corporation are nested and hence nonindependent due to similarity in their
background and experiences. With such nonindependent/nested data, one has to adopt techniques such as
nested ANOVA designs and/or hierarchical linear modeling (HLM).
The independence assumption can be decided and controlled by the researcher when planning and executing
the experiment. In most research situations, the requirement for independence is typically realized by
randomization (i.e., using a random sample of separate, unrelated subjects).
Test the Assumption
For a simple one–way ANOVA design, an intraclass correlation (ICC) can be used to assess whether this
assumption is tenable (Stevens, 1996). One can compute the ICC using the SPSS RELIABILITY procedure.
But the data layout has to be constructed differently from the format in the usual ANOVA analysis—for
example, the layout in the dist data (see Appendix A for a description of the data). For instance, to evaluate
the assumption of independence of correct responses (numright) from three treatment conditions (condit),
one has to create three continuous variables (c1, c2, c3) out of numright, each representing correct responses
in one treatment condition, so the layout of the data is as shown in Table 22.1. The responses are randomly
arranged in each group. It has to be noted that the order of the subjects in each group does not affect the
extent to which the scores correlate, although the ICC and p values could be different with another ordering.
Table 22.2 displays the SPSS syntax and output for computing the ICC with the data shown in Table 22.1. A
large p value (p = .920) suggests low nonsignificant correlations among observations from the three treatment
conditions. Hence, the independence of observations can be assumed.
However, this procedure is only applicable in a single–factor design. One may also test this assumption using
a run chart of the residual scores (e.g., for the one–way design, where Yj denotes the sample
estimate of the jth population mean), particularly if the time order of data collection is available. unlike the
method with ICC, this procedure can be used in factorial ANOVAs as well. To illustrate, suppose one sets up a
two–factor ANOVA investigating how correct responses (numright) are affected by the background conditions
(condit) and the participant's gender (gender) in the dist data. Because the data do not concern the time
order of data collection, one may use the participant's id number (id) to check the independence assumption.
Residuals are obtained by fitting a factorial ANOVA model using the SPSS GLM procedure and are further
plotted against id using the SPSS GRAPH procedure. Figure 22.1 shows the SPSS syntax and the chart,
which indicates that with no clear patterns in the residuals, the independence assumption can be assumed.

Page 4 of 29
Table 22.1 Number of Correct Responses in the Three Treatment Conditions

Page 5 of 29

Page 6 of 29
Effects of Violations
In cases of nonindependence, the scores/observations of the subject are influenced by other subjects or
previous scores. Failure to satisfy the independence assumption can have serious effects on the validity of
probability statements in the ANOVA F procedure (Glass et al., 1972; Harwell et al., 1992; Scheffé, 1959;
Stevens, 1996), leading to tests with inflated Type I error and reduced power even though the sample size is
large (Scariano & Davenport, 1987). The nature of nonindependence (i.e., positive or negative relationship in
the observations) has different effects on F tests. Specifically, positive
Table 22.2 SPSS Syntax and Output for Evaluating the Independence Assumption With a Single Factor

Page 7 of 29
Figure 22.1 SPSS syntax and output for a visual check of the independence assumption.
correlations yield a more liberal test, whereas negative correlations lead to a more conservative test
(Cochran, 1947). In addition, the higher the correlations, the more different the actual significance level is
from the nominal level.

Page 8 of 29
Table 22.3 ANOVA F Tests for the Data Simulated From Uncorrelated and Correlated Distributions
To further illustrate the effect of violation on Type I and II errors, let's consider two simulated data sets,
each containing three treatment groups and six observations per group, as displayed in Table 22.3. The two
data sets satisfy two conditions—namely, conditions with and without the independence of observations. In
the latter case where the assumption is violated, moderate correlations (ρ = 0.5) between treatment groups
are assumed. In addition, in both sets, the values in Groups 1, 2, and 3 are random draws from normal
distributions with a common variance 1 and means 0, 1, or 2, respectively. The effect size is kept similar in
both conditions. This actually specifies that the true population means are different among the three groups,
and the omnibus F is supposed to show significance for both data sets. Indeed, with the uncorrelated data
(i.e., when the independence assumption is satisfied), the ANOVA F test shows significance at the .05 level
(p = .038) and a .636 probability of correctly rejecting a false null hypothesis (see “Summary of Analysis” in
Table 22.3 for uncor–related data). However, the correlated data result in a nonsignificant F test with a higher
p value (p = .063) and a lower power estimate (.542).
Outliers

Page 9 of 29
There are many reasons for an extreme score in ANOVA (discussed in detail in Chapter 14). As with other
procedures, outliers can be detected using the standardized residuals so that any values more than 3
standard deviations away from the mean can be considered as outliers. Using the dist data, for a two–factor
ANOVA using the background conditions (condit) and the participant's gender (gender) to model correct
responses (numright), standardized residuals can be obtained using the GLM procedure shown in Table 22.4.
From the saved standardized residuals displayed in the table, it is obvious that id 35 is an outlier, whose
standardized residual is —3.37. A sensitivity analysis can subsequently be carried out to check if this outlier
seriously affects the F test. Table 22.5 summarizes the ANOVA results using the original data (the upper part
of the table) and using the data without the outlier (the lower part of the table). The two tests, resulting in
different F statistics, p values, or estimated effect sizes, indicate that the presence of the outlier does affect
inferences drawn from the F tests. Actually, the presence of this outlier depresses the estimated effect size,
especially for the interaction. Hence, the observation of ID 35 should be examined more carefully to determine
whether it should be retained in the data.
Homogeneity of Population Variances
The assumption of homogeneity of variances states that the error variances are the same across the
populations under investigation (e.g.,σ21 = σ22 = …=σ2k for a one–way design). In performing F tests,
the common variance, σ2, estimated by the mean square error (or within–group mean square), is the
denominator of the F ratio. Conceptually, the mean square error is considered as the pooled or the weighted
average of the sample variances. It estimates the baseline variation that is present in a response variable.
When populations differ widely in variances, this average is a poor summary measure.
Test the Assumption
A visual check of the homogeneity assumption can be made by inspecting and comparing the side–by-
side boxplots of the residuals (e.g., ,) from different populations. The relative height of the boxes or the
interquartile range (IRQ)—that is, the distance between the upper quartile (75th percentile) and the lower
quartile (25th percentile)—indicates the relative variability of the

Page 10 of 29
Table 22.4 SPSS Syntax for Detecting Within–Cell Outliers

Page 11 of 29
Table 22.5 Sensitivity Analysis With and Without Case 35
residual scores in each treatment condition. Equal variability can be assumed if boxplots have relatively equal
spread or IRQ across populations.
Consider the dist example of the correct responses (numright) obtained from three treatment conditions
(condit) for male and female (gender) participants, after removing the potential outlying case (id 35). The
boxplots as well as SPSS syntax for obtaining them on the saved residuals are shown in Figure 22.2. It is
obvious from the figure that the boxplots are dissimilar in their spread or IRQ. Specifically, the box for the
(female, constant sound) condition is the smallest, whereas that for the (male, no sound) condition shows the
largest distance. The latter appears to be about twice as wide as the former. However, given the unit of the
scale, it is hard to decide whether the difference in the IRQ or population variances is substantial. One has to
check the significance of this inequality numerically.
There are a number of tests for the homogeneity of variance assumption, including Bartlett's chi–square
test, Hartley's F–max test, Cochran's C test, Levene's test, and Brown–Forsythe's test, and most software
packages incorporate at least one. The first three tests are sensitive to even slight departures from normality
(Box, 1953), while the latter two tests are shown to be fairly robust even when the underlying distributions
deviate significantly from the normal distribution (Olejnik & Algina, 1987) and hence are recommended.
They have an added advantage of simplifying the procedure by allowing normality tests conducted after the
homogeneity of variances is tested, which will be explained further in a later section.

Page 12 of 29
The Brown–Forsyth test is a modification of the Levene test (Brown & Forsythe, 1974). The null hypothesis for
the two tests is that the variances are homogeneous (H0 : σ21 = σ22… = σ2k;). In both tests, original values
of the dependent variable (Yij) are transformed to derive a dispersion variable on their absolute deviations
from the respective group means ( ) for the Levene test, or medians (|Yij mj, where mj denotes the
median of the jth group) for the Brown–Forsythe test. ANOVA is subsequently performed on the dispersion
variable, and the significance level for the test of homogeneity of variance is then the p value for the ANOVA F
test on this variable. Thus, if the F test is significant at a predefined critical level (usually .05), the hypothesis
of equal variances should be rejected in conclusion of the violation of the assumption.
Generally, Brown–Forsythe's test is preferred over Levene's test (when available) when there is unequal
sample size in the two (or more) groups that are to be compared. In addition, it is reported to be the most
robust and best at providing power to detect variance differences while protecting the Type I error (Conover,
Johnson, & Johnson, 1981; Olejnik & Algina, 1987).
The previous side–by-side boxplots suggest that the residuals appear to differ much in their spread of
variability across groups. Formal statistical tests can be conducted to test the homogeneity of variances
in correct responses (numright) at the three treatment conditions (condit) for male and female (gender)
participants by setting up the null and alternative hypotheses as H0 : σ21 = σ22…=σ26 and H1 : not all σ2i
are equal, respectively.
The result (displayed in Table 22.6) shows a nonsignificant Levene's test, F(5,6g) = 2.175, p = .067, on the
dispersion variable. Hence, the null hypothesis is retained, and equal variances can be assumed at the .05
level.

Page 13 of 29
Figure 22.2 SPSS syntax and output for a visual check of the equal variances assumption.

Page 14 of 29
Table 22.6 SPSS Syntax and Output for an Example of Using Levene's Test
Brown–Forsythe's test is not directly implemented in SPSS (it is, however, implemented in SAS). The
procedures can get very complicated if there are many factors or groups (in which case, one can consider
using other statistical packages such as SAS). For the above example, the Brown–Forsythe test, for which
procedures are detailed in Appendix B, results in a nonsignificant F ratio, F(5,6g) = 1.587, p = .175 (see Table
B1 in Appendix B). The test shows no evidence for heterogeneity. Therefore, one can assume homogeneity
of variances and proceed with an ANOVA F test for the differences among the population means.
Although Levene's and Brown–Forsythe's tests are robust to nonnormality, some authors (e.g., Glass &
Hopkins, 1996) have pointed out that the tests themselves rely on the homogeneity of variances assumption
(of the absolute deviations from the means or medians), and hence it is not clear how robust these tests are
themselves in the presence of significant variance heterogeneity and unequal sample sizes.
Effeects of Violations
The prevailing conclusion drawn from early studies on the effects of variance heterogeneity (see Glass et
al., 1972, for a review of this earlier literature) indicated that the F test was quite robust against violations
of this assumption, especially when sample sizes were equal. However, the robustness of the test has been
somewhat misinterpreted in the literature (Krutchkoff, 1988). Research initiated by Box (1953) has shown that
F tests are not robust to all degrees of variance heterogeneity even when the sample sizes are equal (e.g.,

Page 15 of 29
Rogan & Keselman, 1977; Weerahandi, 1995; among others). It is true that the value of the F statistic does
not dramatically change due to small or even moderate departures from this assumption when sample sizes
are equal. However, as the degree of heterogeneity increases, the F test is more biased with inflated error
rates and reduced power, especially with small sample sizes. The effect of variance heterogeneity may be
more serious when sample sizes are not equal. It is also noted that when the variances are positively paired
with the sample sizes—that is, when the group with the largest sample size, for example, has the largest
variance—the true Type I error will be less than the nominal significant level. On the other hand, when the
variances are negatively paired with the sample sizes, the true Type I error will exceed the nominal level
(Glass et al., 1972; Harwell et al., 1992).
When variances are heterogeneous, Welch's (1951) test should be used in most one–way designs with
balanced or unbalanced data. However, this test is sensitive to large skewness in the data. Lix et al. (1996)
pointed out that skew–ness greater than 2.0 might result in inflated error rates for the Welch test. They further
recommended using sample sizes no less than 10. It has to be noted that extension of Welch's test to a
variety of univariate and multivariate designs is possible Lix & Keselman, 1995).
To further illustrate effects of the variance heterogeneity, let's consider the dist data again. One of the
variables in the data is HETERVAR, which was simulated from three normal distributions, each corresponding
to the constant sound, random SOUND, and no sound conditions, with population means 3,10, and 6 and
standard deviations 0.1,10, and 15, respectively. The F test, Welch's test, and Levene's test can be performed
simultaneously with the SPSS ONEWAY procedure. The syntax, together with the output, is shown in Table
22.7.
The Levene test is significant with an extremely small p value (p < .001), indicating a great degree of variance
heterogeneity in HETER–VAR among the three treatment conditions. This further results in an F test not
significant at the .05 level, F = 2.878 (p = .063, = .074), suggesting that the data do not provide sufficient
evidence against the null hypothesis of equal population means. However, the Welch test, not assuming
equal variances, reports significant mean differences, F w(2,32), = 6.875, p = .003, which is consistent with
the actual situation. As pointed out earlier, Welch's test is not commonly implemented in statistical packages
and is only available for one–way design in SPSS. Hence, one has to adopt other robust tests when variance
homogeneity is not attainable.
In this example, we see that the heterogeneity of variance has greatly deteriorated the power of the F test
even with equal sample sizes (nj = 25). For the effect with unequal sample sizes, readers can use the
simulated variable HETERVAR in the HSB data, evaluate the equal variances assumption of this variable
across the three socioeconomic status (ses) levels, and compare the F test result with the Welch test result.
The effects of unequal variances on the Type I error rates and power for F tests have been mainly investigated
using Monte Carlo studies in the literature. For an excellent review as well as a summary of consequences of
the violations, see Harwell et al. (1992).

Page 16 of 29
Normality in Treatment Population
In addition to equal variances, ANOVA fixed–factor models make an additional distributional assumption that
the random errors are normally distributed in all treatment conditions.
Table 22.7 SPSS Syntax and Output for the ANOVA F Test and the Welch Test When Variances Are
Heterogeneous
Test the Assumption
The normality assumption can be assessed using graphics, descriptive statistics, and formal statistical tests
on the residual scores (e.g., Graphics provide the simplest and most direct way to examine the
shape of a distribution. While histograms can aid in visual inspection of normality, normal probability plots
(graphs of the empirical quantiles based on the data against the actual quantiles of the standard normal

Page 17 of 29
distribution, also called normal Q–Q plots) are considered to be more informative. The points will fall close
to a straight line if the distribution is approximately normal. Points that are above the straight–line pattern
suggest that residuals are smaller than expected for normal data. Points that are below the straight–line
pattern suggest that residuals are bigger than expected for normal data.
Simple descriptive statistics (i.e., the coefficients for skewness and kurtosis) provide important information
relevant to this issue. For example, perfect normal distributions are symmetrical with a zero skewness and
moderately spread with a zero kurtosis. If the result of dividing skewness or kurtosis by its corresponding
standard errors exceeds 2, the distribution departs significantly from a normal distribution.
More precise information can be obtained by performing one of the statistical tests of normality to determine
the probability that the sample came from a normally distributed population (e.g., the Kolmogorov–Smirnov
[KS] test or the Shapiro–Wilks test). For both tests, a p value smaller than a predefined critical level (usually
.05) suggests rejection of the null hypothesis that the data follow a normal distribution. It should be noted
that tests of normality do not reflect the magnitude of the departure. That is, these tests only reflect whether
the data depart from normality. Small p values indicate that the distribution almost definitely departs from
normality. They do not, however, indicate that the distribution departs substantially from normality, especially
when the sample size is large. In general, the Shapiro–Wilks test is too sensitive to minor deviations from
normality, and even the KS test is relatively sensitive, and none of the statistical tests can substitute for a
visual examination of the data.
Effiects of Violations
The fixed effects model F test is relatively robust to deviations from normality (see Glass et al., 1972; Harwell
et al., 1992, for summaries of the effects), except when sample sizes are small and power is being evaluated.
Studies indicate that severe skewness can have a greater impact than kurtosis (e.g., Lumley, Diehr, Emerson,
& Chen, 2002; Scheffé, 1959). Its impact on the F test varies. Tiku (1964) concluded that distributions with
skew–ness in different directions had a greater effect on Type I errors than distributions with skew–ness
values in the same direction. Harwell et al. (1992) observed that when sample sizes are unequal, skewed
distributions can result in slightly inflated Type I error rates. Skovlund and Fenstad (2001) also reported that
the level of significance is sensitive to severely skewed distributions with unequal sample size and unequal
variances.
To illustrate the effects of nonnormality due to skewness or kurtosis, let's consider the NORMAL, SKEWED,
and UNIFORM variables in the dist data. Observations for the three variables were randomly simulated from
a normal, a positive skewed, and a uniform distribution, respectively, so that the cell means are the same for
each distribution. It should be noted that a uniform distribution is an extreme case when the distribution is
flatter than a standard normal distribution with a kurtosis around —1.2. Figure 22.3 displays the histograms of
the three variables aggregating over the three treatment conditions. The overall distributional shapes suggest
that SKEWED and uniform deviate from normal distributions in that the former is right–skewed whereas the

Page 18 of 29
latter looks platykurtic.
F tests are then performed to test the overall mean differences between the three conditions (condit)
in NORMAL, SKEWED, and UNIFORM. To make sure that nonnormality is not complicated by variance
heterogeneity, the homogeneity of variance is checked using the Levene procedure, and the resulting
nonsignificant p values suggest that equal variances can be assumed for the three F tests. Table 22.8
summarizes the ANOVA results, which indicate a significant mean difference in SKEWED (F(2,72) = 3.584, p
= .033, η2p = .091) but not in NORMAL (F(2,72) = .662, p = .519, η22 = .018) or UNIFORM (F(272) = .123, p
= .884, η2p = .003).

Figure 22.3 Histograms of aggregated normal, skewed, and uniform variables.

Page 19 of 29
Table 22.8 F Tests of Mean Differences Between condit in normal, skewed, and uniform
Table 22.9 SPSS Syntax and Output for the Kruskal–Wallis Test
a. Kruskal-Wallis test.
b. Grouping variable: PRESENTATION CONDITION.

Page 20 of 29
To evaluate the F results, the Kruskal–Wallis (KW; Kruskal & Wallis, 1952) test, a nonpara–metric alternative
to the F test, is conducted to test the respective mean differences between CONDIT in the three variables.
This test does not assume normal distributions. However, it can still be affected by some form of nonnormality
when variances are unequal (Cribbie & Keselman, 2003; Oshima & Algina, 1992; Tomarken & Serlin, 1986).
The KW test results as well as the SPSS syntax are shown in Table 22.9. Different from the significant F result
obtained with the positive skewed variable (skewed), the KW test reports a slightly higher p value (p = .058),
resulting in a nonsignificance test at the .05 level. Actually, the p values for the normal and uniform variables
are a little larger than the corresponding p values from the F test as well. However, they do not affect the
significance levels tremendously. From this example, we see that skewed distributions can result in spurious
rejection of the null hypothesis. For practice, the reader can also conduct the F test and the KW test on the
simulated variable NONNORMAL across the levels in ses in the hsb data to further understand the effect of
the assumption violation.
Summary Remarks
As most applied researchers who employ inferential statistical techniques make extensive use of ANOVA
procedures, the importance of testing the F test assumptions and understanding the effects of violations
cannot be overstated. The notions that dependence among observations, heterogeneity of variance, or
non–normality does not affect inferences based on the ANOVA F test have to be dispelled. Dependence of
observations can be overcome by carefully designing the research through randomization. However, unequal
variances and nonnormal distributions have to be evaluated even though the study is carefully designed.
From the previous discussions, we see that F tests are relatively robust to nonnormality assuming equal
variances but not to variance heterogeneity when normality is assumed, regardless of whether sample sizes
are equal. Cramer (1994), among others, examined the effect of simultaneous violations of normality and
equality of variance assumptions, reported serious problems, and suggested that the ANOVA F test be
avoided when simultaneous violations are detected. Consequently, it is essential for a researcher to compare
treatment means with an appreciation for the underlying assumptions of the F test.
Although several statistical procedures are available for examining a particular ANOVA assumption, no
technique is completely adequate. Statistical tests are flawed in two major ways: (a) The assumption tests
were not designed to show that the null hypothesis is true, and thus, the usual .05 critical level might not
be an appropriate significance level; and (b) the precision of the tests depends on sample sizes. The F test
is fairly robust with respect to nonnormality, and the degree of robustness is dependent on the number of
observations per treatment. As the number of observations increases, normality tests become more sensitive
or more likely to declare nonnormality. Likewise, testing the variance homogeneity assumption, one is likely
to accept the null hypothesis with small samples even when variances are heterogeneous but reject the
null hypothesis with large samples even when the differences between treatment variances are too small
to be a problem. Therefore, given the above problems associated with statistical tests, the use of graphics
approaches is generally recommended as a key part of any examination of assumptions.

Page 21 of 29
When one is conducting an ANOVA with a between–group factor, it is more appropriate to use the residual
scores than the raw scores to check the assumptions. The ANOVA test is based on the F ratio, which is the
between–groups variance divided by the within–group variance. The latter is the average squared deviation of
scores around their group mean (i.e., the average squared residuals). In order to ensure that no observation
has a strong influence on the calculation of the within–group variance, we should examine these residuals
and look for heterogeneity in variances, outliers, or extremely nonnormal distributions. If the assumptions of
normality and variance homogeneity are reasonable, then certain characteristics should be evident for the
residuals. Deviations from these characteristics suggest that ANOVA assumptions may not be met. It has to
be noted again here that the test of normality on the residuals is not appropriate if there is a heterogeneity of
variance problem. Hence, it is recommended that normality be examined after the unequal variances problem
has been dealt with.
As a practice, the reader can follow the procedures described in this chapter to evaluate the assumptions for
a two–factor ANOVA design using students’ sex (sex) and socioeconomic status (ses) to model their science
scores (sci) in the HSB data.
Appendix A: Data Set
Example Data
Throughout the chapter, a major data set is used to illustrate the procedures for evaluating ANOVA
assumptions and setting up contrast coding. The data are based on a study investigating the effect of
background sound on learning. It is predicted that students will learn most effectively with a constant
background sound, as opposed to an unpredictable sound or no sound at all. In order to test this, participants
were asked to study a passage of text for 30 minutes. Then they were given 20 multiple–choice questions to
answer based on the material. Each participant was randomly assigned to one of the three studying groups.
One group had background sound at a constant volume. A second group studied with noise that changed
volume periodically in the background. A third group had no sound at all.
The data are stored in an SPSS data file dist.sav. Each variable is described in the table to the right.
NUMRIGHT shows the number of correct responses over the 20 items. condit is the grouping variable with the
three experimental conditions. Other quantitative variables— normal, skewed, uniform, and HETERVAR—are
variables simulated from distributions that are normal, positively skewed, uniform, and heterogeneous in
variances, respectively. Furthermore, independ and depend are variables simulated from distributions with
and without independent observations.
Practice Data
Another data set is provided for those who want to practice or experiment with the techniques described in

Page 22 of 29
this chapter. The data are revised from the High School and Beyond Study (Glass & Hopkins, 1996) and are
stored in an SPSS data file hsd.sav . Achievement (reading, math, and science scores) and demographic
(sex, ses ) information was recorded for a national representative sample of 600 high school seniors.
hetervar and nonnorma l are variables simulated from distributions with unequal variances and nonnormal
distributions, respectively. A description of each variable is as follows.

Page 23 of 29

Page 24 of 29
Appendix B: Brown–Forsythe's Test
For the dist example, within–group medians—that is, medians of correct responses (numright) for male and
female participants (gender) at three treatment conditions (condit)—are first computed using the MEANS
procedure:

Page 25 of 29
Then the numright scores are transformed to derive the dispersion variable on their absolute deviations from
the respective group medians ( ) using the following syntax:
IF (pre_1<13) dispersion = ABS(numright-13) .
IF (pre_1<15 & pre_1>13) dispersion = ABS(numright-15) .
IF (pre_1<16.5 & pre_1>15) dispersion = ABS(numright-16.5) .
IF (pre_1>16.5) dispersion = ABS(numright-18) .
A new variable, dispersion, is then created in SPSS. The upper part of Table B1 shows the values for the
original numright and the dispersion variables at each treatment condition. ANOVA F test is then performed
to test group difference in DISPERSION with the ONEWAY procedure using the fitted values as the factor:
ONEWAY dispersion BY pre_1.
F test results are summarized at the bottom of Table B1, which are the result for the Brown–Forsythe test.
Note
1. Readers can refer to Chapter 18 (this volume) on robust methods for more information, although the
chapter is more focused on regression–style methodology.

Page 26 of 29
Table B1 An Example of Using the Brown–Forsythe Test to Test Homogeneity of Variance
References
Blair, R. C. Higgins, J. J. A comparison of the power of the Wilcoxon's rank-sum statistic to that of student's
t statistic under various non-normal distributions 5,309-335(1980).
Box, G. E. P. Non-normality and tests on variances 40,318-335(1953).
Breckler, S. J. Application of covariance structure modeling in psychology: Cause for concern?

107,260-273(1990).

Page 27 of 29
Brown, M. B. Forsythe, A. B. Robust tests for the equality of variances 69,364-367(1974).
Cochran, W. G. Some consequences when the assumptions for the analysis of variance are not satisfied
3,22-38(1947).
Conover, W. J. Johnson, M. E. Johnson, M. M. A comparative study of tests for homogeneity of variances,

with applications to the outer continental shelf bidding data 23,351-361(1981).
Cramer, D.(1994). Introducing statistics for social research: Step-by-step calculations and computer
techniques using SPSS. London: Routledge.
Cribbie, R. A. Keselman, H. J. The effects of nonnormality on parametric, nonparametric, and model

comparison approaches to pairwise comparisons 63,615-635(2003).
Glass, G. v., & Hopkins, K. D.(1996). Statistical methods in psychology and education (3rd ed.). Needham
Heights, MA: Allyn & Bacon.
Glass, G. v. Peckham, P. D. Sanders, J. R. Consequences of failure to meet assumptions underlying the

fixed effects analyses of variance and covariance 42,237-288(1972).
Harwell, M. R. Rubinstein, E. N. Hayes, W. S. Olds, C. C. Summarizing Monte Carlo results in

methodological research: The one-and two-factor effects ANOVA cases 17,315-339(1992).
Keselman, H. J. Huberty, C. Lix, L. M. Olejnik, S. Cribbie, R. A. Donahue, B. , et al. Statistical practices of

educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses 68,350-386(1998).
Kruskal, W. H. Wallis, W. A. Use of ranks in one-criterion variance analysis 47,583-621(1952).
Krutchkoff, R. G. One-way fiixed effects analysis of variance when the error variances may be unequal
30,177-183(1988).
Levy, K. J. An empirical comparison of the ANOVA F-test with alternatives

which are more robust against heterogeneity of variance 8,49-57(1978).
Lix, L. M. Keselman, H. J. Approximate degrees of freedom tests: A unified perspective on testing for mean
equality 117,547-560(1995).
Lix, L. M. Keselman, J. C. Keselman, H. J. Consequences of assumption violations revisited: A quantitative

review of alternatives to the one-way analysis of variance F test 66,579-619(1996).
Lumley, T. Diehr, P. Emerson, S. Chen, L. The importance of the normality assumption in large public health
data sets 23,151-169(2002).
Micceri, T. The unicorn, the normal curve, and other improbable creatures 105,156-166(1989).

Page 28 of 29
Miller, R. G.(1986). Beyond ANOVA, basics of applied statistics. New York: John Wiley.
Olejnik, S. F. Algina, J. Type l error rates and power estimates of selected parametric and non-parametric
tests of scale 12,45-61(1987).
Oshima, T. C. Algina, J. Type l error rates for James's second-order test and Wilcox's Hm test under heteroscedasticity and non-normality 42,255-263(1992).
Penfield, D. A. 62,343-360(1994).
Rogan, J. C. Keselman, H. J. Is the ANOVA F-test robust to variance heterogeneity when sample sizes are
equal? An investigation via a coefficient of variation 14,493-498(1977).
Scariano, S. M. Davenport, J. M. The effects of violations of independence in the one-way ANOVA

41,123-129(1987).
Scheffé, N.(1959). The analysis of variance. New York: John Wiley.
Skovlund, E. Fenstad, G. U. Should we always choose a nonparametric test when comparing two apparently
nonnormal distributions? 54,86-92(2001).
Stevens, J.(1996). Applied multivariate statistics for the social sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
Tiku, M. L. Approximating the general nonnormal variance-ratio sampling distributions 46,114-122(1964).
Tomarken, A. J. Serlin, R. C. Comparison of ANOVA alternatives under variance-covariance heterogeneity

and specific noncentrality structures 99,90-99(1986).
Weerahandi, S. ANOVA under unequal error variances 51,589-599(1995).
Welch, B. L. On the comparison of several mean values: An alternative approach 38,330-336(1951).
http://dx.doi.org/10.4135/9781412995627.d27

Page 29 of 29

Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance

Uploaded by

Copyright:

Available Formats

Testing the Assumptions of Analysis of

By: Yanyan Sheng

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Test the Assumption

Testing the Assumptions of Analysis of Variance

Table 22.1 Number of Correct Responses in the Three Treatment Conditions

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Homogeneity of Population Variances

Test the Assumption

Testing the Assumptions of Analysis of Variance

Table 22.4 SPSS Syntax for Detecting Within–Cell Outliers

Testing the Assumptions of Analysis of Variance

Table 22.5 Sensitivity Analysis With and Without Case 35

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Normality in Treatment Population

Test the Assumption

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

latter looks platykurtic.

= .884, η2p = .003).

Testing the Assumptions of Analysis of Variance

b. Grouping variable: PRESENTATION CONDITION.

Testing the Assumptions of Analysis of Variance

Appendix A: Data Set

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Testing the Assumptions of Analysis of Variance

Appendix B: Brown–Forsythe's Test

Testing the Assumptions of Analysis of Variance

the respective group medians ( ) using the following syntax:

IF (pre_1<13) dispersion = ABS(numright-13) .

IF (pre_1<15 & pre_1>13) dispersion = ABS(numright-15) .

IF (pre_1<16.5 & pre_1>15) dispersion = ABS(numright-16.5) .

IF (pre_1>16.5) dispersion = ABS(numright-18) .

ONEWAY dispersion BY pre_1.

Testing the Assumptions of Analysis of Variance

Table B1 An Example of Using the Brown–Forsythe Test to Test Homogeneity of Variance

Box, G. E. P. Non-normality and tests on variances 40,318-335(1953).

Breckler, S. J. Application of covariance structure modeling in psychology: Cause for concern?

Testing the Assumptions of Analysis of Variance

Brown, M. B. Forsythe, A. B. Robust tests for the equality of variances 69,364-367(1974).

Conover, W. J. Johnson, M. E. Johnson, M. M. A comparative study of tests for homogeneity of variances,

Cribbie, R. A. Keselman, H. J. The effects of nonnormality on parametric, nonparametric, and model

Glass, G. v. Peckham, P. D. Sanders, J. R. Consequences of failure to meet assumptions underlying the

Harwell, M. R. Rubinstein, E. N. Hayes, W. S. Olds, C. C. Summarizing Monte Carlo results in

Keselman, H. J. Huberty, C. Lix, L. M. Olejnik, S. Cribbie, R. A. Donahue, B. , et al. Statistical practices of

Kruskal, W. H. Wallis, W. A. Use of ranks in one-criterion variance analysis 47,583-621(1952).

Levy, K. J. An empirical comparison of the ANOVA <span class="hi-italic">F</span>-test with alternatives

Lix, L. M. Keselman, J. C. Keselman, H. J. Consequences of assumption violations revisited: A quantitative

Testing the Assumptions of Analysis of Variance

Scariano, S. M. Davenport, J. M. The effects of violations of independence in the one-way ANOVA

Scheffé, N.(1959). The analysis of variance. New York: John Wiley.

Tiku, M. L. Approximating the general nonnormal variance-ratio sampling distributions 46,114-122(1964).