You are on page 1of 9

Analysis of Variance (ANOVA)

Analysis of variance (ANOVA) is a method for decomposing variance in a measured outcome in to variance that can be explained, such as by a regression model or an experimental treatment assignment, and variance which cannot be explained, which is often attributable to random error. Using this decomposition into component sums of squares, certain test statistics can be calculated that can be used to describe the data or even justify model selection. There will rst be a discussion of how to decompose the variance into explained and unexplained components under both a regression and experimental context, followed by a discussion of how to use analysis of variance to explore and justify regression model selection.

In the familiar regression context, the sum of squares can be decomposed as follows, noting that Yi is individual is outcome, Y is the mean of the outcomes, Yi is individual is tted value based on the OLS estimates, and ei is the resulting residual:
N N N

(Yi Y )2 =
i=1 i=1

(Yi Y )2 +
i=1

e2 i

where
N

SStotal =
i=1

(Yi Y )2

N

SSregression =
i=1

(Yi Y )2

refers to the variance explained by the regression, and

N

SSerror =
i=1

e2 i

is the variance due to the error term, also known as the unexplained variance. Commonly, we would write this decomposition as:

SStotal = SSregression + SSerror

The equations above show how the total variance in the observations can be decomposed into that variance which can be explained by the regression equation and that variance which can be attributed to the random error term in the regression model.

Analysis of variance is not restricted to use with regression models. The concept of decomposing variance can be applied to other models of the world, such as an experimental model. The following is the decomposition of a one-way layout experimental design in which an experimenter randomly assigns observations to one of I treatment assignments. Each treatment assignment has J observations assigned to it. In the case of a randomized controlled trial with only one treatment regime and N subjects randomly assigned to treatment with a 1/2 probability, this would mean that I = 2, one treated group and one control group, where each group has size J. In this framework, the variance decomposition would be as follows:

(Yij Y.. )2 =
i=1 j=1 j=1 i=1

(Yi. Y.. )2 +
i=1 j=1

(Yij Yi. )2

where 1 Yi. = J
J

Yij
j=1

is dened as the average response under the Ith treatment and 1 Y.. = IJ
I J

Yij
i=1 j=1

is dened as the overall average of all observations, regardless of treatment assignment. Commonly, this sum of squares expression is written as

SStotal = SSbetween + SSwithin

Where SSbetween refers to the part of the variance that can be attributed to the dierent treatment assignments and SSwithin refers to the variance that can be described by the random error within a treatment assignment. From this, we can see that SSbetween and SSregression , from the regression framework, are both referring to explained variance. SSwithin and SSerror both refer to unexplained variance.

Typically, using the above decompositions, an analysis of variance table is constructed. In the regression context, where p is dened as the number of independent regressors and n is the number of observations, the ANOVA table typically looks like: This table gives us a sense for how to break down our analysis. In the regression framework,

df p np1 n1

SS SSregression SSerror SStotal

MS SSregression /p SSerror /n p 1

F
SSregression /p SStotal /n1

the degrees of freedom for the regression is the number of parameters in the regression equation. The degrees of freedom for error is n p 1. Finally, the total number of degrees of freedom is dened as dfregression + dferror . The column M S refers to the mean squared error, which is dened as SS/df for each row in the table. If we were in the experimental one-way layout design, then the rst row would refer to the between variance, and the error row would refer to the within variance. The degrees of freedom for the model is the number of treatments less one, (I 1) for I treatments. The degrees of freedom for error is dened as the number of treatments times the number of trials in each treatment less one, or I(J 1) for J trials in each treatment. The above table also contains a column called F which refers to the F-test. The F-test is a way of using the analysis of variance to determine if all of the regressors in a regression equation are jointly zero. In a one-way experimental analysis, the F-test determines if the means of the treatment groups are signicantly dierent. If we have more than one treatment and one control group, then the F-test is a test to see if any treatments are signicantly dierent from zero. The null for an F-test is that all coecients in our model are jointly not statistically distinguishable from 0. The F-test is dened as:

F =

SSregression /p explained variance = unexplained variance SSerror /n p 1

Using the ANOVA table, we can also determine the R2 value of our treatment or model. We dene R2 as follows:

R2 =

SSregression SSerror =1 SStotal SStotal

The R2 refers to how much of the variance is explained by the model. A high R2 value means that much of the variation is explained by the model, which in a regression framework means that the model ts the data well. This also implies that very little of the variance is explained by the random error term. In the experimental framework, a high R2 value would mean that much of the variation is explained by the treatment assignment, and little of the variance is due to random error within those treatment assignments.

Model Selection and Analysis of Variance Analysis of variance is generally used with linear regression to assess model selection. When selecting the best model, we seek to strike a balance between goodness of t and parsimony. If two models t the data equally well, the model selected should include only those explanatory variables that explain a signicant degree of the variance in the response variable. The question is how to distinguish between important and trivial variables in a way that is systematic. Analysis of variance is one method for identifying the parameters of interest. It is important to stress that ANOVA makes all the same assumptions made by normal linear regression. Furthermore, in general applications of ANOVA, the explanatory variables must be all mutually orthogonal, although in some limited cases this orthogonality is not necessary to make a reasonable justication for model choice. In order to determine which covariates are important for the regression model, analysis of variance can be run multiple times in succession to determine if adding an additional covariate contributes any more to the explained variance. If adding an additional covariate reduces the unexplained

variance, then there is justication for including that covariate in the model. However, if adding the additional covariate does not reduce the unexplained variance in a signicant way, then there is justication for leaving it out of the model. It is very important to note here that the order in which covariates are added is very important since the F-test and the reduction in sum of squares is based on which covariates were added previously to the model. Consider two normal linear models: y = + 1 X1 + 2 X2 + , y = + 1 X1 + where X1 and X2 are explanatory variables, is the intercept, and 1 and 2 are the parameters of interest. The second is obviously a simper version of the rst. We can think of model two as the version of model one, in which 2 is restricted to 0. For this reason, we often refer to model one as the unrestricted model and model two as the restricted model. The question is which model is better. If the restricted model ts the data equally well then adding complexity does not improve the accuracy of the estimation signicantly and the simpler (restricted) model is preferable. Analysis of variance is a common method used to compare the relative t of the two related models. This method analyzes the degree to which residual variance changes with the addition of explanatory variables to the basic model. Note that the vector of residuals for the restricted model (where 2 = 0) can be broken into two components:

y y1 = (y y2 ) + (2 y1 ) y

where y1 = 1 X1 and y2 = 1 X1 + 2 X2 . Thus, the vector of residuals for the restricted model consists of a vector of the residuals for the unrestricted model plus the residual dierence between the two models. By construction of OLS, the vectors y y2 and y2 y1 are orthogonal and Pyhagoras theorem implies that the sum of squares for the restricted model is just the sum of squares for the unrestricted model plus the dierence in the sum of squares for the two models, or equivalently:

SS1 = SS1 ,2 + (SS1 SS1 ,2 )

While adding complexity reduces the amount of unexplained variance in the residuals, it also reduces the degrees of freedom. This trade-o motivates the principal of parsimony in model selection. We want to choose the model that produces the most precise estimates of the parameters of interest; the model that includes only those explanatory variables that explain a signicant amount of the variance in the dependent variable. Under the assumptions of OLS, SS1 and SS1 ,2 are mutually independent and have a 2 distribution. The F-test is therefore the appropriate test to determine whether the degree to which inclusion of each additional explanatory variable in the model improves the precision of estimation. In this case, the F-test would look as follows:

F =

SS1 SS1 ,2 /(p q) SS1 ,2 /(n p 1)

where p and q represent the number of parameters in the unrestricted and restricted model respectively (excluding the intercept). Under the null hypothesis that the unrestricted model does not provide signicantly better t than the restricted model, reject the null if the F calculated from the data is greater than the critical value of the F distribution with

(p q, n p) degrees of freedom. The models, their sum of squares, mean square and F-test can be displayed in an analysis of variance table. Model y= y = + X1 y = + 1 X1 + 2 X2 Total df 1 2 3 3 SS SSerror SS SS SS1 SS1 SS1 ,2 SSerror MS
SSerror SS n(n1) SS SS
1

F
SSerror SS SS /(n1) SS SS
1 1

n1(n2) SS SS
1 1 ,2

SS /(n2) SS SS SS
1 1 ,2 /(n3) 1 ,2

n2(n3) SSerror n3

Often the results of an ANOVA table are used to justify the inclusion of each individual variable in the model. As the model expands from p to p + 1 explanatory variables, the F-test evaluates the hypothesis that the parameter, p+1 = 0 given the assumptions of the model are satised. If the explanatory variables that comprise the design matrix are all mutually orthogonal and we have the correct model, then the ANOVA results can be used to determine whether the inclusion of Xp+1 signicantly increases the tness of the model. Without orthogonality, however, we do not know if the order in which the variables are added matters. As successive variables are added from the model, only the variance of the part of the variable that is orthogonal to the previously included variables in the model is removed from the variance of the error. It is false to conclude that a insignicant F statistic implies anything about the relationship between that variable and the response if the condition of orthogonality does not hold. This article has discussed the denition of ANOVA and how it is often applied to regression and experimental data. ANOVA is a decomposition of variance into component parts. There is variance which is attributable to a model, such as a regression model or an experimental treatment, and variance which is attributable to random error. Variance 8

attributable to a model is commonly referred to as SSregression or SSbetween , whereas variance due to random error is referred to as SSerror or SSwithin . Analysis of variance is often used to construct an ANOVA table, which succinctly presents the variance decomposition. This method can also be used to justify regression model selection. The goal of model selection is to nd parsimony between t and degrees of freedom. ANOVA can be used to determine how much extra variance a marginal explanatory variable explains while also weighing the loss of a degree of freedom. An F-test is used to justify the inclusion of a marginal explanatory variable. It is important to note, however, that the order in which variables are added to a model is important in these tests unless the variables are orthogonal to one another. The decomposition of variance using the analysis of variance is a powerful tool for describing data and the t of a model. Adrienne Hosek UC Berkeley Erin Hartman UC Berkeley See also: Quantitative Methods, Basic Assumptions: Regression, Linear and Multiple

Further Reading Davison, A. C., 2003. Statistical Models. New York: Cambridge University Press. Rice, John A., 1995. Mathematical Statistics and Data Analysis. Belmont: Duxbury Press. Hill, R. Carter and Judge, George G. and Griths, W. E. 2001. Undergraduate Econometrics. New York: Wiley Press.