You are on page 1of 4

Using R for an Analysis of Variance (ANOVA)

In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional entries (all current as of version R-2.4.1). Sample texts from an R session are highlighted with gray shading. In a one-way analysis of variance we are trying to find evidence that a single independent variable influences the results for a dependent variable. Suppose, for example, that we are evaluating a multivitamin tablet for iron to see if there is too much variability between individual tables. Taking three tablets and analyzing each four times, we obtain the following results in mg Fe/g tablet. Trial Tablet A Tablet B Tablet C 1 5.67 5.75 4.74 2 5.67 5.47 4.45 3 5.55 5.43 4.65 4 5.57 5.45 4.94 In this example the different tablets are the independent variable and the dependent variable is the concentration of iron. For a one-way analysis of variance the data must be in a data frame containing two columns, one an index for the independent variable and the other for the dependent variables values. Because the raw data usually are already available in separate vectors, creating the data frame requires two steps: first, creating a data frame with columns for each of the independent variables and, second, stacking the columns into a single column while simultaneously creating an indexing column. The syntax stacking a dataframe is stack(dataframe) > tabA = scan(5.67, 5.67, 5.55, 5.57) > tabB = scan(5.75, 5.47, 5.43, 5.45) > tabC = scan(4.74, 4.45, 4.65, 4.94) > tabs = data.frame(tabA, tabB, tabC) > tabs tabA tabB tabC 1 5.67 5.75 4.74 2 5.67 5.47 4.45 3 5.55 5.43 4.65 4 5.57 5.45 4.94 > tablets = stack(tabs) > tablets # stacks the data frames columns # into a single column labeled

# gather vectors into dataframe

1 2 3 4 5 6 7 8 9 10 11 12

values ind 5.67 tabA 5.67 tabA 5.55 tabA 5.57 tabA 5.75 tabB 5.47 tabB 5.43 tabB 5.45 tabB 4.74 tabC 4.45 tabC 4.65 tabC 4.94 tabC

# values and a second column # labeled ind containing the # original column labels as an # index

Now we are ready for the analysis of variance, the syntax for which is anova(lm(dependent variable ~ independent variable, data = dataframe) where lm( ) is the command for a linear model (more on this in a later handout); thus > anova(lm(values ~ ind, data = tablets)) Analysis of Variance Table Response: values Df Sum Sq Mean Sq F value Pr(>F) ind 2 2.05787 1.02893 45.239 2.015e-05 *** Residuals 9 0.20470 0.02274 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The row labeled ind provides the between-sample variance (the variance between the three tablets) and the row labeled Residuals provides the within-sample variance (the variance between the replicates for each tablet). The p-value of 2.01510-5 tells us that there is almost a 99.998% probability that the between-sample variance is significantly larger than the within-sample variance; thus, we have strong evidence that there is a difference between the mean results for the three tablets. Having found evidence for a significant difference between the three tablets, we next seek to clarify where that difference lies. Fishers least significant difference test is not a feature of R. Instead, we use Tukeys Honest Significant Difference Test, the syntax for which is TukeyHSD(aov(dependent variable ~ independent variable, data = dataframe), conf.level = 0.95)

where the optional command conf.level defaults to 0.95 unless otherwise specified. > TukeyHSD(aov(values ~ ind, data = tablets)) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = values ~ ind, data = tablets) $ind diff lwr upr tabB-tabA -0.09 -0.3877412 0.2077412 tabC-tabA -0.92 -1.2177412 -0.6222588 tabC-tabB -0.83 -1.1277412 -0.5322588 p adj 0.6866791 0.0000321 0.0000731

The table at the bottom of the output shows the actual differences (diff) between the mean values for the two tablets and the lower (lwr) and upper (upr) boundaries for this difference. If the lower and upper boundaries includes a difference of zero (0), then there is no evidence for a significant difference between the means; if the range does not include zero, then the difference is significant. In this case we find that differences between Tablet C and that of Tablets A and B is significant, but have no evidence to suggest a significant difference between Tablets A and B. The values of p adj give the probability level at which the difference is significant; that is, the difference between tablets A and B is significant at the 31% confidence level.

ANOVA With More Than One Independent Variable


Suppose that the data for the analysis of iron in a multivitamin has another independent variable the acid used to dissolve the tablet. For example, the first two replicates for each tablet might have been obtained by dissolving the tablet in HCl and the third and fourth replicate by dissolving the tablet in HNO3; thus Acid Used Trial Tablet A Tablet B Tablet C HCl 1 5.67 5.75 4.74 2 5.67 5.47 4.45 3 1 5.55 5.43 4.65 2 5.57 5.45 4.94 With two independent variables, we need to examine the effect of any difference between tablets and between acids, as well as any interaction between the acids and the tablets. To accomplish a two-way analysis of variance we need a data frame containing three columns, one for the dependent variables values, one providing an index for the tablets and one providing an index for the acid.

> acid = c(1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2) > tabletsnew = data.frame(tablets, acid) > anova(lm(values ~ ind*acid, data = tabletsnew)) Analysis of Variance Table Response: values Df Sum Sq Mean Sq F value Pr(>F) ind 2 2.05787 1.02893 49.9078 0.0001823 *** acid 1 0.00213 0.00213 0.1035 0.7586088 ind:acid 2 0.07887 0.03943 1.9127 0.2277221 Residuals 6 0.12370 0.02062 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 This time the command anova includes the formula values ~ ind*acid where the symbol * indicates that both tablets (ind), the type of acid and the interaction between the two are to be tested for significance. In this case only the tablets is a significant factor; thus, there is an effect due to the tablets, but not due to the choice of acid or to any interaction between the tablets and the acids.

You might also like