You are on page 1of 15

ANOVA (Analysis of Variance)

ANOVA is a statistical technique that assesses potential differences in a scale-level dependent variable by a
nominal-level variable having 2 or more categories.  For example, an ANOVA can examine potential differences in IQ
scores by Country (US vs. Canada vs. Italy vs. Spain).  The ANOVA, developed by Ronald Fisher in 1918, extends
the t and the z test which have the problem of only allowing the nominal level variable to have two categories.  This
test is also called the Fisher analysis of variance.

NOVA Example
Calcium is an essential mineral that regulates the heart, is important for blood clotting and for building healthy bones. The
National Osteoporosis Foundation recommends a daily calcium intake of 1000-1200 mg/day for adult men and women. While
calcium is contained in some foods, most adults do not get enough calcium in their diets and take supplements. Unfortunately
some of the supplements have side effects such as gastric distress, making them difficult for some patients to take on a regular
basis.  

 A study is designed to test whether there is a difference in mean daily calcium intake in adults with normal bone density, adults
with osteopenia (a low bone density which may lead to osteoporosis) and adults with osteoporosis. Adults 60 years of age with
normal bone density, osteopenia and osteoporosis are selected at random from hospital records and invited to participate in the
study. Each participant's daily calcium intake is measured based on reported food intake and supplements. The data are shown
below.   

Normal Bone Density Osteopenia Osteoporosis


1200 1000 890
1000 1100 650
980 700 1100
900 800 900
750 500 400
800 700 350

Is there a statistically significant difference in mean calcium intake in patients with normal bone density as compared to patients
with osteopenia and osteoporosis? We will run the ANOVA using the five-step approach.

 Step 1. Set up hypotheses and determine level of significance

H0: μ1 = μ2 = μ3 H1: Means are not all equal                            α=0.05

 Step 2. Select the appropriate test statistic.  

The test statistic is the F statistic for ANOVA, F=MSB/MSE.

 Step 3. Set up decision rule.  

In order to determine the critical value of F we need degrees of freedom, df 1=k-1 and df2=N-k.   In this example, df1=k-1=3-
1=2 and df2=N-k=18-3=15. The critical value is 3.68 and the decision rule is as follows: Reject H 0 if F > 3.68.

 Step 4. Compute the test statistic.  


To organize our computations we will complete the ANOVA table. In order to compute the sums of squares we must first
compute the sample means for each group and the overall mean.  

Normal Bone Osteopenia Osteoporosis


Density
n1=6 n2=6 n3=6

 If we pool all N=18 observations, the overall mean is 817.8. 

We can now compute:

Substituting:

Finally,

Next,

  

SSE requires computing the squared differences between each observation and its group mean. We will compute SSE in parts.
For the participants with normal bone density:

Normal Bone Density (X - 938.3) (X - 938.3333)2

1200 261.6667 68,486.9


1000 61.6667 3,806.9
980 41.6667 1,738.9
900 -38.3333 1,466.9
750 -188.333 35,456.9
800 -138.333 19,126.9
Total 0 130,083.3

Thus,   

For participants with osteopenia:

Osteopenia (X - 800.0) (X - 800.0)2


1000 200 40,000
1100 300 90,000
700 -100 10,000
800 0 0
500 -300 90,000
700 -100 10,000
Total 0 240,000

Thus, 
For participants with osteoporosis:

Osteoporosis (X - 715.0) (X - 715.0)2


890 175 30,625
650 -65 4,225
1100 385 148,225
900 185 34,225
400 -315 99,225
350 -365 133,225
Total 0 449,750

 Thus, 

 We can now construct the ANOVA table.

 Source of Variation Sums of Squares (SS) Degrees of freedom (df)


Between Treatments 152,477.7 2
Error or Residual 819,833.3 15
Total 972,311.0 17

 Step 5. Conclusion.  

We do not reject H0 because 1.395 < 3.68. We do not have statistically significant evidence at a =0.05 to show that there is a
difference in mean calcium intake in patients with normal bone density as compared to osteopenia and osterporosis. Are the
differences in mean calcium intake clinically meaningful? If so, what might account for the lack of statistical significance?

What is a Chi Square Test?


There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:
 A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type,
see: Goodness of Fit Test.
 A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more
general sense, it tests to see whether distributions of categorical variables differ from each another.
 A very small chi square test statistic means that your observed data fits your expected data extremely well. In
other words, there is a relationship.
 A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a
relationship.

What is a Chi-Square Statistic?


The formula for the chi-square statistic used in the chi square test is:

The chi-square formula.

The subscript “c” are the degrees of freedom. “O” is your observed value and E is your expected value. It’s very rare that you’ll
want to actually use this formula to find a critical chi-square value by hand. The summation symbol means that you’ll have to
perform a calculation for every single data item in your data set. As you can probably imagine, the calculations can get very,
very, lengthy and tedious. Instead, you’ll probably want to use technology:
 Chi Square Test in SPSS.
 Chi Square P-Value in Excel.
A chi-square statistic is one way to show a relationship between two categorical variables. In statistics, there are two types of
variables: numerical (countable) variables and non-numerical (categorical) variables. The chi-squared statistic is a single number
that tells you how much difference exists between your observed counts and the counts you would expect if there were no
relationship at all in the population.
There are a few variations on the chi-square statistic. Which one you use depends upon how you collected the data and which
hypothesis is being tested. However, all of the variations use the same idea, which is that you are comparing your expected
values with the values you actually collect. One of the most common forms can be used for contingency tables:

Where O is the observed value, E is the expected value and “i” is the “ith” position in the contingency table.

A low value for chi-square means there is a high correlation between your two sets of data. In theory, if your observed and
expected values were equal (“no difference”) then chi-square would be zero — an event that is unlikely to happen in real life.
Deciding whether a chi-square test statistic is large enough to indicate a statistically significant difference isn’t as easy it seems.
It would be nice if we could say a chi-square test statistic >10 means a difference, but unfortunately that isn’t the case.
You could take your calculated chi-square value and compare it to a critical value from a chi-square table. If the chi-square value
is more than the critical value, then there is a significant difference.
You could also use a p-value. First state the null hypothesis and the alternate hypothesis. Then generate a chi-square curve for
your results along with a p-value (See: Calculate a chi-square p-value Excel). Small p-values (under 5%) usually indicate that a
difference is significant (or “small enough”).
Tip: The Chi-square statistic can only be used on numbers. They can’t be used for percentages, proportions, means or similar
statistical value. For example, if you have 10 percent of 200 people, you would need to convert that to a number (20) before you
can run a test statistic.
Back to Top

Chi Square P-Values.


A chi square test will give you a p-value. The p-value will tell you if your test results are significant or not. In order to perform a
chi square test and get the p-value, you need two pieces of information:
1. Degrees of freedom. That’s just the number of categories minus 1.
2. The alpha level(α). This is chosen by you, or the researcher. The usual alpha level is 0.05 (5%), but you could also
have other levels like 0.01 or 0.10.
 
In elementary statistics or AP statistics, both the degrees of freedom(df) and the alpha level are usually given to you in a
question. You don’t normally have to figure out what they are. You may have to figure out the df yourself, but it’s pretty simple:
count the categories and subtract 1.
Degrees of freedom are placed as a subscript after the chi-square (Χ2) symbol. For example, the following chi square shows 6 df:
Χ26.
And this chi square shows 4 df:
Χ24.
Back to Top

The Chi-Square Distribution

By Geek3|Wikimedia Commons GFDL

The chi-square distribution (also called the chi-squared distribution) is a special case of the gamma distribution; A chi square
distribution with n degrees of freedom is equal to a gamma distribution with a = n / 2 and b = 0.5 (or β = 2).
Let’s say you have a random sample taken from a normal distribution. The chi square distribution is the distribution of the sum of
these random samples squared . The degrees of freedom (k) are equal to the number of samplesbeing summed. For example, if
you have taken 10 samples from the normal distribution, then df = 10. The degrees of freedom in a chi square distribution is also
its mean. In this example, the mean of this particular distribution will be 10. Chi square distributions are always right skewed.
However, the greater the degrees of freedom, the more the chi square distribution looks like a normal distribution.

Uses
The chi-squared distribution has many uses in statistics, including:

 Confidence interval estimation for a population standard deviation of a normal distribution from a sample standard
deviation.
 Independence of two criteria of classification of qualitative variables.
 Relationships between categorical variables (contingency tables).
 Sample variance study when the underlying distribution is normal.
 Tests of deviations of differences between expected and observed frequencies (one-way tables).
 The chi-square test (a goodness of fit test).

Chi Distribution
A similar distribution is the chi distribution. This distribution describes the square root of a variable distributed according to a chi-
square distribution.; with df = n > 0 degrees of freedom has a probability density function of:
f(x) = 2(1-n/2) x(n-1) e(-(x2)/2) / Γ(n/2)

For values where x is positive.


The cdf for this function does not have a closed form, but it can be approximated with a series of integrals, usingcalculus.
Back to Top

How to Calculate a Chi Square Statistic


A chi-square statistic is used for testing hypotheses. Watch this video, How to calculate a chi square, or read the steps below.

The chi-square formula.

The chi-square formula is a difficult formula to deal with. That’s mostly because you’re expected to add a
large amount of numbers. The easiest way to solve the formula is by making a table.
Sample question: 256 visual artists were surveyed to find out their zodiac sign. The results were: Aries
(29), Taurus (24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23),
Capricorn (18), Aquarius (20), Pisces (23). Test the hypothesis that zodiac signs are evenly distributed
across visual artists.
Step 1: Make a table with columns for “Categories,” “Observed,” “Expected,” “Residual (Obs-Exp)”, “(Obs-
Exp)2” and “Component (Obs-Exp)2 / Exp.” Don’t worry what these mean right now; We’ll cover that in the
following steps.
Step 2: Fill in your categories. Categories should be given to you in the question. There are 12 zodiac signs,
so:

Step 3: Write your counts. Counts are the number of each items in each category in column 2. You’re
given the counts in the question:

Step 4: Calculate your expected value for column 3. In this question, we would expect the 12 zodiac signs
to be evenly distributed for all 256 people, so 256/12=21.333. Write this in column 3.
Step 5: Subtract the expected value (Step 4) from the Observed value (Step 3) and place the result in the
“Residual” column. For example, the first row is Aries: 29-21.333=7.667.

Step 6: Square your results from Step 5 and place the amounts in the (Obs-Exp)2 column.

Step 7: Divide the amounts in Step 6 by the expected value (Step 4) and place those results in the final
column.
Step 8: Add up (sum) all the values in the last column.

This is the chi-square statistic: 5.094.


Pearson's Correlation Coefficient
Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for
example, age and blood pressure. Pearson's correlation coefficient (r) is a measure of the strength of the
association between the two variables.
The first step in studying the relationship between two continuous variables is to draw a scatter plot of the variables
to check for linearity. The correlation coefficient should not be calculated if the relationship is not linear. For
correlation only purposes, it does not really matter on which axis the variables are plotted. However, conventionally,
the independent (or explanatory) variable is plotted on the x-axis (horizontally) and the dependent (or response)
variable is plotted on the y-axis (vertically).
The nearer the scatter of points is to a straight line, the higher the strength of association between the variables.
Also, it does not matter what measurement units are used.

Values of Pearson's correlation coefficient


Pearson's correlation coefficient (r) for continuous (interval level) data ranges from -1 to +1:

r = -1 data lie on a perfect straight line with a negative slope

r = 0 no linear relationship between the variables

r = +1 data lie on a perfect straight line with a positive slope

Positive correlation indicates that both variables increase or decrease together, whereas negative correlation
indicates that as one variable increases, so the other decreases, and vice versa.
Example Scatterplots
Identify the approximate value of Pearson's correlation coefficient. There are 8 charts, and on choosing the correct
answer, you will automatically move onto the next chart.
(FLASH)
Tip: that the square of the correlation coefficient indicates the proportion of variation of one variable 'explained' by the
other (see Campbell & Machin, 1999 for more details).
Statistical significance of r

Significance
The t-test is used to establish if the correlation coefficient is significantly different from zero, and, hence that there is
evidence of an association between the two variables. There is then the underlying assumption that the data is from
a normal distribution sampled randomly. If this is not true, the conclusions may well be invalidated. If this is the case,
then it is better to use Spearman's coefficient of rank correlation (for non-parametric variables). See Campbell &
Machin (1999) appendix A12 for calculations and more discussion of this.
It is interesting to note that with larger samples, a low strength of correlation, for example r = 0.3, can be highly
statistically significant (ie p < 0.01). However, is this an indication of a meaningful strength of association?
NB Just because two variables are related, it does not necessarily mean that one directly causes the other!

Worked example
Nine students held their breath, once after breathing normally and relaxing for one minute, and once after
hyperventilating for one minute. The table indicates how long (in sec) they were able to hold their breath. Is there an
association between the two variables?

Subject A B C D E F G H I

Normal 56 56 65 65 50 25 87 44 35

Hypervent 87 91 85 91 75 28 122 66 58

The chart shows the scatter plot (drawn in MS Excel) of the data, indicating the reasonableness of assuming a linear
association between the variables.
Hyperventilating times are considered to be the dependent variable, so are plotted on the vertical axis.
Output from SPSS and Minitab are shown below:
SPSS
Select Analysis>Correlation>Bi-variate

Minitab
Correlations: Normal, Hypervent
Pearson correlation of Normal and Hypervent = 0.966
P-Value = 0.000
In conclusion, the printouts indicate that the strength of association between the variables is very high (r = 0.966),
and that the correlation coefficient is very highly significantly different from zero (P < 0.001). Also, we can say that
93% (0.9662) of the variation in hyperventilating times is explained by normal breathing times.
Regression
Regression computations are usually handled by a software package or a graphing calculator. For this example,
however, we will do the computations "manually", since the gory details have educational value.

Problem Statement

Last year, five randomly selected students took a math aptitude test before they began their statistics course. The
Statistics Department has three questions.

 What linear regression equation best predicts statistics performance, based on math aptitude scores?
 If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics?
 How well does the regression equation fit the data?

 How to Find the Regression Equation


 In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column shows statistics
grades. The last two columns show deviations scores - the difference between the student's score and the
average score on each test. The last two rows show sums and mean scores that we will use to conduct the
regression analysis.

Student xi yi (xi-x) (yi-y)


1 95 85 17 8
2 85 95 7 18
3 80 70 2 -7
4 70 65 -8 -12
5 60 70 -18 -7
Sum 390 385
Mean 78 77

 And for each student, we also need to compute the squares of the deviation scores (the last two columns in
the table below).

Student xi yi (xi-x)2 (yi-y)2


1 95 85 289 64
2 85 95 49 324
3 80 70 4 49
4 70 65 64 144
5 60 70 324 49
Sum 390 385 730 630
Mean 78 77

 And finally, for each student, we need to compute the product of the deviation scores.

Student xi yi (xi-x)(yi-y)
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77

 The regression equation is a linear equation of the form: ŷ = b0 + b1x . To conduct a regression analysis, we
need to solve for b0 and b1. Computations are shown below. Notice that all of our inputs for the regression
analysis come from the above three tables.
 First, we solve for the regression coefficient (b1):
 b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
 b1 = 470/730
 b1 = 0.644
 Once we know the value of the regression coefficient (b1), we can solve for the regression slope (b0):
 b0 = y - b1 * x
 b0 = 77 - (0.644)(78)
 b0 = 26.768
 Therefore, the regression equation is: ŷ = 26.768 + 0.644x .

 How to Use the Regression Equation

 Once you have the regression equation, using it is a snap. Choose a value for the independent variable (x),
perform the computation, and you have an estimated value (ŷ) for the dependent variable.

 In our example, the independent variable is the student's score on the aptitude test. The dependent variable
is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated statistics grade
(ŷ) would be:
 ŷ = b0 + b1x

 ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80

 ŷ = 26.768 + 51.52 = 78.288

 Warning: When you use a regression equation, do not use values for the independent variable that are
outside the range of values used to create the equation. That is called extrapolation, and it can produce
unreasonable estimates.

 In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95.
Therefore, only use values inside that range to estimate statistics grades. Using values outside that range
(less than 60 or greater than 95) is problematic.

 How to Find the Coefficient of Determination


 Whenever you use a regression equation, you should ask how well the equation fits the data. One way to
assess fit is to check the coefficient of determination, which can be computed from the following formula.
 R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
 where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the x value for
observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the
standard deviation of x, and σy is the standard deviation of y.
 Computations for the sample problem of this lesson are shown below. We begin by computing the standard
deviation of x (σx):
 σx = sqrt [ Σ ( xi - x )2 / N ]
 σx = sqrt( 730/5 ) = sqrt(146) = 12.083
 Next, we find the standard deviation of y, (σy):
 σy = sqrt [ Σ ( yi - y )2 / N ]
 σy = sqrt( 630/5 ) = sqrt(126) = 11.225
 And finally, we compute the coefficient of determination (R2):
 R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
 R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2
 R2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48
 A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics grades
(thedependent variable) can be explained by the relationship to math aptitude scores (the independent
variable). This would be considered a good fit to the data, in the sense that it would substantially improve an
educator's ability to predict student performance in statistics class.

You might also like