Professional Documents
Culture Documents
ANOVA is a statistical technique that assesses potential differences in a scale-level dependent variable by a
nominal-level variable having 2 or more categories. For example, an ANOVA can examine potential differences in IQ
scores by Country (US vs. Canada vs. Italy vs. Spain). The ANOVA, developed by Ronald Fisher in 1918, extends
the t and the z test which have the problem of only allowing the nominal level variable to have two categories. This
test is also called the Fisher analysis of variance.
NOVA Example
Calcium is an essential mineral that regulates the heart, is important for blood clotting and for building healthy bones. The
National Osteoporosis Foundation recommends a daily calcium intake of 1000-1200 mg/day for adult men and women. While
calcium is contained in some foods, most adults do not get enough calcium in their diets and take supplements. Unfortunately
some of the supplements have side effects such as gastric distress, making them difficult for some patients to take on a regular
basis.
A study is designed to test whether there is a difference in mean daily calcium intake in adults with normal bone density, adults
with osteopenia (a low bone density which may lead to osteoporosis) and adults with osteoporosis. Adults 60 years of age with
normal bone density, osteopenia and osteoporosis are selected at random from hospital records and invited to participate in the
study. Each participant's daily calcium intake is measured based on reported food intake and supplements. The data are shown
below.
Is there a statistically significant difference in mean calcium intake in patients with normal bone density as compared to patients
with osteopenia and osteoporosis? We will run the ANOVA using the five-step approach.
H0: μ1 = μ2 = μ3 H1: Means are not all equal α=0.05
In order to determine the critical value of F we need degrees of freedom, df 1=k-1 and df2=N-k. In this example, df1=k-1=3-
1=2 and df2=N-k=18-3=15. The critical value is 3.68 and the decision rule is as follows: Reject H 0 if F > 3.68.
Substituting:
Finally,
Next,
SSE requires computing the squared differences between each observation and its group mean. We will compute SSE in parts.
For the participants with normal bone density:
Thus,
Thus,
For participants with osteoporosis:
Thus,
Step 5. Conclusion.
We do not reject H0 because 1.395 < 3.68. We do not have statistically significant evidence at a =0.05 to show that there is a
difference in mean calcium intake in patients with normal bone density as compared to osteopenia and osterporosis. Are the
differences in mean calcium intake clinically meaningful? If so, what might account for the lack of statistical significance?
The subscript “c” are the degrees of freedom. “O” is your observed value and E is your expected value. It’s very rare that you’ll
want to actually use this formula to find a critical chi-square value by hand. The summation symbol means that you’ll have to
perform a calculation for every single data item in your data set. As you can probably imagine, the calculations can get very,
very, lengthy and tedious. Instead, you’ll probably want to use technology:
Chi Square Test in SPSS.
Chi Square P-Value in Excel.
A chi-square statistic is one way to show a relationship between two categorical variables. In statistics, there are two types of
variables: numerical (countable) variables and non-numerical (categorical) variables. The chi-squared statistic is a single number
that tells you how much difference exists between your observed counts and the counts you would expect if there were no
relationship at all in the population.
There are a few variations on the chi-square statistic. Which one you use depends upon how you collected the data and which
hypothesis is being tested. However, all of the variations use the same idea, which is that you are comparing your expected
values with the values you actually collect. One of the most common forms can be used for contingency tables:
Where O is the observed value, E is the expected value and “i” is the “ith” position in the contingency table.
A low value for chi-square means there is a high correlation between your two sets of data. In theory, if your observed and
expected values were equal (“no difference”) then chi-square would be zero — an event that is unlikely to happen in real life.
Deciding whether a chi-square test statistic is large enough to indicate a statistically significant difference isn’t as easy it seems.
It would be nice if we could say a chi-square test statistic >10 means a difference, but unfortunately that isn’t the case.
You could take your calculated chi-square value and compare it to a critical value from a chi-square table. If the chi-square value
is more than the critical value, then there is a significant difference.
You could also use a p-value. First state the null hypothesis and the alternate hypothesis. Then generate a chi-square curve for
your results along with a p-value (See: Calculate a chi-square p-value Excel). Small p-values (under 5%) usually indicate that a
difference is significant (or “small enough”).
Tip: The Chi-square statistic can only be used on numbers. They can’t be used for percentages, proportions, means or similar
statistical value. For example, if you have 10 percent of 200 people, you would need to convert that to a number (20) before you
can run a test statistic.
Back to Top
By Geek3|Wikimedia Commons GFDL
The chi-square distribution (also called the chi-squared distribution) is a special case of the gamma distribution; A chi square
distribution with n degrees of freedom is equal to a gamma distribution with a = n / 2 and b = 0.5 (or β = 2).
Let’s say you have a random sample taken from a normal distribution. The chi square distribution is the distribution of the sum of
these random samples squared . The degrees of freedom (k) are equal to the number of samplesbeing summed. For example, if
you have taken 10 samples from the normal distribution, then df = 10. The degrees of freedom in a chi square distribution is also
its mean. In this example, the mean of this particular distribution will be 10. Chi square distributions are always right skewed.
However, the greater the degrees of freedom, the more the chi square distribution looks like a normal distribution.
Uses
The chi-squared distribution has many uses in statistics, including:
Confidence interval estimation for a population standard deviation of a normal distribution from a sample standard
deviation.
Independence of two criteria of classification of qualitative variables.
Relationships between categorical variables (contingency tables).
Sample variance study when the underlying distribution is normal.
Tests of deviations of differences between expected and observed frequencies (one-way tables).
The chi-square test (a goodness of fit test).
Chi Distribution
A similar distribution is the chi distribution. This distribution describes the square root of a variable distributed according to a chi-
square distribution.; with df = n > 0 degrees of freedom has a probability density function of:
f(x) = 2(1-n/2) x(n-1) e(-(x2)/2) / Γ(n/2)
The chi-square formula is a difficult formula to deal with. That’s mostly because you’re expected to add a
large amount of numbers. The easiest way to solve the formula is by making a table.
Sample question: 256 visual artists were surveyed to find out their zodiac sign. The results were: Aries
(29), Taurus (24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23),
Capricorn (18), Aquarius (20), Pisces (23). Test the hypothesis that zodiac signs are evenly distributed
across visual artists.
Step 1: Make a table with columns for “Categories,” “Observed,” “Expected,” “Residual (Obs-Exp)”, “(Obs-
Exp)2” and “Component (Obs-Exp)2 / Exp.” Don’t worry what these mean right now; We’ll cover that in the
following steps.
Step 2: Fill in your categories. Categories should be given to you in the question. There are 12 zodiac signs,
so:
Step 3: Write your counts. Counts are the number of each items in each category in column 2. You’re
given the counts in the question:
Step 4: Calculate your expected value for column 3. In this question, we would expect the 12 zodiac signs
to be evenly distributed for all 256 people, so 256/12=21.333. Write this in column 3.
Step 5: Subtract the expected value (Step 4) from the Observed value (Step 3) and place the result in the
“Residual” column. For example, the first row is Aries: 29-21.333=7.667.
Step 6: Square your results from Step 5 and place the amounts in the (Obs-Exp)2 column.
Step 7: Divide the amounts in Step 6 by the expected value (Step 4) and place those results in the final
column.
Step 8: Add up (sum) all the values in the last column.
Positive correlation indicates that both variables increase or decrease together, whereas negative correlation
indicates that as one variable increases, so the other decreases, and vice versa.
Example Scatterplots
Identify the approximate value of Pearson's correlation coefficient. There are 8 charts, and on choosing the correct
answer, you will automatically move onto the next chart.
(FLASH)
Tip: that the square of the correlation coefficient indicates the proportion of variation of one variable 'explained' by the
other (see Campbell & Machin, 1999 for more details).
Statistical significance of r
Significance
The t-test is used to establish if the correlation coefficient is significantly different from zero, and, hence that there is
evidence of an association between the two variables. There is then the underlying assumption that the data is from
a normal distribution sampled randomly. If this is not true, the conclusions may well be invalidated. If this is the case,
then it is better to use Spearman's coefficient of rank correlation (for non-parametric variables). See Campbell &
Machin (1999) appendix A12 for calculations and more discussion of this.
It is interesting to note that with larger samples, a low strength of correlation, for example r = 0.3, can be highly
statistically significant (ie p < 0.01). However, is this an indication of a meaningful strength of association?
NB Just because two variables are related, it does not necessarily mean that one directly causes the other!
Worked example
Nine students held their breath, once after breathing normally and relaxing for one minute, and once after
hyperventilating for one minute. The table indicates how long (in sec) they were able to hold their breath. Is there an
association between the two variables?
Subject A B C D E F G H I
Normal 56 56 65 65 50 25 87 44 35
Hypervent 87 91 85 91 75 28 122 66 58
The chart shows the scatter plot (drawn in MS Excel) of the data, indicating the reasonableness of assuming a linear
association between the variables.
Hyperventilating times are considered to be the dependent variable, so are plotted on the vertical axis.
Output from SPSS and Minitab are shown below:
SPSS
Select Analysis>Correlation>Bi-variate
Minitab
Correlations: Normal, Hypervent
Pearson correlation of Normal and Hypervent = 0.966
P-Value = 0.000
In conclusion, the printouts indicate that the strength of association between the variables is very high (r = 0.966),
and that the correlation coefficient is very highly significantly different from zero (P < 0.001). Also, we can say that
93% (0.9662) of the variation in hyperventilating times is explained by normal breathing times.
Regression
Regression computations are usually handled by a software package or a graphing calculator. For this example,
however, we will do the computations "manually", since the gory details have educational value.
Problem Statement
Last year, five randomly selected students took a math aptitude test before they began their statistics course. The
Statistics Department has three questions.
What linear regression equation best predicts statistics performance, based on math aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics?
How well does the regression equation fit the data?
And for each student, we also need to compute the squares of the deviation scores (the last two columns in
the table below).
And finally, for each student, we need to compute the product of the deviation scores.
Student xi yi (xi-x)(yi-y)
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
The regression equation is a linear equation of the form: ŷ = b0 + b1x . To conduct a regression analysis, we
need to solve for b0 and b1. Computations are shown below. Notice that all of our inputs for the regression
analysis come from the above three tables.
First, we solve for the regression coefficient (b1):
b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
b1 = 470/730
b1 = 0.644
Once we know the value of the regression coefficient (b1), we can solve for the regression slope (b0):
b0 = y - b1 * x
b0 = 77 - (0.644)(78)
b0 = 26.768
Therefore, the regression equation is: ŷ = 26.768 + 0.644x .
Once you have the regression equation, using it is a snap. Choose a value for the independent variable (x),
perform the computation, and you have an estimated value (ŷ) for the dependent variable.
In our example, the independent variable is the student's score on the aptitude test. The dependent variable
is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated statistics grade
(ŷ) would be:
ŷ = b0 + b1x
Warning: When you use a regression equation, do not use values for the independent variable that are
outside the range of values used to create the equation. That is called extrapolation, and it can produce
unreasonable estimates.
In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95.
Therefore, only use values inside that range to estimate statistics grades. Using values outside that range
(less than 60 or greater than 95) is problematic.