Professional Documents
Culture Documents
Estimation and Hypothesis Testing II: Data Analysis Using R 1 / 25
Estimation and Hypothesis Testing II: Data Analysis Using R 1 / 25
4 ANOVA
5 Kruskal-Wallis test
In R: wilcox.test(x, y, alternative=”two.sided”,”less”,”greater”,
paired=T,F)
Example
A study is designed in order to test the efficiency of vitamin E supplements
for Alzeihmer disease prevention. 20 subjects aged over 65 are randomly
distributed to two groups. The first group (10 pleople) receives 400Ul/day
of vitamin E, while the second group receives a placebo treatment. The
initial vitamin E levels are measured:
Group 1 : 7.5; 12.6; 3.8; 20.2; 6.8; 403.3; 2.9; 7.2; 10.5; 205.4
Group 2 : 8.2; 13.3; 102.0; 12.7; 6.3; 4.8; 19.5; 8.3; 407.1; 10.2
Test if there are a significant difference between the two groups initially.
Hypothesis:
H0 : p = p 0
6 p0
Ha : p =
p < p0
p > p0
The test statistic:
p̂ − p0
Z=q
p0 (1−p0 )
n
Example
We study the effect of birth weight on the cognitive abilities of babies. In
order to do this the IQ score of 33 randomly chosen children that were
underweight at birth (< 1500gr ) is measured and it is found that 8 of
them have a score less than 70. In normal children, this proportion is
3.2%. Test if the low birth weight has a significant effect on the cognitive
abilities of children.
Confidence intervals for proportions:
The (1 − α)% confidence interval for the proportion of a population is
r r
p̂(1 − p̂) p̂(1 − p̂)
(p̂ − zα/2 · , p̂ + zα/2 · ),
n n
where p̂ = x/n, x number of succes in the sample, n is the sample size,
zα/2 is the corresponding quantile of the standard normal distribution.
Example
Compute the 95% and 99% confidence intervals for the proportion of
underweight newborns that have an IQ score less than 70.
Hypothesis:
H0 : p = p 0
6 p0
Ha : p =
p < p0
p > p0
Test statistic:
(O1 − E1 )2 (O2 − E2 )2
χ2 = +
E1 E2
has a χ2 (1) distribution.
In R: prop.test(x, n, p, alternative = ”two.sided”/”less”/”greater ”) where
x is the number of individuals in the sample exhibiting the studied
characteristic
n is the sample size
Example
The breast cancer incidence rate in the US female population in the 50-54
group age is aproximately 2%. We would like to test if the proportion of
breast cancer patients whose mothers where disgnosed with the same
condition is larger than that of the general population. A random sample
of 10000 women is chosen in the age group 50-54 whose mothers had
breast cancer at some point in their lives and we observe that 400 of the
women also have this condition. At the 0.05 level of significance, what is
your conclusion?
Hypothesis:
H0 : p1 = p2 = . . . = pk
Ha : at least one proportions is different
Test statistic:
k
X (Oi − Ei )2
χ2 =
Ei
i=1
has a χ2 (k − 1) distribution.
In R: prop.test(x, n, alternative = ”two.sided”/”less”/”greater ”) where
x is the vector containing the number of individuals in each sample
exhibiting the studied characteristic
n is the vector of sample sizes
Example
A study is conducted in order to analyse the effects of oral contraception
(OC) on heart conditions of women aged 40-44. The researchers find that
among 5000 women which participated in the study and took OC, 13
suffered a heart attack in the next 3 years, while among 10000 women
that did not take OC, 7 suffered a heart attack in the next 3 years. What
is the conclusion of the study (α = 0.05)?
Ipotezele statistice:
H0 : p1 = p2
Ha : p1 6 p2
=
p1 < p2
p1 > p2
has approximately the N(0, 1) distribution, where p̂1 = nx11 , p̂2 = nx22
x1 +x2
(proportion of succes in the two samples), n1 , n2 sample sizes, p̂ = n1 +n2 .
Example
A study is conducted to find out the factors that help spread tuberculosis among
drug users. Two samples of 97 drug users that admittedly shared needles and 161
that didn’t are chosen and tested. The reports show that 34 individuals in the
first group and 28 in the second have tuberculosis. Test whether the proportion of
TBC infected people is larger among drug users that share needles, with a 0.05
level of significance.
Ha : σ 2 =
6 σ02 → W = (0, χ2α/2 ) ∪ (χ21−α/2 , ∞)
σ 2 < σ02 → W = (0, χ2α/2 )
σ 2 > σ02 → W = (χ21−α/2 , ∞)
Example
Test whether the variance of the data in the ”weight” variable of the
”PlantGrowth” data frame is equal to 0.5 of larger, α = 0.05.
(n − 1)s 2 2 (n − 1)s 2
≤ σ ≤ (varianţă)
χ21−α/2 χ2α/2
s s
(n − 1)s 2 (n − 1)s 2
≤σ≤
χ21−α/2 χ2α/2
Example
Compute the 95% confidence interval for the variance of the weight of
plants in the ”PlantGrowth” data frame.
Ha : σ12 =
6 σ22 → W = (0, Fα/2 ) ∪ (F1−α/2 , ∞)
σ1 < σ22 → W = (0, Fα/2 )
2
Example
The ”ToothGrowth” data frame contains the length of odontoblasts (cells
responsible for tooth growth) in 60 guinea pigs. Each animal received one
of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two
delivery methods, orange juice or ascorbic acid (a form of vitamin C and
coded as VC). Test if the variance of the length of cells is the same in the
two groups determined by the delivery method. Test if the standard
deviation of the length of cells is the same depending on the dose given
(Bartlett test).
For the non-normal populations and for comparing more then two groups,
there is the Levene test.
In R: leveneTest(var∼groups) in the ”car” package.
In R: summary(aov(var∼group))
Example
The data set ”WeightLoss” in the ”car” package contains data about the
weight loss of 34 subjects included in a study, belonging to three groups -
control, diet and diet+sport. We would like to compare the weight loss
after two months (”wl2” variable) between these groups.
Notations:
xij is the jth observation (value) in group i
ni the number of observations in group i
x̄i the mean for group i
The ANOVA model:
xij = µ + αi + εij
where
Data Analysis Using R 21 / 25
ANOVA II
µ is a constant
Pk
αi is a constant specific to the ith group; αi =0
εij is an error term with the N(0, σ 2 ) distribution.
Equivalent hypethesis:
H0 : α1 = α2 = . . . = αk = 0
Ha : there exists an αi 6= 0
We define:
between groups variance:
k
X
SSB = ni (x̄i − x̄)2
i=1
total variance:
ni
k X
X
SST = (xij − x̄)2
i=1 j=1
We have that
SST = SSB + SSW
The principle behind ANOVA: if the means of the groups are significantly
different, the SSB will be larger than SSW .
The test statistic:
SSB/(k − 1)
F = .
SSW /(n − k)
Data Analysis Using R 23 / 25
ANOVA IV
If the assumptions ii) or iii) of the ANOVA test are not met, then we use
the Kruskal-Wallis test instead.
In R: kruskal.test(var∼group)