Professional Documents
Culture Documents
Statistics
Chapter 7: Tests
Pierre Alquier
3 Test of independance
3 Test of independance
x 1 2 3 4
frequency fx = nx /n 0.27 0.28 0.21 0.24
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.02 0.03 −0.04 −0.01
We could measure the distance between the frequencies and the probabilities by :
4
X
(fx − p(x))2 = 0.022 + 0.032 + (−0.04)2 + (−0.01)2 = 0.003.
x=1
It appears that it is better to weight the terms (fx − p(x))2 , that is, we put :
4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25
> hist(Tdist,breaks=30)
Example 1 :
x 1 2 3 4
frequency fx = nx /n 0.27 0.28 0.21 0.24
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.02 0.03 −0.04 −0.01
4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25
Example 2 :
x 1 2 3 4
frequency fx = nx /n 0.29 0.36 0.20 0.15
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.04 0.09 −0.05 −0.10
4
(fx − p(x))2 0.042 0.092 (−0.05)2 (−0.10)2
X
T =n = 100 × + + + = 10.48.
x=1
p(x) 0.25 0.25 0.25 0.25
In other words,
The data in Example 1 are really typical from a uniform
distribution on {1, 2, 3, 4}.
The data in Example 2 are not typical from a uniform
distribution.
x 1 ... K
freq. fx = nx /n f1 . . . fK
p(x) p(1) . . . p(K )
Theorem
IF H0 is true, then when n → ∞, T is distributed as a
χ2 (K − 1) random variable (chi-square random variable with
K − 1 degrees of freedom).
fk ≃ N (p(k), Var(fk ))
and thus
(fk − p(k)) ≃ N (0, Var(fk )).
Recall from Chapter 3 that a χ2 (K − 1) random variable is
defined a sum of squares of Gaussian variables...
Theorem
IF H0 is true, then when n → ∞, T is distributed as a
χ2 (K − 1) random variable (chi-square random variable with
K − 1 degrees of freedom).
P(T ≤ q) = 0.95.
Test procedure
Compute T .
If T > q, reject H0 .
Example 1 :
4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25
Example 2 :
4
(fx − p(x))2 0.042 0.092 (−0.05)2 (−0.10)2
X
T =n = 100 × + + + = 10.48.
x=1
p(x) 0.25 0.25 0.25 0.25
> qchisq(0.95,3)
[1] 7.814728
q = 7.8147
> p1 = 1-pchisq(T1,3)
> p1
[1] 0.7530043
> p2 = 1-pchisq(T2,3)
> p2
[1] 0.01489718
> fx = table(x)[1:5]/n
> fx[5] = sum(table(x)[5:6]/n)
> fx
[1] 0.22 0.36 0.27 0.10 0.05
> T = n*sum((hypo-fx)^2/hypo)
> T
[1] 17.49376
> qchisq(0.95,4)
[1] 9.487729
where
x is the sample,
p is the PDF you want to test the adequation to,
a is the parameter(s) of p.
Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance
> ks.test(y,pnorm,0,1)
One-sample Kolmogorov-Smirnov test
data: y
D = 0.50093, p-value < 2.2e-16
alternative hypothesis: two-sided
> ks.test(y,pexp,2)
One-sample Kolmogorov-Smirnov test
data: y
D = 0.085252, p-value = 0.4615
alternative hypothesis: two-sided
> ks.test(y,pexp,2.5)
One-sample Kolmogorov-Smirnov test
data: y
D = 0.16583, p-value = 0.008172
alternative hypothesis: two-sided
3 Test of independance
Example :
In the previous section, we tested the adequation of the
following sample to P(1).
> x
[1] 3 2 0 2 2 1 0 0 1 1 2 3 1 2 2 1 1 0 2 0 0 1 1 2 0 4 0 1 1 2 3 1 1 5
0 2 2 2 1 0 1 0 2 1 2 0 0 1 1 3 2 0 0 1 1 2 2 2 1 1 1 1 2 0 1 0 1 0
2 3 2 1 2 3 3 3 2 1 4 4 0 1 1 4 1 1 3 2 1 1 2 1 3 1 2 0 0 1 0 2
T = 0.8278685
Let us compare T to the quantile q of order 0.95 of
χ2 (K − 1 − d), here : K = 5 and d = 1.
> qchisq(0.95,3)
[1] 7.814728
3 Test of independance
V non-V
no flu 4850 4741
mild flu 145 225
severe flu 5 34
General case
The probability distribution is not constrained :
V non-V
no flu p(V, no flu) p(non-V, no flu)
mild flu p(V, mild flu) p(non-V, mild flu)
severe flu p(V, severe flu) p(non-V, severe flu)
Independent case
V non-V
no flu p(V)p(no flu) p(non-V)p(no flu)
mild flu p(V)p(mild flu) p(non-V)p(mild flu)
severe flu p(V)p(severe flu) p(non-V)p(severe flu)
Independent case
V non-V
no flu p(V)p(no flu) p(non-V)p(no flu)
mild flu p(V)p(mild flu) p(non-V)p(mild flu)
severe flu p(V)p(severe flu) p(non-V)p(severe flu)
V non-V
no flu αβ (1 − α)β
mild flu αγ (1 − α)γ
severe flu α(1 − β − γ) (1 − α)(1 − β − γ)
V non-V
no flu αβ (1 − α)β
mild flu αγ (1 − α)γ
severe flu α(1 − β − γ) (1 − α)(1 − β − γ)
Observed frequencies
V non-V
no flu 0.4850 0.4741
mild flu 0.0145 0.0225
severe flu 0.0005 0.0034
V non-V
no flu 0.4850 0.4741 β̂ = 0.4850 + 0.4741 = 0.9591
mild flu 0.0145 0.0225 γ̂ = 0.0145 + 0.0225 = 0.0370
severe flu 0.0005 0.0034
α̂ = 0.5
Observed frequencies
V non-V
no flu 0.4850 0.4741
mild flu 0.0145 0.0225
severe flu 0.0005 0.0034
T = 40.1
T is to be compared to the quantile q of order 0.95 of
χ2 (K − 1 − d) where K = 6 cells and d = 3 as θ has 3 entries.
> qchisq(0.95,2)
[1] 5.991465
> M = matrix(data=c(4850,145,5,4741,225,34),3,2)
> M
[,1] [,2]
[1,] 4850 4741
[2,] 145 225
[3,] 5 34
> chisq.test(M)
data: M
X-squared = 40.1, df = 2, p-value = 1.96e-09
Exercise 3 :
Load the data from the file Credit.csv.
What are the variables in the dataset ?
You can obtain the contingency table of two variables x and y using
table(x,y). Show the contingency table of the variables Married and Student.
Test the independance of these two variables.
Show the contingency table of the variables Cards and Gender. Test the
independance of these two variables. What do you observe ? How to fix this ?
A nice way to visualize contigency tables is to use the gplot library. First,
install this library. Then, try the following :
> library(gplot)
> balloonplot(table(cr$Cards,cr$Student))