You are on page 1of 42

Test of adequation to a fixed distribution

Test of adequation to a model


Test of independance

Statistics
Chapter 7: Tests

Pierre Alquier

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

In Chapter 7, we assumed that X1 , . . . , Xn were i.i.d. from


some distribution of the form p(·|θ), and we studied how to
estimate θ. Today, we address the question : do the data
support this assumption ?
1 First, we will address a simpler question : given a
distribution p(·) and a sample X1 , . . . , Xn , can we test the
hypothesis : “p is the distribution of the Xi ” ? → “test of
adequation to a fixed distribution”
We will start with discrete variables,
we will discuss briefly continuous variables.
2 In a second time, we will address the initial question : is
one of the p(·|θ) the distribution of the Xi ? → “test of
adequation to a model”
We will cover only discrete variables in this course.
3 Finally, we will see another application of these tests :
testing the independance of two variables.
Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

1 Test of adequation to a fixed distribution

2 Test of adequation to a model

3 Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

1 Test of adequation to a fixed distribution

2 Test of adequation to a model

3 Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Let us start with the following question : let X1 , . . . , Xn be


i.i.d. random variables on the the finite set {1, . . . , K }. Let
p(·) be a PMF on {1, . . . , K }. How can we test

H0 : “the distribution of the Xi ’s is p(·)”


Example 1 : Xi ∈ {1, 2, 3, 4}, sample size n = 100, we observe :
value x 1 2 3 4
number of obs.nx 27 28 21 24
is it compatible with the uniform distribution p(x) = 1/4 ?
Example 2 : Xi ∈ {1, 2, 3, 4}, sample size n = 100, we observe :
value x 1 2 3 4
number of obs.nx 29 36 20 15
is it compatible with the uniform distribution p(x) = 1/4 ?
Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Idea : compare the estimated probabilities to the actual ones.


Example 1 :

x 1 2 3 4
frequency fx = nx /n 0.27 0.28 0.21 0.24
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.02 0.03 −0.04 −0.01

We could measure the distance between the frequencies and the probabilities by :

4
X
(fx − p(x))2 = 0.022 + 0.032 + (−0.04)2 + (−0.01)2 = 0.003.
x=1

It appears that it is better to weight the terms (fx − p(x))2 , that is, we put :

4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X  
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25

How do we interpret this value ?


Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Let us sample 25000 times data “as it should be if the


hypothesis was true”, that is, Y1 , . . . , Yn are uniform random
variables on {1, 2, 3, 4}. For each sample, we compute the T
statistic.
> hypo = c(1/4,1/4,1/4,1/4)

> Tdist = c()


> for (i in 1:25000)
> {
> X = sample(x=c(1,2,3,4),size=100,replace=TRUE,prob=hypo)
> f = table(X)/n
> T = sum((hypo-f)^2/hypo)
> Tdist = c(Tdist,T)
> }

> hist(Tdist,breaks=30)

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Example 1 :

x 1 2 3 4
frequency fx = nx /n 0.27 0.28 0.21 0.24
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.02 0.03 −0.04 −0.01

4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X  
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25

Example 2 :

x 1 2 3 4
frequency fx = nx /n 0.29 0.36 0.20 0.15
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.04 0.09 −0.05 −0.10

4
(fx − p(x))2 0.042 0.092 (−0.05)2 (−0.10)2
X  
T =n = 100 × + + + = 10.48.
x=1
p(x) 0.25 0.25 0.25 0.25

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

In other words,
The data in Example 1 are really typical from a uniform
distribution on {1, 2, 3, 4}.
The data in Example 2 are not typical from a uniform
distribution.

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

In a general setting, assume we observe X1 , . . . , Xn random


variables on {1, . . . , K } :

x 1 ... K
freq. fx = nx /n f1 . . . fK
p(x) p(1) . . . p(K )

We want to test the hypothesis :

H0 : “the distribution of the Xi ’s is p(·)”

The chi-square (or χ2 ) test statistic if defined as :


K
X (fk − p(k))2
T = .
k=1
p(k)
Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Theorem
IF H0 is true, then when n → ∞, T is distributed as a
χ2 (K − 1) random variable (chi-square random variable with
K − 1 degrees of freedom).

Idea of the proof : by the CLT,

fk ≃ N (p(k), Var(fk ))

and thus
(fk − p(k)) ≃ N (0, Var(fk )).
Recall from Chapter 3 that a χ2 (K − 1) random variable is
defined a sum of squares of Gaussian variables...

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Theorem
IF H0 is true, then when n → ∞, T is distributed as a
χ2 (K − 1) random variable (chi-square random variable with
K − 1 degrees of freedom).

Let q be the quantile of order 0.95 of the χ2 (K − 1)


distribution. By definition, if H0 is true,

P(T ≤ q) = 0.95.

Test procedure
Compute T .
If T > q, reject H0 .

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Example 1 :

4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X  
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25

Example 2 :

4
(fx − p(x))2 0.042 0.092 (−0.05)2 (−0.10)2
X  
T =n = 100 × + + + = 10.48.
x=1
p(x) 0.25 0.25 0.25 0.25

> qchisq(0.95,3)
[1] 7.814728

q = 7.8147

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Example 2 : T = 10.48. If we sample Y1 , . . . , Yn from the


uniform distribution, what would be the probability to get
T ≥ 10.48? ?
P(T ≥ 10.48) = 1 − P(T ≤ 10.48) = 1 − F (10.48)
where F is the CDF of χ2 (K − 1). This quantity is defined as
the p-value of the test.
> T1
[1] 1.2
> T2
[1] 10.48

> p1 = 1-pchisq(T1,3)
> p1
[1] 0.7530043

> p2 = 1-pchisq(T2,3)
> p2
[1] 0.01489718

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Test procedure - alternative formulation


Compute T .
Compute the p-value p = 1 − F (T ) where F is the CDF
of χ2 (K − 1).
If p > 0.05, reject H0 . The smaller p is, the more
confident we are that H0 is not true.

Note : there are many test procedures in R, some are quite


different. However, for all of them, R will return a p-value. The
interpretation of the p-value is always as above.

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Assume now we observe a sample X1 , . . . , Xn of i.i.d.


discrete random variables that can take any integer value.
We want to test the adequation to a distribution on
integers.
For example, we want to test whether the following was
sampled from a Poisson distribution P(1).
> x
[1] 3 2 0 2 2 1 0 0 1 1 2 3 1 2 2 1 1 0 2 0 0 1 1 2 0 4 0 1 1 2 3 1 1 5
0 2 2 2 1 0 1 0 2 1 2 0 0 1 1 3 2 0 0 1 1 2 2 2 1 1 1 1 2 0 1 0 1 0
2 3 2 1 2 3 3 3 2 1 4 4 0 1 1 4 1 1 3 2 1 1 2 1 3 1 2 0 0 1 0 2

Problem : there is now infinitely many fx and p(x)...


x 0 1 2 3 4 5 6 ...
fx = nx /n 0.22 0.36 0.27 0.10 0.04 0.01 0.00 . . .
p(x) 0.368 0.368 0.184 0.061 0.015 0.003 0.001 . . .

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

To bypass this problem, we can merge events :


x 0 1 2 3 4 5 6 ...
fx = nx /n 0.22 0.36 0.27 0.10 0.04 0.01 0.00 . . .
p(x) 0.368 0.368 0.184 0.061 0.015 0.003 0.001 . . .

x 0 1 2 3 ≥4
fx = nx /n 0.22 0.36 0.27 0.10 0.05
p(x) 0.368 0.368 0.184 0.061 0.019

As a general rule, it is recommended to merge cells so


that no cell has less than 5 observations.
We can then perform a standard χ2 test.

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Detailed code for the computation of T :


> table(x)
x
0 1 2 3 4 5
22 36 27 10 4 1

> fx = table(x)[1:5]/n
> fx[5] = sum(table(x)[5:6]/n)
> fx
[1] 0.22 0.36 0.27 0.10 0.05

> hypo = dpois(c(0,1,2,3),lambda=1)


> hypo = c(hypo,1-sum(hypo))
> hypo
[1] 0.36787944 0.36787944 0.18393972 0.06131324 0.01898816

> T = n*sum((hypo-fx)^2/hypo)
> T
[1] 17.49376

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

First version of the test : we compare T to the quantile of the


χ2 (4) distribution :
> T
[1] 17.49376

> qchisq(0.95,4)
[1] 9.487729

Second version of the test : we compare the p-value to 0.05 :


> p = 1-pchisq(T,4)
> p
[1] 0.001549331

In both case, we reject the hypothesis that the data was


sampled from P(1).

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Assume now we observe a sample X1 , . . . , Xn of i.i.d.


continuous random variables. We want to test the
adequation to a continuous distribution.
There are many tests to do this. We will not study how
they are built, but you can still use them in practice. R
returns a p-value that can be interpreted as in the χ2 test
(reject the hypothesis if p < 0.05).
For example, the Kolmogorov-Smirnov test can be used
as follows :
ks.test(x, p, a)

where
x is the sample,
p is the PDF you want to test the adequation to,
a is the parameter(s) of p.
Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

As an example, I sampled Y1 , . . . , Y100 from E(2) and tested


its adequation to N (0, 1), E(2) and E(1) respectively.
> y = rexp(100,2)

> ks.test(y,pnorm,0,1)
One-sample Kolmogorov-Smirnov test

data: y
D = 0.50093, p-value < 2.2e-16
alternative hypothesis: two-sided

> ks.test(y,pexp,2)
One-sample Kolmogorov-Smirnov test

data: y
D = 0.085252, p-value = 0.4615
alternative hypothesis: two-sided

> ks.test(y,pexp,2.5)
One-sample Kolmogorov-Smirnov test

data: y
D = 0.16583, p-value = 0.008172
alternative hypothesis: two-sided

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

1 Test of adequation to a fixed distribution

2 Test of adequation to a model

3 Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

We now turn to the question : let X1 , . . . , Xn be i.i.d. random


variables on the the finite set {1, . . . , K }. Let p(·|θ) be a
family of PMFs on {1, . . . , K } indexed by a parameter θ. How
can we test

H0 : “the distribution of the Xi ’s is one of the p(·|θ)’s”

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Test procedure - adequation to a model


1 Compute an estimator θ̂ of θ (using the method of
moments of Chapter 6).
2 Compute T the χ2 test statistic as if you were testing the
adequation to p(·|θ̂) :
K
X (fk − p(k|θ̂))2
T = .
k=1
p(k|θ̂)

3 Under H0 , when n → ∞, T is distributed as a


χ2 (K − 1−d) random variable, where d is the dimension
of θ.

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Example :
In the previous section, we tested the adequation of the
following sample to P(1).
> x
[1] 3 2 0 2 2 1 0 0 1 1 2 3 1 2 2 1 1 0 2 0 0 1 1 2 0 4 0 1 1 2 3 1 1 5
0 2 2 2 1 0 1 0 2 1 2 0 0 1 1 3 2 0 0 1 1 2 2 2 1 1 1 1 2 0 1 0 1 0
2 3 2 1 2 3 3 3 2 1 4 4 0 1 1 4 1 1 3 2 1 1 2 1 3 1 2 0 0 1 0 2

Now, we test its adequation to the Poisson model P(λ).


First, we compute λ̂ = X̄n = 1.41.
Let us now compute T :
x 0 1 2 3 ≥4
fx = nx /n 0.22 0.36 0.27 0.10 0.05
p(x|1.41) 0.244 0.344 0.243 0.114 0.055
T = 0.8278685
Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

T = 0.8278685
Let us compare T to the quantile q of order 0.95 of
χ2 (K − 1 − d), here : K = 5 and d = 1.
> qchisq(0.95,3)
[1] 7.814728

As T < q, we don’t reject H0 .


p-value :
> p=1-pchisq(0.8278685,3)
> p
[1] 0.8427903

as p > 0.05, we don’t reject H0 .

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

1 Test of adequation to a fixed distribution

2 Test of adequation to a model

3 Test of independance

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Assume we observe (X1 , Y1 ), . . . , (Xn , Yn ) a sample of n


i.i.d. pairs of random (discrete) variables.
We want to test the hypothesis :

H0 : Xi and Yi are independent.

It turns out that we can use the χ2 test to test H0 !


I will describe this on an example.

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Example : flu vaccination (note : these are not real data).


For n = 10, 000 patients, we collect two variables :
X the vaccination status, X = “V” or “non-V”.
Y = “no flu”, “mild flu” or “severe flu”.
We obtain the following counts :

V non-V
no flu 4850 4741
mild flu 145 225
severe flu 5 34

(this is called a contingency table).

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

General case
The probability distribution is not constrained :

V non-V
no flu p(V, no flu) p(non-V, no flu)
mild flu p(V, mild flu) p(non-V, mild flu)
severe flu p(V, severe flu) p(non-V, severe flu)

Independent case
V non-V
no flu p(V)p(no flu) p(non-V)p(no flu)
mild flu p(V)p(mild flu) p(non-V)p(mild flu)
severe flu p(V)p(severe flu) p(non-V)p(severe flu)

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Independent case
V non-V
no flu p(V)p(no flu) p(non-V)p(no flu)
mild flu p(V)p(mild flu) p(non-V)p(mild flu)
severe flu p(V)p(severe flu) p(non-V)p(severe flu)

Put α = p(V), then p(non-V) = 1 − α.


Put β = p(no flu) and γ = p(mild flu), then
p(severe flu) = 1 − β − γ.

V non-V
no flu αβ (1 − α)β
mild flu αγ (1 − α)γ
severe flu α(1 − β − γ) (1 − α)(1 − β − γ)

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

The independence hypothesis is exactly equivalent to the


following statistical model p(·|θ) where θ = (α, β, γ) :

V non-V
no flu αβ (1 − α)β
mild flu αγ (1 − α)γ
severe flu α(1 − β − γ) (1 − α)(1 − β − γ)

Thus, we can use the test of adequation to a statistical model


seen in the previous section !

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Observed frequencies
V non-V
no flu 0.4850 0.4741
mild flu 0.0145 0.0225
severe flu 0.0005 0.0034

We estimate α, β and γ as follows :

V non-V
no flu 0.4850 0.4741 β̂ = 0.4850 + 0.4741 = 0.9591
mild flu 0.0145 0.0225 γ̂ = 0.0145 + 0.0225 = 0.0370
severe flu 0.0005 0.0034
α̂ = 0.5

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Observed frequencies
V non-V
no flu 0.4850 0.4741
mild flu 0.0145 0.0225
severe flu 0.0005 0.0034

Estimated probabilities p(·|α̂, β̂, γ̂) under independence


V non-V
no flu 0.47955 0.47955
mild flu 0.0185 0.0185
severe flu 0.00195 0.00195

(0.47955 − 0.4850)2 (0.00195 − 0.0034)2


 
T =n× + ··· + = 40.1.
0.47955 0.00195

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

T = 40.1
T is to be compared to the quantile q of order 0.95 of
χ2 (K − 1 − d) where K = 6 cells and d = 3 as θ has 3 entries.
> qchisq(0.95,2)
[1] 5.991465

As T > q, we reject the independence hypothesis in this case.


With the p-value :
> p = 1-pchisq(40.1,2)
> p
[1] 1.96063e-09

as p < 0.05, we indeed reject the independence hypothesis.

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

The whole procedure might seem a little complicated.


Good news : there is a function that will do it for you !

> M = matrix(data=c(4850,145,5,4741,225,34),3,2)
> M
[,1] [,2]
[1,] 4850 4741
[2,] 145 225
[3,] 5 34

> chisq.test(M)

Pearson’s Chi-squared test

data: M
X-squared = 40.1, df = 2, p-value = 1.96e-09

Pierre Alquier, ESSEC Statistics - Chapter 7


Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Exercise 1 : more generally, we can use this procedure to test


the independance of two variables X ∈ {1, 2, . . . , MX } and
Y ∈ {1, 2, . . . , MY }. In this case, what would be the
distribution of T ?
Exercise 2 : the file exercise2.csv contains two variables I
simulated myself (but you don’t know from which
distributions).
1 Load the data.
2 Test the adequation of the variable cont to the
distribution N (0, 1).
3 Test the adequation of the variable disc to the
distribution Bin(5, 0.5).
4 Test the adequation of the variable disc to the model
Bin(5, θ).
Pierre Alquier, ESSEC Statistics - Chapter 7
Test of adequation to a fixed distribution
Test of adequation to a model
Test of independance

Exercise 3 :
Load the data from the file Credit.csv.
What are the variables in the dataset ?
You can obtain the contingency table of two variables x and y using
table(x,y). Show the contingency table of the variables Married and Student.
Test the independance of these two variables.
Show the contingency table of the variables Cards and Gender. Test the
independance of these two variables. What do you observe ? How to fix this ?
A nice way to visualize contigency tables is to use the gplot library. First,
install this library. Then, try the following :

> library(gplot)
> balloonplot(table(cr$Cards,cr$Student))

> balloonplot(table(cr$Cards,cr$Student), main="Contingency table",


xlab ="N. credit cards", ylab="Student")

> balloonplot(table(cr$Cards,cr$Student), main="Contingency table",


xlab ="N. credit cards", ylab="Student", label = FALSE)

> balloonplot(table(cr$Cards,cr$Student), main="Contingency table",


xlab ="N. credit cards", ylab="Student", show.margins = FALSE)

Pierre Alquier, ESSEC Statistics - Chapter 7

You might also like