Chapter 7 Moodle

Test of adequation to a fixed distribution
Test of adequation to a model

Test of independance
Statistics
Chapter 7: Tests
Pierre Alquier
Pierre Alquier, ESSEC Statistics - Chapter 7

In Chapter 7, we assumed that X1 , . . . , Xn were i.i.d. from

some distribution of the form p(·|θ), and we studied how to
estimate θ. Today, we address the question : do the data
support this assumption ?
1 First, we will address a simpler question : given a
distribution p(·) and a sample X1 , . . . , Xn , can we test the
hypothesis : “p is the distribution of the Xi ” ? → “test of
adequation to a fixed distribution”
We will start with discrete variables,
we will discuss briefly continuous variables.
2 In a second time, we will address the initial question : is
one of the p(·|θ) the distribution of the Xi ? → “test of
adequation to a model”
We will cover only discrete variables in this course.
3 Finally, we will see another application of these tests :
testing the independance of two variables.
1 Test of adequation to a fixed distribution
2 Test of adequation to a model
3 Test of independance


Let us start with the following question : let X1 , . . . , Xn be

i.i.d. random variables on the the finite set {1, . . . , K }. Let
p(·) be a PMF on {1, . . . , K }. How can we test
H0 : “the distribution of the Xi ’s is p(·)”

Example 1 : Xi ∈ {1, 2, 3, 4}, sample size n = 100, we observe :
value x 1 2 3 4
number of obs.nx 27 28 21 24
is it compatible with the uniform distribution p(x) = 1/4 ?
Example 2 : Xi ∈ {1, 2, 3, 4}, sample size n = 100, we observe :
value x 1 2 3 4
number of obs.nx 29 36 20 15
is it compatible with the uniform distribution p(x) = 1/4 ?
Idea : compare the estimated probabilities to the actual ones.

Example 1 :
x 1 2 3 4
frequency fx = nx /n 0.27 0.28 0.21 0.24
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.02 0.03 −0.04 −0.01
We could measure the distance between the frequencies and the probabilities by :
4
X
(fx − p(x))2 = 0.022 + 0.032 + (−0.04)2 + (−0.01)2 = 0.003.
x=1
It appears that it is better to weight the terms (fx − p(x))2 , that is, we put :
4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25
How do we interpret this value ?

Let us sample 25000 times data “as it should be if the

hypothesis was true”, that is, Y1 , . . . , Yn are uniform random
variables on {1, 2, 3, 4}. For each sample, we compute the T
statistic.
> hypo = c(1/4,1/4,1/4,1/4)
> Tdist = c()

> for (i in 1:25000)
> {
> X = sample(x=c(1,2,3,4),size=100,replace=TRUE,prob=hypo)
> f = table(X)/n
> T = sum((hypo-f)^2/hypo)
> Tdist = c(Tdist,T)
> }
> hist(Tdist,breaks=30)


Example 1 :
x 1 2 3 4
frequency fx = nx /n 0.27 0.28 0.21 0.24
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.02 0.03 −0.04 −0.01
4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25
Example 2 :
x 1 2 3 4
frequency fx = nx /n 0.29 0.36 0.20 0.15
p(x) 0.25 0.25 0.25 0.25
fx − p(x) 0.04 0.09 −0.05 −0.10
4
(fx − p(x))2 0.042 0.092 (−0.05)2 (−0.10)2
X
T =n = 100 × + + + = 10.48.
x=1
p(x) 0.25 0.25 0.25 0.25


In other words,
The data in Example 1 are really typical from a uniform
distribution on {1, 2, 3, 4}.
The data in Example 2 are not typical from a uniform
distribution.

In a general setting, assume we observe X1 , . . . , Xn random

variables on {1, . . . , K } :
x 1 ... K
freq. fx = nx /n f1 . . . fK
p(x) p(1) . . . p(K )
We want to test the hypothesis :
H0 : “the distribution of the Xi ’s is p(·)”
The chi-square (or χ2 ) test statistic if defined as :

K
X (fk − p(k))2
T = .
k=1
p(k)
Theorem
IF H0 is true, then when n → ∞, T is distributed as a
χ2 (K − 1) random variable (chi-square random variable with
K − 1 degrees of freedom).
Idea of the proof : by the CLT,
fk ≃ N (p(k), Var(fk ))
and thus
(fk − p(k)) ≃ N (0, Var(fk )).
Recall from Chapter 3 that a χ2 (K − 1) random variable is
defined a sum of squares of Gaussian variables...

Theorem
IF H0 is true, then when n → ∞, T is distributed as a
χ2 (K − 1) random variable (chi-square random variable with
K − 1 degrees of freedom).
Let q be the quantile of order 0.95 of the χ2 (K − 1)

distribution. By definition, if H0 is true,
P(T ≤ q) = 0.95.
Test procedure
Compute T .
If T > q, reject H0 .

Example 1 :
4
(fx − p(x))2 0.022 0.032 (−0.04)2 (−0.01)2
X
T =n = 100 × + + + = 1.2.
x=1
p(x) 0.25 0.25 0.25 0.25
Example 2 :
4
(fx − p(x))2 0.042 0.092 (−0.05)2 (−0.10)2
X
T =n = 100 × + + + = 10.48.
x=1
p(x) 0.25 0.25 0.25 0.25
> qchisq(0.95,3)
[1] 7.814728
q = 7.8147


Example 2 : T = 10.48. If we sample Y1 , . . . , Yn from the

uniform distribution, what would be the probability to get
T ≥ 10.48? ?
P(T ≥ 10.48) = 1 − P(T ≤ 10.48) = 1 − F (10.48)
where F is the CDF of χ2 (K − 1). This quantity is defined as
the p-value of the test.
> T1
[1] 1.2
> T2
[1] 10.48
> p1 = 1-pchisq(T1,3)
> p1
[1] 0.7530043
> p2 = 1-pchisq(T2,3)
> p2
[1] 0.01489718

Test procedure - alternative formulation

Compute T .
Compute the p-value p = 1 − F (T ) where F is the CDF
of χ2 (K − 1).
If p > 0.05, reject H0 . The smaller p is, the more
confident we are that H0 is not true.
Note : there are many test procedures in R, some are quite

different. However, for all of them, R will return a p-value. The
interpretation of the p-value is always as above.


Assume now we observe a sample X1 , . . . , Xn of i.i.d.

discrete random variables that can take any integer value.
We want to test the adequation to a distribution on
integers.
For example, we want to test whether the following was
sampled from a Poisson distribution P(1).
> x
[1] 3 2 0 2 2 1 0 0 1 1 2 3 1 2 2 1 1 0 2 0 0 1 1 2 0 4 0 1 1 2 3 1 1 5
0 2 2 2 1 0 1 0 2 1 2 0 0 1 1 3 2 0 0 1 1 2 2 2 1 1 1 1 2 0 1 0 1 0
2 3 2 1 2 3 3 3 2 1 4 4 0 1 1 4 1 1 3 2 1 1 2 1 3 1 2 0 0 1 0 2
Problem : there is now infinitely many fx and p(x)...

x 0 1 2 3 4 5 6 ...
fx = nx /n 0.22 0.36 0.27 0.10 0.04 0.01 0.00 . . .
p(x) 0.368 0.368 0.184 0.061 0.015 0.003 0.001 . . .

To bypass this problem, we can merge events :

x 0 1 2 3 4 5 6 ...
fx = nx /n 0.22 0.36 0.27 0.10 0.04 0.01 0.00 . . .
p(x) 0.368 0.368 0.184 0.061 0.015 0.003 0.001 . . .
⇊
x 0 1 2 3 ≥4
fx = nx /n 0.22 0.36 0.27 0.10 0.05
p(x) 0.368 0.368 0.184 0.061 0.019
As a general rule, it is recommended to merge cells so

that no cell has less than 5 observations.
We can then perform a standard χ2 test.

Detailed code for the computation of T :

> table(x)
x
0 1 2 3 4 5
22 36 27 10 4 1
> fx = table(x)[1:5]/n
> fx[5] = sum(table(x)[5:6]/n)
> fx
[1] 0.22 0.36 0.27 0.10 0.05
> hypo = dpois(c(0,1,2,3),lambda=1)

> hypo = c(hypo,1-sum(hypo))
> hypo
[1] 0.36787944 0.36787944 0.18393972 0.06131324 0.01898816
> T = n*sum((hypo-fx)^2/hypo)
> T
[1] 17.49376

First version of the test : we compare T to the quantile of the

χ2 (4) distribution :
> T
[1] 17.49376
> qchisq(0.95,4)
[1] 9.487729
Second version of the test : we compare the p-value to 0.05 :

> p = 1-pchisq(T,4)
> p
[1] 0.001549331
In both case, we reject the hypothesis that the data was

sampled from P(1).

Assume now we observe a sample X1 , . . . , Xn of i.i.d.

continuous random variables. We want to test the
adequation to a continuous distribution.
There are many tests to do this. We will not study how
they are built, but you can still use them in practice. R
returns a p-value that can be interpreted as in the χ2 test
(reject the hypothesis if p < 0.05).
For example, the Kolmogorov-Smirnov test can be used
as follows :
ks.test(x, p, a)
where
x is the sample,
p is the PDF you want to test the adequation to,
a is the parameter(s) of p.
As an example, I sampled Y1 , . . . , Y100 from E(2) and tested

its adequation to N (0, 1), E(2) and E(1) respectively.
> y = rexp(100,2)
> ks.test(y,pnorm,0,1)
One-sample Kolmogorov-Smirnov test
data: y
D = 0.50093, p-value < 2.2e-16
alternative hypothesis: two-sided
> ks.test(y,pexp,2)
data: y
D = 0.085252, p-value = 0.4615
> ks.test(y,pexp,2.5)
data: y
D = 0.16583, p-value = 0.008172


We now turn to the question : let X1 , . . . , Xn be i.i.d. random

variables on the the finite set {1, . . . , K }. Let p(·|θ) be a
family of PMFs on {1, . . . , K } indexed by a parameter θ. How
can we test
H0 : “the distribution of the Xi ’s is one of the p(·|θ)’s”

Test procedure - adequation to a model

1 Compute an estimator θ̂ of θ (using the method of
moments of Chapter 6).
2 Compute T the χ2 test statistic as if you were testing the
adequation to p(·|θ̂) :
K
X (fk − p(k|θ̂))2
T = .
k=1
p(k|θ̂)
3 Under H0 , when n → ∞, T is distributed as a

χ2 (K − 1−d) random variable, where d is the dimension
of θ.

Example :
In the previous section, we tested the adequation of the
following sample to P(1).
> x
[1] 3 2 0 2 2 1 0 0 1 1 2 3 1 2 2 1 1 0 2 0 0 1 1 2 0 4 0 1 1 2 3 1 1 5
0 2 2 2 1 0 1 0 2 1 2 0 0 1 1 3 2 0 0 1 1 2 2 2 1 1 1 1 2 0 1 0 1 0
2 3 2 1 2 3 3 3 2 1 4 4 0 1 1 4 1 1 3 2 1 1 2 1 3 1 2 0 0 1 0 2
Now, we test its adequation to the Poisson model P(λ).

First, we compute λ̂ = X̄n = 1.41.
Let us now compute T :
x 0 1 2 3 ≥4
fx = nx /n 0.22 0.36 0.27 0.10 0.05
p(x|1.41) 0.244 0.344 0.243 0.114 0.055
T = 0.8278685
T = 0.8278685
Let us compare T to the quantile q of order 0.95 of
χ2 (K − 1 − d), here : K = 5 and d = 1.
> qchisq(0.95,3)
[1] 7.814728
As T < q, we don’t reject H0 .

p-value :
> p=1-pchisq(0.8278685,3)
> p
[1] 0.8427903
as p > 0.05, we don’t reject H0 .


Assume we observe (X1 , Y1 ), . . . , (Xn , Yn ) a sample of n

i.i.d. pairs of random (discrete) variables.
We want to test the hypothesis :
H0 : Xi and Yi are independent.
It turns out that we can use the χ2 test to test H0 !

I will describe this on an example.

Example : flu vaccination (note : these are not real data).

For n = 10, 000 patients, we collect two variables :
X the vaccination status, X = “V” or “non-V”.
Y = “no flu”, “mild flu” or “severe flu”.
We obtain the following counts :
V non-V
no flu 4850 4741
mild flu 145 225
severe flu 5 34
(this is called a contingency table).

General case
The probability distribution is not constrained :
V non-V
no flu p(V, no flu) p(non-V, no flu)
mild flu p(V, mild flu) p(non-V, mild flu)
severe flu p(V, severe flu) p(non-V, severe flu)
Independent case
V non-V
no flu p(V)p(no flu) p(non-V)p(no flu)
mild flu p(V)p(mild flu) p(non-V)p(mild flu)
severe flu p(V)p(severe flu) p(non-V)p(severe flu)

Independent case
V non-V
no flu p(V)p(no flu) p(non-V)p(no flu)
mild flu p(V)p(mild flu) p(non-V)p(mild flu)
severe flu p(V)p(severe flu) p(non-V)p(severe flu)
Put α = p(V), then p(non-V) = 1 − α.

Put β = p(no flu) and γ = p(mild flu), then
p(severe flu) = 1 − β − γ.
V non-V
no flu αβ (1 − α)β
mild flu αγ (1 − α)γ
severe flu α(1 − β − γ) (1 − α)(1 − β − γ)

The independence hypothesis is exactly equivalent to the

following statistical model p(·|θ) where θ = (α, β, γ) :
V non-V
no flu αβ (1 − α)β
mild flu αγ (1 − α)γ
severe flu α(1 − β − γ) (1 − α)(1 − β − γ)
Thus, we can use the test of adequation to a statistical model

seen in the previous section !

Observed frequencies
V non-V
no flu 0.4850 0.4741
mild flu 0.0145 0.0225
severe flu 0.0005 0.0034
We estimate α, β and γ as follows :
V non-V
no flu 0.4850 0.4741 β̂ = 0.4850 + 0.4741 = 0.9591
mild flu 0.0145 0.0225 γ̂ = 0.0145 + 0.0225 = 0.0370
severe flu 0.0005 0.0034
α̂ = 0.5

Observed frequencies
V non-V
no flu 0.4850 0.4741
mild flu 0.0145 0.0225
severe flu 0.0005 0.0034
Estimated probabilities p(·|α̂, β̂, γ̂) under independence

V non-V
no flu 0.47955 0.47955
mild flu 0.0185 0.0185
severe flu 0.00195 0.00195
(0.47955 − 0.4850)2 (0.00195 − 0.0034)2

T =n× + ··· + = 40.1.
0.47955 0.00195

T = 40.1
T is to be compared to the quantile q of order 0.95 of
χ2 (K − 1 − d) where K = 6 cells and d = 3 as θ has 3 entries.
> qchisq(0.95,2)
[1] 5.991465
As T > q, we reject the independence hypothesis in this case.

With the p-value :
> p = 1-pchisq(40.1,2)
> p
[1] 1.96063e-09
as p < 0.05, we indeed reject the independence hypothesis.

The whole procedure might seem a little complicated.

Good news : there is a function that will do it for you !
> M = matrix(data=c(4850,145,5,4741,225,34),3,2)
> M
[,1] [,2]
[1,] 4850 4741
[2,] 145 225
[3,] 5 34
> chisq.test(M)
Pearson’s Chi-squared test
data: M
X-squared = 40.1, df = 2, p-value = 1.96e-09

Exercise 1 : more generally, we can use this procedure to test

the independance of two variables X ∈ {1, 2, . . . , MX } and
Y ∈ {1, 2, . . . , MY }. In this case, what would be the
distribution of T ?
Exercise 2 : the file exercise2.csv contains two variables I
simulated myself (but you don’t know from which
distributions).
1 Load the data.
2 Test the adequation of the variable cont to the
distribution N (0, 1).
3 Test the adequation of the variable disc to the
distribution Bin(5, 0.5).
4 Test the adequation of the variable disc to the model
Bin(5, θ).
Exercise 3 :
Load the data from the file Credit.csv.
What are the variables in the dataset ?
You can obtain the contingency table of two variables x and y using
table(x,y). Show the contingency table of the variables Married and Student.
Test the independance of these two variables.
Show the contingency table of the variables Cards and Gender. Test the
independance of these two variables. What do you observe ? How to fix this ?
A nice way to visualize contigency tables is to use the gplot library. First,
install this library. Then, try the following :
> library(gplot)
> balloonplot(table(cr$Cards,cr$Student))
> balloonplot(table(cr$Cards,cr$Student), main="Contingency table",

xlab ="N. credit cards", ylab="Student")

xlab ="N. credit cards", ylab="Student", label = FALSE)

xlab ="N. credit cards", ylab="Student", show.margins = FALSE)

Chapter 7 Moodle

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 7 Moodle

Uploaded by

Copyright:

Available Formats

Test of adequation to a fixed distribution

Test of adequation to a model

Pierre Alquier, ESSEC Statistics - Chapter 7

In Chapter 7, we assumed that X1 , . . . , Xn were i.i.d. from

1 Test of adequation to a fixed distribution

2 Test of adequation to a model

Pierre Alquier, ESSEC Statistics - Chapter 7

1 Test of adequation to a fixed distribution

2 Test of adequation to a model

Pierre Alquier, ESSEC Statistics - Chapter 7

Let us start with the following question : let X1 , . . . , Xn be

H0 : “the distribution of the Xi ’s is p(·)”

Idea : compare the estimated probabilities to the actual ones.

How do we interpret this value ?

Let us sample 25000 times data “as it should be if the

> Tdist = c()

Pierre Alquier, ESSEC Statistics - Chapter 7

Pierre Alquier, ESSEC Statistics - Chapter 7

Pierre Alquier, ESSEC Statistics - Chapter 7

Pierre Alquier, ESSEC Statistics - Chapter 7

Pierre Alquier, ESSEC Statistics - Chapter 7

In a general setting, assume we observe X1 , . . . , Xn random

We want to test the hypothesis :

H0 : “the distribution of the Xi ’s is p(·)”

The chi-square (or χ2 ) test statistic if defined as :

Idea of the proof : by the CLT,

Pierre Alquier, ESSEC Statistics - Chapter 7

Let q be the quantile of order 0.95 of the χ2 (K − 1)

Pierre Alquier, ESSEC Statistics - Chapter 7

Pierre Alquier, ESSEC Statistics - Chapter 7

Pierre Alquier, ESSEC Statistics - Chapter 7

Example 2 : T = 10.48. If we sample Y1 , . . . , Yn from the

Pierre Alquier, ESSEC Statistics - Chapter 7

Test procedure - alternative formulation

Note : there are many test procedures in R, some are quite

Pierre Alquier, ESSEC Statistics - Chapter 7

Pierre Alquier, ESSEC Statistics - Chapter 7

Assume now we observe a sample X1 , . . . , Xn of i.i.d.

Problem : there is now infinitely many fx and p(x)...

Pierre Alquier, ESSEC Statistics - Chapter 7

To bypass this problem, we can merge events :

As a general rule, it is recommended to merge cells so

Pierre Alquier, ESSEC Statistics - Chapter 7

Detailed code for the computation of T :

> hypo = dpois(c(0,1,2,3),lambda=1)

Pierre Alquier, ESSEC Statistics - Chapter 7

First version of the test : we compare T to the quantile of the

Second version of the test : we compare the p-value to 0.05 :

In both case, we reject the hypothesis that the data was

Pierre Alquier, ESSEC Statistics - Chapter 7

Assume now we observe a sample X1 , . . . , Xn of i.i.d.

As an example, I sampled Y1 , . . . , Y100 from E(2) and tested

Pierre Alquier, ESSEC Statistics - Chapter 7

1 Test of adequation to a fixed distribution

2 Test of adequation to a model

Pierre Alquier, ESSEC Statistics - Chapter 7

We now turn to the question : let X1 , . . . , Xn be i.i.d. random

H0 : “the distribution of the Xi ’s is one of the p(·|θ)’s”

Pierre Alquier, ESSEC Statistics - Chapter 7

Test procedure - adequation to a model

3 Under H0 , when n → ∞, T is distributed as a

Pierre Alquier, ESSEC Statistics - Chapter 7

Now, we test its adequation to the Poisson model P(λ).