Hypothesis Testing

Statistical foundations of machine
learning
INFO-F-422
Gianluca Bontempi
Dpartement dInformatique
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di
Machine learning p. 1/45
Testing hypothesis
Hypothesis testing is the second major area of statistical inference.
A statistical hypothesis is an assertion or conjecture about the distribution

of one or more random variables.
A test of a statistical hypothesis is a rule or procedure for deciding
whether to reject the assertion on the basis of the observed data.
The basic idea is formulate some statistical hypothesis and look to see if
the data provides any evidence to reject the hypothesis.
An hypothesis testing problem

Consider the model of the traffic in the boulevard.
Suppose that the measures of the inter-arrival times are
DN = {10, 11, 1, 21, 2, . . . } seconds.
Can we say that the mean inter-arrival time is different from 10?
Consider the grades of two different school sections.
Section A had {15, 10, 12, 19, 5, 7}.
Section B had {14, 11, 11, 12, 6, 7}.

Can we say that Section A had better grades than Section B?
Consider two protein coding genes and their expression levels in a cell.
Are the two genes differentially expressed ?
A statistical test is a procedure that aims to answer such questions.
Types of hypothesis
We start by declaring the working (basic, null) hypothesis H to be tested, in the
form = 0 or , where 0 or are given.
The hypothesis can be
Simple.
It fully specifies the distribution of z.
Composite.
It partially specifies the distribution of z.
if DN constitutes a random sample of size N from N (, 2 ) the

hypothesis H : = 0 , = 0 , (with 0 and 0 known values) is simple while
the hypothesis H : = 0 is composite since it leaves open the value of in
(0, ).
Example:
Types of statistical test

Suppose we have collected N samples DN = {z1 , . . . , zN } from a distribution
Fz and we have declared a null hypothesis H about F .
Three are the most common types of statistical test:
Pure significance test:
data DN are used to assess the inferential evidence
against H.
the inferential evidence against H is used to judge whether
H is inappropriate. In other words it is a rule for rejecting H.
Significance test:
data DN are used to assess the hypothesis H against a

In other words this is a rule for
specific alternative hypothesis H.
rejecting H in favour of H.
Hypothesis test:
Pure significance test

Suppose that the null hypothesis H is simple.
Let t(DN ) be a statistic such that the larger its value the more it casts
doubt on H.
The quantity t(DN ) is called test statistic or discrepancy measure.
Let tN = t(DN ) the value of t calculated on the basis of the sample data
DN .
Let us consider the p-value quantity
p = Prob {t(DN ) > tN |H}
If p is small the sample data DN are highly inconsistent with H and p
(significance probability or significance level ) is the measure of such
inconsistency.
Some considerations
p is the proportion of situations under the hypothesis H where we would
observe a degree of inconsistency at least to the extent represented by
tN .
tN is the observed value of the statistic for a given DN . Different DN
yield different values of p (0, 1).
it is essential that the distribution of t(DN ) under H is known.
We cannot say that p is the probability that H is true but better that p is the
probability that the dataset DN is observed given that H is true
Open issues
1. What if H is composite?
2. how to choose t(DN ).
Tests of significance
Suppose that the value p is known. If p is small either a rare event has
occured or perhaps H is not true.
Idea: if p is less than some stated value , we reject H.
We choose a critical level , we observe DN and we reject H at level if

P {t(DN ) > tN |H)
This is equivalent to choose some critical value t and we reject H if
tN > t .
We obtain two regions in the space of sample data:
critical region
S0 where if DN S0 we reject H.
S1 where the sample data DN gives us no-reason to

reject H on the basis of the level- test.
non-critical region
Some considerations
The principle is that we will accept H unless we witness some event that
has sufficiently small probability of arising when H is true.
If H were true we could still obtain data in S0 and consequently wrongly
reject H with probability
Prob {DN S0 |H} = Prob {t(DN ) > t |H} =
The significance level provides an upper bound to the maximum
probability of incorrectly rejecting H.
The p-value is the probability that the test statistic is more extreme than
its observed value. The p-value changes with the observed data (i.e. it is
a random variable) while is a level fixed by the user.
Standard normal distribution

Normal distribution function (=0, =1)
Normal density function (=0, =1)
0.4
0.9
0.35
0.8
0.3
0.7
0.25
0.6
0.5
0.2
0.4
0.15
0.3
0.1
0.2
0.05
0.1
0
5
0
5
Remember that z0.05 1.64.

This means that, if z N (0, 1), then Prob {z z0.05 } = 0.05 and also that
Prob {|z| z0.05 } = 2 0.05 = 0.1
For a generic z N (, 2 )
Prob {|z | z0.05 } = 2 0.05 = 0.1
TP: example
Let DN consist of N independent observations of x N (, 2 ), with
known variance 2 .
We want to test the hypothesis H : = 0 with 0 known.
Consider as test statistic t(DN ), the quantity |

0 | where
is the
N (0 , 2 /N ).
sample average estimator . If H is true we know that
Let us calculate the value t(DN ) = |
0 | and assume that the
rejection region is S0 = {|
0 |||
0 | > t }.
Let us put a significance level = 10% = 0.1. This means that t should
satisfy
0 | > t |H} =
Prob {t(DN ) > t |H} = Prob {|
0 > t ) (OR)
Prob {(
0 < t )|H} = 0.1

(
TP: example (II)

For a normal variable x N (, 2 )
Prob {x > 1.645} = 1 Fx (1.645) = 0.05
and consequently
Prob {x > 1.645
(OR) x < 1.645} = 0.05 + 0.05 = 0.1
It follows that being

N (0 , 2 /N ) (i.e.
/ N
N (0, 1)) once we put
t = 1.645/ N
we have
0 | > t |H} = 0.1
Prob {|
and that the critical region is
n
S0 = DN
o
: |
0 | > 1.645/ N
TP: example (III)

Suppose that = 0.1 and that we want to test if = 0 = 10 with a
significance level 10%.
After N = 6 observations we have DN = {10, 11, 12, 13, 14, 15}.
On the basis of the dataset we compute
10 + 11 + 12 + 13 + 14 + 15
=
= 12.5
6
and
t(DN ) = |
0 | = 2.5
Since t = 1.645 0.1/ 6 = 0.0672, and t(DN ) > t , the observations

DN are in the critical region.
The hypothesis is rejected.
Hypothesis testing: types of error

So far we considered a single hypothesis. Let us now consider two
alternative hypothesis: H and H.

It is the error we make when we reject H if it is true.
Significance level represents the probability of making the type I error.
Type I error.
It is the error we make when we accept H if it is false.

In order to define this error, we are forced to declare an alternative
as a formal definition of what is meant by H being false.
hypothesis H
The probability of type II error is the probability that the test leads to
prevails.
acceptance of H when in fact H
When the alternative hypothesis is composite, there is no unique Type II
error.
Type II error.
An analogy
Consider the analogy with a murder trial, where we have as suspect Mr.
Bean.
The null hypothesis H is Mr. Bean is innocent.
The dataset is the amount of evidence collected by the police against

Mr. Bean.
The Type I error is the error that we make if, being Mr. Bean innocent,
we send him to penalty death.
The Type II error is the error that we make if, being Mr. Bean guilty, we
acquit him.
Hypothesis testing
Suppose we have some data {z1 , . . . , zN } F from a distribution F .
represent two hypotheses about F .
H and H
On the basis of the data, one is accepted and one is rejected.
Note that the two hypotheses have different philosophical status

(asymmetry).
H is a conservative hypothesis, not to be rejected unless evidence is
clear. This means that a type I error is more serious than a type II error
(benefit of the doubt).
It is often assumed that F belongs to a parametric family F (z, ). The
test on F becomes a test on .
A particular example of hypothesis test is the goodness of fit test where
: F 6= F0 .
we test H : F = F0 against H
The five steps of hypothesis testing

1. Declare the null (e.g. H: honest student) and the alternative hypothesis
cheat student)
(H:
2. Choose the numeric value of the type I error (e.g. the risk I want to run).
3. Choose a procedure to obtain test statistic (e.g. number of similar lines).
4. Determine the critical value of the test statistic (e.g. 4 identical lines) that
leads to a rejection of H. This is done in order to ensure the Type I error
defined in Step 2.
5. Obtain the data and determine whether the observed value of the test
statistic leads to an acceptation or rejection of H.
Quality of the test

Suppose that
N students took part to the exam,
NN did not copy,
NP copied,
N were considered not guilty and passed the exam

N
P were considered guilty and rejected
N
FP honest students were refused
FN cheat students passed.
Confusion matrix
Then we have
H: Not guilty student (-)
Guilty student (+)
H:
Not refused
Refused
TN
FP
NN
FN
N
N
TP
P
N
NP
N
FP is the number of False Positives and the ratio FP /NN represents the
type I error.
FN is the number of False Negatives and the ratio FN /NP represents
the type II error.
Specificity and sensitivity

Specificity:
the ratio (to be maximized)
SP =
TN
TN
NN FP
FP
=
=
=1
,
FP + T N
NN
NN
NN
0 SP 1
It increases by reducing the number of false positive.

Sensitivity:

SE =
TP
NP FN
FN
TP
=
=
=1
,
T P + FN
NP
NP
NP
0 SE 1
It increases by reducing the number of false negatives and corresponds

to the power of the test (i.e. it estimates the quantity 1-Type II error).
Specificity and sensitivity (II)

There exists a trade-off between these two quantities.
In the case of a test who return always H (e.g. very kind professor) we
P = 0,N
N = N , FP = 0, TN = NN and SP = 1 but SE = 0.
have N
(e.g. very suspicious
In the case of a test who return always H
P = N ,N
N = 0, FN = 0, TP = NP and SE = 1 but
professor) we have N
SP = 0.
False Positive and False Negative Rate

False Positive Rate:
F P R = 1 SP = 1
FP
FP
TN
=
=
,
FP + TN
FP + TN
NN
0 FPR 1
It decreases by reducing the number of false positive and estimates the Type I error.
False Negative Rate
F N R = 1 SE = 1
FN
FN
TP
=
=
TP + FN
TP + FN
NP
0 FPR 1
It decreases by reducing the number of false negative.
Predictive value
Positive Predictive value:
the ratio(to be maximized)
PPV =
Negative Predictive value:
0 PPV 1
P NV =
False Discovery Rate:
TP
TP
,
=
T P + FP
NP
TN
TN
,
=
T N + FN
NN
0 P NV 1
the ratio (to be minimized)
FP
FP
=
F DR =
= 1 P P V,
T P + FP
NP
0 F DR 1
Receiver Operating Characteristic curve

The Receiver Operating Characteristic (also known as ROC curve) is a plot
of the true positive rate (i.e. sensitivity or power) against the false positive
rate (Type I error) for the different possible decision thresholds of a test.
Consider an example where t+ N (1, 1) and t N (1, 1). Suppose that

the examples are classed as positive if t > T HR and negative if t < T HR,
where THR is a threshold.
If T HR = , all the examples are classed as positive: T N = F N = 0
TP
FP
which implies SE = N
=
1
and
F
P
R
=
FP +TN = 1.
P
If T HR = , all the examples are classed as negative: T P = F P = 0
which implies SE = 0 and F P R = 0.
0.0
0.2
0.4
SE
0.6
0.8
1.0
ROC curve
0.0
0.2
0.4
0.6
0.8
1.0
FPR
R script roc.R
Choice of test
The choice of test and consequently the choice of the partition {S0 , S1 } is
based on two steps
1. Define a significance level , that is the probability of type I error
Prob {reject H|H} = Prob {DN S0 |H}
that is the probability of incorrectly rejecting H
2. Among the set of tests {S0 , S1 } of level , choose the test that minimizes
the probability of type II error

Prob accept H|H = Prob DN S1 |H
that is the probability of incorrectly accepting H. This is equivalent to

look for maximizing the power of the test

Prob reject H|H = Prob DN S0 |H = 1 Prob DN S1 |H
which is the probability of correctly rejecting H. The higher the power,

the better !
TP example
Consider a r.v. z N (, 2 ), where is known and a set of N iid
observations are given.
We want to test the null hypothesis = 0 = 0, with = 0.1
Consider the 3 critical regions S0
1. |
0 | > 1.645/ N
2.
0 > 1.282/ N
3. |
0 | < 0.126/ N
For all these tests Prob {DN S0 |H} , hence the significance level
is the same.
: = 10 the type II error of the three tests is significantly
However if H
different.
What is the best one?
TP example (II)
:H
11111111111111111
00000000000000000
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
0
S1
:H
1111111111111111111111111111
0000000000000000000000000000
0000000000000000000000000000
1111111111111111111111111111
0000000000000000000000000000
1111111111111111111111111111
10
S0
if H : 0 = 0 is true. On the right: distribution of the

On the left: distribution of the test statistic
: 1 = 10 is true. The interval marked by S1 denotes the set of observed
if H
test statistic
values for which H is accepted (non-critical region). The interval marked by S0 denotes the set
of observed
values for which H is rejected (critical region). The area of the black pattern
region on the right equals Prob {DN S0 |H}, i.e. the probability of rejecting H when H is true
(Type I error). The area of the grey shaded region on the left equals the probability of accepting
H when H is false (Type II error).
TP example (III)
: H
111
000
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
0
S1
S0
: H
10
S1
if H : 0 = 0 is true. On the right: distribution of the

On the left: distribution of the test statistic
: 1 = 10 is true. The two intervals marked by S1 denote the set of observed
if H
test statistic
values for which H is accepted (non-critical region). The interval marked by S0 denotes the set
of observed
values for which H is rejected (critical region). The area of the pattern region
equals Prob {DN S0 |H}, i.e. the probability of rejecting H when H is true (Type I error).
Which area corresponds to the probability of the Type II error?
Type of parametric tests

Consider random variables with a parametric distribution F (, ).
in the one-sample test we consider a single r.v.
and we formulate hypothesis about its distribution. In the two-samples
test we consider 2 r.v. z1 and z2 and we formulate hypothesis about their
differences/similarities.
One-sample vs. two-sample:
the test is simple if H describes completely the

distributions of the involved r.v. otherwise it is composite.
Simple vs composite:
in the single-sided test the

region of rejection concerns only one tail of the distribution of the null
indicates the predicted direction of the
distribution. This means that H
: > 0 ) . In the two-sided test, the region of rejection
difference (e.g. H
does not
concern both tails of the null distribution. This means that H
: 6= 0 ) .
indicate the predicted direction of the difference (e.g. H
Single-sided (or one-tailed) vs Two-sided (or two-tailed):
Example of parametric test

Consider a parametric test on the distribution of a gaussian r.v., and
suppose that the null hypothesis is H : = 0 where 0 is given and
represents the mean.
The test is one-sample and composite.
In order to know whether it is one or two-sided we have to define the

: < 0 the test is one-sided down, if
alternative configuration: if H
: > 0 the test is one-sided up, if H
: 6= 0 the test is double-sided.
H
z-test (one-sample and one-sided)

Consider a random sample DN x N (, 2 ) with unknown et 2 known.
STEP 1:
Consider the null hypothesis and the alternative (composite and one-sided)
H : = 0 ;
: > 0
H
fix the value of the type I error.

STEP 3: choose a test statistic:
is N (0 , 2 /N ). This means that the
If H is true then the distribution of
variable z is
0 ) N
(
N (0, 1)
z=
STEP 2:
It is convenient to rephrase the test in terms of the test statistic z.
z-test (one-sample and one-sided) (II)

STEP 4:
determine the critical value for z.
We reject the hypothesis H is rejected if zN > z where z is such that

Prob {N (0, 1) > z } = .
Ex: for = 0.05 we would take z = 1.645 since 5% of the standard normal
distribution lies to the right of 1.645.
R command: z =qnorm(alpha,lower.tail=FALSE)
STEP 5:
Once the dataset DN is measured, the value of the test statistic is

zN
(
0 ) N
=
TP: example z-test

Consider a r.v. z N (, 1).
: > 5 with significance level 0.05.

We want to test H : = 5 against H
Supose that the data is DN = {5.1, 5.5, 4.9, 5.3}.
Then
= 5.2 and zN = (5.2 5) 2/1 = 0.4.
Since this is less than z = 1.645, we do not reject the null hypothesis.
Two-sided parametric tests

Assumption:
all the variables are normal!
Name
one/two sample
known
z-test
one
= 0
z-test
two
12 = 22
1 = 2
6= 0
t-test
one
= 0
t-test
two
1 = 2
1 6= 2
2 -test
one
2 = 02
2 -test
one
2 = 02
2 6= 02
F-test
two
12 = 22
1 6= 2
6= 0
2 6= 02
12 6= 22
Students t-distribution
If x N (0, 1) and y 2N are independent then the Students t-distribution
with N degrees of freedom is the distribution of the r.v.
z= p
x
y/N
We denote this with z tN .

If z1 , . . . , zN are i.i.d. N (, 2 ) then
)
)
N (
N (
q
tN 1
=
c
SS/(N
1)
t-test: one-sample and two-sided

Consider a random sample from N (, 2 ) with 2 unknown. Let
H : = 0 ;
Let
: 6= 0
H
(
0 )
N (
0 )
= q
t(DN ) = T = q
P
N
1
2
2
(z
)
i
i=1
N 1
N
a statistic computed using the data set DN .
t-test: one-sample and two-sided (II)

It can be shown that if the hypothesis H holds, T TN 1 is a r.v. with a
Student distribution with N 1 degrees of freedom.
The size t-test consists in rejecting H if
|T | > k = t/2,N 1
where t/2,N 1 is the upper point of a T -distribution on N 1 degrees
of freedom, i.e.

Prob |tN 1 | > t/2,N 1 = /2.
where tN 1 TN 1 .
In other terms H is rejected when T is large.
R command: t/2,N 1 =qt(alpha/2,N-1,lower.tail=TRUE)
TP example
Does jogging lead to a reduction in pulse rate? Eight non jogging volunteers
engaged in a one-month jogging programme. Their pulses were taken before
and after the programme
pulse rate before
74
86
98
102
78
84
79
70
pulse rate after
70
85
90
110
71
80
69
74
decrease
-8
10
-4
Suppose that the decreases are samples from N (, 2 ) for some unknown
2 .
: 6= 0 with a significance = 0.05.
We want to test H : = 0 = 0 against H
We have N = 8,
= 2.75, T = 1.263, t/2,N 1 = 2.365
Since |T | t/2,N 1 , the data is not sufficient to reject the hypothesis H. In
other terms we have not enough evidence to show that there is a reduction in
pulse rate.
The chi-squared distribution

For a N positive integer, a r.v. z has a 2N distribution if
z = x21 + + x2N
where x1 ,x2 ,. . . ,xN are i.i.d. random variables N (0, 1).
The probability distribution is a gamma distribution with parameters

( 21 N, 21 ).
E[z] = N and Var [z] = 2N .
The distribution is called a chi-squared distribution with N degrees of

freedom.
-test: one-sample and two-sided

Consider a random sample from N (, 2 ) with known.
Let
H : 2 = 02 ;
c = P (zi )2 .
Let SS
i
: 2 6= 02
H
c 2 2
It can be shown that if H is true then SS/
0
N
c 02 < a1 or SS/
c 02 > a2 where
The size 2 -test rejects H if SS/
Prob
c
SS
< a1
2
0
+ Prob
c
SS
> a2
2
0
If is unknown, you must
c
1. replace with
in the quantity SS
2. use a 2N 1 distribution.
t-test: two-samples, two-sided

Consider two r.v.s x N (1 , 2 ) and y N (2 , 2 ) with the same variance.
y
x
Let DN
and DM
two independent sets of samples .
: 1 6= 2 .
We want to test H : 1 = 2 against H
Let
x =
PN
i=1 xi
,
N
SSx =
N
X
i=1
(xi
x )2 ,
y =
PM
i=1 yi
,
M
SSy =
M
X
i=1
(yi
y )2
Once defined the statistic

T =r
1
M
x
y
TM +N 2

SSx +SSy
1
+N
M +N 2
it can be shown that a test of size rejects H if

|T | > t/2,M +N 2
F-distribution
Let x 2M and y 2N be two independent r.v.. A r.v. z has a F-distribution
Fm,n with M and N degrees of freedom if
z=
x/M
y/N
If z FM,N then 1/z FN,M .

If z TN then z2 F1,N .
F-distribution
FM,N density: M=20 N=10
FM,N cumulative distribution: M=20 N=10
0.9
0.8
0.9
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.5
1.5
2.5
3.5
4.5
0.5
1.5
2.5
3.5
4.5
R script s_f.R.
F-test: two-samples, two-sided

Consider a random sample x1 , . . . , xM from N (1 , 12 ) and a random sample
y1 , . . . , yN from N (2 , 22 ) with 1 and 2 unknown. Suppose we want to test
H : 12 = 22 ;
: 12 6= 22
H
Let us consider the statistic

c 1 /(M 1)
12 2M 1 /(M 1)
21
SS
12
2 2
= 2 FM 1,N 1
f= 2 =
c
2 N 1 /(N 1)
2
2
SS2 /(N 1)
It can be shown that if H is true, the ratio f has a F-distribution FM 1,N 1

We reject H if the ratio f is large, i.e. f > F,M 1,N 1 where
Prob {z > F,M 1,N 1 } =
if z FM 1,N 1 .

Hypothesis Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hypothesis Testing

Uploaded by

Copyright:

Available Formats

Statistical foundations of machine

Machine learning p. 1/45

A statistical hypothesis is an assertion or conjecture about the distribution

Machine learning p. 2/45

An hypothesis testing problem

Section B had {14, 11, 11, 12, 6, 7}.

Machine learning p. 3/45

It fully specifies the distribution of z.

It partially specifies the distribution of z.

if DN constitutes a random sample of size N from N (, 2 ) the

Machine learning p. 4/45

Types of statistical test

data DN are used to assess the inferential evidence

data DN are used to assess the hypothesis H against a

Machine learning p. 5/45

Pure significance test

Machine learning p. 6/45

Machine learning p. 7/45

We choose a critical level , we observe DN and we reject H at level if

S1 where the sample data DN gives us no-reason to

Machine learning p. 8/45

Machine learning p. 9/45

Standard normal distribution

Normal density function (=0, =1)

Remember that z0.05 1.64.

Machine learning p. 10/45

Consider as test statistic t(DN ), the quantity |

0 < t )|H} = 0.1

Machine learning p. 11/45

TP: example (II)

(OR) x < 1.645} = 0.05 + 0.05 = 0.1

It follows that being

N (0, 1)) once we put

TP: example (III)

Since t = 1.645 0.1/ 6 = 0.0672, and t(DN ) > t , the observations

Machine learning p. 13/45

Hypothesis testing: types of error

alternative hypothesis: H and H.

It is the error we make when we accept H if it is false.

Machine learning p. 14/45

The dataset is the amount of evidence collected by the police against

Machine learning p. 15/45

Note that the two hypotheses have different philosophical status

Machine learning p. 16/45

The five steps of hypothesis testing

Machine learning p. 17/45

Quality of the test

N were considered not guilty and passed the exam

Machine learning p. 18/45

Machine learning p. 19/45

Specificity and sensitivity

the ratio (to be maximized)

It increases by reducing the number of false positive.

the ratio (to be maximized)

It increases by reducing the number of false negatives and corresponds

Machine learning p. 20/45

Specificity and sensitivity (II)

Machine learning p. 21/45

False Positive and False Negative Rate

It decreases by reducing the number of false negative.

Machine learning p. 22/45

the ratio(to be maximized)

the ratio (to be maximized)