You are on page 1of 39

MAT3601

Introduction to data analysis

Supervised classification
Bayes classifier and discriminant analysis

1/39
Machine Learning

2/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

3/39
Supervised Learning

Supervised Learning Framework


+ Input measurement X ∈ X (often X ⊂ Rd ), output measurement Y ∈ Y.
+ The joint distribution of (X, Y ) is unknown.
+ Y ∈ {−1, 1} (classification) or Y ∈ Rm (regression).
+ A predictor is a measurable function in F = {f : X → Y}.

Training data
+ Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. with the same distribution as (X, Y ).

Goal
+ Construct a good predictor b
fn from the training data.
+ Need to specify the meaning of good.

4/39
Loss and Probabilistic Framework

Loss function
+ `(Y , f (X)) measures the goodness of the prediction of Y by f (X).
+ Prediction loss: `(Y , f (X)) = 1Y 6=f (X) .
+ Quadratic loss: `(Y , X) = kY − f (X)k2 .

Risk function
+ Risk measured as the average loss:
R(f ) = E [`(Y , f (X))] .

+ Prediction loss: E [`(Y , f (X))] = P (Y 6= f (X)).


 
+ Quadratic loss: E [`(Y , f (X))] = E kY − f (X)k2 .

fn depends on Dn , R(b
+ Beware: As b fn ) is a random variable!

5/39
A robot that learns

A robot endowed with a set of sensors and an online learning algorithm.

+ Task: play football.


+ Performance: score.
+ Experience: current environment and outcome, past games...

6/39
Object recognition in an image

+ Task: say if an object is present or not in the image.


+ Performance: number of errors.
+ Experience: set of previously seen labeled image.

7/39
Number

+ Task: Read a ZIP code from an envelop.


+ Performance: give a number from an image.
+ Prediction problem with X: image and Y : corresponding number.

8/39
Applications in biology

+ Task: protein interaction network prediction.


+ Goal: predict (unknown) interactions between proteins.
+ Prediction problem with X: pair of proteins and Y : existence or no of
interaction.

9/39
Detection

+ Goal: detect the position of faces in an image.


+ X: mask in the image and Y : presence or no of a face...

10/39
Classification

Setting
+ Historical data about individuals i = 1, . . . , n.
+ Features vector Xi ∈ Rd for each individual i.
+ For each i, the individual belongs to a group (Yi = 1) or not (Yi = −1).
+ Yi ∈ {−1, 1} is the label of i.

Aim
+ Given a new X (with no corresponding label), predict a label in {−1, 1}.
+ Use data Dn = {(x1 , y1 ), . . . , (xn , yn )} to construct a classifier.

11/39
Classification

Geometrically

Learn a boundary to separate two “groups” of points.

12/39
Classification

...many ways to separate points!

13/39
Supervised learning methods

Support Vector Machine

Linear Discriminant Analysis

Logistic Regression

Trees/ Random Forests

Kernel methods

Neural Networks

Many more...

14/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

15/39
Best Solution

The best solution f ∗ (which is independent of Dn ) is

f ∗ = arg minf ∈F R(f ) = arg minf ∈F E [`(Y , f (X))] .

Bayes Predictor (explicit solution)


+ Binary classification with 0 − 1 loss:

+1 if P (Y = 1|X) > P (Y = −1|X)

f ∗ (X) = ⇔ P (Y = 1|X) > 1/2 ,

−1

otherwise .
+ Regression with the quadratic loss
f ∗ (X) = E [Y |X] .

The explicit solution requires to know E [Y |X]...

16/39
Plugin Classifier

+ In many cases, the conditional law of Y given X is not known... or relies on


parameters to be estimated.
+ An empirical surrogate of the Bayes classifier is obtained from a possibly
nonparametric estimator ηbn (X) of

η(x) = P(Y = 1|X)

using the training dataset.


+ This surrogate is then plugged into the Bayes classifier.

Plugin Bayes Classifier

+ Binary classification with 0 −(1 loss:


+1 if ηbn (X) > 1/2 ,
fn (X) =
b
−1 otherwise .

17/39
Plugin Classifier

Input: a data set Dn .


Learn the ditribution of Y given X (using the data set) and plug this estimate
in the Bayes classifier.

fn : Rd → {−1, 1}
Output: a classifier b

(
+1 if ηbn (X) > 1/2 ,
fn (X) =
b
−1 otherwise .

+ Can we certify that the plug-in classifier is good ?

18/39
Classification Risk Analysis

The missclassification error satisfies (see exercices):

1/2
fn (X) 6= Y ) − L? 6 2E |η(X) − η̂n (X)|2

0 6 P(b ,

where
L? = P(f ? (X) 6= Y )

and ηbn (x) is an empirical estimate based on the training dataset of

η(x) = P(Y = 1|X = x) .

19/39
How to estimate the conditional law of Y ?

Fully parametric modeling.


Estimate the law of (X, Y ) and use the Bayes formula to deduce an estimate
of the conditional law of Y : LDA/QDA, Naive Bayes...

Parametric conditional modeling.


Estimate the conditional law of Y by a parametric law: linear regression,
logistic regression, Feed Forward Neural Networks...

Nonparametric conditional modeling.


Estimate the conditional law of Y by a non parametric estimate: kernel
methods, nearest neighbors...

20/39
Fully Generative Modeling

If the law of (X, Y ) is known everything can be easy!


Bayes formula
With a slight abuse of notation, if the law of X has a density g with respect to
a reference measure,
gk (X)P(Y =k)
P (Y = k|X) = g(X)
,

where gk is the density of the distribution of X given {Y = k}.

Generative Modeling
Propose a model for (X, Y ).
Plug the conditional law of Y given X in the Bayes classifier.

Remark: require to model the joint law of (X, Y ) rather than only the
conditional law of Y .
Great flexibility in the model design but may lead to complex computation.

21/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

22/39
Naive Bayes

Naive Bayes
Classical algorithm using a crude modeling for P (X|Y ):
+ Feature independence assumption:
d
Y
P X (i) Y

P (X|Y ) = .
i=1

+ Simple featurewise model: binomial if binary, multinomial if finite and


Gaussian if continuous.

If all features are continuous, the law of X given Y is Gaussian with a diagonal
covariance matrix!

Very simple learning even in very high dimension!

23/39
Gaussian Naive Bayes

+ Feature independence assumption:


d
Y
P X (j) Y

P (X|Y ) = .
j=1

For k ∈ {−1, 1}, P (Y = k) = πk and the conditional density of X (j) given


{Y = k} is

2 −1/2

gk (x(j) ) = (2πσj,k ) exp −(x(j) − µj,k )2 /(2σj,k
2
) .

The conditional distribution of X given {Y = k} is then

gk (x) = (det(2πΣk ))−1/2 exp −(x − µk )T Σ−1



k (x − µk )/2 ,

2 2
where Σk = diag(σ1,k , . . . , σd,k ) and µk = (µ1,k , . . . , µd,k )T .

24/39
Gaussian Naive Bayes

In a two-classes problem, the optimal classifier is (see linear discriminant


analysis below):

f ∗ : X 7→ 21{P (Y = 1|X) > P (Y = −1|X)} − 1 .

+ When the parameters are unknown, they may be replaced by their maximum
likelihood estimates.
This yields, for k ∈ {−1, 1},
n
1X
bkn =
π 1Yi =k ,
n
i=1
n
1 X
bnk = Pn
µ 1Yi =k Xi ,
1
i=1 Yi =k i=1
n
!
b nk = diag 1X T
Σ bnk ) (Xi − b
(Xi − µ µnk ) 1Yi =k .
n
i=1

25/39
Gaussian Naive Bayes

26/39
Gaussian Naive Bayes

27/39
Gaussian Naive Bayes

28/39
Gaussian Naive Bayes

29/39
Kernel density estimate based Naive Bayes

30/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

31/39
Discriminant Analysis

Discriminant Analysis (Gaussian model)


The conditional densities are modeled as multivariate normal. For all class k,
conditionnally on {Y = k},
X ∼ N (µk , Σk ) .

Discriminant functions:

gk (X) = ln(P{X|Y = k}) + ln(P (Y = k)) .

In a two-classes problem, the optimal classifier is (see exercises):

f ∗ : x 7→ 21{g1 (x ) > g−1 (x )} − 1 .

QDA (differents Σk in each class) and LDA (Σk = Σ for all k)


lightr: this model can be false but the methodology remains valid!

32/39
Discriminant Analysis

Estimation
In practice, µk , Σk and πk := P (Y = k) have to be estimated.
nk 1
Pn
+ Estimated proportions π
bk = n
= n i=1
1{Yi =k} .

+ Maximum likelihood estimate of µbk and Σ


ck (explicit formulas).

The DA classifier then becomes(


+1 g−1 (X) ,
g1 (X) ≥ b
if b
fn (X) =
b
−1 otherwise .

If Σ−1 = Σ1 = Σ then the decision boundary is an affine hyperplane.

33/39
The loglikelihood of the observations is given by
n
X
log Pθ (X1:n , Y1:n ) = log Pθ (Xi , Yi ) ,
i=1
n
! n
!
nd n X X
=− log(2π) − log det(Σ) + 1Yi =1 log π1 + 1Yi =−1 log(1 − π1 )
2 2
i=1 i=1
n n
1X 1X
− 1Yi =1 (Xi − µ1 )T Σ−1 (Xi − µ1 ) − 1Yi =−1 (Xi − µ−1 )T Σ−1 (Xi − µ−1 ) .
2 2
i=1 i=1

This yields, for k ∈ {−1, 1},


n
1X
bkn =
π 1Yi =k ,
n
i=1
n
1 X
bnk = Pn
µ 1Yi =k Xi ,
1
i=1 Yi =k i=1
n
bn = 1X T
bnYi bnYi

Σ Xi − µ Xi − µ .
n
i=1

Remains to plug these estimates in the classification boundary.


34/39
Example: LDA

35/39
Example: LDA

36/39
Example: QDA

37/39
Example: QDA

38/39
Packages in R

Function svm in package e1071.

Function lda and qda in package MASS.

Function naive_bayes in package naivebayes.

39/39

You might also like