Supervised Classification 3601

MAT3601
Introduction to data analysis
Supervised classification
Bayes classifier and discriminant analysis
1/39
Machine Learning
2/39
Outline
Introduction to supervised learning
Bayes and Plug-in classifiers
Naive Bayes
Discriminant analysis (linear and quadratic)
3/39
Supervised Learning
Supervised Learning Framework

+ Input measurement X ∈ X (often X ⊂ Rd ), output measurement Y ∈ Y.
+ The joint distribution of (X, Y ) is unknown.
+ Y ∈ {−1, 1} (classification) or Y ∈ Rm (regression).
+ A predictor is a measurable function in F = {f : X → Y}.
Training data
+ Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. with the same distribution as (X, Y ).
Goal
+ Construct a good predictor b
fn from the training data.
+ Need to specify the meaning of good.
4/39
Loss and Probabilistic Framework
Loss function
+ `(Y , f (X)) measures the goodness of the prediction of Y by f (X).
+ Prediction loss: `(Y , f (X)) = 1Y 6=f (X) .
+ Quadratic loss: `(Y , X) = kY − f (X)k2 .
Risk function
+ Risk measured as the average loss:
R(f ) = E [`(Y , f (X))] .
+ Prediction loss: E [`(Y , f (X))] = P (Y 6= f (X)).

+ Quadratic loss: E [`(Y , f (X))] = E kY − f (X)k2 .
fn depends on Dn , R(b
+ Beware: As b fn ) is a random variable!
5/39
A robot that learns
A robot endowed with a set of sensors and an online learning algorithm.
+ Task: play football.

+ Performance: score.
+ Experience: current environment and outcome, past games...
6/39
Object recognition in an image
+ Task: say if an object is present or not in the image.

+ Performance: number of errors.
+ Experience: set of previously seen labeled image.
7/39
Number
+ Task: Read a ZIP code from an envelop.

+ Performance: give a number from an image.
+ Prediction problem with X: image and Y : corresponding number.
8/39
Applications in biology
+ Task: protein interaction network prediction.

+ Goal: predict (unknown) interactions between proteins.
+ Prediction problem with X: pair of proteins and Y : existence or no of
interaction.
9/39
Detection
+ Goal: detect the position of faces in an image.

+ X: mask in the image and Y : presence or no of a face...
10/39
Classification
Setting
+ Historical data about individuals i = 1, . . . , n.
+ Features vector Xi ∈ Rd for each individual i.
+ For each i, the individual belongs to a group (Yi = 1) or not (Yi = −1).
+ Yi ∈ {−1, 1} is the label of i.
Aim
+ Given a new X (with no corresponding label), predict a label in {−1, 1}.
+ Use data Dn = {(x1 , y1 ), . . . , (xn , yn )} to construct a classifier.
11/39
Classification
Geometrically
Learn a boundary to separate two “groups” of points.
12/39
Classification
...many ways to separate points!
13/39
Supervised learning methods
Support Vector Machine
Linear Discriminant Analysis
Logistic Regression
Trees/ Random Forests
Kernel methods
Neural Networks
Many more...
14/39
Outline
Naive Bayes
15/39
Best Solution
The best solution f ∗ (which is independent of Dn ) is
f ∗ = arg minf ∈F R(f ) = arg minf ∈F E [`(Y , f (X))] .
Bayes Predictor (explicit solution)

+ Binary classification with 0 − 1 loss:

+1 if P (Y = 1|X) > P (Y = −1|X)

f ∗ (X) = ⇔ P (Y = 1|X) > 1/2 ,

−1

otherwise .
+ Regression with the quadratic loss
f ∗ (X) = E [Y |X] .
The explicit solution requires to know E [Y |X]...
16/39
Plugin Classifier
+ In many cases, the conditional law of Y given X is not known... or relies on

parameters to be estimated.
+ An empirical surrogate of the Bayes classifier is obtained from a possibly
nonparametric estimator ηbn (X) of
η(x) = P(Y = 1|X)
using the training dataset.

+ This surrogate is then plugged into the Bayes classifier.
Plugin Bayes Classifier
+ Binary classification with 0 −(1 loss:

+1 if ηbn (X) > 1/2 ,
fn (X) =
b
−1 otherwise .
17/39
Plugin Classifier
Input: a data set Dn .

Learn the ditribution of Y given X (using the data set) and plug this estimate
in the Bayes classifier.
fn : Rd → {−1, 1}
Output: a classifier b
(
+1 if ηbn (X) > 1/2 ,
fn (X) =
b
−1 otherwise .
+ Can we certify that the plug-in classifier is good ?
18/39
Classification Risk Analysis
The missclassification error satisfies (see exercices):
1/2
fn (X) 6= Y ) − L? 6 2E |η(X) − η̂n (X)|2

0 6 P(b ,
where
L? = P(f ? (X) 6= Y )
and ηbn (x) is an empirical estimate based on the training dataset of
η(x) = P(Y = 1|X = x) .
19/39
How to estimate the conditional law of Y ?
Fully parametric modeling.

Estimate the law of (X, Y ) and use the Bayes formula to deduce an estimate
of the conditional law of Y : LDA/QDA, Naive Bayes...
Parametric conditional modeling.

Estimate the conditional law of Y by a parametric law: linear regression,
logistic regression, Feed Forward Neural Networks...
Nonparametric conditional modeling.

Estimate the conditional law of Y by a non parametric estimate: kernel
methods, nearest neighbors...
20/39
Fully Generative Modeling
If the law of (X, Y ) is known everything can be easy!

Bayes formula
With a slight abuse of notation, if the law of X has a density g with respect to
a reference measure,
gk (X)P(Y =k)
P (Y = k|X) = g(X)
,
where gk is the density of the distribution of X given {Y = k}.
Generative Modeling
Propose a model for (X, Y ).
Plug the conditional law of Y given X in the Bayes classifier.
Remark: require to model the joint law of (X, Y ) rather than only the
conditional law of Y .
Great flexibility in the model design but may lead to complex computation.
21/39
Outline
Naive Bayes
22/39
Naive Bayes
Naive Bayes
Classical algorithm using a crude modeling for P (X|Y ):
+ Feature independence assumption:
d
Y
P X (i) Y

P (X|Y ) = .
i=1
+ Simple featurewise model: binomial if binary, multinomial if finite and

Gaussian if continuous.
If all features are continuous, the law of X given Y is Gaussian with a diagonal
covariance matrix!
Very simple learning even in very high dimension!
23/39
Gaussian Naive Bayes
+ Feature independence assumption:

d
Y
P X (j) Y

P (X|Y ) = .
j=1
For k ∈ {−1, 1}, P (Y = k) = πk and the conditional density of X (j) given

{Y = k} is
2 −1/2

gk (x(j) ) = (2πσj,k ) exp −(x(j) − µj,k )2 /(2σj,k
2
) .
The conditional distribution of X given {Y = k} is then
gk (x) = (det(2πΣk ))−1/2 exp −(x − µk )T Σ−1

k (x − µk )/2 ,
2 2
where Σk = diag(σ1,k , . . . , σd,k ) and µk = (µ1,k , . . . , µd,k )T .
24/39
In a two-classes problem, the optimal classifier is (see linear discriminant

analysis below):
f ∗ : X 7→ 21{P (Y = 1|X) > P (Y = −1|X)} − 1 .
+ When the parameters are unknown, they may be replaced by their maximum
likelihood estimates.
This yields, for k ∈ {−1, 1},
n
1X
bkn =
π 1Yi =k ,
n
i=1
n
1 X
bnk = Pn
µ 1Yi =k Xi ,
1
i=1 Yi =k i=1
n
!
b nk = diag 1X T
Σ bnk ) (Xi − b
(Xi − µ µnk ) 1Yi =k .
n
i=1
25/39
26/39
27/39
28/39
29/39
Kernel density estimate based Naive Bayes
30/39
Outline
Naive Bayes
31/39
Discriminant Analysis
Discriminant Analysis (Gaussian model)

The conditional densities are modeled as multivariate normal. For all class k,
conditionnally on {Y = k},
X ∼ N (µk , Σk ) .
Discriminant functions:
gk (X) = ln(P{X|Y = k}) + ln(P (Y = k)) .
In a two-classes problem, the optimal classifier is (see exercises):
f ∗ : x 7→ 21{g1 (x ) > g−1 (x )} − 1 .
QDA (differents Σk in each class) and LDA (Σk = Σ for all k)

lightr: this model can be false but the methodology remains valid!
32/39
Discriminant Analysis
Estimation
In practice, µk , Σk and πk := P (Y = k) have to be estimated.
nk 1
Pn
+ Estimated proportions π
bk = n
= n i=1
1{Yi =k} .
+ Maximum likelihood estimate of µbk and Σ

ck (explicit formulas).
The DA classifier then becomes(

+1 g−1 (X) ,
g1 (X) ≥ b
if b
fn (X) =
b
−1 otherwise .
If Σ−1 = Σ1 = Σ then the decision boundary is an affine hyperplane.
33/39
The loglikelihood of the observations is given by
n
X
log Pθ (X1:n , Y1:n ) = log Pθ (Xi , Yi ) ,
i=1
n
! n
!
nd n X X
=− log(2π) − log det(Σ) + 1Yi =1 log π1 + 1Yi =−1 log(1 − π1 )
2 2
i=1 i=1
n n
1X 1X
− 1Yi =1 (Xi − µ1 )T Σ−1 (Xi − µ1 ) − 1Yi =−1 (Xi − µ−1 )T Σ−1 (Xi − µ−1 ) .
2 2
i=1 i=1
This yields, for k ∈ {−1, 1},

n
1X
bkn =
π 1Yi =k ,
n
i=1
n
1 X
bnk = Pn
µ 1Yi =k Xi ,
1
i=1 Yi =k i=1
n
bn = 1X T
bnYi bnYi

Σ Xi − µ Xi − µ .
n
i=1
Remains to plug these estimates in the classification boundary.

34/39
Example: LDA
35/39
Example: LDA
36/39
Example: QDA
37/39
Example: QDA
38/39
Packages in R
Function svm in package e1071.
Function lda and qda in package MASS.
Function naive_bayes in package naivebayes.
39/39

Supervised Classification 3601

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Classification 3601

Uploaded by

Copyright:

Available Formats

MAT3601

Introduction to data analysis

Introduction to supervised learning

Bayes and Plug-in classifiers

Discriminant analysis (linear and quadratic)

Supervised Learning Framework

+ Prediction loss: E [`(Y , f (X))] = P (Y 6= f (X)).

A robot endowed with a set of sensors and an online learning algorithm.

+ Task: play football.

+ Task: say if an object is present or not in the image.

+ Task: Read a ZIP code from an envelop.

+ Task: protein interaction network prediction.

+ Goal: detect the position of faces in an image.

Learn a boundary to separate two “groups” of points.

...many ways to separate points!

Support Vector Machine

Linear Discriminant Analysis

Trees/ Random Forests

Introduction to supervised learning

Bayes and Plug-in classifiers

Discriminant analysis (linear and quadratic)

The best solution f ∗ (which is independent of Dn ) is

f ∗ = arg minf ∈F R(f ) = arg minf ∈F E [`(Y , f (X))] .

Bayes Predictor (explicit solution)

The explicit solution requires to know E [Y |X]...

+ In many cases, the conditional law of Y given X is not known... or relies on

η(x) = P(Y = 1|X)

using the training dataset.

Plugin Bayes Classifier

+ Binary classification with 0 −(1 loss:

Input: a data set Dn .

+ Can we certify that the plug-in classifier is good ?

The missclassification error satisfies (see exercices):

and ηbn (x) is an empirical estimate based on the training dataset of

η(x) = P(Y = 1|X = x) .

Fully parametric modeling.

Parametric conditional modeling.

Nonparametric conditional modeling.

If the law of (X, Y ) is known everything can be easy!

where gk is the density of the distribution of X given {Y = k}.

Introduction to supervised learning

Bayes and Plug-in classifiers

Discriminant analysis (linear and quadratic)

+ Simple featurewise model: binomial if binary, multinomial if finite and

Very simple learning even in very high dimension!

+ Feature independence assumption:

For k ∈ {−1, 1}, P (Y = k) = πk and the conditional density of X (j) given

The conditional distribution of X given {Y = k} is then

gk (x) = (det(2πΣk ))−1/2 exp −(x − µk )T Σ−1

In a two-classes problem, the optimal classifier is (see linear discriminant

f ∗ : X 7→ 21{P (Y = 1|X) > P (Y = −1|X)} − 1 .

Introduction to supervised learning

Bayes and Plug-in classifiers

Discriminant analysis (linear and quadratic)

Discriminant Analysis (Gaussian model)

gk (X) = ln(P{X|Y = k}) + ln(P (Y = k)) .

In a two-classes problem, the optimal classifier is (see exercises):

f ∗ : x 7→ 21{g1 (x ) > g−1 (x )} − 1 .

QDA (differents Σk in each class) and LDA (Σk = Σ for all k)

+ Maximum likelihood estimate of µbk and Σ

The DA classifier then becomes(

If Σ−1 = Σ1 = Σ then the decision boundary is an affine hyperplane.