Professional Documents
Culture Documents
Supervised Classification 3601
Supervised Classification 3601
Supervised classification
Bayes classifier and discriminant analysis
1/39
Machine Learning
2/39
Outline
Naive Bayes
3/39
Supervised Learning
Training data
+ Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. with the same distribution as (X, Y ).
Goal
+ Construct a good predictor b
fn from the training data.
+ Need to specify the meaning of good.
4/39
Loss and Probabilistic Framework
Loss function
+ `(Y , f (X)) measures the goodness of the prediction of Y by f (X).
+ Prediction loss: `(Y , f (X)) = 1Y 6=f (X) .
+ Quadratic loss: `(Y , X) = kY − f (X)k2 .
Risk function
+ Risk measured as the average loss:
R(f ) = E [`(Y , f (X))] .
fn depends on Dn , R(b
+ Beware: As b fn ) is a random variable!
5/39
A robot that learns
6/39
Object recognition in an image
7/39
Number
8/39
Applications in biology
9/39
Detection
10/39
Classification
Setting
+ Historical data about individuals i = 1, . . . , n.
+ Features vector Xi ∈ Rd for each individual i.
+ For each i, the individual belongs to a group (Yi = 1) or not (Yi = −1).
+ Yi ∈ {−1, 1} is the label of i.
Aim
+ Given a new X (with no corresponding label), predict a label in {−1, 1}.
+ Use data Dn = {(x1 , y1 ), . . . , (xn , yn )} to construct a classifier.
11/39
Classification
Geometrically
12/39
Classification
13/39
Supervised learning methods
Logistic Regression
Kernel methods
Neural Networks
Many more...
14/39
Outline
Naive Bayes
15/39
Best Solution
16/39
Plugin Classifier
17/39
Plugin Classifier
fn : Rd → {−1, 1}
Output: a classifier b
(
+1 if ηbn (X) > 1/2 ,
fn (X) =
b
−1 otherwise .
18/39
Classification Risk Analysis
1/2
fn (X) 6= Y ) − L? 6 2E |η(X) − η̂n (X)|2
0 6 P(b ,
where
L? = P(f ? (X) 6= Y )
19/39
How to estimate the conditional law of Y ?
20/39
Fully Generative Modeling
Generative Modeling
Propose a model for (X, Y ).
Plug the conditional law of Y given X in the Bayes classifier.
Remark: require to model the joint law of (X, Y ) rather than only the
conditional law of Y .
Great flexibility in the model design but may lead to complex computation.
21/39
Outline
Naive Bayes
22/39
Naive Bayes
Naive Bayes
Classical algorithm using a crude modeling for P (X|Y ):
+ Feature independence assumption:
d
Y
P X (i) Y
P (X|Y ) = .
i=1
If all features are continuous, the law of X given Y is Gaussian with a diagonal
covariance matrix!
23/39
Gaussian Naive Bayes
2 −1/2
gk (x(j) ) = (2πσj,k ) exp −(x(j) − µj,k )2 /(2σj,k
2
) .
2 2
where Σk = diag(σ1,k , . . . , σd,k ) and µk = (µ1,k , . . . , µd,k )T .
24/39
Gaussian Naive Bayes
+ When the parameters are unknown, they may be replaced by their maximum
likelihood estimates.
This yields, for k ∈ {−1, 1},
n
1X
bkn =
π 1Yi =k ,
n
i=1
n
1 X
bnk = Pn
µ 1Yi =k Xi ,
1
i=1 Yi =k i=1
n
!
b nk = diag 1X T
Σ bnk ) (Xi − b
(Xi − µ µnk ) 1Yi =k .
n
i=1
25/39
Gaussian Naive Bayes
26/39
Gaussian Naive Bayes
27/39
Gaussian Naive Bayes
28/39
Gaussian Naive Bayes
29/39
Kernel density estimate based Naive Bayes
30/39
Outline
Naive Bayes
31/39
Discriminant Analysis
Discriminant functions:
32/39
Discriminant Analysis
Estimation
In practice, µk , Σk and πk := P (Y = k) have to be estimated.
nk 1
Pn
+ Estimated proportions π
bk = n
= n i=1
1{Yi =k} .
33/39
The loglikelihood of the observations is given by
n
X
log Pθ (X1:n , Y1:n ) = log Pθ (Xi , Yi ) ,
i=1
n
! n
!
nd n X X
=− log(2π) − log det(Σ) + 1Yi =1 log π1 + 1Yi =−1 log(1 − π1 )
2 2
i=1 i=1
n n
1X 1X
− 1Yi =1 (Xi − µ1 )T Σ−1 (Xi − µ1 ) − 1Yi =−1 (Xi − µ−1 )T Σ−1 (Xi − µ−1 ) .
2 2
i=1 i=1
35/39
Example: LDA
36/39
Example: QDA
37/39
Example: QDA
38/39
Packages in R
39/39