2223hk1 Slide03 ML2022

INT3405 Machine Learning
Lecture 3 - Classification
Ta Viet Cuong, Le Duc Trong, Tran Quoc Long

TA: Le Bang Giang, Tran Truong Thuy
VNU-UET
2022
1 / 33
Table of content
Multi-class logistics regression model
Multi-class classification
The optimal classifier
K nearest neighbor - KNN
2 / 33
Recap: Logistic Regression - Binary Classification
Data
x ∈ Rd , y ∈ {0, 1}
Example: image S × S −→ d = S 2 = 784 (MNIST)
D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}
Model
f (x) = w T x + w0
Y |X = x ∼ Ber (y |σ(f (x)))
Sigmoid Function
1
σ(z) =
1 + e −z
Parameter
θ = (w , w0 )
3 / 33
Recap: Logistic Regression - Binary cross entropy loss
Training
With MLE principle, calculate likelihood
n n
µyi i (1 − µi )1−yi
Y Y
L(w , w0 ) = P(D) = P(yi |xi ) =
i=1 i=1
where µi = σ(f (xi ))

Negative-loglikelihood (NLL)
n
X
ℓ(w , w0 ) = − log L(w , w0 ) = −yi log µi − (1 − yi ) log(1 − µi )
i=1
Binary cross entropy loss function - BCE
4 / 33
Recap: Training Algorithm - Gradient Descent
TrainLR-GD(D, λ) → w , w0 :
1. Initialize: w = 0 ∈ Rd , w0 = 0
2. Loop epoch = 1, 2, . . .
a. Calculate ℓ(w , w0 )
b. Calculate derivative ∇w ℓ, ∇w0 ℓ
c. Update params
w ← w − λ∇w ℓ(w , w0 )
w0 ← w0 − λ∇w0 ℓ(w , w0 )
3. Stop when:
a. Epoch is large enough
b. The Loss function decrease negligible
c. The derivative is small enough ∥∇w ℓ∥, ∇w0 ℓ
5 / 33
Given a label set Y = {1, 2, . . . , C }

Categorical distribution
A random variable Y ∼ Cat(y |θ1 , θ2 , . . . , θC ) means that
P(Y = c) = θc , c = 1, 2, . . . , C
P
with θc is the probability of the category c and c θc = 1
Example: A six-side dice with C = 6, θc = 1/6, ∀c

C
I(c=y )
Y
P(Y = y ) = θc
c=1
where I(c = y ) is an indicator random function denoting whether

c = y or not.
6 / 33
Example Dataset with Categorical labels
Iris dataset:
▶ Number of Instances: 150
▶ Number of Attributes: 4 (sepal length/width in centimeters,
petal length/width in centimeters)
▶ Number of classes: 3
7 / 33
Example Dataset with Categorical labels
MNIST dataset:
▶ Number of Instances: 60,000 training images and 10,000
testing images.
▶ Number of Attributes: 28 × 28
▶ Number of classes: 10
Figure: Sample images from MNIST test dataset (source: Wikipedia)
8 / 33
Model
Linear function with parameters wc ∈ Rd , wc0 ∈ R, c = 1, 2, . . . , C
fc (x) = wcT x + wc0
Apply softmax function to convert to probabilities

" #
e zc
S(z1 , z2 , . . . zC ) = PC
zc ′
c ′ =1 e c=1,2,...,C
Probability model
Y |X = x ∼ Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))
where fc (x) is called as logit. With logistic regression, fc (x) are

(simple) linear functions, more sophisticated methods make use of
neural networks.
9 / 33
Inference
hLR (x): Choose c to maximize hLR (x),
Figure: Multi-class logistic regression with Softmax
10 / 33
Training
The likelihood of the parameters with respect to the dataset:
D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}
n
Y
L(W) = P(D) = P(Y = yi |xi )
i=1
n Y
C
µyicic
Y
=
i=1 c=1
where µic is computed as:

e fc (xi )
µic = PC
fc ′ (xi )
c ′ =1 e
and the one-hot encoding of the labels
yic = I(c = yi )
11 / 33
Loss function: Negative log likelihood (NLL)

n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1
which is also called the cross-entropy loss (CE loss) function, it

measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Recall that KL divergence between two distributions P and Q is
X P(x)
DKL (P∥Q) = P(x) log
Q(x)
x∈X
12 / 33
13 / 33
Loss function: Negative log likelihood (NLL)

n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1
which is also called the cross-entropy loss (CE loss) function, it

measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Question: What is the equivalent form of KL divergence of NLL, is
it E[DKL (y ∥µ)] or E[DKL (µ∥y )]? Can we reverse the order? Why?
14 / 33
Gradient descend
The gradient of the loss function w.r.t the loss function are
n
X
∇wc ℓ(W) = (µic − yic )xi
i=1
n
X
∇wc0 ℓ(W) = (µic − yic )
i=1
15 / 33
Problem statement
Given a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}, xi ∈ X and
yi ∈ Y = {1, 2, . . . , C }, we want to find a classifier h(x) : X → Y
Statistical probability perspective

The dataset D is sampled from an unknown distribution P(x, y ).
A ’good’ classifier h has a low probability of making error under P.
Given a sample (X , Y ) ∼ P, the event X is misclassified by h:
{h(X ) ̸= Y }
and the error probability (error rate) of h
errP (h) = PX ,Y ∼P {h(X ) ̸= Y } (1)
16 / 33
Table of content
17 / 33
Estimate the error rate
Since P is unknown, the true error (1) is not directly available.
However, we have the dataset D as a realization of P. The
empirical error rate on the training data D is then
n
1X
err
c D (h) = I(h(xi ) ̸= yi ) (2)
n
i=1
Expectation of empirical error rate

The empirical error rate (2) is an unbiased estimator of the true
error rate (1) under P since
n
1X
EP [err
c D (h)] = Exi ,yi ∼P [I(h(xi ) ̸= yi )]
n
i=1
= P{h(X ) ̸= Y } = errP (h)
18 / 33
Bounding Empirical Error Rate
Concentration bound
Hoeffding inequality with empirical mean reminder: Let
X1 , X2 , . . . Xn be (random) samples from random variables X ,
bounded a.s in [a, b], then
n
n 1 X o −2nϵ2
Xi − E [X ] ≤ ϵ ≥ 1 − 2 exp

P
n (b − a)2
i
Connection to confidence intervals: Bound the range of error ϵ

with the level of significance α (change to have a wrong
estimation) with the needed number of examples n
19 / 33
Concentration bound Apply the Hoeffding inequality to the

empirical error rate, we have a concentration bound on the
difference between empirical and true error rates,
2
c D (h) − errP (h)| ≤ ϵ} ≥ 1 − 2e −2nϵ
P {|err
with Xi = I(h(xi ) ̸= yi ) are indicator random variables taking

values in {0, 1}.
▶ Estimation how err c D (h) closed to errP (h), given ϵ and the
number of samples
20 / 33
21 / 33
Table of content
22 / 33
Fixed X = x, the error event
P{h(X ) ̸= Y |X = x} = 1 − P(Y = h(x)|X = x)
Question: If X = x is fixed, what should be h(x) from the

numbers 1 to C to minimize the probability of error?
 
P(Y = 1|X = x)
 P(Y = 2|X = x) 
 
 .. 
 . 
P(Y = C |X = x)
Bayes optimal classifier: is a probabilistic model that makes the

most probable classification
h⋆ (x) = arg max P(Y = c|X = x)

c
23 / 33
Optimal error probability
E ⋆ = errP (h⋆ ) ≤ errP (h), ∀h
Example: A logistic regression model uses the formula of the

Bayes optimal classifier, with P(Y = c|X = x) is the probabilities
of the categorical distribution Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))
Question: Is it possible to find the Bayesian classification function?
▶ Can we find the distribution P?

▶ Can we approximate P(Y = c|X = x) to approximate h⋆ ?
24 / 33
Two approaches in machine learning

▶ Discriminative models: learn a predictor given the
observations
▶ Approximate P(Y = c|X = x)
▶ Approximate the classifier h⋆
▶ Generative models: learn a joint distribution over all the
variables.
▶ Approximate P(Y = c, X = x)
h⋆ (x) = arg max P(Y = c|X = x)

c
= arg max P(Y = c|X = x)P(X = x)
c
▶ Can solve a variety of problems, not only classification
25 / 33
Discriminative vs generative models
26 / 33
Table of content
27 / 33
Estimate P(Y = c|X = x) in the neighborhood of x

▶ Find k samples from D that are closest to x
▶ Approximate P(Y = c|X = x) by the proportion of label c in
the k samples.
Pn
I(xi ∈ Vk (x))I(yi = c)
P(Y
b = c|X = x) = i=1
k
where Vk (x) is a neighborhood of x containing k of data
samples in D. The numerator is the number of data samples
in Vk (x) that are labeled c.
hKNN (x): select the label c that occurs most in k of the data
sample closest to x in D.
28 / 33
pseudocode of KNN
1. Calculate d(x, xi ), i = 1, 2, . . . n, where d is a distance metric.
2. Take the first k data points whose distances are smallest
among the calculated distance list, along with their labels,
denoted (xj , yj )kj=1 .
3. Return the label which has the most votes.
Question: What data structure and algorithms should we use in

step 2 (sorting, heap, k-d tree)?
29 / 33
Theorem of the upper bound of error rate of knn with k = 1

When the number of training samples n → ∞, we have

⋆ KNN ⋆ C ⋆
R ≤ errP (h )≤R 2− R .
C −1
where R ⋆ is the Bayesian optimal error probability.
30 / 33
Proof: Let x(1) be the nearest neighbor of x in D and y(1) the
label of this data sample and y true the label of x.
Suppose that when number of samples n → ∞,
x(1) → x
P(y |x(1) ) → P(y |x), ∀y = 1, 2, . . . , C
The error of hKNN on x occurs when y(1) ̸= y true . This probability

is equal to
err(hKNN , x) = P(y true ̸= y(1) |x, x(1) )

C
X
=1− P(y true = y |x)P(y(1) = y |x(1) )
y =1
C
X
→1− P 2 (y |x) (as n → ∞)
y =1
31 / 33
If y ⋆ = h⋆ (x) is the output of the Bayesian optimal classifier
function, then
P(y ⋆ |x) = max P(y |x) = 1 − err(h⋆ , x)

y
C
X X
P 2 (y |x) = P 2 (y ⋆ |x) + P 2 (y |x)
y =1 y ̸=y ⋆
( y ̸=y ⋆ P(y |x))2
P
2 ⋆
≥ P (y |x) +
C −1
(Bunyakovsky inequality)
err(h⋆ , x)2
= (1 − err(h⋆ , x))2 +
C −1
So
C
X C
1− P 2 (y |x) ≤ 2err(h⋆ , x) − err(h⋆ , x)2 (3)
C −1
y =1
32 / 33
Taking the expectation both sides of (3), with R ⋆ = Ex err(h⋆ , x),
we have
Z
KNN KNN ⋆ C
Ex err(h , x) = err(h ) ≤ 2R − err(h⋆ , x)2 p(x)dx
C −1 x
(4)
Also,
Z
0≤ (err(h⋆ , x) − R ⋆ )2 p(x)dx
Zx
err(h⋆ , x)2 − 2R ⋆ err(h⋆ , x) + (R ⋆ )2 p(x)dx

=
Zx
= err(h⋆ , x)2 p(x)dx − (R ⋆ )2
x
Substitute to (4)

C C
err(hKNN ) ≤ 2R ⋆ − (R ⋆ )2 = R ⋆ 2 − R⋆ .
C −1 C −1
33 / 33

2223hk1 Slide03 ML2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2223hk1 Slide03 ML2022

Uploaded by

Copyright:

Available Formats

INT3405 Machine Learning

Ta Viet Cuong, Le Duc Trong, Tran Quoc Long

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}

where µi = σ(f (xi ))

 Binary cross entropy loss function - BCE

Given a label set Y = {1, 2, . . . , C }

Example: A six-side dice with C = 6, θc = 1/6, ∀c

where I(c = y ) is an indicator random function denoting whether

Figure: Sample images from MNIST test dataset (source: Wikipedia)

fc (x) = wcT x + wc0

Apply softmax function to convert to probabilities

Y |X = x ∼ Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

where fc (x) is called as logit. With logistic regression, fc (x) are

Figure: Multi-class logistic regression with Softmax

where µic is computed as:

Loss function: Negative log likelihood (NLL)

which is also called the cross-entropy loss (CE loss) function, it

Loss function: Negative log likelihood (NLL)

which is also called the cross-entropy loss (CE loss) function, it

Statistical probability perspective

and the error probability (error rate) of h

errP (h) = PX ,Y ∼P {h(X ) ̸= Y } (1)

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

Expectation of empirical error rate

Connection to confidence intervals: Bound the range of error ϵ

Concentration bound Apply the Hoeffding inequality to the

with Xi = I(h(xi ) ̸= yi ) are indicator random variables taking

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

P{h(X ) ̸= Y |X = x} = 1 − P(Y = h(x)|X = x)

Question: If X = x is fixed, what should be h(x) from the

Bayes optimal classifier: is a probabilistic model that makes the

h⋆ (x) = arg max P(Y = c|X = x)

Optimal error probability

E ⋆ = errP (h⋆ ) ≤ errP (h), ∀h

Example: A logistic regression model uses the formula of the

Question: Is it possible to find the Bayesian classification function?

▶ Can we find the distribution P?

Two approaches in machine learning

h⋆ (x) = arg max P(Y = c|X = x)

▶ Can solve a variety of problems, not only classification

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

Estimate P(Y = c|X = x) in the neighborhood of x

Question: What data structure and algorithms should we use in

Theorem of the upper bound of error rate of knn with k = 1

where R ⋆ is the Bayesian optimal error probability.

The error of hKNN on x occurs when y(1) ̸= y true . This probability

err(hKNN , x) = P(y true ̸= y(1) |x, x(1) )

P(y ⋆ |x) = max P(y |x) = 1 − err(h⋆ , x)

You might also like

Binary cross entropy loss function - BCE