Professional Documents
Culture Documents
Lecture 3 - Classification
VNU-UET
2022
1 / 33
Table of content
Multi-class classification
2 / 33
Recap: Logistic Regression - Binary Classification
Data
x ∈ Rd , y ∈ {0, 1}
Example: image S × S −→ d = S 2 = 784 (MNIST)
Model
f (x) = w T x + w0
Y |X = x ∼ Ber (y |σ(f (x)))
Sigmoid Function
1
σ(z) =
1 + e −z
Parameter
θ = (w , w0 )
3 / 33
Recap: Logistic Regression - Binary cross entropy loss
Training
With MLE principle, calculate likelihood
n n
µyi i (1 − µi )1−yi
Y Y
L(w , w0 ) = P(D) = P(yi |xi ) =
i=1 i=1
4 / 33
Recap: Training Algorithm - Gradient Descent
TrainLR-GD(D, λ) → w , w0 :
1. Initialize: w = 0 ∈ Rd , w0 = 0
2. Loop epoch = 1, 2, . . .
a. Calculate ℓ(w , w0 )
b. Calculate derivative ∇w ℓ, ∇w0 ℓ
c. Update params
w ← w − λ∇w ℓ(w , w0 )
w0 ← w0 − λ∇w0 ℓ(w , w0 )
3. Stop when:
a. Epoch is large enough
b. The Loss function decrease negligible
c. The derivative is small enough ∥∇w ℓ∥, ∇w0 ℓ
5 / 33
Multi-class logistics regression model
P(Y = c) = θc , c = 1, 2, . . . , C
P
with θc is the probability of the category c and c θc = 1
6 / 33
Example Dataset with Categorical labels
Iris dataset:
▶ Number of Instances: 150
▶ Number of Attributes: 4 (sepal length/width in centimeters,
petal length/width in centimeters)
▶ Number of classes: 3
7 / 33
Example Dataset with Categorical labels
MNIST dataset:
▶ Number of Instances: 60,000 training images and 10,000
testing images.
▶ Number of Attributes: 28 × 28
▶ Number of classes: 10
8 / 33
Multi-class logistics regression model
Model
Linear function with parameters wc ∈ Rd , wc0 ∈ R, c = 1, 2, . . . , C
Probability model
Inference
hLR (x): Choose c to maximize hLR (x),
10 / 33
Multi-class logistics regression model
Training
The likelihood of the parameters with respect to the dataset:
D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}
n
Y
L(W) = P(D) = P(Y = yi |xi )
i=1
n Y
C
µyicic
Y
=
i=1 c=1
12 / 33
Multi-class logistics regression model
13 / 33
Multi-class logistics regression model
14 / 33
Multi-class logistics regression model
Gradient descend
The gradient of the loss function w.r.t the loss function are
n
X
∇wc ℓ(W) = (µic − yic )xi
i=1
n
X
∇wc0 ℓ(W) = (µic − yic )
i=1
15 / 33
Multi-class classification
Problem statement
Given a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}, xi ∈ X and
yi ∈ Y = {1, 2, . . . , C }, we want to find a classifier h(x) : X → Y
{h(X ) ̸= Y }
16 / 33
Table of content
Multi-class classification
17 / 33
Multi-class classification
Estimate the error rate
Since P is unknown, the true error (1) is not directly available.
However, we have the dataset D as a realization of P. The
empirical error rate on the training data D is then
n
1X
err
c D (h) = I(h(xi ) ̸= yi ) (2)
n
i=1
18 / 33
Bounding Empirical Error Rate
Concentration bound
Hoeffding inequality with empirical mean reminder: Let
X1 , X2 , . . . Xn be (random) samples from random variables X ,
bounded a.s in [a, b], then
n
n 1 X o −2nϵ2
Xi − E [X ] ≤ ϵ ≥ 1 − 2 exp
P
n (b − a)2
i
19 / 33
Bounding Empirical Error Rate
20 / 33
Bounding Empirical Error Rate
21 / 33
Table of content
Multi-class classification
22 / 33
The optimal classifier
Fixed X = x, the error event
23 / 33
The optimal classifier
24 / 33
The optimal classifier
25 / 33
Discriminative vs generative models
26 / 33
Table of content
Multi-class classification
27 / 33
K nearest neighbor - KNN
hKNN (x): select the label c that occurs most in k of the data
sample closest to x in D.
28 / 33
K nearest neighbor - KNN
pseudocode of KNN
1. Calculate d(x, xi ), i = 1, 2, . . . n, where d is a distance metric.
2. Take the first k data points whose distances are smallest
among the calculated distance list, along with their labels,
denoted (xj , yj )kj=1 .
3. Return the label which has the most votes.
29 / 33
K nearest neighbor - KNN
30 / 33
K nearest neighbor - KNN
Proof: Let x(1) be the nearest neighbor of x in D and y(1) the
label of this data sample and y true the label of x.
Suppose that when number of samples n → ∞,
x(1) → x
P(y |x(1) ) → P(y |x), ∀y = 1, 2, . . . , C
31 / 33
K nearest neighbor - KNN
If y ⋆ = h⋆ (x) is the output of the Bayesian optimal classifier
function, then