You are on page 1of 33

INT3405 Machine Learning

Lecture 3 - Classification

Ta Viet Cuong, Le Duc Trong, Tran Quoc Long


TA: Le Bang Giang, Tran Truong Thuy

VNU-UET

2022

1 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

2 / 33
Recap: Logistic Regression - Binary Classification
Data
x ∈ Rd , y ∈ {0, 1}
Example: image S × S −→ d = S 2 = 784 (MNIST)

D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}

Model
f (x) = w T x + w0
Y |X = x ∼ Ber (y |σ(f (x)))
Sigmoid Function
1
σ(z) =
1 + e −z
Parameter
θ = (w , w0 )

3 / 33
Recap: Logistic Regression - Binary cross entropy loss

Training
With MLE principle, calculate likelihood
n n
µyi i (1 − µi )1−yi
Y Y
L(w , w0 ) = P(D) = P(yi |xi ) =
i=1 i=1

where µi = σ(f (xi ))


Negative-loglikelihood (NLL)
n
X
ℓ(w , w0 ) = − log L(w , w0 ) = −yi log µi − (1 − yi ) log(1 − µi )
i=1

 Binary cross entropy loss function - BCE

4 / 33
Recap: Training Algorithm - Gradient Descent

TrainLR-GD(D, λ) → w , w0 :
1. Initialize: w = 0 ∈ Rd , w0 = 0
2. Loop epoch = 1, 2, . . .
a. Calculate ℓ(w , w0 )
b. Calculate derivative ∇w ℓ, ∇w0 ℓ
c. Update params

w ← w − λ∇w ℓ(w , w0 )
w0 ← w0 − λ∇w0 ℓ(w , w0 )

3. Stop when:
a. Epoch is large enough
b. The Loss function decrease negligible
c. The derivative is small enough ∥∇w ℓ∥, ∇w0 ℓ

5 / 33
Multi-class logistics regression model

Given a label set Y = {1, 2, . . . , C }


Categorical distribution
A random variable Y ∼ Cat(y |θ1 , θ2 , . . . , θC ) means that

P(Y = c) = θc , c = 1, 2, . . . , C
P
with θc is the probability of the category c and c θc = 1

Example: A six-side dice with C = 6, θc = 1/6, ∀c


C
I(c=y )
Y
P(Y = y ) = θc
c=1

where I(c = y ) is an indicator random function denoting whether


c = y or not.

6 / 33
Example Dataset with Categorical labels
Iris dataset:
▶ Number of Instances: 150
▶ Number of Attributes: 4 (sepal length/width in centimeters,
petal length/width in centimeters)
▶ Number of classes: 3

7 / 33
Example Dataset with Categorical labels
MNIST dataset:
▶ Number of Instances: 60,000 training images and 10,000
testing images.
▶ Number of Attributes: 28 × 28
▶ Number of classes: 10

Figure: Sample images from MNIST test dataset (source: Wikipedia)

8 / 33
Multi-class logistics regression model
Model
Linear function with parameters wc ∈ Rd , wc0 ∈ R, c = 1, 2, . . . , C

fc (x) = wcT x + wc0

Apply softmax function to convert to probabilities


" #
e zc
S(z1 , z2 , . . . zC ) = PC
zc ′
c ′ =1 e c=1,2,...,C

Probability model

Y |X = x ∼ Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

where fc (x) is called as logit. With logistic regression, fc (x) are


(simple) linear functions, more sophisticated methods make use of
neural networks.
9 / 33
Multi-class logistics regression model

Inference
hLR (x): Choose c to maximize hLR (x),

Figure: Multi-class logistic regression with Softmax

10 / 33
Multi-class logistics regression model
Training
The likelihood of the parameters with respect to the dataset:
D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

n
Y
L(W) = P(D) = P(Y = yi |xi )
i=1
n Y
C
µyicic
Y
=
i=1 c=1

where µic is computed as:


e fc (xi )
µic = PC
fc ′ (xi )
c ′ =1 e
and the one-hot encoding of the labels
yic = I(c = yi )
11 / 33
Multi-class logistics regression model

Loss function: Negative log likelihood (NLL)


n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1

which is also called the cross-entropy loss (CE loss) function, it


measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Recall that KL divergence between two distributions P and Q is
X P(x)
DKL (P∥Q) = P(x) log
Q(x)
x∈X

12 / 33
Multi-class logistics regression model

13 / 33
Multi-class logistics regression model

Loss function: Negative log likelihood (NLL)


n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1

which is also called the cross-entropy loss (CE loss) function, it


measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Question: What is the equivalent form of KL divergence of NLL, is
it E[DKL (y ∥µ)] or E[DKL (µ∥y )]? Can we reverse the order? Why?

14 / 33
Multi-class logistics regression model

Gradient descend
The gradient of the loss function w.r.t the loss function are
n
X
∇wc ℓ(W) = (µic − yic )xi
i=1
n
X
∇wc0 ℓ(W) = (µic − yic )
i=1

15 / 33
Multi-class classification

Problem statement
Given a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}, xi ∈ X and
yi ∈ Y = {1, 2, . . . , C }, we want to find a classifier h(x) : X → Y

Statistical probability perspective


The dataset D is sampled from an unknown distribution P(x, y ).
A ’good’ classifier h has a low probability of making error under P.
Given a sample (X , Y ) ∼ P, the event X is misclassified by h:

{h(X ) ̸= Y }

and the error probability (error rate) of h

errP (h) = PX ,Y ∼P {h(X ) ̸= Y } (1)

16 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

17 / 33
Multi-class classification
Estimate the error rate
Since P is unknown, the true error (1) is not directly available.
However, we have the dataset D as a realization of P. The
empirical error rate on the training data D is then
n
1X
err
c D (h) = I(h(xi ) ̸= yi ) (2)
n
i=1

Expectation of empirical error rate


The empirical error rate (2) is an unbiased estimator of the true
error rate (1) under P since
n
1X
EP [err
c D (h)] = Exi ,yi ∼P [I(h(xi ) ̸= yi )]
n
i=1
= P{h(X ) ̸= Y } = errP (h)

18 / 33
Bounding Empirical Error Rate

Concentration bound
Hoeffding inequality with empirical mean reminder: Let
X1 , X2 , . . . Xn be (random) samples from random variables X ,
bounded a.s in [a, b], then
n
n 1 X o −2nϵ2
Xi − E [X ] ≤ ϵ ≥ 1 − 2 exp

P
n (b − a)2
i

Connection to confidence intervals: Bound the range of error ϵ


with the level of significance α (change to have a wrong
estimation) with the needed number of examples n

19 / 33
Bounding Empirical Error Rate

Concentration bound Apply the Hoeffding inequality to the


empirical error rate, we have a concentration bound on the
difference between empirical and true error rates,
2
c D (h) − errP (h)| ≤ ϵ} ≥ 1 − 2e −2nϵ
P {|err

with Xi = I(h(xi ) ̸= yi ) are indicator random variables taking


values in {0, 1}.
▶ Estimation how err c D (h) closed to errP (h), given ϵ and the
number of samples

20 / 33
Bounding Empirical Error Rate

21 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

22 / 33
The optimal classifier
Fixed X = x, the error event

P{h(X ) ̸= Y |X = x} = 1 − P(Y = h(x)|X = x)

Question: If X = x is fixed, what should be h(x) from the


numbers 1 to C to minimize the probability of error?
 
P(Y = 1|X = x)
 P(Y = 2|X = x) 
 
 .. 
 . 
P(Y = C |X = x)

Bayes optimal classifier: is a probabilistic model that makes the


most probable classification

h⋆ (x) = arg max P(Y = c|X = x)


c

23 / 33
The optimal classifier

Optimal error probability

E ⋆ = errP (h⋆ ) ≤ errP (h), ∀h

Example: A logistic regression model uses the formula of the


Bayes optimal classifier, with P(Y = c|X = x) is the probabilities
of the categorical distribution Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

Question: Is it possible to find the Bayesian classification function?

▶ Can we find the distribution P?


▶ Can we approximate P(Y = c|X = x) to approximate h⋆ ?

24 / 33
The optimal classifier

Two approaches in machine learning


▶ Discriminative models: learn a predictor given the
observations
▶ Approximate P(Y = c|X = x)
▶ Approximate the classifier h⋆
▶ Generative models: learn a joint distribution over all the
variables.
▶ Approximate P(Y = c, X = x)

h⋆ (x) = arg max P(Y = c|X = x)


c
= arg max P(Y = c|X = x)P(X = x)
c

▶ Can solve a variety of problems, not only classification

25 / 33
Discriminative vs generative models

26 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

27 / 33
K nearest neighbor - KNN

Estimate P(Y = c|X = x) in the neighborhood of x


▶ Find k samples from D that are closest to x
▶ Approximate P(Y = c|X = x) by the proportion of label c in
the k samples.
Pn
I(xi ∈ Vk (x))I(yi = c)
P(Y
b = c|X = x) = i=1
k
where Vk (x) is a neighborhood of x containing k of data
samples in D. The numerator is the number of data samples
in Vk (x) that are labeled c.

hKNN (x): select the label c that occurs most in k of the data
sample closest to x in D.

28 / 33
K nearest neighbor - KNN

pseudocode of KNN
1. Calculate d(x, xi ), i = 1, 2, . . . n, where d is a distance metric.
2. Take the first k data points whose distances are smallest
among the calculated distance list, along with their labels,
denoted (xj , yj )kj=1 .
3. Return the label which has the most votes.

Question: What data structure and algorithms should we use in


step 2 (sorting, heap, k-d tree)?

29 / 33
K nearest neighbor - KNN

Theorem of the upper bound of error rate of knn with k = 1


When the number of training samples n → ∞, we have
 
⋆ KNN ⋆ C ⋆
R ≤ errP (h )≤R 2− R .
C −1

where R ⋆ is the Bayesian optimal error probability.

30 / 33
K nearest neighbor - KNN
Proof: Let x(1) be the nearest neighbor of x in D and y(1) the
label of this data sample and y true the label of x.
Suppose that when number of samples n → ∞,

x(1) → x
P(y |x(1) ) → P(y |x), ∀y = 1, 2, . . . , C

The error of hKNN on x occurs when y(1) ̸= y true . This probability


is equal to

err(hKNN , x) = P(y true ̸= y(1) |x, x(1) )


C
X
=1− P(y true = y |x)P(y(1) = y |x(1) )
y =1
C
X
→1− P 2 (y |x) (as n → ∞)
y =1

31 / 33
K nearest neighbor - KNN
If y ⋆ = h⋆ (x) is the output of the Bayesian optimal classifier
function, then

P(y ⋆ |x) = max P(y |x) = 1 − err(h⋆ , x)


y
C
X X
P 2 (y |x) = P 2 (y ⋆ |x) + P 2 (y |x)
y =1 y ̸=y ⋆
( y ̸=y ⋆ P(y |x))2
P
2 ⋆
≥ P (y |x) +
C −1
(Bunyakovsky inequality)
err(h⋆ , x)2
= (1 − err(h⋆ , x))2 +
C −1
So
C
X C
1− P 2 (y |x) ≤ 2err(h⋆ , x) − err(h⋆ , x)2 (3)
C −1
y =1
32 / 33
K nearest neighbor - KNN
Taking the expectation both sides of (3), with R ⋆ = Ex err(h⋆ , x),
we have
Z
KNN KNN ⋆ C
Ex err(h , x) = err(h ) ≤ 2R − err(h⋆ , x)2 p(x)dx
C −1 x
(4)
Also,
Z
0≤ (err(h⋆ , x) − R ⋆ )2 p(x)dx
Zx
err(h⋆ , x)2 − 2R ⋆ err(h⋆ , x) + (R ⋆ )2 p(x)dx

=
Zx
= err(h⋆ , x)2 p(x)dx − (R ⋆ )2
x
Substitute to (4)
 
C C
err(hKNN ) ≤ 2R ⋆ − (R ⋆ )2 = R ⋆ 2 − R⋆ .
C −1 C −1
33 / 33

You might also like