You are on page 1of 47

Refresh your knowledge

Theoretical Exercises
January 25th, 2023
Exercise 1. Bayesian Classifier
a) What is the difference between discriminative and generative
modeling?

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 1
a) What is the difference between discriminative and generative
modeling?

• Generative modeling: modeling and estimation of p(y ) and p(x|y )


• Discriminative modeling: estimation of p(y |x) directly.

Generative Discriminative

The probability distributions of


A decision boundary is learned.
the data is learned.
Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 2
b)What is the decision rule of the Bayesian classifier?

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 3
b)What is the decision rule of the Bayesian classifier?

Bayes rule:

p(x, y ) = p(y ) · p(x|y )


| {z } |{z} | {z }
joint pdf prior class conditional pdf
= p(x) · p(y |x)
|{z} | {z }
evidence posterior

The posterior probability is then given as

p(y ) · p(x|y )
p(y |x) =
p(x)

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 4
b)

Bayesian decision rule:

y? = arg max p(y |x)


y

p(y ) · p(x|y )
= arg max
y p(x)
= arg max p(y ) · p(x|y )
y

= arg max [log p(y ) + log p(x|y )]


y

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 5
c) Simplify the decision rule if there is no prior knowledge about
the occurrence of the classes available.

arg max p(x|y )


y

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 6
d) Show the optimally of the Bayesian classifier for the (0, 1) loss
function.
l (y1 , y2 ) is the loss if a feature vector belonging to class y2 is assigned to class y1 .
The (0,1)-loss function is defined by

0 , if y1 = y2
l (y1 , y2 ) =
1 , otherwise

The best decision rule minimizes the average loss

AL(x, y ) = ∑ l (y , y 0 )p(y 0 |x)


y0

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 7
d)

Using the (0,1)-loss function, the class decision is based on

y? = arg min AL(x, y )


y

= arg min ∑ l (y , y 0 )p(y 0 |x)


y
y0

= arg min(1 − p(y |x))


y

= arg max p(y |x)


y

The optimal classifier according to the (0,1)-loss function applies the Bayesian
decision rule. This classifier is called Bayesian classifier.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 8
Exercise 2: Discriminant analysis
a) Write down the objective function for PCA.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 9
a) Write down the objective function for PCA.

We search a mapping Φ that maximizes the spread of the features

Φ? = arg max ∑(Φxi − Φxj )T (Φxi − Φxj ) + λ (||Φ||22 − 1)


Φ i ,j

Optimizing using the method of Lagrangian multipliers:

ΣΦT = λ 0 ΦT

where the covariance matrix is given by:

1
Σ=
m
∑(xi − µ)(xi − µ)T
i

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 10
b) Write down the objective function for LDA.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 11
b) Write down the objective function for LDA.

We search a mapping Φ that maximizes the spread of the L-dimensional projection


of the (K − 1)−dimensional projection of the centroids:

!
K L
1
Φ∗ = argmax ∑ (Φµy0 − Φµ̄ 0 )T (Φµy0 − Φµ̄ 0 ) + ∑ λi (kΦi k22 − 1)
Φ K y =1 i =1

Optimizing using the method of Lagrangian multipliers:

Σinter ΦT = λ 0 ΦT

where the inter-class-covariance matrix is given by:

1
Σinter =
m
∑(xi − µy )(xi − µy )T
i i
i

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 12
c)Describe the differences between PCA and LDA.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 13
c)Describe the differences between PCA and LDA.

• PCA does not require a classified set of feature vectors like LDA.
• PCA transformed features are approximately normally distributed (central
limit theorem).
• PCA uses the covariance matrix, and LDA uses the inter-class-covariance
matrix to solve their respective eigenvalue and eigenvector problem.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 14
Exercise 3: Gaussian Mixture Model and EM
a) Write down the general form of a Gaussian mixture model.

K
p(x) = ∑ pk N (xi ; µk , Σk )
k =1

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 15
b) Which parameters of the GMM can be estimated using the EM
algorithm?

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 16
b) Which parameters of the GMM can be estimated using the EM
algorithm?

µk the K means
Σk the K covariance matrices of size d × d
pk fraction of all features in component k
p(k |xi ) ≡ pik the K probabilities for each of the m feature vectors xi

Additional estimates:

p(x) probability distribution of observing a feature vector x


L overall log-likelihood function of the estimated parameter set

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 17
c) How do you initialize the EM algorithm?
(0)
Use k-means to find an initial guess for µk .
(0) (0)
Compute pk and Σk based on K clusters.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 18
d) Describe the basic steps of the EM algorithm for GMMs.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 19
d) Describe the basic steps of the EM algorithm for GMMs.

E−step: M−step:

pk N (xi |µk ,Σk ) ∑i pik xi


pik ≡ p(k |i ) = p(xi )
µ̂k = ∑i pik
∑i pik (xi −µ̂k )(xi −µ̂k )T
Σ̂k = ∑i pik
L = ∑m
i =1 log p (xi ) p̂k = 1
m ∑m
i =1 pik

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 20
d) Describe the basic steps of the EM algorithm for GMMs.

E−step: M−step:

pk N (xi |µk ,Σk ) ∑i pik xi


pik ≡ p(k |i ) = p(xi )
µ̂k = ∑i pik
∑i pik (xi −µ̂k )(xi −µ̂k )T
Σ̂k = ∑i pik
L = ∑m
i =1 log p (xi ) p̂k = 1
m ∑m
i =1 pik

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 20
Exercise 4: Kernel PCA
a) Describe the key idea of Kernel PCA.

Features are virtually transformed to a higher dimensional space in which the


features can be linearly separated.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 21
b) Explain the kernel trick and give the corresponding equation.

The Kernel Trick


Any algorithm that is formulated in terms of a positive semidefinite kernel k , we
can derive an alternative algorithm by replacing the kernel function k by another
positive semidefinite kernel k 0 .
Example:
For the PCA eigenvalues/-vectores problem:

Σei = λi ei

where:
m
1
Σ= ∑ xi xTi ∈ Rd ×d
m i =1

Given eigenvector can be written as a linear combination of the feature vectors:

ei = ∑ αi ,k xk
k

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 22
b)

The eigenvalue/-vector problem of PCA can be rewritten as

Σei = λi ei
!
m
1
m
∑ xj xTj · ∑ αi ,k xk = λi ∑ αi ,k xk
j =1 k k
T
∑α j ,k xj xj xk = m · λi ∑ αi ,k xk
j ,k k

These equations have to be fulfilled for all projections on xl for all indices l:

∑ αj ,k xTl xj xTj xk = m · λi ∑ αi ,k xTl xk


j ,k k

All feature vectors show up in terms of inner products and the kernel trick can be
applied.
αj ,k k (xl , xj ) · k (xj , xk ) = m · λi αi ,k k (xl , xk )
∑ ∑
j ,k k

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 23
Exercise 5: Maximum Likelihood Estimation
a) Write down the log-likelihood function to estimate the
parameters µ and Σ of a Gaussian probability density
N (x ; µ, Σ) from training data x1 . . . xm .

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 24
a) Write down the log-likelihood function to estimate the
parameters µ and Σ of a Gaussian probability density
N (x ; µ, Σ) from training data x1 . . . xm .

m
L(x1 . . . xm ; µ, Σ) = ∑ log N (xi ; µ, Σ)
i =1
m
 
1 T −1
= ∑ − log(|2πΣ|) − (xi − µ) Σ (xi − µ)
i =1 2

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 25
b) Write down the ML estimators for µ and Σ.

m
1
µ = ∑ xi
m i =1
m
1
Σ = ∑ (xi − µ) (xi − µ)T
m i =1

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 26
Exercise 6: Naive Bayes
a) Which independence assumption is used for naive Bayes?

All components d of the feature vector x are assumed to be mutually independent

d
p(x|y ) = ∏ p(xi |y )
i =1

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 27
b) What is the decision rule of naive Bayes?

The decision rule of naive Bayes is

y? = arg max p(y |x)


y

= arg max p(y ) · p(x|y )


y
d
= arg max p(y ) ∏ p(xi |y )
y
i =1

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 28
c) What is the structure of the covariance matrix of
normal-distributed classes in naive Bayes?

Diagonal matrix (the variances are the diagonal elements).

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 29
Exercise 7: Sigmoid Function
a) Write down the Sigmoid function g (x ).

The sigmoid function (also called logistic function) is defined by

1
g (x ) =
1 + e−x

with x ∈ R.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 30
b) Write down the posteriors for a two class problem (y = ±1) for
a given decision boundary F (x) in terms of a logistic function.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 31
b) Write down the posteriors for a two class problem (y = ±1) for
a given decision boundary F (x) in terms of a logistic function.

The logistic regression models the posterior probabilities directly.

1
p(y |x) =
1 + ey ·F (x)

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 32
Exercise 8: Support Vector Machine
a) Write down the objective function for Rosenblatt’s Perceptron.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 33
a) Write down the objective function for Rosenblatt’s Perceptron.

The decision boundary is a linear function:

y? = sgn(α T x + α0 )

Parameters α0 and α are chosen according to the optimization problem

minimize D (α0 , α) = − ∑ yi · (α T xi + α0 )
xi ∈M

where M includes the miss-classified feature vectors. Calculation of parameters


α , α0 by gradient descent. Update of these parameters after each
miss-classification.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 34
b) Write down the optimization problem for SVM.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 35
b) Write down the optimization problem for SVM.

In SVM, the distance between the following two margins has to be maximized:

~α T~xi + α0 ≤ −1, if yi = −1
~α T~xi + α0 ≥ +1, if yi = +1

Therefore, the constrained optimization problem consist of:

1
maximize ||α||22
2
subject to yi · (α T xi + α0 ) − 1 ≥ 0 ∀i

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 36
c) Explain the difference between Rosenblatt’s Perceptron and
SVM.

• SVM is a convex optimization problem (Lagrange-Method, unique solution).


• SVM maximizes the distance to a hyper-plane.
• Rosenblatt fits plane and uses only miss-classified features (mixed
continuous and discrete problem).
• Perceptron is not able to classify patterns with classes that are not linear
separable, but soft-margin SVM is.

Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 37

You might also like