Ex07 Refresh Your Knowledge Solution

Refresh your knowledge
Theoretical Exercises
January 25th, 2023
Exercise 1. Bayesian Classifier
a) What is the difference between discriminative and generative
modeling?
Paul Stöwer, Siming Bayer | Pattern Recognition Exercises January 25th, 2023 1
a) What is the difference between discriminative and generative
modeling?
• Generative modeling: modeling and estimation of p(y ) and p(x|y )

• Discriminative modeling: estimation of p(y |x) directly.
Generative Discriminative
The probability distributions of

A decision boundary is learned.
the data is learned.
b)What is the decision rule of the Bayesian classifier?
b)What is the decision rule of the Bayesian classifier?
Bayes rule:
p(x, y ) = p(y ) · p(x|y )

| {z } |{z} | {z }
joint pdf prior class conditional pdf
= p(x) · p(y |x)
|{z} | {z }
evidence posterior
The posterior probability is then given as
p(y ) · p(x|y )
p(y |x) =
p(x)
b)
Bayesian decision rule:
y? = arg max p(y |x)

y
p(y ) · p(x|y )
= arg max
y p(x)
= arg max p(y ) · p(x|y )
y
= arg max [log p(y ) + log p(x|y )]

y
c) Simplify the decision rule if there is no prior knowledge about
the occurrence of the classes available.
arg max p(x|y )

y
d) Show the optimally of the Bayesian classifier for the (0, 1) loss
function.
l (y1 , y2 ) is the loss if a feature vector belonging to class y2 is assigned to class y1 .
The (0,1)-loss function is defined by

0 , if y1 = y2
l (y1 , y2 ) =
1 , otherwise
The best decision rule minimizes the average loss
AL(x, y ) = ∑ l (y , y 0 )p(y 0 |x)

y0
d)
Using the (0,1)-loss function, the class decision is based on
y? = arg min AL(x, y )

y
= arg min ∑ l (y , y 0 )p(y 0 |x)

y
y0
= arg min(1 − p(y |x))

y
= arg max p(y |x)

y
The optimal classifier according to the (0,1)-loss function applies the Bayesian
decision rule. This classifier is called Bayesian classifier.
Exercise 2: Discriminant analysis
a) Write down the objective function for PCA.
a) Write down the objective function for PCA.
We search a mapping Φ that maximizes the spread of the features
Φ? = arg max ∑(Φxi − Φxj )T (Φxi − Φxj ) + λ (||Φ||22 − 1)

Φ i ,j
Optimizing using the method of Lagrangian multipliers:
ΣΦT = λ 0 ΦT
where the covariance matrix is given by:
1
Σ=
m
∑(xi − µ)(xi − µ)T
i
b) Write down the objective function for LDA.
b) Write down the objective function for LDA.
We search a mapping Φ that maximizes the spread of the L-dimensional projection

of the (K − 1)−dimensional projection of the centroids:
!
K L
1
Φ∗ = argmax ∑ (Φµy0 − Φµ̄ 0 )T (Φµy0 − Φµ̄ 0 ) + ∑ λi (kΦi k22 − 1)
Φ K y =1 i =1
Optimizing using the method of Lagrangian multipliers:
Σinter ΦT = λ 0 ΦT
where the inter-class-covariance matrix is given by:
1
Σinter =
m
∑(xi − µy )(xi − µy )T
i i
i
c)Describe the differences between PCA and LDA.
c)Describe the differences between PCA and LDA.
• PCA does not require a classified set of feature vectors like LDA.
• PCA transformed features are approximately normally distributed (central
limit theorem).
• PCA uses the covariance matrix, and LDA uses the inter-class-covariance
matrix to solve their respective eigenvalue and eigenvector problem.
Exercise 3: Gaussian Mixture Model and EM
a) Write down the general form of a Gaussian mixture model.
K
p(x) = ∑ pk N (xi ; µk , Σk )
k =1
b) Which parameters of the GMM can be estimated using the EM
algorithm?
b) Which parameters of the GMM can be estimated using the EM
algorithm?
µk the K means
Σk the K covariance matrices of size d × d
pk fraction of all features in component k
p(k |xi ) ≡ pik the K probabilities for each of the m feature vectors xi
Additional estimates:
p(x) probability distribution of observing a feature vector x

L overall log-likelihood function of the estimated parameter set
c) How do you initialize the EM algorithm?
(0)
Use k-means to find an initial guess for µk .
(0) (0)
Compute pk and Σk based on K clusters.
d) Describe the basic steps of the EM algorithm for GMMs.
E−step: M−step:
pk N (xi |µk ,Σk ) ∑i pik xi

pik ≡ p(k |i ) = p(xi )
µ̂k = ∑i pik
∑i pik (xi −µ̂k )(xi −µ̂k )T
Σ̂k = ∑i pik
L = ∑m
i =1 log p (xi ) p̂k = 1
m ∑m
i =1 pik
E−step: M−step:
pk N (xi |µk ,Σk ) ∑i pik xi

pik ≡ p(k |i ) = p(xi )
µ̂k = ∑i pik
∑i pik (xi −µ̂k )(xi −µ̂k )T
Σ̂k = ∑i pik
L = ∑m
i =1 log p (xi ) p̂k = 1
m ∑m
i =1 pik
Exercise 4: Kernel PCA
a) Describe the key idea of Kernel PCA.
Features are virtually transformed to a higher dimensional space in which the

features can be linearly separated.
b) Explain the kernel trick and give the corresponding equation.
The Kernel Trick

Any algorithm that is formulated in terms of a positive semidefinite kernel k , we
can derive an alternative algorithm by replacing the kernel function k by another
positive semidefinite kernel k 0 .
Example:
For the PCA eigenvalues/-vectores problem:
Σei = λi ei
where:
m
1
Σ= ∑ xi xTi ∈ Rd ×d
m i =1
Given eigenvector can be written as a linear combination of the feature vectors:
ei = ∑ αi ,k xk
k
b)
The eigenvalue/-vector problem of PCA can be rewritten as
Σei = λi ei
!
m
1
m
∑ xj xTj · ∑ αi ,k xk = λi ∑ αi ,k xk
j =1 k k
T
∑α j ,k xj xj xk = m · λi ∑ αi ,k xk
j ,k k
These equations have to be fulfilled for all projections on xl for all indices l:
∑ αj ,k xTl xj xTj xk = m · λi ∑ αi ,k xTl xk

j ,k k
All feature vectors show up in terms of inner products and the kernel trick can be
applied.
αj ,k k (xl , xj ) · k (xj , xk ) = m · λi αi ,k k (xl , xk )
∑ ∑
j ,k k
Exercise 5: Maximum Likelihood Estimation
a) Write down the log-likelihood function to estimate the
parameters µ and Σ of a Gaussian probability density
N (x ; µ, Σ) from training data x1 . . . xm .
a) Write down the log-likelihood function to estimate the
parameters µ and Σ of a Gaussian probability density
N (x ; µ, Σ) from training data x1 . . . xm .
m
L(x1 . . . xm ; µ, Σ) = ∑ log N (xi ; µ, Σ)
i =1
m

1 T −1
= ∑ − log(|2πΣ|) − (xi − µ) Σ (xi − µ)
i =1 2
b) Write down the ML estimators for µ and Σ.
m
1
µ = ∑ xi
m i =1
m
1
Σ = ∑ (xi − µ) (xi − µ)T
m i =1
Exercise 6: Naive Bayes
a) Which independence assumption is used for naive Bayes?
All components d of the feature vector x are assumed to be mutually independent
d
p(x|y ) = ∏ p(xi |y )
i =1
b) What is the decision rule of naive Bayes?
The decision rule of naive Bayes is
y? = arg max p(y |x)

y
= arg max p(y ) · p(x|y )

y
d
= arg max p(y ) ∏ p(xi |y )
y
i =1
c) What is the structure of the covariance matrix of
normal-distributed classes in naive Bayes?
Diagonal matrix (the variances are the diagonal elements).
Exercise 7: Sigmoid Function
a) Write down the Sigmoid function g (x ).
The sigmoid function (also called logistic function) is defined by
1
g (x ) =
1 + e−x
with x ∈ R.
b) Write down the posteriors for a two class problem (y = ±1) for
a given decision boundary F (x) in terms of a logistic function.
b) Write down the posteriors for a two class problem (y = ±1) for
a given decision boundary F (x) in terms of a logistic function.
The logistic regression models the posterior probabilities directly.
1
p(y |x) =
1 + ey ·F (x)
Exercise 8: Support Vector Machine
a) Write down the objective function for Rosenblatt’s Perceptron.
a) Write down the objective function for Rosenblatt’s Perceptron.
The decision boundary is a linear function:
y? = sgn(α T x + α0 )
Parameters α0 and α are chosen according to the optimization problem
minimize D (α0 , α) = − ∑ yi · (α T xi + α0 )
xi ∈M
where M includes the miss-classified feature vectors. Calculation of parameters

α , α0 by gradient descent. Update of these parameters after each
miss-classification.
b) Write down the optimization problem for SVM.
b) Write down the optimization problem for SVM.
In SVM, the distance between the following two margins has to be maximized:
~α T~xi + α0 ≤ −1, if yi = −1
~α T~xi + α0 ≥ +1, if yi = +1
Therefore, the constrained optimization problem consist of:
1
maximize ||α||22
2
subject to yi · (α T xi + α0 ) − 1 ≥ 0 ∀i
c) Explain the difference between Rosenblatt’s Perceptron and
SVM.
• SVM is a convex optimization problem (Lagrange-Method, unique solution).

• SVM maximizes the distance to a hyper-plane.
• Rosenblatt fits plane and uses only miss-classified features (mixed
continuous and discrete problem).
• Perceptron is not able to classify patterns with classes that are not linear
separable, but soft-margin SVM is.

Ex07 Refresh Your Knowledge Solution

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ex07 Refresh Your Knowledge Solution

Uploaded by

Copyright:

Available Formats

Refresh your knowledge

• Generative modeling: modeling and estimation of p(y ) and p(x|y )

The probability distributions of

p(x, y ) = p(y ) · p(x|y )

The posterior probability is then given as

Bayesian decision rule:

y? = arg max p(y |x)

= arg max [log p(y ) + log p(x|y )]

arg max p(x|y )

The best decision rule minimizes the average loss

AL(x, y ) = ∑ l (y , y 0 )p(y 0 |x)

Using the (0,1)-loss function, the class decision is based on

y? = arg min AL(x, y )

= arg min ∑ l (y , y 0 )p(y 0 |x)

= arg min(1 − p(y |x))

= arg max p(y |x)

We search a mapping Φ that maximizes the spread of the features

Φ? = arg max ∑(Φxi − Φxj )T (Φxi − Φxj ) + λ (||Φ||22 − 1)

Optimizing using the method of Lagrangian multipliers:

where the covariance matrix is given by:

We search a mapping Φ that maximizes the spread of the L-dimensional projection

Optimizing using the method of Lagrangian multipliers:

where the inter-class-covariance matrix is given by:

p(x) probability distribution of observing a feature vector x

pk N (xi |µk ,Σk ) ∑i pik xi

pk N (xi |µk ,Σk ) ∑i pik xi

Features are virtually transformed to a higher dimensional space in which the

The Kernel Trick

Given eigenvector can be written as a linear combination of the feature vectors:

The eigenvalue/-vector problem of PCA can be rewritten as

∑ αj ,k xTl xj xTj xk = m · λi ∑ αi ,k xTl xk

All components d of the feature vector x are assumed to be mutually independent

The decision rule of naive Bayes is

y? = arg max p(y |x)

= arg max p(y ) · p(x|y )

Diagonal matrix (the variances are the diagonal elements).

The sigmoid function (also called logistic function) is defined by

The logistic regression models the posterior probabilities directly.

The decision boundary is a linear function:

Parameters α0 and α are chosen according to the optimization problem

where M includes the miss-classified feature vectors. Calculation of parameters

Therefore, the constrained optimization problem consist of:

• SVM is a convex optimization problem (Lagrange-Method, unique solution).

You might also like