Exercise Solution 05 Linear Classification

lOMoARcPSD|10501562
Exercise solution 05 linear classification
Machine learning (Technische Universität München)
StuDocu is not sponsored or endorsed by any college or university

Downloaded by d deeps (ddps628@gmail.com)
lOMoARcPSD|10501562
Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 1
Machine Learning Exercise Sheet 05
Linear Classification
Exercise sheets consist of two parts: homework and in-class exercises. You solve the homework exercises
on your own or with your registered group and upload it to Moodle for a possible grade bonus. The in-
class exercises will be solved and explained during the tutorial. You do not have to upload any solutions
of the in-class exercises.
In-class Exercises
Multi-Class Classification
Problem 1: Consider a generative classification model for C classes defined by class probabilities p(y =
c) = πc and general class-conditional densities p(x | y = c, θc ) where x ∈ RD is the input feature vector
and θ = {θc }C (n) , y (n) )}N
c=1 are further model parameters. Suppose we are given a training set D = {(x n=1
where y (n) is a binary target vector of length C that uses the 1-of-C (one-hot) encoding scheme, so that
(n)
it has components yc = δck if pattern n is from class y = k. Assuming that the data points are i.i.d.,
show that the maximum-likelihood solution for the class probabilities π is given by
Nc
πc =
N
where Nc is the number of data points assigned to class c.
The data likelihood given the parameters {πc , θc }C

c=1 is
N Y
Y C
(n)
p(D|{πc , θc }C
c=1 ) = (p(x(n) |θc )πc )yc
n=1 c=1
and so the data log-likelihood is given by

N X
X C
log p(D|{πc , θc }C
c=1 ) = yc(n) log πc + const w.r.t. πc .
n=1 c=1
In
P order to maximize the log likelihood with respect to πc we need to preserve the constraint
c πc = 1. For this we use the method of Lagrange multipliers where we introduce λ as an
unconstrained additional parameter and find a local extremum of the unconstrained function
N X C C
!
X X
yc(n) log πc − λ πc − 1 .
n=1 c=1 c=1
instead. See wikipedia article on Lagrange multipliers for an intuition of why this works. This
function is a sum of concave terms in πc as well as λ and is therefore itself concave in these variables.
Upload a single PDF file with your homework solution to Moodle by 09.12.2020, 23:59 CET. We
recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted.
If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.
lOMoARcPSD|10501562
We can find the extremum by finding the root of the derivatives. Setting the derivative with respect
to πc equal to zero, we obtain
N
1 X (n) Nc
πc = yc = .
λ λ
n=1
Setting the derivative with respect to λ equal to zero, we obtain the original constraint
C
X
πc = 1
c=1
Nc P
where we can now plug in the previous result πc = λ and obtain λ = c Nc = N . Plugging this in
turn into the expression for πc we obtain
Nc
πc =
N
which we wanted to show.
Linear Discrimant Analysis

Problem 2: Using the same classification model as in the previous question, now suppose that the
class-conditional densities are given by Gaussian distributions with a shared covariance matrix, so that
p(x | y = c, θ) = p(x | θc ) = N (x | µc , Σ).
Show that the maximum likelihood estimate for the mean of the Gaussian distribution for class c is given
by
N
1 X (n)
µc = x
Nc
n=1
y (n) =c
which represents the mean of the observations assigned to class c.

Similarly, show that the maximum likelihood estimate for the shared covariance matrix is given by
C N
X Nc 1 X (n)
Σ= Sc where Sc = (x − µc )(x(n) − µc )T .
N Nc
c=1 n=1
y (n) =c
Thus Σ is given by a weighted average of the sample covariances of the data associated with each class,
in which the weighting coefficients Nc /N are the prior probabilities of the classes.
We begin by writing out the data log-likelihood.
log p(D|{πc , θc }C
c=1 )
N X
X C
= yc(n) log πc · p(x(n) | µc , Σ)
n=1 c=1
lOMoARcPSD|10501562
Then we plug in the definition of the multivariate Gaussian

N XC
X D 1 1
= yc(n) log (2π)− 2 det(Σ)− 2 exp − (x(n) − µc )T Σ−1 (x(n) − µc ) + y (n) log πc
2
n=1 c=1
and simplify.
N C
1 X X (n)
=− yc D log 2π + log det(Σ) + (x(n) − µc )T Σ−1 (x(n) − µc ) − 2 log πc
2
n=1 c=1
This expression is concave in µc , so we can obtain the maximizer by finding the root of the derivative.
With the help of the matrix cookbook, we identify the derivative with respect to µc as
N
X
yc(n) Σ−1 (x(n) − µc )
n=1
which we can set to 0 and solve for µc to obtain

N N
1 X 1 X (n)
µc = P (n)
yc(n) x(n) = x .
N Nc
n=1 yc n=1 n=1
y (n) =c
To find the optimal Σ, we need the trace trick

a = Tr(a) for all a ∈ R and Tr(ABC) = Tr(BCA).
With this we can rewrite

(x(n) − µc )T Σ−1 (x(n) − µc ) = Tr Σ−1 (x(n) − µc )(x(n) − µc )T
∂
and use the matrix-trace derivative rule ∂A Tr(AB) = B T to find the derivative of the data log-
likelihood with respect to Σ. Because the log-likelihood contains both Σ and Σ−1 , we convert one
into the other with log det A = − log det A−1 to obtain
N C
1 X X (n)
− yc − log det Σ−1 + Tr Σ−1 (x(n) − µc )(x(n) − µc )T + const w.r.t. Σ.
2
n=1 c=1
∂ log | det X|
Finally, we use rule (57) from the matrix cookbook ∂X = (X −1 )T and compute the derivative
of the log-likelihood with respect to Σ−1 as
N C
1 X X (n)
− yc −ΣT + (x(n) − µc )(x(n) − µc )T .
2
n=1 c=1
We find the root with respect to Σ and find

N X
C
!T C N
1 X 1 X X (n)
Σ= P PC (n)
yc(n) (x(n) − µc )(x(n)
− µc ) T
= (x − µc )(x(n) − µc )T
N N
n=1 c=1 yc n=1 c=1 c=1 n=1
y (n) =c
which we can immediately break apart into the representation in the instructions.
lOMoARcPSD|10501562
Homework
Linear classification
Problem 3: We want to create a generative binary classification model for classifying non-negative
one-dimensional data. This means, that the labels are binary (y ∈ {0, 1}) and the samples are x ∈ [0, ∞).
We assume uniform class probabilities
1
p(y = 0) = p(y = 1) = .
2
As our samples x are non-negative, we use exponential distributions (and not Gaussians) as class condi-
tionals:
p(x | y = 0) = Expo(x | λ0 ) and p(x | y = 1) = Expo(x | λ1 ),
where λ0 6= λ1 . Assume, that the parameters λ0 and λ1 are known and fixed.
a) What is the name of the posterior distribution p(y | x)? You only need to provide the name of the
distribution (e.g., “normal”, “gamma”, etc.), not estimate its parameters.
Bernoulli.
Remark: y can only take values in {0, 1}, so obviously Bernoulli is the only possible answer.
b) What values of x are classified as class 1? (As usual, we assume that the classification decision is
ŷ = arg maxk p(y = k | x))
Sample x is classified as class 1 if p(y = 1 | x) > p(y = 0 | x). This is the same as saying
p(y = 1 | x) ! p(y = 1 | x) !
>1 or equivalently log > 0.
p(y = 0 | x) p(y = 0 | x)
We begin by simplifying the left hand side.
p(y = 1 | x) p(x | y = 1) p(y = 1)

log = log
p(y = 0 | x) p(x | y = 0) p(y = 0)
p(x | y = 1)
= log
p(x | y = 0)
λ1 exp(−λ1 x)
= log
λ0 exp(−λ0 x)
λ1 λ1
= log + λ0 x − λ1 x = log + (λ0 − λ1 )x
λ0 λ0
To figure out which x are classified as class 1, we need to solve for x.
λ1 λ1
log + (λ0 − λ1 )x > log 1 ⇔ (λ0 − λ1 )x > − log = log λ0 − log λ1
λ0 λ0
lOMoARcPSD|10501562
We have to be careful, because if (λ0 − λ1 ) < 0, dividing by it will flip the inequality sign. Hence the
answer is

x ∈ log λ0 −log λ1 , ∞ if λ0 > λ1
h λ0 −λ1
x ∈ 0, log λ0 −log λ1 otherwise.
λ0 −λ1
Problem 4: Let D = {(xi , yi )} be a linearly separable dataset for 2-class classification, i.e. there exists
a vector w such that sign wT x separates the classes. Show that the maximum likelihood parameter w
of a logistic regression model has kwk → ∞. Assume that w contains the bias term.
How can we modify the training process to prefer a w of finite magnitude?
In logistic regression, we model the posterior distribution as

1
yi | x ∼ Bernoulli(σ(wT xi )) where σ(a) = .
1 + exp (−a)
We fit the logistic regression model by choosing the parameter w that maximizes the data log-
likelihood or alternatively minimizes the negative log-likelihood which expands to
N
X
E(w) = − log p (y | w, X) = − yi log σ(wT xi ) + (1 − yi ) log(1 − σ(wT xi )).
i=1
We assumed that the data-set is linearly separable, so by definition there is a w̃ such that
w̃T xi > 0 if yi = 1 and w̃T xi < 0 if yi = 0.
Scaling this separator w̃ by a factor λ ≫ 0 makes the negative log-likelihood smaller and smaller. To
see this, we compute the limit
  
>0 <0
XN z }| { N
X z }| {
log lim σ(λ w̃T xi ) + log 1 − lim σ(λ w̃T xi ) = 0
  
lim E(λw̃) = − 
λ→∞ λ→∞ λ→∞
i=1 i=1
yi =1 yi =0
which equals the smallest achievable value (E is the negative log of a probability, so E(w) ∈ [0, ∞)
and thus E(w) ≥ 0).
lOMoARcPSD|10501562
We can see that E is a convex function because log is concave and σ is convex if a < 0 and concave
if a > 0. So log σ(a) is concave if a > 0 and log(1 − σ(a)) is concave if a < 0. It follows that E is a
convex function because E is the negative sum of concave functions.
A convex function has a unique minimum if it attains its minimum value. We know that E tends
towards its minimum as λ → ∞, so E cannot have a finite minimizer and all its minima are only
achieved in the limit. It follows that any solution to the loss minimization problem has infinite norm.
Because E is convex and tends towards a limit of 0 in some directions, we can move the minimum
into the space of finite vectors by adding any convex term that achieves its minimum such as wT w
or similar forms of weight regularization.
Problem 5: Show that the softmax function is equivalent to a sigmoid in the 2-class case.
exp(w1T x) 1
T T
=
exp(w1 x) + exp(w0 x) 1 + exp(w0 x)/ exp(w1T x)
T
1
=
1 + exp(w0 x − w1T x)
T
1
=
1 + exp(−(w1 − w0 )T x)
= σ(ŵT x)
where ŵ = w1 − w0 .
One conclusion we can draw from this is that if we have C parameter vectors wc for C classes, the
logistic regression model is unidentifiable. This means that adding a constant τ ∈ RD to each vector
wc := wc + τ would lead to the same logistic regression model. We can fix this issue by adding a
constraint w1 = 0, which is what is done implicitly when we use sigmoid (instead of 2-class softmax)
in binary classification.
−1
Problem 6: Show that the derivative of the sigmoid function σ(a) = (1 + e−a ) can be written as
∂σ(a)
= σ(a) (1 − σ(a)) .
∂a
∂σ(a) 1 −a 1 e−a 1 + e−a − 1

=− · e · (−1) = = σ(a) = σ(a) (1 − σ(a))
∂a (1 + e−a )2 1 + e−a 1 + e−a 1 + e−a
Problem 7: Give a basis function φ(x1 , x2 ) that makes the data in the example below linearly separable
(crosses in one class, circles in the other).
lOMoARcPSD|10501562
One example is φ(x) = x1 x2 which makes the data separable by the hyperplane w = (1) because the
circles will be mapped to the positive real numbers while the crosses go to the negative numbers, i.e.
wT x > 0 if x is a circle and wT x < 0 otherwise.
Naive Bayes
Problem 8: In 2-class classification the decision boundary Γ is the set of points where both classes are
assigned equal probability,
Γ = {x | p(y = 1 | x) = p(y = 0 | x)} .
Show that Naive Bayes with Gaussian class likelihoods produces a quadratic decision boundary in the
2-class case, i.e. that Γ can be written with a quadratic equation of x,

Γ = x | xT Ax + bx + c = 0 ,
for some A, b and c.

As a reminder, in Naive Bayes we assume class prior probabilities
p(y = 0) = π0 and p(y = 1) = π1
and class likelihoods

p(x | y = c) = N (x | µc , Σc )
with per-class means µc and diagonal (because of the feature independence) covariances Σc .
Because p(y = 1 | x) + p(y = 0 | x) = 1 and we want them to be equal, we can assume that
p(y = 0 | x) > 0 and rewrite the defining equation as
p(y = 1 | x)
= 1.
p(y = 0 | x)
lOMoARcPSD|10501562
Now apply the logarithm to both sides and simplify.

p(y = 1 | x) p(x | y = 1) p(y = 1) p(x)
log = log ·
p(y = 0 | x) p(x) p(x | y = 0) p(y = 0)
= log (p(x | y = 1) p(y = 1)) − log (p(x | y = 0) p(y = 0))
π1
= log N (x | µ1 , S1 ) − log N (x | µ0 , S0 ) + log
π0
1 1
= − log(2π)D |S1 | − (x − µ1 )T S1−1 (x − µ1 )
2 2
1 1 π1
+ log(2π) |S0 | + (x − µ0 )T S0−1 (x − µ0 ) + log
D
2 2 π0
1 1
= − xT S1−1 x + xT S1−1 µ1 − µT1 S1−1 µ1
2 2
1 1 1 |S0 | π1
+ xT S0−1 x − xT S0−1 µ0 + µT0 S0−1 µ0 + log + log
2 2 2 |S1 | π0
1
= xT [S0−1 − S1−1 ]x + xT [S1−1 µ1 − S0−1 µ0 ]
2
1 1 π1 1 |S0 |
− µT1 S1−1 µ1 + µT0 S0−1 µ0 + log + log
2 2 π0 2 |S1 |
This shows that Γ is quadratic and can alternatively be written as

Γ = x | xT Ax + bx + c = 0
where
1
A = [S0−1 − S1−1 ] b = S1−1 µ1 − S0−1 µ0
2
1 1 π1 1 |S0 |
c = − µT1 S1−1 µ1 + µT0 S0−1 µ0 + log + log .
2 2 π0 2 |S1 |
If both classes had the same covariance matrix (S0 = S1 ), A would be the zero matrix and we would
receive a linear decision boundary as we did in the lecture (also, log |S 0|
|S1 | = 0).

Exercise Solution 05 Linear Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exercise Solution 05 Linear Classification

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|10501562

Exercise solution 05 linear classification

Machine learning (Technische Universität München)

StuDocu is not sponsored or endorsed by any college or university

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 1

Machine Learning Exercise Sheet 05

The data likelihood given the parameters {πc , θc }C

and so the data log-likelihood is given by

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 2

Linear Discrimant Analysis

p(x | y = c, θ) = p(x | θc ) = N (x | µc , Σ).

which represents the mean of the observations assigned to class c.

We begin by writing out the data log-likelihood.

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 3

Then we plug in the definition of the multivariate Gaussian

which we can set to 0 and solve for µc to obtain

To find the optimal Σ, we need the trace trick

We find the root with respect to Σ and find

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 4

p(x | y = 0) = Expo(x | λ0 ) and p(x | y = 1) = Expo(x | λ1 ),

We begin by simplifying the left hand side.

p(y = 1 | x) p(x | y = 1) p(y = 1)

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 5

In logistic regression, we model the posterior distribution as

w̃T xi > 0 if yi = 1 and w̃T xi < 0 if yi = 0.

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 6

∂σ(a) 1 −a 1 e−a 1 + e−a − 1

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 7

for some A, b and c.

p(y = 0) = π0 and p(y = 1) = π1

and class likelihoods

Machine Learning — WS2020 — Module IN2064 Sheet 05 · Page 8

Now apply the logarithm to both sides and simplify.

This shows that Γ is quadratic and can alternatively be written as

You might also like