You are on page 1of 8

CS 189 Introduction to Machine Learning

Spring 2018 DIS9

1 Quadratic Discriminant Analysis (QDA)

We have training data for a two class classification problem as laid out in Figure 1. The black
dots are examples of the positive class (y = +1) and the white dots examples of the negative class
(y = −1).

Figure 1: Draw your answers to the QDA problem.

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 1
(a) Draw on Figure 1 the position of the class centroids µ(+) and µ(−) for the positive and negative
class respectively, and indicate them as circled (+) and (−). Give their coordinates:
" # " #
µ(+) = µ(−) =

Solution:
" #
3
µ(+) =
3
" #
3
µ(−) =
0

(b) Compute the covariance matrices for each class:


" # " #
Σ(+) = Σ(−) =

Solution: " #
1/2 0
Σ(+) =
0 2
" #
2 0
Σ(−) =
0 2

(c) Assume each class has data distributed according to a bi-variate Gaussian, centered on the
class centroids computed in question (a). Draw on Figure 1 the contour of equal likelihood
p(X = x|Y = y) going through the data samples, for each class. Indicate with light lines the
principal axes of the data distribution for each class.
(d) Compute the determinant and the inverse of Σ(+) and Σ(−) :

|Σ(+) | = |Σ(−) | =
" # " #
Σ−1
(+) = Σ−1
(−) =

Solution:

|Σ(+) | = 1, |Σ(−) | = 4

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 2
" #
2 0
Σ−1
(+) = 0 21
" #
1
0
Σ−1
(−) =
2
1
0 2

(e) The likelihood of examples of the positive class is given by:


1 1 T −1

p(X = x|Y = +1) = exp − (x − µ (+) ) Σ (+) (x − µ (+) )
2π|Σ(+) |1/2 2

and there is a similar formula for p(X = x|Y = −1).


 Compute f(+) (x) = log p(X = x|Y =
+1) and f(−) (x) = log p(X = x|Y = −1) . Then compute the discriminant function
f (x) = f(+) (x) − f(−) (x):

f(+) (x) =

f(−) (x) =

f (x) =

Solution:

1
f(+) (x) = − (x − µ(+) )T Σ−1

(+) (x − µ (+) ) − log 2π|Σ(+) |
2
1
= −(x1 − 3)2 − (x2 − 3)2 − log(2π)
4

1  
f(−) (x) = − (x − µ(−) )T Σ−1
(−) (x − µ (−) ) − log 2π|Σ(−) |1/2
2
1 1
= − (x1 − 3)2 − x22 − log(4π)
4 4

!
1 −1 |Σ(+) |
f (x) = − ((x − µ(+) )T Σ(+) (x − µ(+) ) − (x − µ(−) )T Σ−1
(−) (x − µ(−) )) − log
2 |Σ(−) |

3 3 9
= − (x1 − 3)2 + x2 − + log(2)
4 2 4

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 3
(f) Draw on Figure 1 for each class contours increasing equal likelihood. Geometrically construct
the Bayes optimal decision boundary. Compare to the formula obtained with f (x) = 0 after
expressing x2 as a function of x1 :

x2 =

What type of function is it?


Solution: We put f (x) = 0 so:
1 3 2
x2 = (x1 − 3)2 + − log(2)
2 2 3

(g) Now assume p(Y = −1) 6= p(Y = +1), how does it change the decision boundary?
Solution: We get that:

 
log p(X = x, Y = +1) = f(+) (x) + log p(Y = +1)

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 4
 
log p(X = x, Y = −1) = f(−) (x) + log p(Y = −1)
 
To find the boundary we need to find where log p(X = x, Y = +1) = log p(X = x, Y = −1) :
 
f(+) (x) + log p(Y = +1) = f(−) (x) + log p(Y = −1)
 
p(Y = +1)
f (x) + log =0
p(Y = −1)
1 3 2 2 p(Y = −1)
x2 = (x1 − 3)2 + − log(2) + log
2 2 3 3 p(Y = +1)

The boundary is shifted by 32 log p(Y =−1)


p(Y =+1)

2 Logistic Regression
In this problem, we will explore logistic regression and derive some insights.

(a) You are given the following datasets:

Assume you are using Least Square Means for classification. Draw the decision boundary for
the dataset above. Recall that the optimization problem has the following form:
Xn
arg min (w> xi − yi )2 + λkwk22
w
i=1

Solution:

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 5
During the optimization process, the magnitude of w> x is used, but we will classify a point
based on sign(w> x). Notice that even though the decision boundary on the first dataset would
be valid for the rest, because there is a cluster of points in the bottom right corner the magnitude
of the error would be higher for those points, pulling the decision boundary down.
(b) Draw the ideal decision boundary for the dataset above.
Solution:

Ideally we wouldn’t want the decision boundary to change once we add the points in the right
bottom corner.
(c) Assume your data comes from two classes and the prior for class k is p(y = k) = πk . Also the
x|y = k ∼ N (µk , Σ), that
each class k is Gaussian, o
conditional probability distribution for n
−1
is fk (x) = f (x|y = k) = √ 1 d exp (x − µk )> Σ (x − µk ) . Assume that {µk }1k=0 , Σ
(2π) |Σ|
where estimated from the training data.
Show that P (y|x) = s(w> x) is the sigmoid function, where s(ζ) = 1+e1−ζ .
√ 
Solution: Let us denote Qk (x) = log ( 2π)d πk fk (x) , so we get that
 −1
π1 f1 (x) π0 f0 (x)
p(y = 1|x) = = 1+
π0 f0 (x) + π1 f1 (x) π1 f1 (x)
!−1
eQ0 (x)
= 1 + Q1 (x) = s(Q1 (x) − Q0 (x))
e

Now lets look at the expression Q1 (x) − Q0 (x)


√  √ 
d d
Q1 (x) − Q0 (x) = log ( 2π) π1 f1 (x) − log ( 2π) π0 f0 (x)
π1 1 1
= log − (x − µ1 )> Σ−1 (x − µ1 ) + (x − µ0 )> Σ−1 (x − µ0 )
1 − π1 2 2
π1 1 1 > −1
= log + x> Σ−1 (µ1 − µ0 ) − µ> −1
1 Σ µ1 + µ0 Σ µ0
1 − π1 2 2

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Notice that we can write it out as:
π1 1 −1 1 > −1 > −1
Q1 (x) − Q0 (x) = log + µ> >
0 Σ µ0 − µ1 Σ µ1 + (µ1 − µ0 ) Σ x = w0 + w x
1 − π1 2 2

(d) In the previous part we saw that the posterior probability for each class is the sigmoid func-
tion under the LDA model assumptions. Notice that LDA is a generative model. In this part
we are going to look at the discriminative model. We will assume that the posterior proba-
bility has Bernoulli distribution and the probability for each class is the sigmoid function, i.e.
p(y|x; w) = q y (1 − q)1−y , where q = s(w> x) and try to find w that maximizes the likelihood
function. Can you find a closed form maximum-likelihood estimation of w?
Solution:
Assume that our dataset is of size n. So we get that the likelihood is:
n
Y n
Y
L(w) = p(y = yi |xi ) = qiyi (1 − qi )1−yi .
i=1 i=1

Now maximizing the likelihood of the training data as a function of the parameters w, we get:
n
Y
ŵ = arg max L(w) = arg max qiyi (1 − qi )1−yi
w w
i=1
n
X
= arg max yi log(qi ) + (1 − yi ) log(1 − qi )
w
i=1
n  
X qi
= arg max yi log + log(1 − qi )
w
i=1
1 − qi

Since qi is the sigmoid function, we get that:


n
X  n o
> >
ŵ = arg min − yi w xi − log 1 + exp w xi
w
i=1
.  
Let us denote J(w) = − ni=1 yi w> xi − log 1 + exp w> xi . Notice that J(w) is convex
P 

in w, so global minima can be found. Recall that s0 (ζ) = s(ζ)(1 − s(ζ)). Now let us take the
derivative of J(w) w.r.t w:
n n n
exp w> xi

∂J X X X
=− y i xi − 
>
xi = >
(s(w xi ) − yi )xi = (si − yi )xi = X> (s − y)
∂w i=1
1 + exp w xi i=1 i=1

 
x> 1
where, si = s(w> xi ), s = (s1 , . . . , sn )> , y = (y1 , . . . , yn )> and X =  ... 
 
x> n
Can’t get a closed form estimate for w by setting the derivative to zero.

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 7
(e) In this section we are going to use Newton method to find the optimal solution for w. Write
out the update step of Newton method. What other method does this resemble?
Solution: In the previous section we saw that we couldn’t find a closed form solution for ŵ,
so to solve this problem we are going to use Newton method. Newton method, is an iterative
method for finding successively better approximations to the roots (or zeroes) of a real-valued
function. The iterative k step in Newton method for some function f (x) is:

x(k+1) = x(k) − (∇f (x(k) ))−1 f (x(k) )

In our case we want to find the zeros of ∇J(w). The update step will be:

w(k+1) = w(k) − (HJ(w(k) ))−1 ∇w J(w(k) )

Where,

∇w J(w) = X> (s − y)
n
X
> >
HJ(w) = ∇2w J(w) = ∇w X (s − y) = si (1 − si )xi x>
i = X ΩX
i=1

where, Ω = diag(s1 (1 − s1 ), · · · , sn (1 − sn ))
Finding the inverse of the Hessian in high dimensions can be an expensive operation. Instead
of directly inverting the Hessian we calculate the vector ∆k+1 = w(k+1) − w(k) as the solution
to the system of linear equations:

HJ(w(k) )∆k+1 = −∇J(w(k) )


X> ΩX∆k+1 = X> (y − s)

This looks similar to weighted least squares where Ω and (y − s) change per iteration. Specif-
ically looking at points where s(w> xi ) is close to 0.5, Ωi has the highest weight and points
where s(w> xi ) is close to 0 or 1 has much lower weight. Points where s(w> xi ) ≈ 0.5 are
close to the decision boundary and model is least sure of these points’ cluster, so they move
the decision boundary most in the next iteration.

DIS9, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 8

You might also like