You are on page 1of 35

CSCI3320

F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNSONG Z HU , H ARDEN H UANG

A SSIGNMENT 1
Deadline: 11:59 pm, February 19, 2023

Submit via Blackboard with VeriGuide receipt

Please follow the course policy and the school’s academic honesty policy.
Each question is worth ten points.

1. Express each of the following tasks in the framework of machine learning by specifying the input space
X , output space Y, target function f : X → Y, and the specifics of the data set that we will learn from.
(1) Breast cancer diagnosis: A patient walks in with a medical history and some symptoms, and you
want to identify whether she has
(2) Handwritten Chinese character recognition: recognize a signature on an attendance book.
(3) Predicting how a natural gas load varies with price, temperature, and day of the week.

Solution: (1) Input space X : patient’s medical history, symptoms, personal health information, etc.
Output space Y : she has breast cancer or not. Target function f : X → Y : ideal formula to identify a
patient’s breast cancer situation. Data set: All available patients’ information and their corresponding
correct breast cancer diagnosis
(2) Input space X : handwritten Chinese characters (digitalized). Output space Y : corresponding
Chinese character. Target function f : X → Y : ideal formula matches a handwritten Chinese char-
acter to a correct character. Data set: handwritten Chinese characters and their corresponding correct
matches.
(3) Input space X : The data set of prices, temperatures, and days of the weeks. Output space Y :
natural gas load results. Target function f : X → Y : Use linear regression or other machine learning
methods to predict results. Data set: The data set of prices, temperatures, and days of the week and
corresponding labels.

2. Suppose that we use a perceptron to detect mobile short message(SMS) spam. Let’s say that each mobile
short message is represented by the frequency of occurrence of keywords, and the output is +1 if the
message is considered spam.
(1) Can you think of some keywords that will end up with a large positive weight in the perceptron?
(2) How about keywords that will get a negative weight?
(3) How to change the parameter b (the bias term) in the perceptron to make the spam detection system
more sensitive (e.g. more spams are detected)?

Solution: (1) Keywords with a large positive weight: free, cheap, earn, concessional rate!
(2) Keywords with a negative weight: person name, hi, thanks
(3) Since the output is +1 if the message is considered spam. Given the perceptron function h(x) =
sign(wT x+b), Making b larger can have h(x) easily greater than 0, which can make the spam detection
system more sensitive.

PAGE 1 OF 4
CSCI3320 A SSIGNMENT 1
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , F EBRUARY 19, 2023

3. In the weight update rule of the Perception Learning Algorithm, we see that the weights moves in the
direction of classifying x∗ correctly.
The rule can be written as: w(t + 1) = w(t) + y∗ x∗ .
Show that y∗ wT (t + 1)x∗ > y∗ wT (t)x∗ . (Hint: Use the weight update rule.)

Solution:
According to the update rule:

y∗ wT (t + 1)x∗ = y∗ (w(t) + y∗ x∗ )T x∗
= y∗ wT (t) + xT∗ y∗T (t) x∗


= y∗ wT (t)x∗ + y∗ xT∗ x∗ y∗T


> y∗ wT (t)x∗ because the last term is ≥ than 0.

 T
4. Consider the perceptron in three dimensions: h(x) = sign wT x where w = [1, 2, 3, 4] and x =
T
[1, x1 , x2 , x3 ] . Technically, x has four coordinates, but we call this perceptron three-dimensional be-
cause the first coordinate is fixed at 1. Show that the regions on the plane where h(x) = +1 and
T
h(x) = −1 are separated by a plane. If we express this plane by the equation x3 = W′ x′ , where
′ T ′ T
W is a three-dimension vector [a, b, c] and x is a three-dimension vector [x1 , x2 , 1] . What are the
parameters a, b, c.

Solution: For h(x) = 1, we need to have wT x > 0, and for h(x) = −1, we need to have wT x < 0. The
plane separate the two regions has wT x = 1 + 2x1 + 3x2 + 4x3 = 0, comparing with x3 = ax1 + bx2 + c
we have a = − 21 , b = − 34 and c = − 14 .

5. Please tell us whether the following problems are suited for the machine learning approach. This prob-
lem is essentially important for one to decide whether to use machine learning methods in his tasks.
Briefly give your reason in one or two sentences.
(1) Determining recovering time of chemotherapy.
(2) Considering whether partial differential equations are solvable or not.
(3) Detecting potential fraud in housing loan issuance.
(4) Determining how far a paper plane can fly when the release speed, angle, and height are set.
(5) Determining the fastest route among multiple points in a city for the postman.
The answer to this problem may change as the development of machine learning (e.g. more problems
would be able to addressed by a learning approach). Remember your answers and see whether your
perspectives change after a few years.

Solution: (1) Learning. This should be derived from data, although human experiences may be re-
ferred.
(2) Design. This is currently tend to be a more design problem and is hard for ML. Things may change
in future.
(3) Learning. Such a detection can be learned from data like fraud detection.

PAGE 2 OF 4
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNSONG Z HU , H ARDEN H UANG

(4) Design. Although learning physics events is a hot topic in machine learning.
(5) Learning. The function is unknown and we need to learn from data.

6. For each of the following tasks, choose one most suitable type of learning (supervised, reinforcement,
or unsupervised) and briefly give your reason for not choosing other types.
(1) Recommending videos on Youtube or Tiktoks.
(2) Playing classic Tetris.
(3) Discover cluster patterns of individuals in the crowd.
(4) Determine the interest rate of a house loan, given the credit history of the borrower.

Solution: 1. Supervised Learning.


2. Reinforcement Learning. Game playing is a typical RL problem.
3. Unsupervised Learning. No annotations can be used and clustering is a standard unsupervised
learning problem.
4. Supervised Learning. Since histories are given, we’d better use supervised learning to accurately
decide the interest rate.

7. We look into the equation of Hoeffding Inequality, namely, for any ϵ > 0,
2
P [|v − µ| > ϵ] ≤ 2e−2ϵ N

Referring to week 2’s slide on the 29th page, µ is the probability to pick a red marble, v is a fraction of
red marbles in the sample, and N is the sample of size. In addition, we assume P [|v − µ| > ϵ] = δ and
set δ = 0.03,
(a) How many examples do we need to make ϵ ≤ 0.1?
(b) How many examples do we need to make ϵ ≤ 0.05?
(c) How many examples do we need to make ϵ ≤ 0.01?

1
Solution: For ϵ ≤ k, we have N ≥ 2k2 ln 2δ . Results:
(1) 210.
(2) 840.
(3) 20999.
1 >>> k = 0 . 1
2 >>> d e l t a = 0 . 0 3
3 >>> M=1
4 >>> 1/2/k * * 2 * math . l o g ( 2 *M/ d e l t a )
5 209.98525389399634
6 >>> k = 0 . 0 5
7 >>> 1/2/k * * 2 * math . l o g ( 2 *M/ d e l t a )
8 839.9410155759854
9 >>> k = 0 . 0 1
10 >>> 1/2/k * * 2 * math . l o g ( 2 *M/ d e l t a )
11 20998.525389399638

PAGE 3 OF 4
CSCI3320 A SSIGNMENT 1
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , F EBRUARY 19, 2023

8. Given the 2D Perceptron case, (1). how many possible dichotomies that can be generated when there
are N = 5 inputs? (2) how many possible dichotomies that can be generated when there are N = 4
inputs? (3) draw all possible dichotomies when N = 4 and you can note positive label as “+” and the
negative one as “-” in your drawing.

Solution: Hint: Perceptron algorithm is a linear algorithm that can only linearly separate two parts
of sample data. Hence dichotomies only consider cases where sample data are linearly separated.
(1) Five of the figures are ("+", "-", "-", "-", "-"), five of the figures are ("-", "+", "+", "+", "+"), five of figures
are ("+", "+", "-", "-", "-"), five of figures are ("-", "-", "+", "+", "+") and two of figures are all positive /
negative. In total 22 figures.Total number is 2 × (5 + 5 + 1) = 22.
(2) 14.
(3) Four of figures are ("+", "-", "-", "-"), four of figures are ("-", "+", "+", "+"), two of figures are ("+", "+",
"-", "-"), two of figures are ("-", "-", "+", "+") and two of figures are all positive / negative. In total 14
figures.

PAGE 4 OF 4
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : Q INZE YU, Z IYUAN HU

A SSIGNMENT 2
Deadline: 11:59 pm, March 5, 2023

Submit via Blackboard with VeriGuide receipt.

Please follow the course policy and the school’s academic honesty policy.
Each question is worth ten points.

1. Please briefly answer the following questions.


(1) What is the difference between the in-sample error Ein and the out-of-sample error Eout ? (Week3
slides Page6)
(2) What is the relation between the VC-dimension dvc and the break point?
(3) What methods have we learned in class that can be used to analyze approximation and generaliza-
tion tradeoff?
(4) What is the smallest break point of the Perceptron in Rd (d 2)? (Week5 slides Page11)
2. Compute the maximum number of dichotomies, mH (N ), for these learning models, and consequently
compute dvc , the VC dimension. Refer to the text book for the definitions.
(a) Positive or negative ray: H contains the functions which are +1 on [a, 1) (for some a ) together with
those that are +1 on ( 1, a] (for some a ).
(b) Positive or negative interval: H contains the functions which are +1 on an interval [a, b] and 1
elsewhere or 1 on an interval [a, b] and +1 elsewhere.
q
3. Referring to Week4 slides Page26: For any tolerance > 0, Eout (g)  Ein (g) + N8 ln 4mH (2N ) with
probability 1 . Suppose we have a simple learning model whose growth function is mH (N ) =
N + 1, hence dvc = 1. Use the VC bound to estimate the probability that Eout will be within 0.2 of Ein
given 200 training examples. For this question, all your results should be truncated to 4 decimal places.
q
4mH (2N )
Hint: Let ✏ = 8
N
ln = 0.2 and compute .

4. Denoted the number of examples as N. Set = 0.02 and let


r
1 2M
✏(M, N, ) = ln .
2N

Please round up your answer to the closest integer.


(a) For M = 1, how many examples do we need to make ✏  0.05 ?
(b) For M = 100, how many examples do we need to make ✏  0.05 ?
(c) For M = 10, 000, how many examples do we need to make ✏  0.05 ?

PAGE 1 OF 2
CSCI3320 A SSIGNMENT 2
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , M ARCH 5, 2023

5. Consider a simplified learning scenario. Assume that the input dimension is one. Assume that the
input variable x is uniformly distributed in the interval [ 1, 1]. The data set consists of 2 points {x1 , x2 }
and assume that the target function is f (x) = x2 . Thus, the full data set is D = x1 , x21 , x2 , x22 .
The learning algorithm returns the line fitting these two points as g(H consists of functions of the form
h(x) = ax + b). We are interested in the test performance (E [Eout ]) of our learning system with respect
to the squared error measure, the bias and the var. (Week5 slides Page27 to Page29)
(a) Give the analytic expression for the average function ḡ(x). You are not required to compute E D [x1 ]
and E D [x2 ] inside the formula of ḡ(x).
h i
Hint: Use g(x) = ax + b, g(x1 ) = x21 , g(x2 ) = x22 to obtain ḡ(x) = ED g (D) (x) .

(b) Compute analytically what E [Eout ], bias and var should be.
⇥ ⇤ ⇥ ⇤ ⇥ ⇤
Hint: ED [x] = 0, ED x2 = 13 , ED x3 = 0, ED x4 = 15 .

*** END ***

PAGE 2 OF 2
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNZE Z HU , Y ONGFENG H UANG

A SSIGNMENT 3
Deadline: 11:59 pm, March 26, 2023

Submit via Blackboard with VeriGuide receipt.

Please follow the course policy and the school’s academic honesty policy.
Each question is worth ten points.

1. (a) How to understand the trade-off between approximation and generalization?


(b) How to understand the learning curve?

Solution: open question


(a) refer to week5’s slides from 20th to 38th page
(b) refer to week6’s slides from 1st to 10th page

2. In this problem, we will walk through the computation of the Linear Regression Algorithm to help you
better understand it. Here we want to fit a linear regression model h(x) = wx + b to the dataset D.
D = {(0.8, 1.2), (2.2, 2.4), (4.5, 3.0)} has only 3 data points and each data point (x, y) ∈ D contains only
a scalar input x and a scalar target y, e.g., 0.8 is the input x of the first data point and 1.2 is the target
y of the first data point for h(x) to approximate. (For all questions below, all your results should be
truncated to 2 decimal places.)
(a) What is X and y according to the Linear Regression Algorithm?
(b) What is the result of X T X?
(c) What is the result of (X T X)−1 ?
Hint: you can use some software to compute the inverse.
(d) What is the result of (X T X)−1 X T ?
(e) What is the result of (X T X)−1 X T y?
(f) Given a new input 2.0, what will our fitted model predict?
 
b
Hint: = (X T X)−1 X T y
w

Solution:
(a)
   
1.0 0.8 1.2
X = 1.0 2.2 , y = 2.4
1.0 4.5 3.0

PAGE 1 OF 6
CSCI3320 A SSIGNMENT 3
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , M ARCH 26, 2023

(b)  
3.0 7.5
XT X =
7.5 25.73

(c)  
T −1 1.22 −0.35
(X X) =
−0.35 0.14

(d)  
T −1 T 0.94 0.45 −0.35
(X X) X =
−0.23 −0.04 0.28

(e)  
1.15
(X T X)−1 X T y =
0.46

(f)

   
b T −1 T 1.15
∵ = (X X) X y =
w 0.46
∴ h(x = 2.0) = 2.0 ∗ 0.46 + 1.15 = 2.07

3. After we find the optimal w∗ = (X T X)−1 X T y by the Linear Regression Algorithm where X is N by
d + 1 matrix , we could obtain the predicted values ŷ = Hy for all inputs X in our dataset D, where
H = X(X T X)−1 X T .
(a) Prove that H is symmetric, i.e., H T = H.
T −1
Hint: M −1 = MT
(b) Prove that H k = H for any positive integer K.
Hint: Try prove HH = H first.
(c) Prove that (I − H)k = I − H for any positive integer K and I is the identity matrix.
(d) Show that trace(H) = d + 1, where trace is the sum of diagonal elements.
Hint: trace(AB) = trace(BA)

Solution:
(a)
T
H T = X(X T X)−1 X T
= X ((X T X)−1 )T X T
| {z }
(M −1 )T =(M T )−1

= X((X T X)T )−1 X T


= X(X T X)−1 X T
=H

PAGE 2 OF 6
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNZE Z HU , Y ONGFENG H UANG

(b)since

HH = X(X T X)−1 X T X(X T X)−1 X T


| {z }| {z }
H H
= X (X T X)−1 X T X (X T X)−1 X T
| {z }
Identity matrix I

= X(X T X)−1 X T = H

it is easy to prove H k = H for any positive integer K


(c) since
k
X
(I − H)k = Cnk (−1)n H n
n=0
k
X
=I+ Cnk (−1)n H
n=1
k
X
=I −H +H Cnk (−1)n
n=0
= I − H + H(1 − 1)k
=I −H

then the statement is proved


(d) since

trace(H) = trace(X(X T X)−1 X T )


= trace((X T X)−1 X T X)
= trace(I)

by dim(H) = d + 1, trace(H) = trace(Id+1 ) = d + 1

p
4. Calculate the partial derivatives of f (x, y) = 3x2 + 2y 3 . Show your work.

Solution:

∂f (x, y) 1 ∂(3x2 + 2y 3 )
= p ·
∂x 2 3x2 + 2y 3 ∂x
1
= p · 6x
2 3x2 + 2y 3
3x
=p
3x2 + 2y 3

PAGE 3 OF 6
CSCI3320 A SSIGNMENT 3
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , M ARCH 26, 2023

∂f (x, y) 1 ∂(3x2 + 2y 3 )
= p ·
∂y 2 3x2 + 2y 3 ∂y
1
= p · 6y 2
2 3x + 2y 3
2

3y 2
=p
3x2 + 2y 3

5. We sample 4 data points from an unknown target function: (0.0, 1.8), (1.0, 4.2), (2.0, 6.0), (3.0, 8.2). One-
dimensional linear regression algorithm is used
Pn to estimate this target function based on sampling data:
1 i i
f (x) = wx + b. The loss function is: L = 2n i=1 (y − (wx + b))2 , where xi and y i is the i-th data point,
n is the number of data. Please estimate parameters w and b based on gradient descent algorithm.
(1) Assuming the initial parameters w = 0, b = 0 and the learning rate η = 0.01. Please calculate the
updated parameters after one-step gradient descent according to the 4 samples collected.
(2) Assuming the current parameters w = 0, b = 0 and the learning rate η = 1.0. Please calculate the
current loss value and the loss value after one-step gradient descent. Is the learning rate set reasonable?

Pn i Pn
Hint: gradw = ∂L
∂w
= 1
n i=1 (f (x )−y i )xi , gradb = ∂L
∂b
= 1
n
i i
i=1 (f (x )−y ), w = w −η ·gradw , b = b−η ·gradb .

Solution:
(1)

n
1X
gradw = (f (xi ) − y i )xi
n i=1
1
= (−1.8 ∗ 0 − 4.2 ∗ 1 − 6.0 ∗ 2 − 8.2 ∗ 3)
4
= −10.2

w = w − η · gradw
= 0 + 0.01 ∗ 10.2
= 0.102

n
1X
gradb = (f (xi ) − y i )
n i=1
1
= (−1.8 − 4.2 − 6.0 − 8.2)
4
= −5.05

b = b − η · gradb
= 0 + 0.01 ∗ 5.05
= 0.0505

PAGE 4 OF 6
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNZE Z HU , Y ONGFENG H UANG

(2)

1
L(w=0,b=0) = ((1.8 − 0)2 + (4.2 − 0)2 + (6.0 − 0)2 + (8.2 − 0)2 )
2∗4
= 15.515

w = w − η · gradw
= 0 + 1.0 ∗ 10.2
= 10.2

b = b − η · gradb
= 0 + 1.0 ∗ 5.05
= 5.05

1
L(w=10.2,b=5.05) = ((1.8 − 5.05)2 + (4.2 − 15.25)2 + (6.0 − 25.45)2 + (8.2 − 35.65)2 )
2∗4
= 158.05875

The learning rate is too large.

s
e
6. Given a logistic regression model h(x) = θ(wx+b), where θ is the so-called logistic function θ(s) = 1+e s

whose output is between 0 and 1. The predictions of model at x = 0 and x = 1 are h(x = 0) = 0.2026
and h(x = 1) = 0.7701 respectively, please calculate the prediction at x = 2. (For this question, all your
results should be truncated to 4 decimal places.)

Solution:

eb
∵ h(x = 0) = = 0.2026
1 + eb
∴ eb = 0.2026 + 0.2026eb
∴ eb = 0.2540
∴ b = −1.3704
ew−1.3704
∵ h(x = 1) = = 0.7701
1 + ew−1.3704
∴ ew−1.3704 = 3.3497
∴ w = 2.5792
∴ h(x) = θ(2.5792x − 1.3704)
∴ h(x = 2) = θ(3.7880) = 0.9778

PAGE 5 OF 6
CSCI3320 A SSIGNMENT 3
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , M ARCH 26, 2023

7. We have gathered data on the occurrence of hypertension, represented by the data pairs (30, −1),
(30, +1), (30, +1), (40, −1), (40, +1), (50, −1), and (50, +1). In each pair, the first value denotes the
age of the patient, while the second value, y, is either +1 or −1, indicating whether the patient has the
disease or not. We aim to analyze the correlation between the prevalence of hypertension and age using
es
a logistic regression model, which takes the form h(x) = θ(wx + b), where θ(s) = 1+e s is the logistic

function. The parameters w and b are initialized to 0.1 and −4.0, respectively. Please calculate the in-
sample error Ein and the gradient of Ein with respect to w at this time. All results should be truncated
to 4 decimal places for this question.
i
PN (wxi +b) ∂Ein (w,b) PN
Hint: Ein (w, b) = 1
N i=1 ln (1 + e−y ), ∂w
= 1
N i=1 −y i xi θ(−y i (wxi + b)), where xi and y i is the
i-th data point.

Solution:

N
1 X i i
Ein (w, b) = ln (1 + e−y (wx +b) )
N i=1
1
= (ln (1 + e−1 ) + 2 ∗ ln (1 + e1 ) + ln (1 + e0 ) + ln (1 + e0 ) + ln (1 + e1 ) + ln (1 + e−1 ))
7
1
= (0.3132 + 2 ∗ 1.3132 + 0.6931 + 0.6931 + 0.3132 + 1.3132)
7
= 0.8503

N
∂Ein (w, b) 1 X i i
= −y x θ(−y i (wxi + b))
∂w N i=1
1
= (30 ∗ θ(−1) − 2 ∗ 30 ∗ θ(1) + 40 ∗ θ(0) − 40 ∗ θ(0) + 50 ∗ θ(1) − 50 ∗ θ(−1))
7
1
= (−20 ∗ θ(−1) − 10 ∗ θ(1))
7
1
= (−20 ∗ 0.2689 − 10 ∗ 0.7310)
7
= − 1.8125

PAGE 6 OF 6
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : Q INZE Y U , Z IYUAN H U

A SSIGNMENT 4
Deadline: 11:59 pm, April 9, 2023

Submit via Blackboard with VeriGuide receipt

Please follow the course policy and the school’s academic honesty policy.
Each question is worth ten points.

1. Soft-order constraints that regularize polynomial models can


[a] be written as hard-order constraints
[b] be determined from the value of the VC dimension
[c] always be used to decrease both Ein and Eout
[d] be translated into augmented error
[e] be None of the above is true
Please select the correct option and briefly discuss your reason.

2. The regularized weight wreg is a solution to:


N
1 X |
minimize (w xn yn )2 subject to w| |
wC
N n=1
| |
where is a matrix. If wlin wlin  C, where wlin is the linear regression solution, then what is wreg ?
Please select the correct option.
[a] wreg = wlin
[b] wreg = Cwlin
[c] wreg = wlin
[d] wreg = C wlin
|
[e] wreg = wlin
Please select the correct option and briefly discuss your reason.

3. What is overfitting? What are the causes? What are the tools to fight it?

4. Deterministic noise depends on H, as some models approximate f better than others.


(a) Assume H is fixed and we increase the complexity of f . Will deterministic noise in general go up or
down? Is there a higher or lower tendency to overfit?
(b) Assume f is fixed and we decrease the complexity of H. Will deterministic noise in general go up
or down? Is there a higher or lower tendency to overfit? [Hint: There is a race between two factors that
affect overfitting in opposite ways, but one wins.]

PAGE 1 OF 2
CSCI3320 A SSIGNMENT 4
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , A PRIL 9, 2023

T 1
5. Let Z = [z1 , . . . , zN ] be the data matrix (assume Z has full column rank); let wlin = Z T Z Z T y; and
1
let H = Z Z T Z Z T (the hat matrix of Exercise 3.3). Show that
T
(w wlin ) Z T Z (w wlin ) + y T (I H)y
Ein (w) =
N
where I is the identity matrix.
(a) What value of w minimizes Ein ?
(b) What is the minimum in sample error?

6. In the augmented error minimization with = I and > 0, assume that Ein is differentiable and use
gradient descent to minimize Eaug :

w(t + 1) w(t) ⌘rEaug (w(t)).

Show that the update rule above is the same as

w(t + 1) (1 2⌘ )w(t) ⌘rEin (w(t)).

Note: This is the origin of the name ’weight decay’: w(t) decays before being updated by the gradient
of Ein .

PAGE 2 OF 2
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNZE Z HU , Y ONGFENG H UANG

A SSIGNMENT 5
Deadline: 11:59 pm, April 16, 2023

Submit via Blackboard with VeriGuide receipt

Please follow the course policy and the school’s academic honesty policy.
Each question is worth ten points.

1. (Bayes Rule) In a city, 80% of taxi cabs are red and 20% are green. Since taxis move fast, the probability
of reporting the color correctly given a taxi is 80%. If a witness claims he saw a green cab, what is the
probability that the cab is green?

Solution: P (true color green) = 0.2,


P (report green|true color green) = 0.8, P (report red|true color red) = 0.8
P (report green) = P (true color green)P (report green|true color green))
+P (true color red)P (report green|true color red) = 0.2 · 0.8 + 0.8 · 0.2 = 0.32
P (true color green)P (report green|true color green)
P (true color green|report green) = P (report green)
0.2·0.8
= 0.32 = 0.5

2. The Poisson distribution is one of the most popular distributions in statistics. You can learn more about
it on Wikipedia. The probability density function of the Poisson distribution is given as:

λx e−λ
P (X = x) =
x!
Given a sample of n measured values xi ∈ {0, 1, ...}, for i = 1, ..., n, we wish to estimate the value
of the parameter λ of the Poisson population from which the sample was drawn. Please derive the
maximum likelihood estimate λ̂. [Hint: you could calculate the log likelihood function first and utilize
the derivative condition for maximum value to calculate the result.]

λx e−λ
Solution: Given the probability density function as P (X = x) = x! , the likelihood function is the
product of the PDF for the observed values x1 , ..., xn :
n
Y λxj e−λ
L(λ; x1 , ..., xn ) =
j=1
xj !

Thus, the log likelihood function:


 
n xj −λ n n n
Y λ e  X X X
ln  = [xj ln(λ) − λ − ln(xj !)] = −nλ + ln(λ) xj − ln(xj !)
j=1
xj ! j=1 j=1 j=1

PAGE 1 OF 5
CSCI3320 A SSIGNMENT 5
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , A PRIL 16, 2023

Calculate the derivative of the log likelihood w.r.t. λ and set it to 0:


 
n n n
d  X X 1X
−nλ + ln(λ) xj − ln(xj !) = −n +
 xj = 0
dλ j=1 j=1
λ j=1

1
Pn
Thus, λ̂ = n i=1 xi

3. In the lecture, you have learned Gaussian Discriminant Analysis (GDA) for handling binary classifica-
tion tasks. Considering the above dataset, you need to tell us which dataset in Figure 1 is more suitable
for GDA and explain the reason.

(a) Dataset A (b) Dataset B

Figure 1

Solution:
Dataset A. Note that, GDA are not applicable for dataset B, as the data does not follow a Gaussian
distribution.

4. We have discussed how to fit a GDA model using maximum likelihood estimation. The log-likelihood
is given by
"N C # C
" #
XX X X
logp(D|θ) = I(yn = c)logπc + logN (xn |µc , Σc )
n=1 c=1 c=1 n:yn =c

The indicator function I(yn = c) takes the value 1 if yn = c; otherwise it equals to 0. Please write down
the details of how to optimize (πc , µc , Σc ) and eventually get the following results:

Nc
π̂c =
N

PAGE 2 OF 5
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNZE Z HU , Y ONGFENG H UANG

1 X
µ̂c = xn
Nc n:y =c
n

1 X
Σ̂c = (xn − µ̂c )(xn − µ̂c )⊺
Nc n:y =c
n

Solution:
(1) For πc , we have:
Following the process in Section 4.2.3. To compute the MLE, we have to minimize the NLL (NLL(π) =
P PC
− c Nc log πc ) subject to the constraint that c=1 πc = 1. To do this, we will use the method of
Lagrange multipliers: !
X X
L(π, λ) ≜ − Nc log πc − λ 1 − πc
c c

Taking derivatives with respect to λ yields the original constraint:

∂L X
=1− πc = 0
∂λ c

Taking derivatives with respect to πc yields

∂L Nc
=− + λ = 0 =⇒ Nc = λπc
∂πc πc
We can solve for λ using the sum-to-one constraint:
X X
Nc = λ πc = λ
c c

Thus the MLE is given by


Nc Nc
π̂c = =
λ N
(2) For µc , Σc :
∂zn
Following the process in Section 4.2.6, with the substitution zn = xn − µ and ∂µ = −I, we have

∂ ∂ ⊺ −1 ∂zn
(xn − µ)⊺ Σ−1 (xn − µ) = z Σ zn ⊺ = −(Σ−1 + Σ−⊺ )zn
∂µ ∂zn n ∂µ
C C
∂ 1X X −1 −1
X X
logp(D|θ) = − −2Σc (xn − µc ) = Σc (xn − µc ) = 0
∂µc 2 c=1 n:x =c c=1 n:x =c
n n

1 X
∴ µ̂c = xn
Nc n:y =c
n


Similar procedurals can be applied to ∂Σc logp(D|θ).

5. Softmax is very useful for multi-class classification problems and has been widely adopted. It can
convert your model output to a probability distribution over classes. The c-th element in the output of

PAGE 3 OF 5
CSCI3320 A SSIGNMENT 5
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , A PRIL 16, 2023

ac
a)c =
softmax is defined as f (a PC e , where a ∈ RC is the output of your model, C is the number
c′ =1
ea c′
of classes and a c denotes the c-th element of a . What is the gradient of the i-th element in the softmax
a)i
a) w.r.t. the j-th element in the softmax input a , i.e., what is ∂f∂a(a
output f (a aj ? Show your work.

Solution:
PC
For simplicity, we denote c′ =1 ea c′ as Σ.
ac
a)i
∂f (a ∂e
= Σ
∂aaj aj
∂a
a
Only ea j in Σ matters
z}|{
∂ea i ai ∂Σ
aj
∂a Σ−e
∂aaj
=
Σ2
∂ea i ai aj
aj
∂a Σ−e e
=
Σ2
∂ea i
aj @
∂a Σ ea i ea j
= −
Σ@
Σ |Σ{zΣ}
a)c =ea c /Σ
f (a
∂ea i
aj
∂a
= − f (a
a)i f (a
a)j
Σ
∂ea i ∂ea i
aj
∂a aj
∂a ea i
Based on the above results, if i ̸= j, then the first term Σ = 0, otherwise Σ = Σ
a)i .
= f (a
In the end, we have (
a )i
∂f (a −f (aa)i f (a
a)j , if i ̸= j
=
∂aaj a)i − f (a
f (a a)i f (a
a)j , otherwise

6. Given the GDA model’s discriminant function


1 1 ⊤
log p(y = c | x, θ) = log πc − log |2πΣc | − (x − µc ) Σ−1 c (x − µc ) + const . (1)
2 2
Please describe why the decision boundary between any two classes will be a quadrative function of x.
In addition, in which case the LDA (figure 2 (right)) will be a special case of GDA and please explain
that according to the decision boundaries that can be derived from the equation 1.

Solution: The boundary can be understood as the positions where the probabilities of the two classes
are equal, and can be determined by calculating the differences in probabilities between the two
classes. It is apparent that the boundary separating the two classes is a quadratic function of x from
equation 1. The special case is the covariance matrices are tied or shared across classes, so Σc = Σ.
Since Σ is independent of c, we can simplify Equation 1 as follows:
1 ⊤
log p(y = c | x, θ) = log πc − (x − µc ) Σ−1 (x − µc ) + const
2
1 −1 1
= log πc − µ⊤ c Σ µc +x⊤ Σ−1 µc + const − x⊤ Σ−1 x
| 2{z } | {z } | {z2 }
βc
γc κ

= γc + x β c + κ

PAGE 4 OF 5
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : R UNZE Z HU , Y ONGFENG H UANG

Figure 2

The final term is independent of c, and hence is an irrelevant additive constant that can be dropped.
Hence we see that the discriminant function is a linear function of x, so the decision boundaries will
be linear. Hence this method is called linear discriminant analysis or LDA.

PAGE 5 OF 5
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : Q INZE Y U , Z IYUAN H U

A SSIGNMENT 6
Deadline: 11:59 pm, April 30, 2023

Submit via Blackboard with VeriGuide receipt.

Please follow the course policy and the school’s academic honesty policy.
Each question is worth ten points.

1. Please briefly answer the following questions.


(a) What is the softmax function in neural networks? Please write its function formula.
(b) What is the purpose of using the softmax function?
Hint: Referring to Week12’s Slides Page16.

Solution: Referring to Week12’s Slides Page16.

2. Which of the following statements is true about Fisher’s LDA for two classes? Only one choice is
correct.
[a] The goal of LDA is to minimize the distance between classes and to maximize the distance within
classes.
[b] The goal of LDA is to maximize the distance between classes and to minimize the distance within
classes.
[c] The goal of LDA is to minimize the distance between classes and within classes.
[d] The goal of LDA is to maximize the distance between classes and within classes.

Solution: b

3. Which of the following statements is Wrong about the LDA for two classes? Only one choice is correct.
[a] Fisher’s linear discriminant attempts to find the vectors that maximizes the separation between
classes of the projected data.
[b] LDA takes into account the categories in the data.
[c] LDA assumes linear decision boundary and variance-covariance homogeneity.
[d] The desired LDA transformation is in the direction of the within-class covariance matrix.
w is directly proportional to the inverse of the within-class covariance matrix times the difference of the class means
Solution: d
Note d should be the inverse of the within-class covariance matrix times the difference of the class
means.

PAGE 1 OF 4
CSCI3320 A SSIGNMENT 6
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , A PRIL 30, 2023

4. Referring to Week13’s Slides Page8, we have learned about Fisher’s LDA for two classes. Prove that w
T T
reaches the maximum of J(w w ) = wwT SSW
Bw
w
when w satisfies SB w = λSW w , where λ = wwT SSW
Bw
w
.
Hint: For clarity, we first define f (w w ) = w T SB w and g(w
w ) = w T SW w .
∂ f (x) f ′ g−f g ′
Recall that ∂x g(x) = g2
, where f = ∂x f (x) and g ′ = ∂x
′ ∂ ∂
g(x). Recall that ∂ T
∂x
x
x x Ax = (A + AT )x
x.
SB and SW are symmetric, i.e., STB = SB and STW = SW .

Solution:
w ) = w T SB w and g(w
We first define f (w w ) = w T SW w , then we have

f ′ (w
w ) = (SB + STB )w
w
g ′ (w
w ) = (SW + STW )w
w

w ), then
If w maximizes J(w
w)
∂J(w ∂ f (w w)
=
w
∂w ∂ww g(w w)
f ′ (w w )g ′ (w
w ) − f (w
w )g(w w)
=
g 2 (w
w)
=0

To satisfy the above equation, we only need

f ′ (w w ) − f (w
w )g(w w )g ′ (w
w) = 0

f (ww )g(w w )g ′ (w
w ) = f (w w)
w w T SW w = w T SB w (SW + STW )w
(SB + STB )w w
| {z } | {z }
a scalar another scalar
w T SB w
(SB + STB )w
w= (SW + STW )w
w
w T SW w
(SB + STB )w
w= λ(SW + STW )w w

By the definitions of SB and SW , they are symmetric:


T
STB = (µ µ 2 − µ 1 )T
µ2 − µ 1 )(µ
T
µ2 − µ 1 )T (µ
= (µ µ 2 − µ 1 )T
µ 2 − µ 1 )T
µ2 − µ 1 )(µ
= (µ
= SB
!T
X X
STW = xn − µ 1 )(x
(x xn − µ 1 ) + T
xn − µ 2 )(x
(x xn − µ 2 ) T

n:yn =1 n:yn =2
X T X T
= (x xn − µ 1 )T
xn − µ 1 )(x + (x xn − µ 2 )T
xn − µ 2 )(x
n:yn =1 n:yn =2
T T
X X T
xn − µ1 )T + xn − µ2 )T xn − µ2 )T

= xn − µ1 )
(x (x (x (x
n:yn =1 n:yn =2
X X
= (x xn − µ 1 )T +
xn − µ 1 )(x xn − µ 2 )T
xn − µ 2 )(x
(x
n:yn =1 n:yn =2

= SW

PAGE 2 OF 4
CSCI3320
F UNDAMENTALS OF M ACHINE L EARNING TA N AMES : Q INZE Y U , Z IYUAN H U

Therefore we have

(SB + STB )w
w = λ(SW + STW )w
w
(SB + SB )ww = λ(SW + SW )ww
2SB w = 2λSW w
SB w = λSW w

5. We have a dataset for 2 classes. For class 1, we have these samples {(1.4, 1.3, 0.8), (0.3, -0.4, -0.3), (0, -1.1,
-2), (1.3, -0.5, -0.6)}. For class 2, we have these samples {(-1, -0.5, -1), (-0.5, -0.9, -0.2), (-1.4, 0.5, -1.2), (-0.8,
-0.9, -1.3), (0.4, -0.1, 0.9), (1.1, -0.4, -0.3)}. Now we want to fit a Fisher’s LDA model to this dataset. For
this question, all your results should be rounded to 3 decimal places.
(a) Calculate the mean of the two classes, µ 1 and µ 2 .
(b) Calculate the between-class scatter matrix SB .
(c) Calculate the within-class scatter matrix SW .
(d) Calculate the optimal w ∗ based on SB and SW .
Hint: You can use some software to solve the eigenvalue problem.

Solution:
(a)
       
1.4 0.3 0 1.3
1
µ 1 = 1.3 + −0.4 + −1.1 + −0.5
4
0.8 −0.3 −2 −0.6
 
0.75
= −0.175
−0.525
           
−1 −0.5 −1.4 −0.8 0.4 1.1
1
µ 2 = −0.5 + −0.9 +  0.5  + −0.9 + −0.1 + −0.4
6
−1 −0.2 −1.2 −1.3 0.9 −0.3
 
−0.367
= −0.383
−0.517

(b)
µ 2 − µ 1 )T
µ2 − µ 1 )(µ
SB = (µ
       T
−0.367 0.75 −0.367 0.75
= −0.383 − −0.175 −0.383 − −0.175
−0.517 −0.525 −0.517 −0.525
 
−1.117 
= −0.208 −1.117 −0.208 0.008
0.008
 
1.248 0.232 −0.009
=  0.232 0.043 −0.002
−0.009 −0.002 0

PAGE 3 OF 4
CSCI3320 A SSIGNMENT 6
F UNDAMENTALS OF M ACHINE L EARNING D EADLINE : 11:59 PM , A PRIL 30, 2023

(c)
X
S1 = (x x − µ 1 )T
x − µ 1 )(x
x
   2    2
1.4 0.75 0.3 0.75
= 1.3 − −0.175 + −0.4 − −0.175
0.8 −0.525 −0.3 −0.525
   2    2
0 0.75 1.3 0.75
+ −1.1 − −0.175 + −0.5 − −0.175
−2 −0.525 −0.6 −0.525
 
1.49 1.575 1.825
= 1.575 3.188 3.293
1.825 3.293 3.988

X
S2 = (x x − µ 2 )T
x − µ 2 )(x
x
   2    2
−1 −0.367 −0.5 −0.367
= −0.5 − −0.383 + −0.9 − −0.383
−1 −0.517 −0.2 −0.517
   2    2
−1.4 −0.367 −0.8 −0.367
+  0.5  − −0.383 + −0.9 − −0.383
−1.2 −0.517 −1.3 −0.517
   2    2
0.4 −0.367 1.1 −0.367
+ −0.1 − −0.383 + −0.4 − −0.383
0.9 −0.517 −0.3 −0.517
 
4.413 −0.353 2.713
= −0.353 1.408 0.092
2.713 0.092 3.468
 
5.903 1.222 4.538
SW = S1 + S2 = 1.222 4.596 3.385
4.538 3.385 7.456
−1
(d) Referring to Week13 Slides Page18, w∗ ∝ SW (µ2 − µ1 )

PAGE 4 OF 4

You might also like