Professional Documents
Culture Documents
Written Assignment 2
Question 1
Build a classification tree model using the following table, which records restaurant details.
The class variable, OK, is whether you’ve enjoyed it or not. Build the tree using the entropy
impurity function up to depth 2 (two splits along each path). Draw the resulting tree. What is
the classification error of this tree?
Solution:
The classification tree:
Price =
$$$ ?
'Yes' 'No'
𝒕𝟏 𝒕𝟐
'0' Restriction =
Gluten Free ?
'Yes' 'No'
𝒕𝟐𝟏 𝒕𝟐𝟐
'0' '1'
Step 1:
% % # #
Impurity measure: 𝐼𝑚𝑝(𝑡" ) = − /$$0 log & /$$0 − /$$0 log & /$$0 = 0.994
'0' '1'
6 5
Step 2:
We found that the "best" split (maximum Δ𝐼𝑚𝑝(𝑡" )) is the question: "Price = $$$ ?"
2 9
𝜋(𝑡$ ) = , 𝜋(𝑡& ) =
11 11
Root (𝒕𝟏 ): 2 observation: '0' – 2 obs., '1' – 0 obs.
) ) % %
Impurity measure: 𝐼𝑚𝑝(𝑡& ) = − /*0 log & /*0 − /*0 log & /*0 = 0.9911
Price =
$$$ ?
'Yes' 'No'
𝒕𝟏 𝒕𝟐
2 0 4 5
Step 3: Split 𝒕𝟐 (we don’t split 𝒕𝟏 because all the samples are with class '0' ):
& & $ $
Impurity measure: 𝐼𝑚𝑝(𝑡&$ ) = − /+0 log & /+0 − /+0 log & /+0 = 0.9183
$ $ & &
Impurity measure: 𝐼𝑚𝑝(𝑡&& ) = − /+0 log & /+0 − /+0 log & /+0 = 0.9183
'Yes' 'No'
𝒕𝟏 𝒕𝟐
'0' '1'
Restriction =
2 0 Gluten Free ?
'Yes' 'No'
𝒕𝟐𝟏 𝒕𝟐𝟐
2 1 2 4
Here are six different datasets A,B,C,D,E,F. Each dataset is clustered using two different
methods, one of them is K-means. Decide which result is more likely to be the output of the
K-means algorithm. The distance measure used here is the Euclidean distance. Provide a
very short explanation.
Solution:
Since the distance measure is a Euclidean distance, the K-Means algorithm should classify the
data samples by this specific distance.
For instance, the minimum distance between a sample point and a centroid (from the k-
means) is used to determine which class to classify it (each mean represents a class).
From this algorithm, our decision rule is actually created by vectors. Each vector divide our
data into 2 class (and so on). These vectors are bisections of each line between two
centroids.
For instance: 'X's are centroids, blacks lines are decision rules (the bisections) and the RGB
(red, green and blue) colors are the three different clusters.
X
X
In this question, in order to figure out which result is a k-mean one, we need to find those
vectors (bisections) that divided our data. The decision rules (bisections) marked with black
lines.
Consider 4 data points in the 2D space: (−1, −1), (0.5, −0.5), (1, 1), (−0.5, 0.5).
2. If we project all points into the 1D subspace spanned by the first principal components.
What is the variance of the projected data?
3. For a given dataset of 𝑋, all the eigenvalues of its covariance matrix 𝐶 are {2.2, 1.7, 1.4,
0.8, 0.4, 0.2, 0.15, 0.02, 0.001}.
• How many features does each point in X have? (If it’s impossible to know,
say “impossible to know”).
• How many points are there in X have? (If it’s impossible to know, say
“impossible to know”).
• If we want to explain more than 90% of the total variance using the first 𝑘
principal components, what is the least value of 𝑘?
Solution:
1. What is the first principal component vector?
In this case we have 2-D "feature space".
Step 1: calculation of the covariance matrix (the means are zero in this case)
5 1
1
𝑆* = 𝑋 + 𝑋 = <6 2@
4−1 1 5
2 6
As we learn in the class, the first principal component vector is the eigenvector that
appropriate to the highest eigenvalue.
𝑆* = 𝑉𝐷𝑉 +
4
0
0.7071 −0.7071
𝑉=5 : 𝐷 = <3 @
0.7071 0.7071 1
0
3
0.7071
Hence, the first principal component is: 5 :.
0.7071
2. If we project all points into the 1D subspace spanned by the first principal
components. What is the variance of the projected data?
The projection of all point into the 1D subspace spanned by the first principal
component is: 𝑋S = 𝑋𝒗(
,
In our case, the variance is 𝜆( = .
-
3. For a given dataset of 𝑋, all the eigenvalues of its covariance matrix 𝐶 are {2.2,
1.7, 1.4, 0.8, 0.4, 0.2, 0.15, 0.02, 0.001}.
If 𝑋 is data set matrix with size 𝑛 × 𝑝 (points × features) then the covariance matrix
defined by:
4
1
𝐶𝑂𝑉(𝑋) = r(𝑋. − 𝑋W)+ {1×(} (𝑋. − 𝑋W ){(×1} = 𝐶{1×1}
𝑛−1
.5(
Where, 𝑋W is the mean vector of the feature in the data set with size 1 × 𝑝 and 𝑋.
is the point 𝑖 /0 of features vector from the data set with size 1 × 𝑝 (the 𝑖 /0 row
from the matrix 𝑋).
How many features does each point in X have? (If it’s impossible to know, say
“impossible to know”).
How many points are there in X have? (If it’s impossible to know, say “impossible
to know”).
From the number of eigenvalues of 𝐶 we can know the number of features 𝑝 and
not the number of points in 𝑋, 𝑛.
impossible to know.
If we want to explain more than 90% of the total variance using the first 𝑘 principal
components, what is the least value of 𝑘?
∑" 7
The proportion of variance explained by the 𝑘 principal components is: ∑!#$
%
!
.
7
!#$ !
𝑘 𝑃𝐶𝑠 𝑘=1 𝑘=2 𝑘=3 𝑘=4 𝒌=𝟓 𝑘=6 𝑘=7 𝑘=8 𝑘=9
variance 32.02 56.76 77.14 88.78 94.60 97.51 99.69 99.99 100.00
explained [%]
The least value of 𝒌 that explain more than 𝟗𝟎% of the total variance is: 𝒌 = 𝟓.
Question 4
1. Write down the cross-entropy loss function. Explain why it makes more sense for training
a classification problem than MSE.
2. Prove that the minus-log-likelihood of a Logistic Regression (LR), given weights 𝑤, is the
cross-entropy loss at 𝑤.
3. Prove that the decision surface of a LR is linear in the input 𝑥.
Solution:
1. Write down the cross-entropy loss function. Explain why it makes more sense for
training a classification problem than MSE.
Cross Entropy: Let 𝑝 and 𝑞 be two probability distributions of a random variable the
Cross Entropy of 𝑝 and 𝑞 is defined as:
Classification problem:
Let {(𝑋. , 𝑌. )}4.5( be the conditionally independent samples in the training data set,
1, 𝑘 = 𝑦.
where, 𝑝(𝑌. = 𝑘|𝑋. ) = 1{#58!} = } is the true probability of the sample
0, 𝑒𝑙𝑠𝑒
and 𝑦. ∈ {1, … . , 𝐾} is the ground-truth label for the sample 𝑋. .
Let the estimated conditional probability of output 𝑌 given 𝑋 and model with
parameters 𝑤 be 𝑞9 (𝑌|𝑋).
4 @
1 1 4
𝐿:;<==>?4/;<18 (𝑤) = r 𝐻(𝑝(𝑌. |𝑋. ), 𝑞9 (𝑌. |𝑋. )) = − r r 𝑝(𝑌. = 𝑘|𝑋. )log(𝑞9 (𝑌. = 𝑘|𝑋. ))
𝑛 𝑛 .5(
.5( #5(
Why it makes sense to use cross entropy loss function for training a classification
problem:
Note that
8
1 -
𝑎𝑟𝑔min S𝐿./01123-4/056 (𝑤)U = 𝑎𝑟𝑔min V− X X 𝑝(𝑌7 = 𝑘|𝑋7 )log=𝑞, (𝑌7 = 𝑘|𝑋7 )?[
, , 𝑛 7:$
9:$
8
1 -
= 𝑎𝑟𝑔min V− X X 1{9:6!} log=𝑞, (𝑌7 = 𝑘|𝑋7 )?[
, 𝑛 7:$
9:$
1 - 1 -
= 𝑎𝑟𝑔min \− X log=𝑞, (𝑌7 = 𝑦7 |𝑋7 )?] = 𝑎𝑟𝑔max \ X log=𝑞, (𝑌7 = 𝑦7 |𝑋7 )?]
, 𝑛 7:$ , 𝑛 7:$
- -
= 𝑎𝑟𝑔max \ log ^_ 𝑞, (𝑌7 = 𝑦7 |𝑋7 )`] = 𝑎𝑟𝑔max \_ 𝑞, (𝑌7 = 𝑦7 |𝑋7 )]
, 7:$ , 7:$
Hence,
That is, with minimization of the cross-entropy loss we get the maximum likelihood
and hence it makes sense to use this loss function.
As we learned in the class the parameters of the estimation model, that minimize the
MSE loss function, get the maximum likelihood if 𝑌|𝑋 generated from a normal
distribution. But in classification problem 𝑌|𝑋 generated from a discrete distribution,
hence, normal distribution a wrong assumption.
2. Prove that the minus-log-likelihood of a Logistic Regression (LR), given weights 𝑤,
is the cross-entropy loss at 𝑤.
- @! $2@!
= − log ^_ =𝑞, (𝑌7 = 1|𝑋7 )? =𝑞, (𝑌7 = 0|𝑋7 )? `
7:$
-
= −X (1 − 𝑌7 ) log=𝑞, (𝑌7 = 0|𝑋7 )? + 𝑌7 log=𝑞, (𝑌7 = 1|𝑋7 )?
7:$
&
-
= −X X 1{9:6! } log=𝑞, (𝑌7 = 𝑘|𝑋7 )? = 𝑛𝐿./01123-4/056 (𝑤)
7:$
9:$
Hence,
That is, the optimization problem of minimize the minus-log-likelihood is equal to the
optimization problem of minimize the cross-entropy loss.
Note that,
1 " 1
𝑞,#,,$ (𝑌7 = 1|𝑋7 ) > 𝜏 ⇔ > 𝜏 ⇔ 𝑒 2>! , < 𝜏 − 1 ⇔ 𝑋7= 𝑤 > log ^ `
2>!" , 𝜏−1
1+𝑒
1
Hence, the decision surface is the linear surface: 𝑋7= 𝑤 = log /𝜏−10.
Question 5
1. Explain why a perceptron cannot implement the XOR function, x1 xor x2.
2. Design a NN that computes the XOR function using AND, NOT and OR gates that we saw in
class.
Solution:
1. Explain why a perceptron cannot implement the XOR function, x1 xor x2.
A perceptron of 2 features will map the input vector of 2 dimensions into a 2 specific
classes, in our case: 0 or 1.
1, 𝑤( 𝑥( + 𝑤3 𝑥3 + 𝑏 ≥ 0
𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑟𝑜𝑛(𝑥( , 𝑥3 ) = }
0, 𝑒𝑙𝑠𝑒
So, by this rule, we infer that in order to classify the set {𝑥( , 𝑥3 } we need to check if
the value 𝑤( 𝑥( + 𝑤3 𝑥3 + 𝑏 is positive or negative – or in other words:
For a set of 2 points {𝑥( , 𝑥3 } we need to check if: 𝑤( 𝑥( + 𝑤3 𝑥3 + 𝑏 ≥ 0
Thus, there is a single linear line which will divide our data to two classes.
0 value
1 1 value
0 1
we can’t find a strict line that managed to divide the function to it’s correct class.
We can see that with 2 lines we can create this division – that means we need more
than one perceptron.
2. Design a NN that computes the XOR function using AND, NOT and OR gates that
we saw in class.
As we saw in class, those are the 2 perceptrons for OR and AND gates.
A NOT gate is as follows:
What ML algorithm could have produced the following decision surfaces? Give only one
method per drawing. Do not repeat the same method twice (different hyper-parameters are
considered different methods).
No explanation needed.
Solution: