Statistical Inference and Data Mining, 371-2-1721 Written Assignment 2

Statistical Inference and Data Mining, 371-2-1721
Written Assignment 2
Winter semester 2021-2022
Team: Orel Ben Zaken 315628313

Omer Luxembourg 205500390
Question 1
Build a classification tree model using the following table, which records restaurant details.
The class variable, OK, is whether you’ve enjoyed it or not. Build the tree using the entropy
impurity function up to depth 2 (two splits along each path). Draw the resulting tree. What is
the classification error of this tree?
Solution:
The classification tree:
Price =
$$$ ?
'Yes' 'No'
𝒕𝟏 𝒕𝟐
'0' Restriction =
Gluten Free ?
'Yes' 'No'
𝒕𝟐𝟏 𝒕𝟐𝟐
'0' '1'
The building process of the tree:
Step 1:
Root (𝒕𝟎 ): 11 observation: '0' – 6 obs., '1' – 5 obs.

# %
proportion of observations: 𝑝̂ (0|𝑡" ) = , 𝑝̂ (1|𝑡" ) =
$$ $$
% % # #
Impurity measure: 𝐼𝑚𝑝(𝑡" ) = − /$$0 log & /$$0 − /$$0 log & /$$0 = 0.994
'0' '1'
6 5
Step 2:
We found that the "best" split (maximum Δ𝐼𝑚𝑝(𝑡" )) is the question: "Price = $$$ ?"
2 9
𝜋(𝑡$ ) = , 𝜋(𝑡& ) =
11 11
Root (𝒕𝟏 ): 2 observation: '0' – 2 obs., '1' – 0 obs.
proportion of observations: 𝑝̂ (0|𝑡" ) = 1, 𝑝̂ (1|𝑡" ) = 0
Impurity measure: 𝐼𝑚𝑝(𝑡$ ) = 0
Root (𝒕𝟐 ): 9 observation: '0' – 4 obs., '1' – 5 obs.

) %
proportion of observations: 𝑝̂ (0|𝑡" ) = * , 𝑝̂ (1|𝑡" ) = *
) ) % %
Impurity measure: 𝐼𝑚𝑝(𝑡& ) = − /*0 log & /*0 − /*0 log & /*0 = 0.9911
The decrease in Impurity measure:
Δ𝐼𝑚𝑝(𝑡" ) = 𝐼𝑚𝑝(𝑡" ) − =𝜋(𝑡1 )𝐼𝑚𝑝(𝑡$ ) + 𝜋(𝑡& )𝐼𝑚𝑝(𝑡& )? = 0.1831
Price =
$$$ ?
'Yes' 'No'
𝒕𝟏 𝒕𝟐
'0' '1' '0' '1'
2 0 4 5
Step 3: Split 𝒕𝟐 (we don’t split 𝒕𝟏 because all the samples are with class '0' ):
We found that the "best" split (maximum Δ𝐼𝑚𝑝(𝑡& )) is the question:
" Restriction = Gluten Free ?"

3 6
𝜋(𝑡&$ ) = , 𝜋(𝑡&& ) =
9 9
Root (𝒕𝟐𝟏 ): 3 observation: '0' – 2 obs., '1' – 1 obs.
& $
proportion of observations: 𝑝̂ (0|𝑡" ) = + , 𝑝̂ (1|𝑡" ) = +
& & $ $
Impurity measure: 𝐼𝑚𝑝(𝑡&$ ) = − /+0 log & /+0 − /+0 log & /+0 = 0.9183
Root (𝒕𝟐𝟐 ): 6 observation: '0' – 2 obs., '1' – 4 obs.

$ &
proportion of observations: 𝑝̂ (0|𝑡" ) = , 𝑝̂ (1|𝑡" ) =
+ +
$ $ & &
Impurity measure: 𝐼𝑚𝑝(𝑡&& ) = − /+0 log & /+0 − /+0 log & /+0 = 0.9183
The decrease in Impurity measure:
Δ𝐼𝑚𝑝(𝑡& ) = 𝐼𝑚𝑝(𝑡& ) − =𝜋(𝑡21 )𝐼𝑚𝑝(𝑡&$ ) + 𝜋(𝑡&& )𝐼𝑚𝑝(𝑡&& )? = 0.0728

Price =
$$$ ?
'Yes' 'No'
𝒕𝟏 𝒕𝟐
'0' '1'
Restriction =
2 0 Gluten Free ?
'Yes' 'No'
𝒕𝟐𝟏 𝒕𝟐𝟐
'0' '1' '0' '1'
2 1 2 4
Step 4: Assign, to a terminal node t, the class 𝑌(𝑡) = argmax 𝑝̂ (𝑘|𝑡).

#∈{&,(}
Question 2
Here are six different datasets A,B,C,D,E,F. Each dataset is clustered using two different
methods, one of them is K-means. Decide which result is more likely to be the output of the
K-means algorithm. The distance measure used here is the Euclidean distance. Provide a
very short explanation.
Solution:
Since the distance measure is a Euclidean distance, the K-Means algorithm should classify the
data samples by this specific distance.
For instance, the minimum distance between a sample point and a centroid (from the k-
means) is used to determine which class to classify it (each mean represents a class).
From this algorithm, our decision rule is actually created by vectors. Each vector divide our
data into 2 class (and so on). These vectors are bisections of each line between two
centroids.
For instance: 'X's are centroids, blacks lines are decision rules (the bisections) and the RGB
(red, green and blue) colors are the three different clusters.
X
X
In this question, in order to figure out which result is a k-mean one, we need to find those
vectors (bisections) that divided our data. The decision rules (bisections) marked with black
lines.
A: Figure A2 is more likely to be the output of the K-means algorithm.

B: Figure B2 is more likely to be the output of the K-means algorithm.
C: Figure C2 is more likely to be the output of the K-means algorithm.
D: Figure D1 is more likely to be the output of the K-means algorithm.

E: Figure E2 is more likely to be the output of the K-means algorithm.
F: Figure F2 is more likely to be the output of the K-means algorithm.

Question 3
Consider 4 data points in the 2D space: (−1, −1), (0.5, −0.5), (1, 1), (−0.5, 0.5).
1. What is the first principal component vector?
2. If we project all points into the 1D subspace spanned by the first principal components.
What is the variance of the projected data?
3. For a given dataset of 𝑋, all the eigenvalues of its covariance matrix 𝐶 are {2.2, 1.7, 1.4,
0.8, 0.4, 0.2, 0.15, 0.02, 0.001}.
• How many features does each point in X have? (If it’s impossible to know,
say “impossible to know”).
• How many points are there in X have? (If it’s impossible to know, say
“impossible to know”).
• If we want to explain more than 90% of the total variance using the first 𝑘
principal components, what is the least value of 𝑘?
Solution:
1. What is the first principal component vector?
In this case we have 2-D "feature space".
Step 1: calculation of the covariance matrix (the means are zero in this case)
5 1
1
𝑆* = 𝑋 + 𝑋 = <6 2@
4−1 1 5
2 6
𝑋 − 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡 𝑚𝑎𝑡𝑟𝑖𝑥 𝑤𝑖𝑡ℎ 𝑠𝑖𝑧𝑒 4 × 2 (𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)
As we learn in the class, the first principal component vector is the eigenvector that
appropriate to the highest eigenvalue.
Step 2: eigen decomposition of 𝑆* is:
𝑆* = 𝑉𝐷𝑉 +
4
0
0.7071 −0.7071
𝑉=5 : 𝐷 = <3 @
0.7071 0.7071 1
0
3
0.7071
Hence, the first principal component is: 5 :.
0.7071
2. If we project all points into the 1D subspace spanned by the first principal
components. What is the variance of the projected data?
The projection of all point into the 1D subspace spanned by the first principal
component is: 𝑋S = 𝑋𝒗(
𝑋 − 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡 𝑚𝑎𝑡𝑟𝑖𝑥 𝑤𝑖𝑡ℎ 𝑠𝑖𝑧𝑒 4 × 2 (𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)

𝒗( − first principal component (𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑠𝑖𝑧𝑒 𝑜𝑓 2 × 1)
The variance calculation:
𝐸[(𝑋𝒗( )+ ] = 𝒗(+ 𝐸[𝑋 + ] = 𝟎3×(
𝐶𝑂𝑉n𝑋So = 𝐸[(𝑋𝒗( )+ (𝑋𝒗( )] = 𝒗(+ 𝐸[𝑋 + 𝑋]𝒗( = 𝒗(+ 𝑆* 𝒗(
The vector 𝒗( is eigenvector of the matrix 𝑆*
𝐶𝑂𝑉n𝑋So = 𝒗(+ 𝑆* 𝒗( = {𝑆* 𝒗( = 𝜆( 𝒗( } = 𝜆( 𝒗(+ 𝒗( = {𝒗( 𝑖𝑠 𝑜𝑟𝑡𝑜𝑛𝑜𝑟𝑚𝑎𝑙} = 𝜆(
,
In our case, the variance is 𝜆( = .
-
3. For a given dataset of 𝑋, all the eigenvalues of its covariance matrix 𝐶 are {2.2,
1.7, 1.4, 0.8, 0.4, 0.2, 0.15, 0.02, 0.001}.
If 𝑋 is data set matrix with size 𝑛 × 𝑝 (points × features) then the covariance matrix
defined by:
4
1
𝐶𝑂𝑉(𝑋) = r(𝑋. − 𝑋W)+ {1×(} (𝑋. − 𝑋W ){(×1} = 𝐶{1×1}
𝑛−1
.5(
Where, 𝑋W is the mean vector of the feature in the data set with size 1 × 𝑝 and 𝑋.
is the point 𝑖 /0 of features vector from the data set with size 1 × 𝑝 (the 𝑖 /0 row
from the matrix 𝑋).
To get the eigenvalues of the matrix 𝐶 we calculated the eigenvalue decomposition

+
(EVD) of the matrix: 𝐶 = 𝑉{1×1} 𝐷{1×1} 𝑉{1×1} where 𝐷 is diagonal matrix and the
eigenvalues of 𝐶 are the 𝑝 values in the diagonal of the matrix 𝐷.
How many features does each point in X have? (If it’s impossible to know, say
“impossible to know”).
The number of features 𝒑 in each point in 𝑿, is equal to the number of eigenvalues

of the covariance matrix 𝑪 and in this case, 𝒑 = 𝟗
How many points are there in X have? (If it’s impossible to know, say “impossible
to know”).
From the number of eigenvalues of 𝐶 we can know the number of features 𝑝 and
not the number of points in 𝑋, 𝑛.
impossible to know.
If we want to explain more than 90% of the total variance using the first 𝑘 principal
components, what is the least value of 𝑘?
∑" 7
The proportion of variance explained by the 𝑘 principal components is: ∑!#$
%
!
.
7
!#$ !
𝑘 𝑃𝐶𝑠 𝑘=1 𝑘=2 𝑘=3 𝑘=4 𝒌=𝟓 𝑘=6 𝑘=7 𝑘=8 𝑘=9
variance 32.02 56.76 77.14 88.78 94.60 97.51 99.69 99.99 100.00
explained [%]
The least value of 𝒌 that explain more than 𝟗𝟎% of the total variance is: 𝒌 = 𝟓.
Question 4
1. Write down the cross-entropy loss function. Explain why it makes more sense for training
a classification problem than MSE.
2. Prove that the minus-log-likelihood of a Logistic Regression (LR), given weights 𝑤, is the
cross-entropy loss at 𝑤.
3. Prove that the decision surface of a LR is linear in the input 𝑥.
Solution:
1. Write down the cross-entropy loss function. Explain why it makes more sense for
training a classification problem than MSE.
Cross Entropy: Let 𝑝 and 𝑞 be two probability distributions of a random variable the
Cross Entropy of 𝑝 and 𝑞 is defined as:
𝐻(𝑝, 𝑞) = −𝐸1 [log(𝑞)]

Where, 𝐸1 [∙] is the expected value operator with respect to the distribution 𝑝.
Classification problem:
• In classification problem, the task is approximating a mapping function from

input variables (𝑥) to discrete output variable 𝑦.
• It is common for classification models to predict a continuous value as the
probability of a given sample belonging to each output label. The probabilities
can be interpreted as the likelihood or confidence of a given example belonging
to each class. A predicted probability can be converted into a class value by
selecting the class label that has the highest probability.
Let {(𝑋. , 𝑌. )}4.5( be the conditionally independent samples in the training data set,
1, 𝑘 = 𝑦.
where, 𝑝(𝑌. = 𝑘|𝑋. ) = 1{#58!} = } is the true probability of the sample
0, 𝑒𝑙𝑠𝑒
and 𝑦. ∈ {1, … . , 𝐾} is the ground-truth label for the sample 𝑋. .
Let the estimated conditional probability of output 𝑌 given 𝑋 and model with
parameters 𝑤 be 𝑞9 (𝑌|𝑋).
The Cross Entropy loss function defined as:
4 @
1 1 4
𝐿:;<==>?4/;<18 (𝑤) = r 𝐻(𝑝(𝑌. |𝑋. ), 𝑞9 (𝑌. |𝑋. )) = − r r 𝑝(𝑌. = 𝑘|𝑋. )log(𝑞9 (𝑌. = 𝑘|𝑋. ))
𝑛 𝑛 .5(
.5( #5(
Why it makes sense to use cross entropy loss function for training a classification
problem:
In classification problem we want those the model's parameters 𝑤 maximize the

likelihood of the model 𝑞9 (𝑌|𝑋) on the training set:
𝑤 = 𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦1 , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)}

,
Where, ℙ(𝑌( , … . , 𝑌4 |𝑋( , … . , 𝑋4 ; 𝑤) is the conditional probability of (𝑌( , … . , 𝑌4 )

given (𝑋( , … . , 𝑋4 ) and given model with parameters 𝑤.
For 𝑛 conditionally independent samples:

-
𝑤 = 𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦$ , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)} = 𝑎𝑟𝑔 max V_ ℙ(𝑌7 = 𝑦7 |𝑋7 ; 𝑤)[

, ,
7:$
the estimated conditional probability of output 𝑌 given 𝑋 and model with

parameters 𝑤 is 𝑞9 (𝑌|𝑋):
-
𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦$ , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)} = 𝑎𝑟𝑔 max V_ 𝑞, (𝑌7 = 𝑦7 |𝑋7 )[

, ,
7:$
Note that
8
1 -
𝑎𝑟𝑔min S𝐿./01123-4/056 (𝑤)U = 𝑎𝑟𝑔min V− X X 𝑝(𝑌7 = 𝑘|𝑋7 )log=𝑞, (𝑌7 = 𝑘|𝑋7 )?[
, , 𝑛 7:$
9:$
8
1 -
= 𝑎𝑟𝑔min V− X X 1{9:6!} log=𝑞, (𝑌7 = 𝑘|𝑋7 )?[
, 𝑛 7:$
9:$
1 - 1 -
= 𝑎𝑟𝑔min \− X log=𝑞, (𝑌7 = 𝑦7 |𝑋7 )?] = 𝑎𝑟𝑔max \ X log=𝑞, (𝑌7 = 𝑦7 |𝑋7 )?]
, 𝑛 7:$ , 𝑛 7:$
- -
= 𝑎𝑟𝑔max \ log ^_ 𝑞, (𝑌7 = 𝑦7 |𝑋7 )`] = 𝑎𝑟𝑔max \_ 𝑞, (𝑌7 = 𝑦7 |𝑋7 )]
, 7:$ , 7:$
Hence,
𝑎𝑟𝑔min S𝐿./01123-4/056 (𝑤)U = 𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦1 , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)}

, ,
That is, with minimization of the cross-entropy loss we get the maximum likelihood
and hence it makes sense to use this loss function.
Why it makes more sense compared to MSE:
As we learned in the class the parameters of the estimation model, that minimize the
MSE loss function, get the maximum likelihood if 𝑌|𝑋 generated from a normal
distribution. But in classification problem 𝑌|𝑋 generated from a discrete distribution,
hence, normal distribution a wrong assumption.
2. Prove that the minus-log-likelihood of a Logistic Regression (LR), given weights 𝑤,
is the cross-entropy loss at 𝑤.
LR is binary classification problem with estimation model that defined by:

1
ℙ(𝑌7 |𝑋7 ; 𝑤) = 𝑞, (𝑌7 |𝑋7 ) = "
1 + 𝑒 2>! ,
In this case the minus-log-likelihood, given weights (model's parameters) 𝑤 is:

-
− log=ℙ(𝑌$ , … . , 𝑌- |𝑋$ , … . , 𝑋- ; 𝑤)? = {conditionally independent samples} = − log ^_ 𝑞, (𝑌7 |𝑋7 )`
7:$
- @! $2@!
= − log ^_ =𝑞, (𝑌7 = 1|𝑋7 )? =𝑞, (𝑌7 = 0|𝑋7 )? `
7:$
-
= −X (1 − 𝑌7 ) log=𝑞, (𝑌7 = 0|𝑋7 )? + 𝑌7 log=𝑞, (𝑌7 = 1|𝑋7 )?
7:$
&
-
= −X X 1{9:6! } log=𝑞, (𝑌7 = 𝑘|𝑋7 )? = 𝑛𝐿./01123-4/056 (𝑤)
7:$
9:$
Hence,
𝑎𝑟𝑔 minS− log=ℙ(𝑌$ , … . , 𝑌- |𝑋$ , … . , 𝑋- ; 𝑤)?U = 𝑎𝑟𝑔 minS𝐿./01123-4/056 (𝑤)U

, ,
That is, the optimization problem of minimize the minus-log-likelihood is equal to the
optimization problem of minimize the cross-entropy loss.
3. Prove that the decision surface of a LR is linear in the input 𝑥.

The decision rule of a LR is defined by:
1, 𝑞 (𝑌 = 1|𝑋7 ) > 𝜏
𝑌i7 = \ , 7
0, 𝑒𝑙𝑠𝑒
where, 𝜏 ∈ (0,1) is threshold.
Note that,
1 " 1
𝑞,#,,$ (𝑌7 = 1|𝑋7 ) > 𝜏 ⇔ > 𝜏 ⇔ 𝑒 2>! , < 𝜏 − 1 ⇔ 𝑋7= 𝑤 > log ^ `
2>!" , 𝜏−1
1+𝑒
1
Hence, the decision surface is the linear surface: 𝑋7= 𝑤 = log /𝜏−10.
Question 5
1. Explain why a perceptron cannot implement the XOR function, x1 xor x2.
2. Design a NN that computes the XOR function using AND, NOT and OR gates that we saw in
class.
Solution:
1. Explain why a perceptron cannot implement the XOR function, x1 xor x2.
A perceptron of 2 features will map the input vector of 2 dimensions into a 2 specific
classes, in our case: 0 or 1.
1, 𝑤( 𝑥( + 𝑤3 𝑥3 + 𝑏 ≥ 0
𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑟𝑜𝑛(𝑥( , 𝑥3 ) = }
0, 𝑒𝑙𝑠𝑒
We will assume the change between negative and positive values of 𝑤( 𝑥( + 𝑤3 𝑥3 +

𝑏 will be the change in class from 0 to 1.
So, by this rule, we infer that in order to classify the set {𝑥( , 𝑥3 } we need to check if
the value 𝑤( 𝑥( + 𝑤3 𝑥3 + 𝑏 is positive or negative – or in other words:
For a set of 2 points {𝑥( , 𝑥3 } we need to check if: 𝑤( 𝑥( + 𝑤3 𝑥3 + 𝑏 ≥ 0
Thus, there is a single linear line which will divide our data to two classes.
Because XOR is not separable by linear surface:
0 value
1 1 value
0 1
we can’t find a strict line that managed to divide the function to it’s correct class.
We can see that with 2 lines we can create this division – that means we need more
than one perceptron.
2. Design a NN that computes the XOR function using AND, NOT and OR gates that
we saw in class.
𝑋𝑂𝑅(𝑥( , 𝑥3 ) = 𝑥( WWW 𝑥( = 𝐴𝑁𝐷n𝑥( , 𝑁𝑂𝑇(𝑥3 )o + 𝐴𝑁𝐷n𝑥3 , 𝑁𝑂𝑇(𝑥( )o

𝑥3 + 𝑥3 WWW
where + is OR gate.
As we saw in class, those are the 2 perceptrons for OR and AND gates.
A NOT gate is as follows:
where the activation gate can be emitted if {𝑥( , 𝑥3 } ∈ {0,1}.
Therefore, we will construct NN according to the following architecture:

Question 6
What ML algorithm could have produced the following decision surfaces? Give only one
method per drawing. Do not repeat the same method twice (different hyper-parameters are
considered different methods).
No explanation needed.
Solution:
Classification decision tree Logistic regression Neural Network
SVM SVM with quadratic kernel

Statistical Inference and Data Mining, 371-2-1721 Written Assignment 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Inference and Data Mining, 371-2-1721 Written Assignment 2

Uploaded by

Copyright:

Available Formats

Statistical Inference and Data Mining, 371-2-1721

Winter semester 2021-2022

Team: Orel Ben Zaken 315628313

The building process of the tree:

Root (𝒕𝟎 ): 11 observation: '0' – 6 obs., '1' – 5 obs.

proportion of observations: 𝑝̂ (0|𝑡" ) = 1, 𝑝̂ (1|𝑡" ) = 0

Impurity measure: 𝐼𝑚𝑝(𝑡$ ) = 0

Root (𝒕𝟐 ): 9 observation: '0' – 4 obs., '1' – 5 obs.

The decrease in Impurity measure:

Δ𝐼𝑚𝑝(𝑡" ) = 𝐼𝑚𝑝(𝑡" ) − =𝜋(𝑡1 )𝐼𝑚𝑝(𝑡$ ) + 𝜋(𝑡& )𝐼𝑚𝑝(𝑡& )? = 0.1831

'0' '1' '0' '1'

We found that the "best" split (maximum Δ𝐼𝑚𝑝(𝑡& )) is the question:

" Restriction = Gluten Free ?"

Root (𝒕𝟐𝟐 ): 6 observation: '0' – 2 obs., '1' – 4 obs.

The decrease in Impurity measure:

Δ𝐼𝑚𝑝(𝑡& ) = 𝐼𝑚𝑝(𝑡& ) − =𝜋(𝑡21 )𝐼𝑚𝑝(𝑡&$ ) + 𝜋(𝑡&& )𝐼𝑚𝑝(𝑡&& )? = 0.0728

'0' '1' '0' '1'

Step 4: Assign, to a terminal node t, the class 𝑌(𝑡) = argmax 𝑝̂ (𝑘|𝑡).

A: Figure A2 is more likely to be the output of the K-means algorithm.

C: Figure C2 is more likely to be the output of the K-means algorithm.

D: Figure D1 is more likely to be the output of the K-means algorithm.

F: Figure F2 is more likely to be the output of the K-means algorithm.

1. What is the first principal component vector?

𝑋 − 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡 𝑚𝑎𝑡𝑟𝑖𝑥 𝑤𝑖𝑡ℎ 𝑠𝑖𝑧𝑒 4 × 2 (𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)

Step 2: eigen decomposition of 𝑆* is:

𝑋 − 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡 𝑚𝑎𝑡𝑟𝑖𝑥 𝑤𝑖𝑡ℎ 𝑠𝑖𝑧𝑒 4 × 2 (𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)

The variance calculation:

𝐸[(𝑋𝒗( )+ ] = 𝒗(+ 𝐸[𝑋 + ] = 𝟎3×(

𝐶𝑂𝑉n𝑋So = 𝐸[(𝑋𝒗( )+ (𝑋𝒗( )] = 𝒗(+ 𝐸[𝑋 + 𝑋]𝒗( = 𝒗(+ 𝑆* 𝒗(

The vector 𝒗( is eigenvector of the matrix 𝑆*

𝐶𝑂𝑉n𝑋So = 𝒗(+ 𝑆* 𝒗( = {𝑆* 𝒗( = 𝜆( 𝒗( } = 𝜆( 𝒗(+ 𝒗( = {𝒗( 𝑖𝑠 𝑜𝑟𝑡𝑜𝑛𝑜𝑟𝑚𝑎𝑙} = 𝜆(

To get the eigenvalues of the matrix 𝐶 we calculated the eigenvalue decomposition

The number of features 𝒑 in each point in 𝑿, is equal to the number of eigenvalues

𝐻(𝑝, 𝑞) = −𝐸1 [log(𝑞)]

• In classification problem, the task is approximating a mapping function from

The Cross Entropy loss function defined as:

In classification problem we want those the model's parameters 𝑤 maximize the

𝑤 = 𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦1 , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)}

Where, ℙ(𝑌( , … . , 𝑌4 |𝑋( , … . , 𝑋4 ; 𝑤) is the conditional probability of (𝑌( , … . , 𝑌4 )

For 𝑛 conditionally independent samples:

𝑤 = 𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦$ , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)} = 𝑎𝑟𝑔 max V_ ℙ(𝑌7 = 𝑦7 |𝑋7 ; 𝑤)[

the estimated conditional probability of output 𝑌 given 𝑋 and model with

𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦$ , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)} = 𝑎𝑟𝑔 max V_ 𝑞, (𝑌7 = 𝑦7 |𝑋7 )[

𝑎𝑟𝑔min S𝐿./01123-4/056 (𝑤)U = 𝑎𝑟𝑔 max{ℙ(𝑌$ = 𝑦1 , … . , 𝑌- = 𝑦- |𝑋$ , … . , 𝑋- ; 𝑤)}

Why it makes more sense compared to MSE:

LR is binary classification problem with estimation model that defined by:

In this case the minus-log-likelihood, given weights (model's parameters) 𝑤 is:

𝑎𝑟𝑔 minS− log=ℙ(𝑌$ , … . , 𝑌- |𝑋$ , … . , 𝑋- ; 𝑤)?U = 𝑎𝑟𝑔 minS𝐿./01123-4/056 (𝑤)U

3. Prove that the decision surface of a LR is linear in the input 𝑥.

where, 𝜏 ∈ (0,1) is threshold.

We will assume the change between negative and positive values of 𝑤( 𝑥( + 𝑤3 𝑥3 +

Because XOR is not separable by linear surface:

𝑋𝑂𝑅(𝑥( , 𝑥3 ) = 𝑥( WWW 𝑥( = 𝐴𝑁𝐷n𝑥( , 𝑁𝑂𝑇(𝑥3 )o + 𝐴𝑁𝐷n𝑥3 , 𝑁𝑂𝑇(𝑥( )o

where the activation gate can be emitted if {𝑥( , 𝑥3 } ∈ {0,1}.

Therefore, we will construct NN according to the following architecture:

Classification decision tree Logistic regression Neural Network

SVM SVM with quadratic kernel

You might also like