You are on page 1of 4

ISHAN CHAWLA (0029031621) ​chawla7@purdue.

edu

CS573 DATA MINING HW 1

(Q0)

(1) C
(2) B
(a) http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticReg
ression.html

____________________________________________________________________________

(Q1)

Before doing logistic regression as we are using regularisation we want all features to be
penalised proportionately so i normalise each feature value by subtracting the mean of that
feature and dividing by its std deviation and then use them in the regression formulation. Also i
remove ‘time’ field before processing.

(1)

While training we maximise

M = Σ​all examples​ t​c​ log y(ɸ​c​) + (1-t​c​) log (1-y(ɸ​c​))

As the number of examples for the negative class are much more than those of the positive
class, while training it learns parameters which classify the negative classes correctly. So even
some of the positive samples might be incorrectly classified as negative and hence the accuracy
for the negative class is much better (99%) than that of the positive class (42%)

(2)

Two effective solutions are :

1. Resample the data so that you have a comparable population of both classes , or get
more data for the class which has lower amount of data
2. Change the objective function to downweigh each class by its probability, so if a class
occurs more , still the corresponding term will carry a nearly equal weight to that of the
other class, so equal importance would be given to positive and negative samples in
training.
(3)

Increasing or decreasing C , does not largely change the accuracy for the negative class but it
hugely affects the accuracy for the positive class (the one which has fewer samples) .

Bayes theorem says that

P(y | data) ∝ P(data | y) P(y)

The LHS is the posterior which can be thought of as the product of the Likelihood and the Prior.
To maximise P(y | data) there is a tradeoff between the likelihood and the prior.
Now thinking of the P(y) term as a regulariser , it would not matter if the LHS was almost entirely
decided by P(data | y) ie the likelihood. As the first class has a large amount of data , the
likelihood of the data given that class is high and dominates that term. In contrast , the
regulariser controls the MAP estimate of the class with fewer samples.

(4)

To fix this issue , while training we maximise

M = Σ​all examples​ 1/p​c t​​ c​ log y(ɸ​c​) + 1/ p​c​ (1-t​c​) log (1-y(ɸ​c​))

Basically we calculate the Probabilities of the two classes and maximise, in this case

Σ​all examples with class 1​ 1/p​1 (t​


​ 1​ log y(ɸ​1​)) + Σ​all examples with class 0 1/p​
​ 0 (t​
​ 0​ log y(ɸ​0​))

p​1 ​ = total examples with class 1 / total examples

p​0 ​ = total examples with class 0 / total examples

(5)

I just calculated the class weight of each class as the inverse of its probability of occuring.

Specifically i used

-pos_weight = total / pos


-neg_weight = total/ neg

And then used


logreg =
linear_model.LogisticRegression(penalty='l2',C=1e-8,class_weight={1:pos_weight,0:neg_weight
})

New accuracies : 97%, 76%

(6)

The original objective function for SVM is as follows,


We are minimizing this function,

L(​w​,b,​a​) = ½ ||​w​||​2​ - 𝝨a​n​ ( t​n​(w​T​ɸ(x​n​) + b) - 1)

Now here if examples of 1 class for example the negative class are more prevalent then the
formulation will learn to find parameters which maximise the second function or always correctly
classify the negative class. We would instead modify this function as follows.

L(​w​,b,​a​) = ½ ||​w​||​2​ - 𝝨a​n​ ( t​n​ / p​n​ (w​T​ɸ(x​n​) + b) - 1)

Where p​n is
​ the probability of the class of example n which is calculated as Sample of that class
/ Total samples.

____________________________________________________________________________

(Q2)

(1)

𝝐 ~ N(0,𝞴I)

P(y_i | x_i, 𝛃) = ?

As y_i = x​i​T​ 𝛃 + 𝝐

When we are given x_i and 𝛃, that is a constant,

Hence we have P(y_i | x_i, 𝛃) = N(x​i​T​ 𝛃,𝞴I) ​as c + N(𝞵,𝞂​2​) ~ N(𝞵+c,𝞂​2​)

(2)

If 𝛃 is normal then we can use this knowledge as a prior over 𝛃.


MAP estimate of 𝛃 ∝ argmax​𝛃​ P({y_i,x_i}​1​n​ | 𝛃) P(𝛃)

Assuming y_i,x_i are independent from other y_j,x_j , we can write

MAP estimate of 𝛃 ∝ argmax​𝛃​ ∏ P(y_i,x_i| 𝛃) P(𝛃)

∝ argmax​𝛃​ ∏ P(y_i| 𝛃,x_i) P(𝛃) P(x_i)

∝ argmax​𝛃​ ∏ P(y_i| 𝛃,x_i) P(𝛃) [ Assuming x_i are uniformly sampled]

Now ∏ P(y_i| 𝛃,x_i) = ∏ N(x​i​T​ 𝛃,𝞴I) ∝ 𝞴​-n​ exp(-1/2​𝞴2​ ​(​y​-​XT​ ​𝛃)​T​(​y​-​XT​ ​𝛃))

-n​
So our MAP estimate ∝ argmax​𝛃 𝞴​
​ exp(-1/2​𝞴2​ ​(​y​-​XT​ ​𝛃)​T​(​y​-​XT​ ​𝛃)) * ​𝞂​-1​ exp(-​𝛃2​​ /2𝞂​2​)
-n -1 ​ ​
∝ argmax​𝛃 𝞴​​ 𝞂​ exp(-​𝛃2​​ /2𝞂​2 -1/2 ​𝞴2​ ​(y
​ ​-​XT​ ​𝛃)​T​(​y​-​XT​ ​𝛃))

____________________________________________________________________________

(Q3)

At the decision boundary the Euclidean distance will be same from 𝞵​+​ and 𝞵​-​ .

y = || x - 𝞵​+ ​||​2​ - || x - 𝞵​- ​||​2​ = 0 at the decision boundary

Y = (||𝞵​+​||​2​ - ||𝞵​-​||​2 ​)​ ​+ 2(𝞵​-​ - 𝞵​+​) X

So we have

w​T​ = 2(𝞵​-​ - 𝞵​+​) => w = 2(𝞵​-​ - 𝞵​+​)​T


b = (||𝞵​+​||​2​ - ||𝞵​-​||​2 ​)

____________________________________________________________________________