You are on page 1of 1

Q3.a) relation between P(Tt | x 1, x 2, . . .

, x n) and P(x i | Tt) using the Bayes theorem under the


assumption of conditional independence given the class label

In [223]: from IPython.display import Image

Image("b.jpg")

Out[223]:

Q3.b) NAIVE BAYES CLASSIFIER


Using just two attributes/features mentionably 'concave points_worst' and 'radius_mean'

In [195]: from scipy.stats import norm

import numpy as np

import math

from math import log,pi

import matplotlib.pyplot as plt

import pandas as pd

Importing and reading csv file containing training dataset

In [196]: trainy = pd.read_csv("Cancer_train.csv")

trainy.dropna(inplace=True)

trainy.drop_duplicates(inplace=True)

Computing Priors for the two tumor types: Benign "B" and Malignant "M"

In [214]: total_count = len(trainy)

prior_B = len(trainy[trainy['diagnosis'] == 'B'])/total_count

prior_M = 1-prior_B

priors = [prior_B,prior_M]

print("Prior of class B: ", prior_B)

print("Prior of class M: ", prior_M)

Prior of class B: 0.6244979919678715

Prior of class M: 0.3755020080321285

Using groupby method to compute statistics associated with each class

In [198]: df_by_tumortype = trainy.groupby('diagnosis')

Obtaining mean and standard deviation associated with each class for the following attributes

In [199]: mean_cpw_B, mean_cpw_M = df_by_tumortype.mean()['concave points_worst']

mean_rm_B, mean_rm_M = df_by_tumortype.mean()['radius_mean']

std_cpw_B, std_cpw_M = df_by_tumortype.std()['concave points_worst']

std_rm_B, std_rm_M = df_by_tumortype.std()['radius_mean']

In [200]: testy = pd.read_csv("Cancer_test.csv")

testy.dropna(inplace=True)

testy.drop_duplicates(inplace=True)

Loglikelihood without scaling

In [201]: loggy = lambda x,mu,sigma: np.sum((-1/2)*(((x-mu)/sigma)**2 + log(2*pi)) - log(sigma))

cpw_B = lambda x: loggy(x, mean_cpw_B, std_cpw_B)

cpw_M = lambda x: loggy(x, mean_cpw_M, std_cpw_M)

rm_B = lambda x: loggy(x, mean_rm_B, std_rm_B)

rm_M = lambda x: loggy(x, mean_rm_M, std_rm_M)

In [202]: pL_B = lambda x : log(prior_B) + rm_B(x[0]) + cpw_B(x[1])

pL_M = lambda x : log(prior_M) + rm_M(x[0]) + cpw_M(x[1])

In [203]: prediction = []

for _,row in testy.iterrows():

x = [row['radius_mean'], row['concave points_worst']]

if pL_B(x) > pL_M(x):

prediction.append('B')

else:

prediction.append('M')

In [204]: accuracy = sum(prediction[i] == testy['diagnosis'][i] for i in range(len(testy)))/len(testy)

print("The accuracy obtained using two given attributes :", accuracy*100)

The accuracy obtained using two given attributes : 92.85714285714286

Loglikelihood with scaling

In [205]: loggy_std = lambda x,mu,sigma: np.sum((-1/2)*(((x-mu)/sigma)**2 + log(2*pi)))

cpw_B_scaled = lambda x: loggy_std((x-mean_cpw_B)/std_cpw_B, 0,1)

cpw_M_scaled = lambda x: loggy_std((x-mean_cpw_M)/std_cpw_M, 0,1)

rm_B_scaled = lambda x: loggy_std((x-mean_rm_B)/std_rm_B, 0,1)

rm_M_scaled = lambda x: loggy_std((x-mean_rm_M)/std_rm_M, 0,1)

In [206]: pL_B_scaled = lambda x : log(prior_B) + rm_B_scaled(x[0]) + cpw_B_scaled(x[1])

pL_M_scaled = lambda x : log(prior_M) + rm_M_scaled(x[0]) + cpw_M_scaled(x[1])

In [207]: prediction_scaled = []

for _,row in testy.iterrows():

x = [row['radius_mean'], row['concave points_worst']]

if pL_B_scaled(x) > pL_M_scaled(x):

prediction_scaled.append('B')

else:

prediction_scaled.append('M')

In [208]: accuracy_scaled = sum(prediction_scaled[i] == testy['diagnosis'][i] for i in range(len(testy


)))/len(testy)

print("The accuracy obtained after scaling:", accuracy_scaled*100)

The accuracy obtained after scaling: 94.28571428571428

Q3.c)

The priors of the two classes "B" (Benign) and "M" (Malignant) clearly shows the class imbalance in the given dataset. Nearly
62.45% of the dataset is labelled "B" while only 37.55% being labelled as "M". Although this isn't a severe case, this does
have a significant impact on the performance of Naive Bayes' classifier because of the following arguments. Priors of the
classes (as evident from the above mathematical derivation in (Q3.a)) plays a crucial role in prediction task as we assume that
the observed priors represents the natural frequency at which the classes of interest occur. Priors calculated from an
imbalanced dataset introduces bias in the prediction task as a higher value of Prior term could mask the effect of even higher
valued P(x i | Tt) terms thereby promoting the class with higher prior to be picked. So when there is a class imbalance,
reliablity over the classifier's prediction decreases.

You might also like