Naive Bayes' Classifier

Q3.a) relation between P(Tt | x 1, x 2, . . .
, x n) and P(x i | Tt) using the Bayes theorem under the

assumption of conditional independence given the class label
In [223]: from IPython.display import Image
Image("b.jpg")
Out[223]:
Q3.b) NAIVE BAYES CLASSIFIER

Using just two attributes/features mentionably 'concave points_worst' and 'radius_mean'
In [195]: from scipy.stats import norm
import numpy as np
import math
from math import log,pi
import matplotlib.pyplot as plt
import pandas as pd
Importing and reading csv file containing training dataset
In [196]: trainy = pd.read_csv("Cancer_train.csv")
trainy.dropna(inplace=True)
trainy.drop_duplicates(inplace=True)
Computing Priors for the two tumor types: Benign "B" and Malignant "M"
In [214]: total_count = len(trainy)
prior_B = len(trainy[trainy['diagnosis'] == 'B'])/total_count
prior_M = 1-prior_B
priors = [prior_B,prior_M]
print("Prior of class B: ", prior_B)
print("Prior of class M: ", prior_M)
Prior of class B: 0.6244979919678715
Prior of class M: 0.3755020080321285
Using groupby method to compute statistics associated with each class
In [198]: df_by_tumortype = trainy.groupby('diagnosis')
Obtaining mean and standard deviation associated with each class for the following attributes
In [199]: mean_cpw_B, mean_cpw_M = df_by_tumortype.mean()['concave points_worst']
mean_rm_B, mean_rm_M = df_by_tumortype.mean()['radius_mean']
std_cpw_B, std_cpw_M = df_by_tumortype.std()['concave points_worst']
std_rm_B, std_rm_M = df_by_tumortype.std()['radius_mean']
In [200]: testy = pd.read_csv("Cancer_test.csv")
testy.dropna(inplace=True)
testy.drop_duplicates(inplace=True)
Loglikelihood without scaling
In [201]: loggy = lambda x,mu,sigma: np.sum((-1/2)*(((x-mu)/sigma)**2 + log(2*pi)) - log(sigma))
cpw_B = lambda x: loggy(x, mean_cpw_B, std_cpw_B)
cpw_M = lambda x: loggy(x, mean_cpw_M, std_cpw_M)
rm_B = lambda x: loggy(x, mean_rm_B, std_rm_B)
rm_M = lambda x: loggy(x, mean_rm_M, std_rm_M)
In [202]: pL_B = lambda x : log(prior_B) + rm_B(x[0]) + cpw_B(x[1])
pL_M = lambda x : log(prior_M) + rm_M(x[0]) + cpw_M(x[1])
In [203]: prediction = []
for _,row in testy.iterrows():
x = [row['radius_mean'], row['concave points_worst']]
if pL_B(x) > pL_M(x):
prediction.append('B')
else:
prediction.append('M')
In [204]: accuracy = sum(prediction[i] == testy['diagnosis'][i] for i in range(len(testy)))/len(testy)
print("The accuracy obtained using two given attributes :", accuracy*100)
The accuracy obtained using two given attributes : 92.85714285714286
Loglikelihood with scaling
In [205]: loggy_std = lambda x,mu,sigma: np.sum((-1/2)*(((x-mu)/sigma)**2 + log(2*pi)))
cpw_B_scaled = lambda x: loggy_std((x-mean_cpw_B)/std_cpw_B, 0,1)
cpw_M_scaled = lambda x: loggy_std((x-mean_cpw_M)/std_cpw_M, 0,1)
rm_B_scaled = lambda x: loggy_std((x-mean_rm_B)/std_rm_B, 0,1)
rm_M_scaled = lambda x: loggy_std((x-mean_rm_M)/std_rm_M, 0,1)
In [206]: pL_B_scaled = lambda x : log(prior_B) + rm_B_scaled(x[0]) + cpw_B_scaled(x[1])
pL_M_scaled = lambda x : log(prior_M) + rm_M_scaled(x[0]) + cpw_M_scaled(x[1])
In [207]: prediction_scaled = []
for _,row in testy.iterrows():
x = [row['radius_mean'], row['concave points_worst']]
if pL_B_scaled(x) > pL_M_scaled(x):
prediction_scaled.append('B')
else:
prediction_scaled.append('M')
In [208]: accuracy_scaled = sum(prediction_scaled[i] == testy['diagnosis'][i] for i in range(len(testy

)))/len(testy)
print("The accuracy obtained after scaling:", accuracy_scaled*100)
The accuracy obtained after scaling: 94.28571428571428
Q3.c)
The priors of the two classes "B" (Benign) and "M" (Malignant) clearly shows the class imbalance in the given dataset. Nearly
62.45% of the dataset is labelled "B" while only 37.55% being labelled as "M". Although this isn't a severe case, this does
have a significant impact on the performance of Naive Bayes' classifier because of the following arguments. Priors of the
classes (as evident from the above mathematical derivation in (Q3.a)) plays a crucial role in prediction task as we assume that
the observed priors represents the natural frequency at which the classes of interest occur. Priors calculated from an
imbalanced dataset introduces bias in the prediction task as a higher value of Prior term could mask the effect of even higher
valued P(x i | Tt) terms thereby promoting the class with higher prior to be picked. So when there is a class imbalance,
reliablity over the classifier's prediction decreases.

Naive Bayes&#39; Classifier

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naive Bayes&#39; Classifier

Uploaded by

Copyright:

Available Formats

Q3.a) relation between P(Tt | x 1, x 2, . . .

, x n) and P(x i | Tt) using the Bayes theorem under the

In [223]: from IPython.display import Image

Q3.b) NAIVE BAYES CLASSIFIER

In [195]: from scipy.stats import norm

from math import log,pi

import matplotlib.pyplot as plt

Importing and reading csv file containing training dataset

In [196]: trainy = pd.read_csv("Cancer_train.csv")

In [214]: total_count = len(trainy)

prior_B = len(trainy[trainy['diagnosis'] == 'B'])/total_count

print("Prior of class B: ", prior_B)

print("Prior of class M: ", prior_M)

Prior of class B: 0.6244979919678715

Prior of class M: 0.3755020080321285

Using groupby method to compute statistics associated with each class

In [198]: df_by_tumortype = trainy.groupby('diagnosis')

In [199]: mean_cpw_B, mean_cpw_M = df_by_tumortype.mean()['concave points_worst']

mean_rm_B, mean_rm_M = df_by_tumortype.mean()['radius_mean']

std_cpw_B, std_cpw_M = df_by_tumortype.std()['concave points_worst']

std_rm_B, std_rm_M = df_by_tumortype.std()['radius_mean']

In [200]: testy = pd.read_csv("Cancer_test.csv")

Loglikelihood without scaling

In [201]: loggy = lambda x,mu,sigma: np.sum((-1/2)*(((x-mu)/sigma)**2 + log(2*pi)) - log(sigma))

cpw_B = lambda x: loggy(x, mean_cpw_B, std_cpw_B)

cpw_M = lambda x: loggy(x, mean_cpw_M, std_cpw_M)

rm_B = lambda x: loggy(x, mean_rm_B, std_rm_B)

rm_M = lambda x: loggy(x, mean_rm_M, std_rm_M)

In [202]: pL_B = lambda x : log(prior_B) + rm_B(x[0]) + cpw_B(x[1])

pL_M = lambda x : log(prior_M) + rm_M(x[0]) + cpw_M(x[1])

for _,row in testy.iterrows():

x = [row['radius_mean'], row['concave points_worst']]

if pL_B(x) > pL_M(x):

In [204]: accuracy = sum(prediction[i] == testy['diagnosis'][i] for i in range(len(testy)))/len(testy)

print("The accuracy obtained using two given attributes :", accuracy*100)

The accuracy obtained using two given attributes : 92.85714285714286

Loglikelihood with scaling

In [205]: loggy_std = lambda x,mu,sigma: np.sum((-1/2)*(((x-mu)/sigma)**2 + log(2*pi)))

cpw_B_scaled = lambda x: loggy_std((x-mean_cpw_B)/std_cpw_B, 0,1)

cpw_M_scaled = lambda x: loggy_std((x-mean_cpw_M)/std_cpw_M, 0,1)

rm_B_scaled = lambda x: loggy_std((x-mean_rm_B)/std_rm_B, 0,1)

rm_M_scaled = lambda x: loggy_std((x-mean_rm_M)/std_rm_M, 0,1)

In [206]: pL_B_scaled = lambda x : log(prior_B) + rm_B_scaled(x[0]) + cpw_B_scaled(x[1])

pL_M_scaled = lambda x : log(prior_M) + rm_M_scaled(x[0]) + cpw_M_scaled(x[1])

for _,row in testy.iterrows():

x = [row['radius_mean'], row['concave points_worst']]

if pL_B_scaled(x) > pL_M_scaled(x):

In [208]: accuracy_scaled = sum(prediction_scaled[i] == testy['diagnosis'][i] for i in range(len(testy

print("The accuracy obtained after scaling:", accuracy_scaled*100)

The accuracy obtained after scaling: 94.28571428571428

You might also like

Naive Bayes' Classifier

Naive Bayes' Classifier