You are on page 1of 16

Artificial Intelligence for Business

Naïve Bayes Classifier

Submitted by:

Paras Nath Munda

PGP/25/160

Naive Bayes Classifier - Predict if a person earns more than $50,000.

● The Naive Bayes Classifier method is used in Jupyter Notebook to estimate if a person
earns more than $50,000 per year.
● Imported the Pandas data analytics package, NumPy for numerics, and matplotlib for
visualization.
● pd.read CSV reads the data from the CSV and shows it in the head.
● There are 32561 instances and 15 attributes in the data set.

● Displays the number of rows and columns


● Since the attributes we labelled as number, that is, 0,1,2…. and they were renamed like
“age”, “workclass” etc
● The code was run to find categorical variables fundamental to the classifier. It
identified 9 categorical variables and viewed the data frame for the same.
● Checked for the missing values by looking for a null.
● Identified the frequency of values in categorical variables in integer and floating point
numeric format to identify the missing values.
● Since in the data set null was not coded as NaN while as ?, it was replaced with NaN for
python to identify the missing values for each categorical variable.
● Data was split into training and test. The size of test data set is 30% while that of training
data is 70%.
● After exploring the missing values, the data was further cleaned.
● Missing values were replaced and added with new values in order to
remove the null values from the set.
● Data was encoded in numerical format. Intitially we had 14 columns but now we have
113 columns.
● Data set is fed to the Naïve Bayes Classifier to train the model. The type utilized here

is the Gaussian classifier.

● Predicted results help in identifying the accuracy of the data set.


● Naïve Bayes Classifier can be utilised to calculate Accuracy score, Total True Positives
(TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
● The same data could be visualised in the form of visual_confusion_matrix with seaborn
heatmap.
● The values of performance parameters, that is, accuracy, classification error,Precision,
Recall or Sensitivity are also calculating using the formulas.
● The curve of True Positive Rate vs False Positive Rate generated from the algorithm
depicts that AUC is greater than 0.7 making the model good.

You might also like