You are on page 1of 6

Program 06.

Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write the
program. Calculate the accuracy, precision, and recall for your data set.

Explaination:

For the theoey of the naive bayesian classifier refer Experiment No. 5. Theory of performance
anaysis analysis is ellaborated here. Analysis of Document Classification

For classification tasks, the terms true positives, true negatives, false positives, and false negatives
compare the results of the classifier under test with trusted external judgments. The terms positive
and negative refer to the classifier's prediction (sometimes known as the expectation), and the terms
true and false refer to whether that prediction corresponds to the external judgment (sometimes
known as the observation).

• Precision - Precision is the ratio of correctly predicted positive documents to the total predicted
positive documents. High precision relates to the low false positive rate.
Precision = (Σ True positive ) / ( Σ True positive + Σ False positive)

• Recall (Sensitivity) - Recall is the ratio of correctly predicted positive documents to the all
observations in actual class.
Recall = (Σ True positive ) / ( Σ True positive + Σ False negative)

• Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of
correctly predicted observation to the total observations. One may think that, if we have high
accuracy then our model is best. Yes, accuracy is a great measure but only when you have
symmetric datasets where values of false positive and false negatives are almost same. Therefore,
you have to look at other parameters to evaluate the performance of your model. For our model, we
have got 0.803 which means our model is approx. 80% accurate.
Accuracy = (Σ True positive + Σ True negative) / Σ Total population
Scikit-learn is probably the most useful library for machine learning in Python. The sklearn
library contains a lot of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction.

Please note that sklearn is used to build machine learning models. It should not be used for
reading the data, manipulating and summarizing it. There are better libraries for that (e.g.
NumPy, Pandas etc.)

● sklearn.model_selection.train_test_split(*arrays, **options)

Split arrays or matrices into random train and test subsets

*arrays:sequence of indexables with same length / shape[0]


Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

Returns
splittinglist, length=2 * len(arrays)
List containing train-test split of inputs

● sklearn.feature_extraction.text.CountVectorizer
Convert a collection of text documents to a matrix of token counts
sklearn.naive_bayes.MultinomialNB
class  sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
[source]
Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with discrete
features (e.g., word counts for text classification). The multinomial distribution
normally requires integer feature counts. However, in practice, fractional counts such
as tf-idf may also work.

class  sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
[source]
Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with discrete
features (e.g., word counts for text classification). The multinomial distribution
normally requires integer feature counts. However, in practice, fractional counts such
as tf-idf may also work.

fit(X, y[, sample_weigh Fit Naive Bayes classifier according to X,


t]) y

sklearn.metrics.confusion_matrix¶
Compute confusion matrix to evaluate the accuracy of a classification.
Thus in binary classification, the count of true negatives is C0,0, false negatives
is C1,0, true positives is C1,1 and false positives is C0,1.

PROCEDURE / PROGRAMME :

import pandas as pd
msg=pd.read_csv('data6.csv',names=['message','label'])

print('\n Total instances in the dataset: ',msg.shape[0])

msg['labelnum']=msg.label.map({'pos':1,'neg':0})
x=msg.message
y=msg.labelnum

from sklearn.model_selection import train_test_split

#Splitting the dataset into train and test data randomly


xtrain,xtest,ytrain,ytest = train_test_split(x,y)

print('\n Dataset is Split into Training and Testing Samples')


print('\n Training Instances: ',xtrain.shape[0])
print(xtrain)
print('\n Testing Instances :',xtest.shape[0])
print(xtest)

# Output of count vectoriser is a sparse matrix

# CountVectorizer - stands for 'feature extraction'

from sklearn.feature_extraction.text import CountVectorizer


count_vect = CountVectorizer()
xtrain_dtm = count_vect.fit_transform(xtrain)
xtest_dtm = count_vect.transform(xtest)
print('\n Total features extracted using CountVectorizer: ',xtrain_dtm.shape[1])

print('\n Features for first 5 training instances are listed below')


df = pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_feature_names())
print(df[0:5] ) #tabular representation

# Training Naive Bayes (NB) classifier on training data


from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB().fit(xtrain_dtm,ytrain)
predicted=clf.predict(xtest_dtm)

print('\nClassification Results of Test Dataset are:\n')


for doc, p in zip(xtest,predicted):
pred = 'pos' if p==1 else 'neg'
print('%s --> %s '%(doc,pred))

#printing accuracy metrics


from sklearn import metrics
print('\nAccuracy of the classifier is',metrics.accuracy_score(ytest,predicted))
print('\nConfusion Matrix')
print(metrics.confusion_matrix(ytest,predicted))
print('\nRecall and Precision')
print(metrics.recall_score(ytest,predicted))
print(metrics.precision_score(ytest,predicted))

output1:

C:\Users\admin\AppData\Local\Programs\Python\Python37\python.exe
C:/Users/admin/Downloads/pgm6.py

Total instances in the dataset: 18

Dataset is Split into Training and Testing Samples

Training Instances: 13
10 This is an awesome place
13 I am sick and tired of this place
15 That is a bad locality to stay
5 I do not like this restaurant
16 We will have good fun tomorrow
11 I do not like the taste of this juice
9 My boss is horrible
14 What a great holiday
12 I love to dance
8 He is my sworn enemy
17 I went to my enemey's house today
2 I feel very good about these places
3 This is my best work
Name: message, dtype: object

Testing Instances : 5
6 I am tired of this stuff
1 This is an amazing place
7 I can't deal with this
4 What an awesome view
0 I love this sandwich
Name: message, dtype: object

Total features extracted using CountVectorizer: 50

Features for first 5 training instances are listed below


about am an and awesome bad ... very we went what will work
0 0 0 1 0 1 0 ... 0 0 0 0 0 0
1 0 1 0 1 0 0 ... 0 0 0 0 0 0
2 0 0 0 0 0 1 ... 0 0 0 0 0 0
3 0 0 0 0 0 0 ... 0 0 0 0 0 0
4 0 0 0 0 0 0 ... 0 1 0 0 1 0

[5 rows x 50 columns]

Classification Results of Test Dataset are:

I am tired of this stuff --> neg


This is an amazing place --> pos
I can't deal with this --> neg
What an awesome view --> pos
I love this sandwich --> pos

Accuracy of the classifier is 1.0

Confusion Matrix
[[2 0]
[0 3]]

Recall and Precision


1.0
1.0

Process finished with exit code 0


Output2:

C:\Users\admin\AppData\Local\Programs\Python\Python37\python.exe
C:/Users/admin/Downloads/pgm6.py

Total instances in the dataset: 18

Dataset is Split into Training and Testing Samples

Training Instances: 13
8 He is my sworn enemy
13 I am sick and tired of this place
1 This is an amazing place
12 I love to dance
15 That is a bad locality to stay
17 I went to my enemey's house today
4 What an awesome view
3 This is my best work
0 I love this sandwich
11 I do not like the taste of this juice
10 This is an awesome place
6 I am tired of this stuff
7 I can't deal with this
Name: message, dtype: object
Testing Instances : 5
2 I feel very good about these places
14 What a great holiday
9 My boss is horrible
5 I do not like this restaurant
16 We will have good fun tomorrow
Name: message, dtype: object

Total features extracted using CountVectorizer: 41

Features for first 5 training instances are listed below


am amazing an and awesome bad ... today view went what with work
0 0 0 0 0 0 0 ... 0 0 0 0 0 0
1 1 0 0 1 0 0 ... 0 0 0 0 0 0
2 0 1 1 0 0 0 ... 0 0 0 0 0 0
3 0 0 0 0 0 0 ... 0 0 0 0 0 0
4 0 0 0 0 0 1 ... 0 0 0 0 0 0

[5 rows x 41 columns]

Classification Results of Test Dataset are:

I feel very good about these places --> neg --------false positive
What a great holiday --> pos -------------------------true positive
My boss is horrible --> pos ----------------------------------------------------------false negative
I do not like this restaurant --> neg -------------------------------------------------true negative
We will have good fun tomorrow --> neg ------------false positive
------------------------------------------------------------------precision=1/1+2 =1/3 = 0.33
------------------------------------------------------------------Recall=1/1+1 = ½ = 0.5
Accuracy of the classifier is 0.4

Confusion Matrix
[[1 1]
[2 1]]

Recall and Precision


0.3333333333333333
0.5

Process finished with exit code 0

You might also like