Professional Documents
Culture Documents
CS695A
Sayan Maity
CSE 3B
Roll-05
12017009001193
Shamik Basu
CSE 3B
Roll-06
12017009001193
INDEX
The success and final outcome of this project required a lot of guidance
and assistance from many people and we are extremely privileged to
have got this all along the completion of our project. All that we have
done is only due to such supervision and assistance and we would not
forget to thank them.
Sayan Maity
Shamik Basu
TITLE
COMPARATIVE STUDY OF 3
DIFFERENT CLASSIFIERS USED
IN PREDICTION LIVER
PATIENTS
Introduction
Terminologies-
Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning
Supervised Learning
In supervised learning, the goal is to learn the mapping (the rules)
between a set of inputs and outputs.
For example, the inputs could be the weather forecast, and the outputs
would be the visitors to the beach. The goal in supervised learning would
be to learn the mapping that describes the relationship between
temperature and number of beach visitors.
Classification
It predicts discrete responses. Classification models are trained to
classify data into categories.
Regression
It predict continuous responses. The difference between classification
and regression is that regression outputs a number rather than a class.
Therefore, regression is useful when predicting number based problems
like stock market prices, the temperature for a given day, or the
probability of an event.
Unsupervised Learning
In unsupervised learning, only input data is provided in the examples.
There are no labelled example outputs to aim for. But it may be
surprising to know that it is still possible to find many interesting and
complex patterns hidden within data without any labels.
Clustering
Unsupervised learning is mostly used for clustering. Clustering is the act
of creating groups with differing characteristics. Clustering attempts
to find various subgroups within a dataset. As this is unsupervised
learning, we are not restricted to any set of labels and are free to choose
how many clusters to create. This is both a blessing and a curse. Picking
a model that has the correct number of clusters (complexity) has to be
conducted via an empirical model selection process.
Reinforcement Learning
The final type of machine learning is by far my favourite. It is less
common and much more complex, but it has generated incredible
results. It doesn’t use labels as such, and instead uses rewards to learn.
Dataset: https://archive.ics.uci.edu/ml/datasets/ILPD+
(Indian+Liver+Patient+Dataset)
Solution-
Since this is a classification problem ,we are ought to use some
classifiers. The most common classifiers that are available to us ,through
the scikit learn package are
Since one classifier may not provide the required target, we would be
using three of them.
Classifiers used-
1) Support Vector Machine
2) K Nearest Neighbour
3) Random Forest
The first task is to import the dataset into the program, for that we are
using pandas package.
The challenging part is to find and remove the gaps and redundancies.
miss = data.isnull().sum()/len(data)
miss.sort_values(inplace=True)
But here, since a very small fraction of values are missing, we choose to
replace it with mean.
Scaling
We plot the data based on age . Looking at the age vs. frequency graph,
we can observe that middle-aged people are the worst affected. Even
elderly people are also suffering from liver ailments, as seen by the bar
sizes at ages 60-80.
Advantages
Disadvantages
3)Random forests
Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks that
operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the classes (classification) or
mean prediction (regression) of the individual trees. Random decision
forests correct for decision trees' habit of overfitting to their training set.
Accuracy=(TP+TN)/(TP+TN+FP+FN)
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
Source Code
#Import required libraries
import pandas as pd
import numpy as np
#File does not contain headers so we need to load the headers manually
features = ["Age", "Gender", "Total Bilirubin", "Direct Bilirubin", "Alkphos Alkaline Phosphotase", "Sgpt
Alamine Aminotransferase", "Sgot Aspartate Aminotransferase", "Total Protiens", "Albumin",
"Albumin-Globulin Ratio", "Selector"]
data.head()
#Overview of data
data.info()
miss = data.isnull().sum()/len(data)
miss = miss[miss > 0]
miss.sort_values(inplace=True)
print("MISINIG VALUES=",miss)
le = preprocessing.LabelEncoder()
le.fit(['Male','Female'])
data.loc[:,'Gender'] = le.transform(data['Gender'])
#removing duplicates
#Overview of data
data.info()
data.head()
data.describe()
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
#Determining the healthy-affected split
plt.figure(figsize=(12, 10))
plt.hist(data[data['Selector'] == 1]['Age'], bins = 16, align = 'mid', rwidth = 0.5, color = 'black', alpha =
0.8)
plt.xlabel('Age')
plt.ylabel('Number of Patients')
plt.title('Frequency-Age Distribution')
plt.grid(True)
plt.savefig('fig1')
plt.show()
#correlation-matrix
plt.subplots(figsize=(12, 10))
plt.savefig('fig2')
plt.show()
#KNN
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
knn_scaled = KNeighborsClassifier(n_neighbors = 5)
knn_scaled.fit(X_train_scaled, y_train)
rfc.fit(X_train, y_train)
rfc_scaled.fit(X_train_scaled, y_train)
barWidth = 0.25
bars1=[accuracy_score(y_test,knn.predict(X_test)),precision_score(y_test,
knn.predict(X_test)),recall_score(y_test, knn.predict(X_test)),f1_score(y_test, knn.predict(X_test))]
bars2=[accuracy_score(y_test,svc_clf.predict(X_test)),precision_score(y_test,
svc_clf.predict(X_test)),recall_score(y_test,svc_clf.predict(X_test)),f1_score(y_test,
svc_clf.predict(X_test))]
bars3=[accuracy_score(y_test,rfc.predict(X_test)),precision_score(y_test,rfc.predict(X_test)),recall_sc
ore(y_test, rfc.predict(X_test)),f1_score(y_test, rfc.predict(X_test))]
r1 = np.arange(len(bars1))
plt.legend()
plt.show()
barWidth = 0.25
bars1=[accuracy_score(y_test,knn_scaled.predict(X_test_scaled)),precision_score(y_test,
knn_scaled.predict(X_test_scaled)),recall_score(y_test,knn_scaled.predict(X_test_scaled)),
f1_score(y_test, knn_scaled.predict(X_test_scaled))]
bars2=[accuracy_score(y_test,svc_clf_scaled.predict(X_test_scaled)),precision_score(y_test,
svc_clf_scaled.predict(X_test_scaled)),recall_score(y_test,svc_clf_scaled.predict(X_test_scaled)),f1_
score(y_test, svc_clf_scaled.predict(X_test_scaled))]
bars3=[accuracy_score(y_test,rfc_scaled.predict(X_test_scaled)),precision_score(y_test,
rfc_scaled.predict(X_test_scaled)),recall_score(y_test,
rfc_scaled.predict(X_test_scaled)),f1_score(y_test, rfc_scaled.predict(X_test_scaled))]
r1 = np.arange(len(bars1))
plt.ylim(.4577, 1)
plt.bar(r1, bars1, color='#AA0505', width=barWidth, edgecolor='white', label='KNN')
plt.legend()
plt.show()
Output
data header