You are on page 1of 30

Data Science and Data Analytics Lab

CS695A
Sayan Maity
CSE 3B
Roll-05
12017009001193

Shamik Basu
CSE 3B
Roll-06
12017009001193
INDEX

Sno Title Page


1 Acknowledgement 1
2 Title 2
3 Introduction 3
4 Specification 11
5 Solution 11
6 Source Code 18
7 Output 24
8 Conclusion 29
ACKNOWLEDGEMENT

The success and final outcome of this project required a lot of guidance
and assistance from many people and we are extremely privileged to
have got this all along the completion of our project. All that we have
done is only due to such supervision and assistance and we would not
forget to thank them.

We respect and thank Prof. Moumita Basu, for providing me an


opportunity to do the project and giving us all support and guidance
which made me complete the project duly. I am extremely thankful to her
for providing such a nice support and guidance.

We owe my deep gratitude to our project guide Prof. Shankhadeep


Chatterjee who took keen interest on our project work and guided us all
along, till the completion of our project work by providing all the
necessary information for developing a good system.

We are thankful to and fortunate enough to get constant


encouragement, support and guidance from all Teaching staffs of CSE
which helped us in successfully completing our project work.

Sayan Maity

Shamik Basu
TITLE

COMPARATIVE STUDY OF 3
DIFFERENT CLASSIFIERS USED
IN PREDICTION LIVER
PATIENTS
Introduction

Machine Learning is the field of study that gives computers the


capability to learn without being explicitly programmed. ML is one of the
most exciting technologies that one would have ever come across. As it
is evident from the name, it gives the computer that makes it more
similar to humans: The ability to learn. Machine learning is actively
being used today, perhaps in many more places than one would expect.

Machine learning is a tool for turning information into knowledge. In the


past 50 years, there has been an explosion of data. This mass of data is
useless unless we analyse it and find the patterns hidden within. Machine
learning techniques are used to automatically find the valuable
underlying patterns within complex data that we would otherwise struggle
to discover. The hidden patterns and knowledge about a problem can be
used to predict future events and perform all kinds of complex decision
making.

There are multiple forms of Machine Learning; supervised, unsupervised


, semi-supervised and reinforcement learning. Each form of Machine
Learning has differing approaches, but they all follow the same
underlying process and theory.

Terminologies-

 Dataset: A set of data examples, that contain features important to


solving the problem.

 Features: Important pieces of data that help us understand a


problem. These are fed in to a Machine Learning algorithm to help it
learn.
 Model: The representation (internal model) of a phenomenon that
a Machine Learning algorithm has learnt. It learns this from the data
it is shown during training. The model is the output you get after
training an algorithm. For example, a decision tree algorithm would
be trained and produce a decision tree model.

Machine Learning Approaches

 Supervised Learning

 Unsupervised Learning

 Semi-supervised Learning

 Reinforcement Learning
Supervised Learning
In supervised learning, the goal is to learn the mapping (the rules)
between a set of inputs and outputs.

For example, the inputs could be the weather forecast, and the outputs
would be the visitors to the beach. The goal in supervised learning would
be to learn the mapping that describes the relationship between
temperature and number of beach visitors.

Example labelled data is provided of past input and output pairs during


the learning process to teach the model how it should behave, hence,
‘supervised’ learning. For the beach example, new inputs can then be
fed in of forecast temperature and the Machine learning algorithm will
then output a future prediction for the number of visitors.

There are two types of Supervised Learning techniques: Regression and


Classification.

Classification separates the data

Regression fits the data.

Classification
It predicts discrete responses. Classification models are trained to
classify data into categories.

Eg. Speech Recognition, Medical imaging, Spam Detection

Regression
It predict continuous responses. The difference between classification
and regression is that regression outputs a number rather than a class.
Therefore, regression is useful when predicting number based problems
like stock market prices, the temperature for a given day, or the
probability of an event.
Unsupervised Learning
In unsupervised learning, only input data is provided in the examples.
There are no labelled example outputs to aim for. But it may be
surprising to know that it is still possible to find many interesting and
complex patterns hidden within data without any labels.

An example of unsupervised learning in real life would be sorting different


colour coins into separate piles. Nobody taught you how to separate
them, but by just looking at their features such as colour, you can see
which colour coins are associated and cluster them into their correct
groups .

Clustering
Unsupervised learning is mostly used for clustering. Clustering is the act
of creating groups with differing characteristics. Clustering attempts
to find various subgroups within a dataset. As this is unsupervised
learning, we are not restricted to any set of labels and are free to choose
how many clusters to create. This is both a blessing and a curse. Picking
a model that has the correct number of clusters (complexity) has to be
conducted via an empirical model selection process.
Reinforcement Learning
The final type of machine learning is by far my favourite. It is less
common and much more complex, but it has generated incredible
results. It doesn’t use labels as such, and instead uses rewards to learn.

If you’re familiar with psychology, you’ll have heard of reinforcement


learning. If not, you’ll already know the concept from how we learn in
everyday life. In this approach, occasional positive and negative
feedback is used to reinforce behaviours. Think of it like training a dog,
good behaviours are rewarded with a treat and become more common.
Bad behaviours are punished and become less common. This reward-
motivated behaviour is key in reinforcement learning.
Process

1. Data Collection: Collect the data that the algorithm will learn


from.

2. Data Preparation: Format and engineer the data into the


optimal format, extracting important features and performing
dimensionality reduction.

3. Training: Also known as the fitting stage, this is where the


Machine Learning algorithm actually learns by showing it the
data that has been collected and prepared.

4. Evaluation: Test the model to see how well it performs.

5. Tuning: Fine tune the model to maximise it’s performance.


SPECIFICATIONS
Problem Statement: Comparative Study of the different classifier on
ILPD dataset

Dataset: https://archive.ics.uci.edu/ml/datasets/ILPD+
(Indian+Liver+Patient+Dataset)

Programming Language: Python 3

Solution-
Since this is a classification problem ,we are ought to use some
classifiers. The most common classifiers that are available to us ,through
the scikit learn package are

1. Multilayer Perceptron Feed-Forward Network


2. Random Forest
3. Support Vector Machine
4. Naïve Bayes Classifier
5. K-Nearest Neighbour

Since one classifier may not provide the required target, we would be
using three of them.

Classifiers used-
1) Support Vector Machine
2) K Nearest Neighbour
3) Random Forest

Dataset- Indian Liver Patient Dataset (ILPD)

The first task is to import the dataset into the program, for that we are
using pandas package.

data = pd.read_csv(r'D:\BOOKS\6TH SEM\DATA SCIENCE AND DATA


ANALYTICS\Indian Liver Patient Dataset (ILPD).csv', names = features)
Data Preprocessing and Cleansing

The challenging part is to find and remove the gaps and redundancies.

We are checking the number of missing values through,

data.info() #this shows the missing value count

miss = data.isnull().sum()/len(data)

miss = miss[miss > 0]

miss.sort_values(inplace=True)

print("MISINIG VALUES=",miss) #this is to reinforce the data.info() if it


is skipped

The Albumin-Globulin Ratio feature has four missing values, as seen


above. Here, we are dropping those particular rows which have missing
data. We could, in fact, fill those place with values of our own, using
options like:

1. A constant value that has meaning within the domain, such as 0,


distinct from all other values.
2. A value from another randomly selected record, or the immediately
next or previous record.
3. A mean, median or mode value for the column.
4. A value estimated by another predictive model.

But here, since a very small fraction of values are missing, we choose to
replace it with mean.

data = data.groupby(data.columns, axis = 1).transform(lambda x:


x.fillna(x.mean()))

If we chose to remove the values we would have opted for

data = data.dropna(how = 'any', axis = 0)

Further the gender values are turned to numeric values and to


conventionalise the selector variable. For this we use LabelEncoder() ,
transform() and map().
Finally we are removing any duplicate values that might be present in
the data set.

data = data[~data.duplicated(subset = None, keep = 'first')]

Testing and Training

We use the train_test_split() , where the training data is 70%

X_train, X_test, y_train, y_test = train_test_split(data,


data['Selector'],test_size=0.3, random_state = 50 )

It returns splittinglist, length=2 * len(arrays)

List containing train-test split of inputs.

New in version 0.16: If the input is sparse, the output will be


a scipy.sparse.csr_matrix. Else, output type is the same as the input
type.

Scaling

min-max scaling or min-max normalization, is the simplest method


and consists in rescaling the range of features to scale the range in
[0, 1] or [−1, 1]. Selecting the target range depends on the nature of
the data. The general formula for a min-max of [0, 1] is given as:
For example, suppose that we have the students' weight data, and
the students' weights span [160 pounds, 200 pounds]. To rescale
this data, we first subtract 160 from each student's weight and
divide the result by 40 (the difference between the maximum and
minimum weights).
We use the MinMaxScaler()

Exploratory Data Analysis


We are finding out the positive and negative records

print("Positive records:", data['Selector'].value_counts().iloc[0])

print("Negative records:", data['Selector'].value_counts().iloc[1])

We plot the data based on age . Looking at the age vs. frequency graph,
we can observe that middle-aged people are the worst affected. Even
elderly people are also suffering from liver ailments, as seen by the bar
sizes at ages 60-80.

The correlation matrix gives us the relationship between two features. As


seen above, the following pairs of features seem to be very closely
related as indicated by their high correlation coefficients:

1. Total Bilirubin and Direct Bilirubin(0.87)


2. Sgpt Alamine Aminotransferase and Sgot Aspartate
Aminotransferase(0.79)
3. Albumin and Total Proteins(0.78)
4. Albumin and Albumin-Globulin Ratio(0.69)
Using Classification Algorithms
Let us now evaluate the performance of various classifiers on this
dataset. For the sake of understanding as to how feature scaling affects
classifier performance, we will train models using both scaled and
unscaled data. Since we are interested in capturing records of people
who have been tested positive, we will base our classifier evaluation
metric on precision and recall instead of accuracy. We could also use F1
score, since it takes into account both precision and recall.

The classifiers we would be using are

1) K Nearest Neighbor “Birds of a feather flock together.”


T he k-nearest neighbors algorithm (k-NN) is a non-parametric method
used for classification and regression.[1] In both cases, the input consists
of the k closest training examples in the feature space. The output
depends on whether k-NN is used for classification or regression:

 In k-NN classification, the output is a class membership. An


object is classified by a plurality vote of its neighbors, with the
object being assigned to the class most common among
its k nearest neighbors (k is a positive integer, typically small).
If k = 1, then the object is simply assigned to the class of that
single nearest neighbor.
 In k-NN regression, the output is the property value for the
object. This value is the average of the values of k nearest
neighbors.
k-NN is a type of instance-based learning, or lazy learning, where
the function is only approximated locally and all computation is
deferred until function evaluation.

Advantages

1. The algorithm is simple and easy to implement.

2. There’s no need to build a model, tune several parameters, or


make additional assumptions.

3. The algorithm is versatile. It can be used for classification,


regression, and search (as we will see in the next section).

Disadvantages

1. The algorithm gets significantly slower as the number of examples


and/or predictors/independent variables increase.

2)Support Vector Machine


In machine learning, support-vector machines (SVMs, also support-
vector networks) are supervised learning models with associated
learning algorithms that analyze data used for  classification  and 
regression analysis. Given a set of training examples, each marked as
belonging to one or the other of two categories, an SVM training
algorithm builds a model that assigns new examples to one category or
the other, making it a non-probabilistic binary linear classifier (although
methods such as Platt scaling exist to use SVM in a probabilistic
classification setting). An SVM model is a representation of the
examples as points in space, mapped so that the examples of the
separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and
predicted to belong to a category based on the side of the gap on which
they fall.

3)Random forests 
Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks that
operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the classes (classification) or
mean prediction (regression) of the individual trees. Random decision
forests correct for decision trees' habit of overfitting to their training set.

All these classifiers are available in sklearn package.We would be using


those. Accuracy, Precision and Recall would be primarily used to judge.

Accuracy of prediction is how correctly the classifier functions.

Precision is a measure that tells us what proportion of True Positive


against all positive samples.

Accuracy=(TP+TN)/(TP+TN+FP+FN)

Precision=TP/(TP+FP)

Recall=TP/(TP+FN)
Source Code
#Import required libraries

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

import seaborn as sns

from sklearn import preprocessing

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import precision_score

from sklearn.metrics import f1_score

from sklearn.metrics import recall_score

from sklearn.metrics import accuracy_score

#File does not contain headers so we need to load the headers manually

features = ["Age", "Gender", "Total Bilirubin", "Direct Bilirubin", "Alkphos Alkaline Phosphotase", "Sgpt
Alamine Aminotransferase", "Sgot Aspartate Aminotransferase", "Total Protiens", "Albumin",
"Albumin-Globulin Ratio", "Selector"]

data = pd.read_csv(r'D:\BOOKS\6TH SEM\DATA SCIENCE AND DATA ANALYTICS\Indian Liver


Patient Dataset (ILPD).csv', names = features)

data.head()

#Overview of data

data.info()

miss = data.isnull().sum()/len(data)
miss = miss[miss > 0]

miss.sort_values(inplace=True)

print("MISINIG VALUES=",miss)

#we are choosing to fill it through mean

data = data.groupby(data.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

#Transfrom Gender string into float values

le = preprocessing.LabelEncoder()

le.fit(['Male','Female'])

data.loc[:,'Gender'] = le.transform(data['Gender'])

#Also transform Selector variable into usual conventions followed

data['Selector'] = data['Selector'].map({2:0, 1:1})

#removing duplicates

data = data[~data.duplicated(subset = None, keep = 'first')]

#Overview of data

data.info()

data.head()

#features characteristics to determine if feature scaling is necessary

data.describe()

#iTest train splitting and scaling

X_train, X_test, y_train, y_test = train_test_split(data, data['Selector'],test_size=0.3, random_state =


50 )

scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)
#Determining the healthy-affected split

print("Positive records:", data['Selector'].value_counts().iloc[0])

print("Negative records:", data['Selector'].value_counts().iloc[1])

#Determine statistics based on age

plt.figure(figsize=(12, 10))

plt.hist(data[data['Selector'] == 1]['Age'], bins = 16, align = 'mid', rwidth = 0.5, color = 'black', alpha =
0.8)

plt.xlabel('Age')

plt.ylabel('Number of Patients')

plt.title('Frequency-Age Distribution')

plt.grid(True)

plt.savefig('fig1')

plt.show()

#correlation-matrix

plt.subplots(figsize=(12, 10))

plt.title('Pearson Correlation of Features')

# Draw the heatmap using seaborn

sns.heatmap(data.corr(),linewidths=0.25, vmax=1.0, square=True,annot=True)

plt.savefig('fig2')

plt.show()

#Using normal data

#SVM Classifier with RBF kernel

svc_clf = SVC(C = 0.1, kernel = 'rbf').fit(X_train, y_train)

print("SVM Classifier on unscaled test data:")

print("Accuracy:", accuracy_score(y_test, svc_clf.predict(X_test)))

print("Precision:", precision_score(y_test, svc_clf.predict(X_test)))

print("Recall:", recall_score(y_test, svc_clf.predict(X_test)))


print("F-1 score:", f1_score(y_test, svc_clf.predict(X_test)))

#Using scaled data

svc_clf_scaled = SVC(C = 0.1, kernel = 'rbf').fit(X_train_scaled, y_train)

print("SVM Classifier on scaled test data:")

print("Accuracy:", accuracy_score(y_test, svc_clf_scaled.predict(X_test_scaled)))

print("Precision:", precision_score(y_test, svc_clf_scaled.predict(X_test_scaled)))

print("Recall:", recall_score(y_test, svc_clf_scaled.predict(X_test_scaled)))

print("F-1 score:", f1_score(y_test, svc_clf_scaled.predict(X_test_scaled)))

#KNN

knn = KNeighborsClassifier(n_neighbors = 5)

knn.fit(X_train, y_train)

print("k-NN Classifier on unscaled test data:")

print("Accuracy:", accuracy_score(y_test, knn.predict(X_test)))

print("Precision:", precision_score(y_test, knn.predict(X_test)))

print("Recall:", recall_score(y_test, knn.predict(X_test)))

print("F-1 score:", f1_score(y_test, knn.predict(X_test)))

#Using scaled data

knn_scaled = KNeighborsClassifier(n_neighbors = 5)

knn_scaled.fit(X_train_scaled, y_train)

print("k-NN Classifier on scaled test data:")

print("Accuracy:", accuracy_score(y_test, knn_scaled.predict(X_test_scaled)))

print("Precision:", precision_score(y_test, knn_scaled.predict(X_test_scaled)))

print("Recall:", recall_score(y_test, knn_scaled.predict(X_test_scaled)))

print("F-1 score:", f1_score(y_test, knn_scaled.predict(X_test_scaled)))

#using normal data


#RANDOM FOREST

rfc = RandomForestClassifier(n_estimators = 20)

rfc.fit(X_train, y_train)

print("Random Forest Classifier on unscaled test data:")

print("Accuracy:", accuracy_score(y_test, rfc.predict(X_test)))

print("Precision:", precision_score(y_test, rfc.predict(X_test)))

print("Recall:", recall_score(y_test, rfc.predict(X_test)))

print("F-1 score:", f1_score(y_test, rfc.predict(X_test)))

#using scaled data

rfc_scaled = RandomForestClassifier(n_estimators = 20)

rfc_scaled.fit(X_train_scaled, y_train)

print("Random Forest Classifier on scaled test data:")

print("Accuracy:", accuracy_score(y_test, rfc_scaled.predict(X_test_scaled)))

print("Precision:", precision_score(y_test, rfc_scaled.predict(X_test_scaled)))

print("Recall:", recall_score(y_test, rfc_scaled.predict(X_test_scaled)))

print("F-1 score:", f1_score(y_test, rfc_scaled.predict(X_test_scaled)))

#Plotting the value for detailed analysis

barWidth = 0.25

bars1=[accuracy_score(y_test,knn.predict(X_test)),precision_score(y_test,
knn.predict(X_test)),recall_score(y_test, knn.predict(X_test)),f1_score(y_test, knn.predict(X_test))]

bars2=[accuracy_score(y_test,svc_clf.predict(X_test)),precision_score(y_test,
svc_clf.predict(X_test)),recall_score(y_test,svc_clf.predict(X_test)),f1_score(y_test,
svc_clf.predict(X_test))]

bars3=[accuracy_score(y_test,rfc.predict(X_test)),precision_score(y_test,rfc.predict(X_test)),recall_sc
ore(y_test, rfc.predict(X_test)),f1_score(y_test, rfc.predict(X_test))]

# Set position of bar on X axis

r1 = np.arange(len(bars1))

r2 = [x + barWidth for x in r1]

r3 = [x + barWidth for x in r2]


plt.ylim(.4577, 1)

plt.bar(r1, bars1, color='#AA0505', width=barWidth, edgecolor='white', label='KNN')

plt.bar(r2, bars2, color='#B97D10', width=barWidth, edgecolor='white', label='SVM')

plt.bar(r3, bars3, color='#FBCA03', width=barWidth, edgecolor='white', label='RF')

plt.xlabel('NORMAL DATA <STARK INDUSTRIES>', fontweight='bold')

plt.xticks([r + barWidth for r in range(len(bars1))], ['ACCURACY', 'PRECISION', 'RECALL','F-


SCORE'])

plt.legend()

plt.show()

barWidth = 0.25

bars1=[accuracy_score(y_test,knn_scaled.predict(X_test_scaled)),precision_score(y_test,
knn_scaled.predict(X_test_scaled)),recall_score(y_test,knn_scaled.predict(X_test_scaled)),
f1_score(y_test, knn_scaled.predict(X_test_scaled))]

bars2=[accuracy_score(y_test,svc_clf_scaled.predict(X_test_scaled)),precision_score(y_test,
svc_clf_scaled.predict(X_test_scaled)),recall_score(y_test,svc_clf_scaled.predict(X_test_scaled)),f1_
score(y_test, svc_clf_scaled.predict(X_test_scaled))]

bars3=[accuracy_score(y_test,rfc_scaled.predict(X_test_scaled)),precision_score(y_test,
rfc_scaled.predict(X_test_scaled)),recall_score(y_test,
rfc_scaled.predict(X_test_scaled)),f1_score(y_test, rfc_scaled.predict(X_test_scaled))]

# Set position of bar on X axis

r1 = np.arange(len(bars1))

r2 = [x + barWidth for x in r1]

r3 = [x + barWidth for x in r2]

plt.ylim(.4577, 1)
plt.bar(r1, bars1, color='#AA0505', width=barWidth, edgecolor='white', label='KNN')

plt.bar(r2, bars2, color='#B97D10', width=barWidth, edgecolor='white', label='SVM')

plt.bar(r3, bars3, color='#FBCA03', width=barWidth, edgecolor='white', label='RF')

plt.xlabel('SCALED DATA <STARK INDUSTRIES>', fontweight='bold')

plt.xticks([r + barWidth for r in range(len(bars1))], ['ACCURACY', 'PRECISION', 'RECALL','F-


SCORE'])

plt.legend()

plt.show()

Output

data header

We can see 4 values are missing in Albumin Globulin Ratio


After replacing the values with mean and cleansing the data we see that
there are no values missing
The accuracy, recall , precision and Fscore or all 3 classifiers for both
scaled and normal data
Comparison
Conlcusion
Random Forest Classifier works best for both scaled and normal data as
we can see through the various parameters presented through the
graph.

Feature scaling definitely was helpful in improving model performance.


While the Random Forest Classifier performed equally well with and
without feature scaling, it is not necessary that this may be the case
always. For, performance issues with feature scaling depend on a lot of
factors:

1. Understanding the data distribution is important. It is highly


possible that there may be some features which are almost
constant except for a small noise-driven variation. This noise
would then be amplified greatly by the normalization.
2. The regularization parameter C is also a very important factor in
classifier performance. This is crucial either way, whether or not
feature scaling is done.
Further applications
It is not expected to get the same level of performance on bigger and
denser data sets. The main reasons behind this are:

1. The dataset we worked on was very small, consisting of only 583


observations, of which 4 were changed.
2. The dataset was highly unbalanced, the positive records being
three times the number of negative ones.

Hence, even though we have obtained perfect scores on this dataset,


the performance of the same models on similar but bigger datasets is
expected to worsen.

You might also like