Comparative Study of SVM, KNN & Random Forest Classifiers on Liver Disease Dataset

Data Science and Data Analytics Lab
CS695A
Sayan Maity
CSE 3B
Roll-05
12017009001193
Shamik Basu
CSE 3B
Roll-06
12017009001193
INDEX
Sno Title Page

1 Acknowledgement 1
2 Title 2
3 Introduction 3
4 Specification 11
5 Solution 11
6 Source Code 18
7 Output 24
8 Conclusion 29
ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance
and assistance from many people and we are extremely privileged to
have got this all along the completion of our project. All that we have
done is only due to such supervision and assistance and we would not
forget to thank them.
We respect and thank Prof. Moumita Basu, for providing me an

opportunity to do the project and giving us all support and guidance
which made me complete the project duly. I am extremely thankful to her
for providing such a nice support and guidance.
We owe my deep gratitude to our project guide Prof. Shankhadeep

Chatterjee who took keen interest on our project work and guided us all
along, till the completion of our project work by providing all the
necessary information for developing a good system.
We are thankful to and fortunate enough to get constant

encouragement, support and guidance from all Teaching staffs of CSE
which helped us in successfully completing our project work.
Sayan Maity
Shamik Basu
TITLE
COMPARATIVE STUDY OF 3
DIFFERENT CLASSIFIERS USED
IN PREDICTION LIVER
PATIENTS
Introduction
Machine Learning is the field of study that gives computers the

capability to learn without being explicitly programmed. ML is one of the
most exciting technologies that one would have ever come across. As it
is evident from the name, it gives the computer that makes it more
similar to humans: The ability to learn. Machine learning is actively
being used today, perhaps in many more places than one would expect.
Machine learning is a tool for turning information into knowledge. In the

past 50 years, there has been an explosion of data. This mass of data is
useless unless we analyse it and find the patterns hidden within. Machine
learning techniques are used to automatically find the valuable
underlying patterns within complex data that we would otherwise struggle
to discover. The hidden patterns and knowledge about a problem can be
used to predict future events and perform all kinds of complex decision
making.
There are multiple forms of Machine Learning; supervised, unsupervised

, semi-supervised and reinforcement learning. Each form of Machine
Learning has differing approaches, but they all follow the same
underlying process and theory.
Terminologies-
 Dataset: A set of data examples, that contain features important to

solving the problem.
 Features: Important pieces of data that help us understand a

problem. These are fed in to a Machine Learning algorithm to help it
learn.
 Model: The representation (internal model) of a phenomenon that
a Machine Learning algorithm has learnt. It learns this from the data
it is shown during training. The model is the output you get after
training an algorithm. For example, a decision tree algorithm would
be trained and produce a decision tree model.
Machine Learning Approaches
 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning
Supervised Learning
In supervised learning, the goal is to learn the mapping (the rules)
between a set of inputs and outputs.
For example, the inputs could be the weather forecast, and the outputs
would be the visitors to the beach. The goal in supervised learning would
be to learn the mapping that describes the relationship between
temperature and number of beach visitors.
Example labelled data is provided of past input and output pairs during

the learning process to teach the model how it should behave, hence,
‘supervised’ learning. For the beach example, new inputs can then be
fed in of forecast temperature and the Machine learning algorithm will
then output a future prediction for the number of visitors.
There are two types of Supervised Learning techniques: Regression and

Classification.
Classification separates the data
Regression fits the data.
Classification
It predicts discrete responses. Classification models are trained to
classify data into categories.
Eg. Speech Recognition, Medical imaging, Spam Detection
Regression
It predict continuous responses. The difference between classification
and regression is that regression outputs a number rather than a class.
Therefore, regression is useful when predicting number based problems
like stock market prices, the temperature for a given day, or the
probability of an event.
Unsupervised Learning
In unsupervised learning, only input data is provided in the examples.
There are no labelled example outputs to aim for. But it may be
surprising to know that it is still possible to find many interesting and
complex patterns hidden within data without any labels.
An example of unsupervised learning in real life would be sorting different

colour coins into separate piles. Nobody taught you how to separate
them, but by just looking at their features such as colour, you can see
which colour coins are associated and cluster them into their correct
groups .
Clustering
Unsupervised learning is mostly used for clustering. Clustering is the act
of creating groups with differing characteristics. Clustering attempts
to find various subgroups within a dataset. As this is unsupervised
learning, we are not restricted to any set of labels and are free to choose
how many clusters to create. This is both a blessing and a curse. Picking
a model that has the correct number of clusters (complexity) has to be
conducted via an empirical model selection process.
Reinforcement Learning
The final type of machine learning is by far my favourite. It is less
common and much more complex, but it has generated incredible
results. It doesn’t use labels as such, and instead uses rewards to learn.
If you’re familiar with psychology, you’ll have heard of reinforcement

learning. If not, you’ll already know the concept from how we learn in
everyday life. In this approach, occasional positive and negative
feedback is used to reinforce behaviours. Think of it like training a dog,
good behaviours are rewarded with a treat and become more common.
Bad behaviours are punished and become less common. This reward-
motivated behaviour is key in reinforcement learning.
Process
1. Data Collection: Collect the data that the algorithm will learn

from.
2. Data Preparation: Format and engineer the data into the

optimal format, extracting important features and performing
dimensionality reduction.
3. Training: Also known as the fitting stage, this is where the

Machine Learning algorithm actually learns by showing it the
data that has been collected and prepared.
4. Evaluation: Test the model to see how well it performs.
5. Tuning: Fine tune the model to maximise it’s performance.

SPECIFICATIONS
Problem Statement: Comparative Study of the different classifier on
ILPD dataset
Dataset: https://archive.ics.uci.edu/ml/datasets/ILPD+
(Indian+Liver+Patient+Dataset)
Programming Language: Python 3
Solution-
Since this is a classification problem ,we are ought to use some
classifiers. The most common classifiers that are available to us ,through
the scikit learn package are
1. Multilayer Perceptron Feed-Forward Network

2. Random Forest
3. Support Vector Machine
4. Naïve Bayes Classifier
5. K-Nearest Neighbour
Since one classifier may not provide the required target, we would be
using three of them.
Classifiers used-
1) Support Vector Machine
2) K Nearest Neighbour
3) Random Forest
Dataset- Indian Liver Patient Dataset (ILPD)
The first task is to import the dataset into the program, for that we are
using pandas package.
data = pd.read_csv(r'D:\BOOKS\6TH SEM\DATA SCIENCE AND DATA

ANALYTICS\Indian Liver Patient Dataset (ILPD).csv', names = features)
Data Preprocessing and Cleansing
The challenging part is to find and remove the gaps and redundancies.
We are checking the number of missing values through,
data.info() #this shows the missing value count
miss = data.isnull().sum()/len(data)
miss = miss[miss > 0]
miss.sort_values(inplace=True)
print("MISINIG VALUES=",miss) #this is to reinforce the data.info() if it

is skipped
The Albumin-Globulin Ratio feature has four missing values, as seen

above. Here, we are dropping those particular rows which have missing
data. We could, in fact, fill those place with values of our own, using
options like:
1. A constant value that has meaning within the domain, such as 0,

distinct from all other values.
2. A value from another randomly selected record, or the immediately
next or previous record.
3. A mean, median or mode value for the column.
4. A value estimated by another predictive model.
But here, since a very small fraction of values are missing, we choose to
replace it with mean.
data = data.groupby(data.columns, axis = 1).transform(lambda x:

x.fillna(x.mean()))
If we chose to remove the values we would have opted for
data = data.dropna(how = 'any', axis = 0)
Further the gender values are turned to numeric values and to

conventionalise the selector variable. For this we use LabelEncoder() ,
transform() and map().
Finally we are removing any duplicate values that might be present in
the data set.
data = data[~data.duplicated(subset = None, keep = 'first')]
Testing and Training
We use the train_test_split() , where the training data is 70%
X_train, X_test, y_train, y_test = train_test_split(data,

data['Selector'],test_size=0.3, random_state = 50 )
It returns splittinglist, length=2 * len(arrays)
List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be

a scipy.sparse.csr_matrix. Else, output type is the same as the input
type.
Scaling
min-max scaling or min-max normalization, is the simplest method

and consists in rescaling the range of features to scale the range in
[0, 1] or [−1, 1]. Selecting the target range depends on the nature of
the data. The general formula for a min-max of [0, 1] is given as:
For example, suppose that we have the students' weight data, and
the students' weights span [160 pounds, 200 pounds]. To rescale
this data, we first subtract 160 from each student's weight and
divide the result by 40 (the difference between the maximum and
minimum weights).
We use the MinMaxScaler()
Exploratory Data Analysis

We are finding out the positive and negative records
print("Positive records:", data['Selector'].value_counts().iloc[0])
print("Negative records:", data['Selector'].value_counts().iloc[1])
We plot the data based on age . Looking at the age vs. frequency graph,
we can observe that middle-aged people are the worst affected. Even
elderly people are also suffering from liver ailments, as seen by the bar
sizes at ages 60-80.
The correlation matrix gives us the relationship between two features. As

seen above, the following pairs of features seem to be very closely
related as indicated by their high correlation coefficients:
1. Total Bilirubin and Direct Bilirubin(0.87)

2. Sgpt Alamine Aminotransferase and Sgot Aspartate
Aminotransferase(0.79)
3. Albumin and Total Proteins(0.78)
4. Albumin and Albumin-Globulin Ratio(0.69)
Using Classification Algorithms
Let us now evaluate the performance of various classifiers on this
dataset. For the sake of understanding as to how feature scaling affects
classifier performance, we will train models using both scaled and
unscaled data. Since we are interested in capturing records of people
who have been tested positive, we will base our classifier evaluation
metric on precision and recall instead of accuracy. We could also use F1
score, since it takes into account both precision and recall.
The classifiers we would be using are
1) K Nearest Neighbor “Birds of a feather flock together.”

T he k-nearest neighbors algorithm (k-NN) is a non-parametric method
used for classification and regression.[1] In both cases, the input consists
of the k closest training examples in the feature space. The output
depends on whether k-NN is used for classification or regression:
 In k-NN classification, the output is a class membership. An

object is classified by a plurality vote of its neighbors, with the
object being assigned to the class most common among
its k nearest neighbors (k is a positive integer, typically small).
If k = 1, then the object is simply assigned to the class of that
single nearest neighbor.
 In k-NN regression, the output is the property value for the
object. This value is the average of the values of k nearest
neighbors.
k-NN is a type of instance-based learning, or lazy learning, where
the function is only approximated locally and all computation is
deferred until function evaluation.
Advantages
1. The algorithm is simple and easy to implement.
2. There’s no need to build a model, tune several parameters, or

make additional assumptions.
3. The algorithm is versatile. It can be used for classification,

regression, and search (as we will see in the next section).
Disadvantages
1. The algorithm gets significantly slower as the number of examples

and/or predictors/independent variables increase.
2)Support Vector Machine

In machine learning, support-vector machines (SVMs, also support-
vector networks) are supervised learning models with associated
learning algorithms that analyze data used for classification and
regression analysis. Given a set of training examples, each marked as
belonging to one or the other of two categories, an SVM training
algorithm builds a model that assigns new examples to one category or
the other, making it a non-probabilistic binary linear classifier (although
methods such as Platt scaling exist to use SVM in a probabilistic
classification setting). An SVM model is a representation of the
examples as points in space, mapped so that the examples of the
separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and
predicted to belong to a category based on the side of the gap on which
they fall.
3)Random forests
Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks that
operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the classes (classification) or
mean prediction (regression) of the individual trees. Random decision
forests correct for decision trees' habit of overfitting to their training set.
All these classifiers are available in sklearn package.We would be using

those. Accuracy, Precision and Recall would be primarily used to judge.
Accuracy of prediction is how correctly the classifier functions.
Precision is a measure that tells us what proportion of True Positive

against all positive samples.
Accuracy=(TP+TN)/(TP+TN+FP+FN)
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
Source Code
#Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
#File does not contain headers so we need to load the headers manually
features = ["Age", "Gender", "Total Bilirubin", "Direct Bilirubin", "Alkphos Alkaline Phosphotase", "Sgpt
Alamine Aminotransferase", "Sgot Aspartate Aminotransferase", "Total Protiens", "Albumin",
"Albumin-Globulin Ratio", "Selector"]
data = pd.read_csv(r'D:\BOOKS\6TH SEM\DATA SCIENCE AND DATA ANALYTICS\Indian Liver

Patient Dataset (ILPD).csv', names = features)
data.head()
#Overview of data
data.info()
miss = data.isnull().sum()/len(data)
miss = miss[miss > 0]
miss.sort_values(inplace=True)
print("MISINIG VALUES=",miss)
#we are choosing to fill it through mean
data = data.groupby(data.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
#Transfrom Gender string into float values
le = preprocessing.LabelEncoder()
le.fit(['Male','Female'])
data.loc[:,'Gender'] = le.transform(data['Gender'])
#Also transform Selector variable into usual conventions followed
data['Selector'] = data['Selector'].map({2:0, 1:1})
#removing duplicates
data = data[~data.duplicated(subset = None, keep = 'first')]
#Overview of data
data.info()
data.head()
#features characteristics to determine if feature scaling is necessary
data.describe()
#iTest train splitting and scaling
X_train, X_test, y_train, y_test = train_test_split(data, data['Selector'],test_size=0.3, random_state =

50 )
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
#Determining the healthy-affected split
print("Positive records:", data['Selector'].value_counts().iloc[0])
print("Negative records:", data['Selector'].value_counts().iloc[1])
#Determine statistics based on age
plt.figure(figsize=(12, 10))
plt.hist(data[data['Selector'] == 1]['Age'], bins = 16, align = 'mid', rwidth = 0.5, color = 'black', alpha =
0.8)
plt.xlabel('Age')
plt.ylabel('Number of Patients')
plt.title('Frequency-Age Distribution')
plt.grid(True)
plt.savefig('fig1')
plt.show()
#correlation-matrix
plt.subplots(figsize=(12, 10))
plt.title('Pearson Correlation of Features')
# Draw the heatmap using seaborn
sns.heatmap(data.corr(),linewidths=0.25, vmax=1.0, square=True,annot=True)
plt.savefig('fig2')
plt.show()
#Using normal data
#SVM Classifier with RBF kernel
svc_clf = SVC(C = 0.1, kernel = 'rbf').fit(X_train, y_train)
print("SVM Classifier on unscaled test data:")
print("Accuracy:", accuracy_score(y_test, svc_clf.predict(X_test)))
print("Precision:", precision_score(y_test, svc_clf.predict(X_test)))
print("Recall:", recall_score(y_test, svc_clf.predict(X_test)))

print("F-1 score:", f1_score(y_test, svc_clf.predict(X_test)))
#Using scaled data
svc_clf_scaled = SVC(C = 0.1, kernel = 'rbf').fit(X_train_scaled, y_train)
print("SVM Classifier on scaled test data:")
print("Accuracy:", accuracy_score(y_test, svc_clf_scaled.predict(X_test_scaled)))
print("Precision:", precision_score(y_test, svc_clf_scaled.predict(X_test_scaled)))
print("Recall:", recall_score(y_test, svc_clf_scaled.predict(X_test_scaled)))
print("F-1 score:", f1_score(y_test, svc_clf_scaled.predict(X_test_scaled)))
#KNN
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
print("k-NN Classifier on unscaled test data:")
print("Accuracy:", accuracy_score(y_test, knn.predict(X_test)))
print("Precision:", precision_score(y_test, knn.predict(X_test)))
print("Recall:", recall_score(y_test, knn.predict(X_test)))
print("F-1 score:", f1_score(y_test, knn.predict(X_test)))
#Using scaled data
knn_scaled = KNeighborsClassifier(n_neighbors = 5)
knn_scaled.fit(X_train_scaled, y_train)
print("k-NN Classifier on scaled test data:")
print("Accuracy:", accuracy_score(y_test, knn_scaled.predict(X_test_scaled)))
print("Precision:", precision_score(y_test, knn_scaled.predict(X_test_scaled)))
print("Recall:", recall_score(y_test, knn_scaled.predict(X_test_scaled)))
print("F-1 score:", f1_score(y_test, knn_scaled.predict(X_test_scaled)))
#using normal data

#RANDOM FOREST
rfc = RandomForestClassifier(n_estimators = 20)
rfc.fit(X_train, y_train)
print("Random Forest Classifier on unscaled test data:")
print("Accuracy:", accuracy_score(y_test, rfc.predict(X_test)))
print("Precision:", precision_score(y_test, rfc.predict(X_test)))
print("Recall:", recall_score(y_test, rfc.predict(X_test)))
print("F-1 score:", f1_score(y_test, rfc.predict(X_test)))
#using scaled data
rfc_scaled = RandomForestClassifier(n_estimators = 20)
rfc_scaled.fit(X_train_scaled, y_train)
print("Random Forest Classifier on scaled test data:")
print("Accuracy:", accuracy_score(y_test, rfc_scaled.predict(X_test_scaled)))
print("Precision:", precision_score(y_test, rfc_scaled.predict(X_test_scaled)))
print("Recall:", recall_score(y_test, rfc_scaled.predict(X_test_scaled)))
print("F-1 score:", f1_score(y_test, rfc_scaled.predict(X_test_scaled)))
#Plotting the value for detailed analysis
barWidth = 0.25
bars1=[accuracy_score(y_test,knn.predict(X_test)),precision_score(y_test,
knn.predict(X_test)),recall_score(y_test, knn.predict(X_test)),f1_score(y_test, knn.predict(X_test))]
bars2=[accuracy_score(y_test,svc_clf.predict(X_test)),precision_score(y_test,
svc_clf.predict(X_test)),recall_score(y_test,svc_clf.predict(X_test)),f1_score(y_test,
svc_clf.predict(X_test))]
bars3=[accuracy_score(y_test,rfc.predict(X_test)),precision_score(y_test,rfc.predict(X_test)),recall_sc
ore(y_test, rfc.predict(X_test)),f1_score(y_test, rfc.predict(X_test))]
# Set position of bar on X axis
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]

plt.ylim(.4577, 1)
plt.bar(r1, bars1, color='#AA0505', width=barWidth, edgecolor='white', label='KNN')
plt.bar(r2, bars2, color='#B97D10', width=barWidth, edgecolor='white', label='SVM')
plt.bar(r3, bars3, color='#FBCA03', width=barWidth, edgecolor='white', label='RF')
plt.xlabel('NORMAL DATA <STARK INDUSTRIES>', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(bars1))], ['ACCURACY', 'PRECISION', 'RECALL','F-

SCORE'])
plt.legend()
plt.show()
barWidth = 0.25
bars1=[accuracy_score(y_test,knn_scaled.predict(X_test_scaled)),precision_score(y_test,
knn_scaled.predict(X_test_scaled)),recall_score(y_test,knn_scaled.predict(X_test_scaled)),
f1_score(y_test, knn_scaled.predict(X_test_scaled))]
bars2=[accuracy_score(y_test,svc_clf_scaled.predict(X_test_scaled)),precision_score(y_test,
svc_clf_scaled.predict(X_test_scaled)),recall_score(y_test,svc_clf_scaled.predict(X_test_scaled)),f1_
score(y_test, svc_clf_scaled.predict(X_test_scaled))]
bars3=[accuracy_score(y_test,rfc_scaled.predict(X_test_scaled)),precision_score(y_test,
rfc_scaled.predict(X_test_scaled)),recall_score(y_test,
rfc_scaled.predict(X_test_scaled)),f1_score(y_test, rfc_scaled.predict(X_test_scaled))]
# Set position of bar on X axis
r1 = np.arange(len(bars1))
plt.ylim(.4577, 1)
plt.bar(r1, bars1, color='#AA0505', width=barWidth, edgecolor='white', label='KNN')
plt.bar(r2, bars2, color='#B97D10', width=barWidth, edgecolor='white', label='SVM')
plt.bar(r3, bars3, color='#FBCA03', width=barWidth, edgecolor='white', label='RF')
plt.xlabel('SCALED DATA <STARK INDUSTRIES>', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(bars1))], ['ACCURACY', 'PRECISION', 'RECALL','F-

SCORE'])
plt.legend()
plt.show()
Output
data header
We can see 4 values are missing in Albumin Globulin Ratio

After replacing the values with mean and cleansing the data we see that
there are no values missing
The accuracy, recall , precision and Fscore or all 3 classifiers for both
scaled and normal data
Comparison
Conlcusion
Random Forest Classifier works best for both scaled and normal data as
we can see through the various parameters presented through the
graph.
Feature scaling definitely was helpful in improving model performance.

While the Random Forest Classifier performed equally well with and
without feature scaling, it is not necessary that this may be the case
always. For, performance issues with feature scaling depend on a lot of
factors:
1. Understanding the data distribution is important. It is highly

possible that there may be some features which are almost
constant except for a small noise-driven variation. This noise
would then be amplified greatly by the normalization.
2. The regularization parameter C is also a very important factor in
classifier performance. This is crucial either way, whether or not
feature scaling is done.
Further applications
It is not expected to get the same level of performance on bigger and
denser data sets. The main reasons behind this are:
1. The dataset we worked on was very small, consisting of only 583

observations, of which 4 were changed.
2. The dataset was highly unbalanced, the positive records being
three times the number of negative ones.
Hence, even though we have obtained perfect scores on this dataset,

the performance of the same models on similar but bigger datasets is
expected to worsen.

Comparative Study of SVM, KNN & Random Forest Classifiers on Liver Disease Dataset

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparative Study of SVM, KNN & Random Forest Classifiers on Liver Disease Dataset

Uploaded by

Copyright:

Available Formats

Data Science and Data Analytics Lab

Sno Title Page

We respect and thank Prof. Moumita Basu, for providing me an

We owe my deep gratitude to our project guide Prof. Shankhadeep

We are thankful to and fortunate enough to get constant

Machine Learning is the field of study that gives computers the

Machine learning is a tool for turning information into knowledge. In the

There are multiple forms of Machine Learning; supervised, unsupervised

 Dataset: A set of data examples, that contain features important to

 Features: Important pieces of data that help us understand a

Machine Learning Approaches

Example labelled data is provided of past input and output pairs during

There are two types of Supervised Learning techniques: Regression and

Classification separates the data

Regression fits the data.

Eg. Speech Recognition, Medical imaging, Spam Detection

An example of unsupervised learning in real life would be sorting different

If you’re familiar with psychology, you’ll have heard of reinforcement

1. Data Collection: Collect the data that the algorithm will learn

2. Data Preparation: Format and engineer the data into the

3. Training: Also known as the fitting stage, this is where the

4. Evaluation: Test the model to see how well it performs.

5. Tuning: Fine tune the model to maximise it’s performance.

Programming Language: Python 3

1. Multilayer Perceptron Feed-Forward Network

Dataset- Indian Liver Patient Dataset (ILPD)

data = pd.read_csv(r'D:\BOOKS\6TH SEM\DATA SCIENCE AND DATA

We are checking the number of missing values through,

data.info() #this shows the missing value count

miss = miss[miss > 0]

print("MISINIG VALUES=",miss) #this is to reinforce the data.info() if it

The Albumin-Globulin Ratio feature has four missing values, as seen

1. A constant value that has meaning within the domain, such as 0,

data = data.groupby(data.columns, axis = 1).transform(lambda x:

If we chose to remove the values we would have opted for

data = data.dropna(how = 'any', axis = 0)

Further the gender values are turned to numeric values and to

data = data[~data.duplicated(subset = None, keep = 'first')]

Testing and Training

We use the train_test_split() , where the training data is 70%

X_train, X_test, y_train, y_test = train_test_split(data,

It returns splittinglist, length=2 * len(arrays)

List containing train-test split of inputs.

New in version 0.16: If the input is sparse, the output will be

min-max scaling or min-max normalization, is the simplest method

Exploratory Data Analysis

print("Positive records:", data['Selector'].value_counts().iloc[0])

print("Negative records:", data['Selector'].value_counts().iloc[1])

The correlation matrix gives us the relationship between two features. As

1. Total Bilirubin and Direct Bilirubin(0.87)

The classifiers we would be using are

1) K Nearest Neighbor “Birds of a feather flock together.”

 In k-NN classification, the output is a class membership. An

1. The algorithm is simple and easy to implement.

2. There’s no need to build a model, tune several parameters, or

3. The algorithm is versatile. It can be used for classification,

1. The algorithm gets significantly slower as the number of examples

2)Support Vector Machine

All these classifiers are available in sklearn package.We would be using

Accuracy of prediction is how correctly the classifier functions.

Precision is a measure that tells us what proportion of True Positive

import matplotlib.pyplot as plt

import seaborn as sns