You are on page 1of 3

2018 4th International Conference on Computing Communication and Automation (ICCCA)

Prediction of Liver Disease using Classification


Algorithms
Thirunavukkarasu K. Ajay S. Singh* Md Irfan Abhishek Chowdhury#
+ School of Computer Science School of Computer Science School of Computer Science School of Computer Science
and Engineering and Engineering and Engineering and Engineering
Galgotias University Galgotias University Galgotias University Galgotias University
Greater Noida, India Greater Noida, India Greater Noida, India Greater Noida, India
+prof.thiru@gmail.com *drajay.cse@gmail.com #xavierchowdhury44@gmail.com

Abstract— The industry that are providing healthcare is III. ALGORITHMS


producing a large amount of data. We know that Machine
For this work, we have used three classification algorithms.
Learning algorithms can also be used to find hidden
information for diagnosis and effective decision making. In A. Logistic Regression
recent years, Liver disorders have increased rapidly and it is
considered to be a very fatal disease in many countries like – Logistic Regression is a classification algorithm which is
Egypt, Moldava etc. For this research paper, the main aim is to used to predict the binary outcome on a given set of
predict liver disease using different classification algorithms. independent variable. In logistic regression, we only look at
The algorithms used for this purpose of work is Logistic the probability of outcome dependent variable.2 The odds of
Regression, K-Nearest Neighbour and Support Vector success is by odd = P/1-P. In logistic regression, the
Machines. Accuracy score and confusion matrix is used to dependent variable is a logit that is natural log of odd given
compare this classification algorithm. by log(odds) = logit(P) = ln(P/1-P).
In logistic regression, we find
Keywords— Machine Learning, K-Nearest Neighbour,
Logistic Regression, Support Vector Machines Classifications. logit(P) = a + bX = y.
I. INTRODUCTION ln(P/1-P) = y.
Machine Learning techniques nowadays have become P/1-P = ey.
very much important in the healthcare sector for the
P = ey/1+ey.
prediction of disease from the medical database. Many
researchers and companies are leveraging machine learning Logistic regression equation:
to improve medical diagnostics. Among different machine
P = e^(b0 + b1*x) / (1 + e^(b0 + b1*x));
learning techniques, classification algorithms are widely used
in predicting diseases. In this paper, Logistic Regression, K- where y= b0+b1*x
Nearest neighbour and Support Vector Machines are been
A typical logistic function is given by
used for prediction of liver disease. We all know that liver is
the body largest internal organ which performs very
important body function including making blood clotting
factors and proteins, manufacturing triglyceride and
cholesterol, glycogen synthesis and bile production. Usually,
more than 75% of liver tissue needs to be affected by a
decrease in function to occur.1 So it’s important to detect at
an early stage such that the disease can be treated before it
becomes severe.

II. DATASET Fig. 1: Logistic Function.


The dataset is taken is from the Indian Liver Patient
B. K-nearest neighbours
Dataset (ILPD). This is downloaded from UCL Machine
Learning Repository. This dataset has 567 instances and 10 K-nearest neighbour can be said as a classification, non-
attributes. Attributes are Age, Gender, DB, TB, ALB, SGOT, parametric algorithm which stores all available cases and its
SGPT, TP, ALP and A/G ratio. works is to classify new cases based on a similarity measure.
It is non-parametric as it’s doesn’t make any assumption on

2
1
Abhishek Chowdhury, Thirunavukkarasu K, Sidhyant Tejas(2017),
Anatomy and function of the liver. [Online]. Available: Predicting whether song will be hit using Logistic Regression. Volume 6
https://www.medicinenet.com/liver_anatomy_and_function/article.htm Issue 9 September 2017.

978-1-5386-6947-1/18/$31.00 ©2018 IEEE 1


the underlying data distribution. KNN uses Euclidean classification model the best model is selected for the
distance to predict the class defined as follow: prediction of liver disease.

d(x,y) = ki=1 (xi – yi)2


Train Data
It works like this, a case is classified by majority vote of its
neighbour, then the case is assigned to the class which is
most common amongst its K nearest neighbour which is
Data Preprocessing
measured by measured by Euclidean distance3.
C. Support Vector Machines
Support Vector Machines is a classification algorithm Feature Selection
which is based on the idea of finding a hyperplane that
divides the dataset into 2 classes. Support vectors are the data
points that are nearest to the hyperplane. Support Vector
Machine is non-probabilistic, so they assign a data point to a Model Selection
class with 100% certainty. The basic idea is shown in figure
2.
Measure Accuracy

IV. EXPERIMENT RESULT


In this section, results are analysed which are given by
three different classification algorithms consist of Logistic
Regression(LR), K-nearest neighbour(KNN) and Support
Vector Machine(SVM). In the experiment, the dataset is
divided into training set and testing set. The ratio of the
Fig. 2: Data Classification.
training set is 70% and 30% respectively. In this work, 10-
fold cross-validation is used to train and test the machine
The maths behind Support Vector Machine to solve the
learning model. The experiment is conducted in Python
optimization problem is as follows:
programming language and the library used are pandas and
sci-kit learn.
maxQ() = i - 1/2ijdidjxTixj
where 0  i  C for i = 1,2,…,n. A. Performance Measure
We will measure the performance accuracy of the model by
f(x) = [ sign jdiK(x,xi) + b ] two methods
where K(x,xi) is the kernel function4.
x Confusion Matrix is the tabular representation of
IV. METHODOLOGY USED actual or predicted values. Accuracy is calculated
by (true positive(TP) + true negative(TN))/ (true
The proposed methods are used to compare classification positive(TP) + true negative(TN) + false
accuracy of Logistic Regression, K-nearest neighbour and positive(FP) + false negative(FN)).
Support Vector Machine. The first step is to clean the data. x The sensitivity which is also called true positive
Filling the missing values followed by transforming nominal rate state that how many positive values out of all
attribute to binary attribute. The next step is feature selection positive values have been correctly predicted.
to select the best attribute for a subset of features. In this Sensitivity = TP/(TP+FN)
work, the relationship between the attributes and the
predictor variable been seen by using a pivot table. Based on x Specificity which is also known as true negative
the findings, attributes been selected. The third step is data rate state that how many negative values out of all
transformation. In this technique, data is been standardized negative values have been correctly predicted.
such that the data follows Gaussian Distribution with a mean Specificity = TN/(TN + FP)
of 0 and standard deviation of 1. In the fourth step, the
classification model is trained to predict the results in unseen B. Results.
data. In the last step, based on the accuracy of different All the three classification algorithm is been tested.

3
K-Nearest Neighbours. [Online]Available:
https://www.saedsayad.com/k_nearest_neighbors.htm

4
A. Charleonnan,T. Fufaung, T. Niyomwong, W Chokchueypattanakit,
S.Suwannawach, N. Ninchawee “Predictive Analytics for Chronic Fig. 3: Confusion matrix of K-nearest neighbour.
Kidney Disease Using Machine Learning Techniques ” MITiCON-
2016.

2
From the confusion matrix of K-nearest neighbour model, From the experiment, it’s found out that Logistic
accuracy has been calculated and it's coming out to be Regression and K-Nearest Neighbour has the equal accuracy.
73.97%. Since in medical term, test sensitivity is the ability of the test
Sensitivity value comes out to be 0.904 and specificity to correctly identify those with the disease thus logistic
values come out to be 0.317. regression is the best model for predicting liver disease.

V. CONCLUSION
In this research work, different classification algorithms
namely Logistic Regression, Support Vector Machine and K-
Nearest Neighbour have been used for liver disease
Fig. 4: Confusion matrix of Logistic Regression. prediction. The comparison of all these algorithms been done
based on classification accuracy which is found through
From the confusion matrix of Logistic Regression model, confusion matrix. From the experiment, Logistic Regression
accuracy has been calculated and it's coming out to be and K-Nearest Neighbour have the highest accuracy but
73.97%. logistic regression have the highest sensitivity. Therefore it
Sensitivity value comes out to be 0.952 and specificity can be concluded that Logistic Regression is appropriate for
values come out to be 0.195. predicting liver disease.

REFERENCE
[1] A. Charleonnan,T. Fufaung, T. Niyomwong, W Chokchueypattanakit,
S. Suwannawach, N. Ninchawee “Predictive Analytics for Chronic
Kidney Disease Using Machine Learning Techniques ” MITiCON-
2016.
[2] Anatomy and function of the liver. [Online]. Available:
Fig. 5: Confusion Matrix of Support Vector Machine.
https://www.medicinenet.com/liver_anatomy_and_function/article.ht
m
And finally, from the confusion matrix of Support Vector [3] Abhishek Chowdhury, Thirunavukkarasu K, Sidhyant Tejas(2017),
Machine model, accuracy has been calculated and it's coming Predicting whether song will be hit using Logistic Regression.
out to be 71.97%. Volume 6 Issue 9 September 2017.
Sensitivity value comes out to be 0.952 and specificity [4] K-Nearest Neighbours. [Online]Available:
values come out to be 0.195. https://www.saedsayad.com/k_nearest_neighbors.htm
[5] P.Mazaheri, A. Narouziand A. Karimi (2015), Using Algorithms to
Predict Liver Disease Classification,Electronics Information and
Planning. 3:255-259.
[6] H. Jin , S. Kim and J.Kim (2014), Decision Factors on Effective
Liver Patient Data Prediction,International Journal of Bio-Science
and Bio-Technology
[7] Available: http://www.ics.uci.edu/~mlearn/databases/
[8] Comparative Study of Artificial Neural Network based
Classification of Liver Patient, Journal of Information Engineering
and Application. 3(4).
[9] David Diez, Christopher Barr and Mine Cetinkaya-Rundel. Open
Intro Statistics. 3rd Edition. OpenIntro.org
[10] Reetu and N.Kumar (2015), Medical Diagnosis for Liver Cancer
Fig. 6: Comparison of the accuracy of all the algorithm in percentages. Using Classification Techniques, International Journal of Recent
Scientific. Volume 6. Issue, 6, pp 4809-4813.
[11] Tina R. Patil, Mrs. S. S. Sherekar. Performance Analysis of Naive
Bayes and J48 Classification Algorithm for Data Classification.
International Journal Of Computer Science And Applications Vol. 6,
No.2, Apr 2013

Fig. 6: Comparison of sensitivity and specificity of different models.

You might also like