Parul Institute of Engineering and Technology Faculty of Engineering and Technology Department of Information Technology

Parul Institute of Engineering and
TechnologyFaculty of Engineering and

Technology Department of Information
Technology
IT DEPT Machine Learning
LIVER DISEASE PREDICTION
A course completion REPORT
Submitted by
MONI KUMARI
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
PARUL INSTITUTE OF ENGINEERING AND

TECHNOLOGY,PARUL UNIVERSITY,
VADODARA, GUJARAT
[2022-2023]
Page 1 of 14
Page 2 of 14
PREFACE
Health is a state of complete physical, mental and social well-being and not merely the
absence of disease or infirmity. ‘Health is wealth’ is a world-famous proverb concerning
health. A healthy body is defined as the overall ability of the body to function well. It includes
the physical, mental, emotional, and social health of all individuals. When one maintains good
health, he/she opens the key to happiness.
When maintaining health, it depends on multiple factors such as drinking water regularly,
exercising, eating healthy foods, sleeping on time, etc. Healthy life also ranges from the people
you spend time with to the air around that you breathe.
In India, delayed diagnosis of a disease is a fundamental problem. Liver is an essential organ

of our body. There is a great need for an early detection of liver disease so as to prevent
. For the proper diagnosis, it is necessary to evaluate some of direct_bilirubin,
alkaline_phosphotas, total_protein, albumin and globulin_ratio.‖ Below, figure 1 shows
the various functions that are performed by the liver. The main attributes of liver patient’s
dataset . Some of the main attributes of liver disease include ―Total_bilirubin, complete
liver failure, which can result in patient’s death. The main attributes of liver patient’s
dataset. Some of the main attributes of liver disease include, and globulin_ratio. It is the
challenging task for doctors to accurately predict the liver disease. Various classification
techniques are used to classify the data and predict the liver disease through the datasets
of liver patients . Having access to classification algorithms with large amount of data will
help clinicians make better decisions and ultimately improve patient paper shows a survey
about the classification techniques that can be used for the prediction of liver disease and
gives an idea for future work, that which classification technique can be utilised further for
diagnosis of the liver disease. Outcomes with an accurate prediction of liver disease.
Page 3 of 14
1. INTRODUCTION
The liver is the largest solid organ of the body. It is essential for removing toxins from the
body’s blood supply, maintains healthy blood sugar levels, regulates blood clotting and
performs hundreds of other vital functions. The viruses and alcohol leads the liver towards
liver damage and lead a human to a life-threatening condition. There are many types of liver
diseases whereas hepatitis, cirrhosis, liver tumours, liver cancer, and many more. Among
them liver diseases and cirrhosis as the main cause of death. Therefore, liver disease is one of
the major health problems in the world.
Fig. 1(a). Functionalities of Liver
1.1 Data Description
Databases of 583 records/entries are taken from the ILPD (Indian Liver Patient Dataset)
Dataset for the purpose of solving problem of this paper. Entire ILPD dataset contains
information about 583 Indian liver patients. In which 416 are liver patient records and 167
non liver patient records. The data set was collected from north east of Andhra Pradesh, India.
Selector is a class label used to divide into groups (liver patient or not).
Page 4 of 14
2. Description of Solution Designed / Implemented
2.1. Planning and Analysis Phase
Planning phase include the creation of ideas to support healthcare and technical team through
the prediction of liver diseases.
The main objective of planning phase is to plan the step involved in the development of
prediction system using software engineering life cycle.
In addition, challenging think is to remove the gap between the software development
members and health-care specialist. In the analysis phase, the concern is to gather prediction
system requirements and environmental considerations. The requirements involve the people
from a different background area such as informaticists, physicians, patients etc.
2.2. Design and Build Phase
In design phase, the architecture model of liver diseases prediction is established. The
architecture defines user interface, segment, action and behaviours of the ILDP Software. The
design document defines the technical plan to implement as per the requirements to build
the system. The details of packages, programming language, platform, environment, and
other technical/non-technical details are established.
2.3. Implementation Phase
In implementation phase, the development of ILDPS done as identified in the design phase.
The main challenging implementation phase is to implement the prediction system as per
requirement, planning, and design. In the implementation phase, ILDPS is dealing with
problems related to the performance, quality and debugging.
Page 5 of 14
3. Problem Statement
In recent years, liver disorders have excessively increased and liver disease is becoming one
of the most fatal diseases in several countries. To overcome this problem, I am going to choose
and train a machine learning algorithm that will be trained to predict liver disease in patients.
4. Solution Description
4.1 Data Collection
The collected Indian liver patient dataset for this problem. It’s a multivariate dataset contain
ten attributes that are: age, gender, total bilirubin, direct bilirubin, total proteins, albumin,
a/g ratio, sgpt, sgot, and alkphos. This dataset contains 416 liver patient records and 167 non-
liver patient records.
Fig. 4(a). Type of Data
Page 6 of 14
4.2 Importing Required Libraries
Fig. 4(b). Listed Libraries
Pandas is a software library written for the Python programming language for data
manipulation and analysis. We have imported pandas as pd for the ease of using it further in
our program.
Numpy is a library of python for multi-dimensional arrays and matrices, along with a large
collection of high-level mathematical functions to operate on these arrays. we have imported
numpy as np.
The OS module in python provides functions for interacting with the operating system.The
determination whether to issue a warning message is controlled by the Warning Filter.
Scikit-learn is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms.so we have imported
some of the important libraries from sklearn which will help our model to build . that why they
are termed as “common model helpers”.
Page 7 of 14
Visualization is powerful technique through which we can visualize how our data looks like .
One of the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. So to perform visualization we have library named
Matplotlib. Matplotlib is a multi-platform data visualization library built on NumPy arrays.
Matplotlib consists of several plots like line, bar, scatter, histogram etc.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level

interface for drawing attractive and informative statistical graphics.
4.3 Data Cleaning
Data cleaning is the procedure of correcting or removing inaccurate and corrupt data. This
process is crucial and emphasized because wrong data can drive a business to wrong
decisions, conclusions, and poor analysis, especially if the huge quantities of big data are into
the picture.
Renaming of the column names are done for ease of understanding. Followed by deleteion
of the null values which are present in dataset. Then from the Liver_Disease column we
subtracted one from the existing values so that it gets categorised as 0 and 1. With the help
of Label encoder we converted the column gender into categorical data. Column name
"Liver_disease" 0 indicate that the the person has some kind of Liver Disease or the liver of
the patient is unhealthy and 1 represents that the person's liver is healthy.
Fig. 4(c). Operations
Page 8 of 14
4.4 Data Visualization
Data visualization is the act of taking information (data) and placing it into a visual context,
such as a map or graph. Data visualizations make big and small data easier for the human
brain to understand, and visualization also makes it easier to detect patterns, trends, and
outliers in groups of data.A boxplot is a standardized way of displaying the distribution of data
based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3),
and “maximum”). It can tell you about your outliers and what their values are.
Median (Q2/50th Percentile): The middle value of the dataset.
First quartile (Q1/25th Percentile): The middle number between the smallest number (not the
“minimum”) and the median of the dataset.
Third quartile (Q3/75th Percentile): The middle value between the median and the highest
value (not the “maximum”) of the dataset.
Interquartile range (IQR): 25th to the 75th percentile.
Fig. 4(d). Visualization
Page 9 of 14
Fig. 4(e). Boxplot
Fig. 4(f). Histogram
Page 10 of 14
4.5 Machine Learning Models
1.Logistic Regression: Logistic regression is a statistical analysis method to predict a binary

outcome, such as yes or no, based on prior observations of a data set.
Fig.4(g).LR Score
Since the value of Auc is 0.5 means this algorithm is not giving us the accurate results so we
will try few more different algorithms.
2.Naive Bayes: It is a probabilistic classifier based on applying Bayes Theoram with strong
independence. The assumption made here is that the fetures are independent, menas the
presence of one particular feature does not affect the other.
Page 11 of 14
Fig. 4(h).NB Score
We applied ROC-AUC in this algorithm and found that it is giving the value of AUC as 0.653
which is more that 0.5 and little closer to 1 ,which is little better than Logistic Regression
algorithm. But again this algorithm still is not best for our dataset.
3.KNN : K-Nearest Neighbours It is an approach to data classification that estimates a data

point is to be a member of one group or the other depending on the group the data points
nearest.
Fig. 4(i).knn Score
Page 12 of 14
4.DecisionTree: It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure. A decision tree
simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees. Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
Fig4(j).DTscore
Fig.4(k).DT Confusion Matrix
Page 13 of 14
5. Conclusion
All the classification algorithm we applied and observed that Decision Tree algorithm is best
among all. Giving the accuracy 75.17%. and AUC 0.657, which means it has good measure of
separability.
Fig.4(L).actual vs predicted values
So after completing all the procedures and testing all the algorithms. we are finally ready with
our model.
Page 14 of 14

Parul Institute of Engineering and Technology Faculty of Engineering and Technology Department of Information Technology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parul Institute of Engineering and Technology Faculty of Engineering and Technology Department of Information Technology

Uploaded by

Copyright:

Available Formats

Parul Institute of Engineering and

TechnologyFaculty of Engineering and

LIVER DISEASE PREDICTION

A course completion REPORT

PARUL INSTITUTE OF ENGINEERING AND

In India, delayed diagnosis of a disease is a fundamental problem. Liver is an essential organ

Fig. 1(a). Functionalities of Liver

1.1 Data Description

2. Description of Solution Designed / Implemented

2.1. Planning and Analysis Phase

2.2. Design and Build Phase

2.3. Implementation Phase

4.1 Data Collection

Fig. 4(a). Type of Data

4.2 Importing Required Libraries

Fig. 4(b). Listed Libraries

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level

4.3 Data Cleaning

Fig. 4(c). Operations

4.4 Data Visualization

Median (Q2/50th Percentile): The middle value of the dataset.

Interquartile range (IQR): 25th to the 75th percentile.

Fig. 4(d). Visualization

Fig. 4(e). Boxplot

Fig. 4(f). Histogram

4.5 Machine Learning Models

1.Logistic Regression: Logistic regression is a statistical analysis method to predict a binary

Fig. 4(h).NB Score

3.KNN : K-Nearest Neighbours It is an approach to data classification that estimates a data

Fig. 4(i).knn Score

Fig.4(k).DT Confusion Matrix

Fig.4(L).actual vs predicted values

You might also like