You are on page 1of 25

DISEASE PREDICTION USING MACHINE

LEARNING
A Mini Project Report

Submitted by

RAHUL MAZUMDER (380116010069)


RAJU KUMAR SHARMA (380116010070)
SAKSHI (380116020081)
SAURAV CHOWDHURY (380116010086)
KUMAR SOURABH ANAND (380117114001)

of
COMPUTER SCIENCE AND ENGINEERING
Under the Guidance of

Dr. Dharmpal Singh

JIS College of Engineering

(An Autonomous Institute)

Block “A” Phase III, Kalyani Nadia-741235

December, 2018
DECLARATION

We hereby declare that the project entitled “DISEASE PREDICTION USING


MACHINE LEARNING” submitted for the B. Tech. (CSE) degree is our original
work and the project has not formed the basis for the award of any other degree,
diploma, fellowship or any other similar titles.
Signature of the Students

Place: KALYANI

Date:
CERTIFICATE
This is to certify that RAHUL MAZUMDER (380116010069), RAJU KUMAR
SHARMA (380116010070), SAKSHI (380116020081), SAURAV
CHOWDHURY (380116010086), KUMAR SOURABH ANAND
(380117114001) has completed their project entitled DISEASE PREDICTION
USING MACHINE LEARNING, under the guidance of Dr. Dharmpal Singh
in partial fulfillment of the requirements for the award of the Bachelor of
Technology in Computer Science and Engineering from JIS college of
Engineering (An Autonomous Institute) is an authentic record of their own work
carried out during the academic year 2018-19 and to the best of our knowledge,
this work has not been submitted elsewhere as part of the process of obtaining a
degree, diploma, fellowship or any other similar title.

--------------------------------- -------------------------------

Signature of the Supervisor Signature of the HOD

_________________________________

Signature of the External Expert

Place:

Date:
ABSTRACT

Disease prediction is one of the critical task while


designing medical diagnosis software. Artificial
intelligence and neural network are two major techniques
which are already used to solve this type of medical
diagnosis problem. Recently, Machine Learning
techniques have been successfully utilized in a different
applications including to assist in medical diagnosis. It is
very effortless and on time process for patients to analyze
disease based on clinical and laboratory symptoms with
appropriate data and give more efficient result for specific
disease. Decision Tree is one of the effective data mining
methods till this date.
In this project, first we have observed the current
scenario of medical diagnosis system with different data
mining techniques and later we have proposed an
algorithm to predicate the Heart disease, Liver Disease
and Diabetes in women based on several attributes, The
algorithm used in this project is namely are Decision
Tree, Naïve Byes, Support vector machine(SVM), k-
nearest neighbours algorithm (KNN), Logistic regression,
Random Forests.
ACKNOWLEDGEMENT
The analysis of the project work wishes to express my gratitude to Dr.
Dharmpal Singh for allowing the degree attitude and providing effective guidance
in development of this project work. His conscription of the topic and all the
helpful hints, he provided, contributed greatly to successful development of this
work, without being pedagogic and overbearing influence.

We also express my sincere gratitude to Dr. Dharmpal Singh, HOD of the


Department of Computer Science and Engineering of JIS College of Engineering
and all the respected faculty members of Department of CSE for giving the scope
of successfully carrying out the project work.

Finally, we take this opportunity to thank to Dr. Malay R Dave, Principal of


JIS College of Engineering and Dr. Deepak Ranjan Jana, Vice Principal, JISCE
for giving us the scope of carrying out the project work.

Date:
RAHUL MAZUMDER
B.TECH in Computer Science and Engineering
3rd YEAR/6th SEMESTER
Univ Roll—380116010069

RAJU KUMAR SHARMA


B.TECH in Computer Science and Engineering
3rd YEAR/6th SEMESTER
Univ Roll—380116010070

SAKSHI
B.TECH in Computer Science and Engineering
3rd YEAR/\6th SEMESTER
Univ Roll--380116020081

SAURAV CHOWDHURY
B.TECH in Computer Science and Engineering
3rd YEAR/6th SEMESTER
Univ Roll—380116010086

KUMAR SOURABH ANAND


B.TECH in Computer Science and Engineering
3rd YEAR/6th SEMESTER
Univ Roll—38011711400
Table of Contents

Title Page i
Declaration of the Student ii
Certificate of the Guide iii
Abstract iv
Acknowledgement v

1. INTRODUCTION 1
1.1 Problem Definition 1
1.2 Project Overview/Specifications 2
1.3 Hardware Specification 3
1.4 Software Specification

2. LITERATURE SURVEY 4
2.1 Existing System 4
2.2 Proposed System 5
2.3 Feasibility Study 6

3. SYSTEM ANALYSIS & DESIGN


3.1 Requirement Specification 7
3.2 Flowcharts / DFDs / ERDs 13
3.3 Design and Test Steps / Criteria 16
3.4 Testing Process 20

4. RESULTS / OUTPUTS 22
5. CONCLUSIONS & FUTURE SCOPE 23
6. REFERENCES 24
1. INTRODUCTION

1.1 Problem Definition

The healthcare sector is being transformed by the ability


to record massive amounts of information about
individual patients, the enormous volume of data being
collected is impossible for human beings to analyse .
Moreover, there are many people living in remote areas
who can not visit a doctor for medical treatment and
hence have to wait for a long.
To overcome this problem there should be solution
through which people can themselves predict the
existance of disease by using some of the symptoms as
data.
1.2 Project Overview/Specifications
Diseases prediction is a web-based machine learning application, trained
by a UCI dataset. The user inputs its specific medical details to get the
prediction of heart disease for that user. The algorithm will calculate the
probability of presence of disease. The result will be displayed on the
webpage itself. Thus, minimizing the cost and time required to predict
the disease. Format of data plays crucial part in this application. At the
time of uploading the user data application will check its proper file
format and if it not as per need then ERROR dialog box will be
prompted. Our system will be implementing the following three
algorithms:

 Support Vector Machine (SVM)

 Logistic Regression

 Random Forest
Furthermore, some steps will be taken for optimizing the algorithms
thereby improving the accuracy. These steps include cleaning the dataset
and data pre-processing. The algorithms were judged based on their
accuracy and it is observed that the Logistic Regression is the most
accurate out of the three with 83.11% efficiency. Hence, it is selected for
the main application .The main application is a web application which
accepts the various parameters from the user as input and computes the
result. The result is displayed along with the accuracy of prediction .
1.3 Hardware Specification

1.3.1 Operating System: Windows 7 or above that


1.3.2 RAM: 1GB
1.3.4 Hard Disk: 40GB

1.4 Software Specification

1.4.1 CODELAB
1.4.2 JUPYTER
1.4.3 PANDAS LIBRARY
1.4.4 NUMPY LINRARY
1.4.5 SKLEARN LIBRARY
1.4.6 FLASK
1.4.7 HTML
1.4.8 CSS
1.4.9 BOOTSRAP
2. LITERATURE SURVEY
2.1 Existing System
The existing system has various problems which are
mentioned below-

 System is only using one data set for validation which does
not predictable enough to generate outcomes.
 System is only exploring the common predictable
performance of their models without considering the F-
score and precision as measures.
 Most studies do not provide statistical test results to
demonstrate the level of significance of their experimental
results
 Most studies related to ensemble classifier do not compare
the performance difference between individual classifiers
and an ensemble classifier consisted of individual
classifiers
2.2 Proposed System

The proposed system have the following features-

Heart diseases prediction is a web-based


machine learning application, trained by a
UCI dataset. The user inputs its specific
medical details to get the prediction of heart
disease for that user. The algorithm will
calculate the probability of presence of heart
disease. The result will be displayed on the
webpage itself. Thus, minimizing the cost
and time required to predict the disease.
Format of data plays crucial part in this
application. At the time of uploading the
user data application will check its proper
file format and if it not as per need then
ERROR dialog box will be prompted.
2.3 Feasibility Study
In feasibility study phase we had undergone through
various steps which are described below:
1. Technical Feasibility: The project entitles “DISEASE
PREDICTION USING MACHINE LEARNING” is
technically feasibility because it provides the high level
of reliability, availability and compatibility.
2. Economical Feasibility: The computerized system will
help in automate the selection leading the profits and
details of the organization. With this software, the
machine and manpower utilization are expected to go up
by 80-90% approximately. The costs incurred of not
creating the system are set to be great, because precious
time can be wanted by manually.
3. SYSTEM ANALYSIS & DESIGN
3.1 REQUIREMENT SPECIFICATION
3.1.1 FUNCTIONAL REQUIREMENT

 Heart dataset is certified by UCI repository.

3.1.2 USER CHARACTERSTICS:-

 Heart disease prediction:


Need to submit the values of these features
Thalach , slope , cp , restecg

 Lever disease prediction:


Need to submit the values of these features
Alkaline_Phosphotase, Age,Alamine_Aminotransferase,
Albumin , Aspartate_Aminotransferase
 Diabetes disease prediction:
Need to submit the values of these features
Glucose, BMI, Age, DiabetesPedigreeFunction,
BloodPressure
COLAB:-
Heart diseases prediction is a web-based machine learning application, trained by a UCI
dataset

Fig.1: Here we are reading the dataset

Fig.2: Here we are checking the null values.

Fig.3: Dataset

Fig.4:Correlation between features


Fig.5: Heatmap to visualize the correlation (here we find thalach,slope,cp,restecg here all
these are positive in corelation with target )

Fig.6: Feature scaling by MinMaxScaler

Fig.7: Cross validation (splitting the dataset) and creating models


Fig.8: Accuracy check

Fig.9: Checking SVM classifier accuracy accoring to correlation features

Fig.10: Random Forest Classifier model built and accuracy check


Fig.11: Count plot of target attribute

Fig.12: we find very less accuracy so we move to feature importance

Fig.13: We find oldpeak , thal , thalach , cp here all these are


positive in relation according to decision tree, randomforest etc with
target
Fig.14: Logistic regression built and new accuracy 83.11%

Fig.14: Confusion matrix (16 predictions are incorrect)

Fig.15: For hyperparameter move to Gridsearch


Best parameters
Fig.17: final accuracy by logistic regression

Fig.18: Final confusion matrix with minimum error

Result:
Due to overfitting of new features according to feature importance
we will go with the features of correlation which gives us better
prediction.

BLOCK DIAGRAM OF HEART DISEASE


PREDICTION
Website
Methodology
As per the data and information we have gathered, we found that
these following tasks must be carried out in order to get much
accurate predictions. The tasks that we are going to carry out are
as follows.
 Data Preprocessing: The dataset we obtained is not
completely accurate and error free. Hence, we will first carry out
the following operations on it.
 Data Cleaning: NA values in the dataset is the major setback
for us as it will reduce the accuracy of the prediction profoundly
so, we will remove the fields which does not have values. We
will substitute it with the mean value of the column. This way,
we will remove all the values in the data set.
 Feature Scaling: Since the range of values of raw data varies
widely, in some machine learning algorithms, objective
functions will not work properly without feature scaling. For
example, the majority of classifiers calculate the distance
between two points by the Euclidean distance. If one of the
features has a broad range of values, the distance will be
governed by this particular feature. Therefore, the range of all
features should be scaled so that each feature contributes
approximately proportionately to the final distance. So we will
scale the various fields in order to get them closer in terms of
values. e.g. Age has just two values i.e. 0,1 and cholesterol has
high values like 100. So, in order to get them closer to each
other we will need to scale them.
 Factorization: In this section, we assigned a meaning to the
values so that the algorithm doesn’t confuse between them. For
example, assigning meaning to 0 and 1 in the age section so that
the algorithm doesn’t consider 1 as greater than 0 in that section.
5. CONCLUSIONS
Research in ML methods for medical applications to-date
remains centered on technological issues and is mostly
application driven. However, in order to answer
fundamental questions and acquire useful insight in the
performance and behavior of ML methods in a medical
context, it is important to enhance our understanding of
ML algorithms, as well as to provide mathematical
justifications for their properties . Furthermore, we have
to cope with a number of difficulties which concern the
process of learning knowledge in practice, such as
visualization of the learned knowledge ,extraction of
understandable rules from neural networks and
identification of noise and outliers in the data.

FUTURE SCOPE
The proposed work will be further increased developed for the
automation of the other disease prediction more accurately.
6. REFERENCES

Sellappan Palaniappan, Rafiah Awang “Intelligent Heart Disease


Prediction System Using Data Mining Techniques”, IEEE, July 2015
M. Raihan, Saikat Mondal, Arun More, Md. Omar Faruqe Sagor, Gopal
Sikder, Mahbub Arab Majumder, Mohammad Abdullah Al Manjur and
Kushal Ghosh “Smartphone Based Ischemic Heart Disease (Heart
Attack) Risk Prediction using Clinical Data and Data Mining
Approaches, a Prototype Design”, September 2014.
Marjia Sultana, Afrin Haider and Mohammad Shorif Uddin “Analysis
of Data Mining Techniques for Heart Disease Prediction”, May 2015.
Soodeh Nikan, Femida Gwadry-Sridhar, and Michael Bauer “Machine
Learning Application to Predict the Risk of Coronary Artery
Atherosclerosis”, IEEE, August 2016
Sanjay Kumar Sen Asst. Professor, Computer Science & Engg. Orissa
Engineering College, Bhubaneswar, Odisha – India.” Predicting and
Diagnosing of Heart Disease Using Machine Learning Algorithms”
International Journal of Engineering and Computer Science. Volume 6
Issue 6, June 2017
V.V. Ramalingam, Ayantan Dandapath, M Karthik Raja “Heart disease
prediction using machine learning tech : A survey” International Journal
of Engineering & Technology, 7 (2.8), April 2018.
Heart Disease Dataset - https://www.kaggle.com/c/heart-disease dated:
Sept 2018
K. Srinivas, B. Kavihta Rani, A. Govrdhan “Applications of Data
Mining Techniques in Healthcare and Prediction of Heart Attack”
IJCSE) International Journal on Computer Science and Engineering Vol.
02, No. 02, 2010.
Heart Disease Data Set
https://archive.ics.uci.edu/ml/datasets/heart+Disease

You might also like