B Tech Major Project Report Final

lOMoARcPSD|25553714
B Tech Major Project Report Final
Computer Science (SRM Institute of Science and Technology)
Studocu is not sponsored or endorsed by any college or university

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)
lOMoARcPSD|25553714
DISEASE PREDICTION SYSTEM USING MACHINE

LEARNING
A PROJECT REPORT
Submitted by
PRANAY CHOPRA [RA1611003030097]

VISHESH KAPOOR [RA1611003030108]
MANAS TYAGI [RA1611003030113]
HARSHIT RAJWAR [RA16110030 30114]
Under the guidance of

Mr. RAKESH YADAV
(Assistant Professor, Department of Computer Science & Engineering)
in partial fulfillment for the award of the degree

of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING

of
FACULTY OF ENGINEERING AND TECHNOLOGY
S.R.M. Nagar, Kattankulathur, Kancheepuram District

MAY 2020

lOMoARcPSD|25553714
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

(Under Section 3 of UGC Act, 1956)
BONAFIDE CERTIFICATE
Certified that this project report titled “DISEASE PREDICTION SYS-

TEM USING MACHINE LEARNING” is the bonafide work of “ PRANAY
CHOPRA [RA1611003030097], VISHESH KAPOOR [RA1611003030108],
MANAS TYAGI [RA1611003030113] , HARSHIT RAJWAR [RA16110030
30114], ”, who carried out the project work under my supervision. Certified
further, that to the best of my knowledge the work reported herein does not
form any other project report or dissertation on the basis of which a degree or
award was conferred on an earlier occasion on this or any other candidate.
SIGNATURE SIGNATURE
Mr. RAKESH YADAV Dr. R.P.Mahapatra

GUIDE HEAD OF THE DEPARTMENT
Assistant Professor Dept. of Computer Science & Engi-
Dept. of Computer Science & Engi- neering
neering
Signature of the Internal Examiner Signature of the External Examiner

lOMoARcPSD|25553714
ABSTRACT
The Disease Prediction System based on various prediction models that help to
predict the disease of the user on the basis of the symptoms that user enters as
an input to the system. Predictive models with the help of machine learning clas-
sification algorithms analyzes the symptoms provided by the user as input and
gives the name and probability of the disease as an output. Disease Prediction
is done by implementing the Naive Bayes Classifier, Decision tree and Random
Forest Algorithm. The Naive Bayes helps to calculate the probability of the dis-
ease which is predicted. Average prediction accuracy probability 87% is obtained.
The model uses a dataset with the count of 132 symptoms from which the user
can select their symptoms. The user does not need to have a medical report to
use this system as the prediction is based on the symptoms which will save the
money. The system also has a very easy to use user interface so all the users can
use it to predict the generic diseases.

lOMoARcPSD|25553714
ACKNOWLEDGEMENT
I would like to express my deepest gratitude to my guide, Mr. Rakesh Yadav his
valuable guidance, consistent encouragement, personal caring, timely help and
providing me with an excellent atmosphere for doing research. All through the
work, in spite of his busy schedule, he has extended cheerful and cordial support
to me for completing this research work.
Author
iv

lOMoARcPSD|25553714
TABLE OF CONTENTS
ABSTRACT iii
ACKNOWLEDGEMENT iv
LIST OF TABLES vii
LIST OF FIGURES viii
ABBREVIATIONS ix
1 INTRODUCTION 1
1.1 About . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Mining and Machine Learning Algorithm . . . . . . . . . . . . . . . 3
1.2.1 Data Analysis and Data Mining . . . . . . . . . . . . . . . . . . . 3
1.2.2 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . 4
1.3 Django Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Advantages of Django . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Django Working . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Postgre SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Amazon Web Services(AWS) . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.1 Benefits of AWS . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 AWS Services Used . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 LITERATURE SURVEY 12
2.1 Introduction to Old Models . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Identification of Research Gap and Problems . . . . . . . . . . . . . . . . 15
3 SYSTEM ANALYSIS 18
3.1 Major Inputs Required . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Feasibility Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

lOMoARcPSD|25553714
3.2.1 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.2 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Algorithms Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 SYSTEM DESIGN 24
4.1 Machine Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Module Description . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 CODING 35
5.1 Machine Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 CONCLUSION 42
7 FUTURE ENHANCEMENT 45

lOMoARcPSD|25553714
LIST OF TABLES
3.1 Prediction Using Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 21
vii

lOMoARcPSD|25553714
LIST OF FIGURES
1.1 Components Of MVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Accuracy Of The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Accuracy Of The Breast Cancer Dataset . . . . . . . . . . . . . . . . . . . 16
2.3 Dataset For [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Decision Tree Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Random Forest Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Detailed Design Of Model . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 AWS Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Ubuntu AMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 EC2 Instance Is Live . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Connecting Through SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Postgres And Other DB Options In AWS . . . . . . . . . . . . . . . . . . . 30
4.7 Postgres Master Username . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Database Name And Details . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.9 Pgadmin Details To Access Instance . . . . . . . . . . . . . . . . . . . . . 32
4.10 Postgres Instance In Pgadmin . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.11 Doctor Consultation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.12 Patient Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.13 Doctor’s Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1 Symptoms Selection List . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2 Doctor Consultation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Predicted Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
viii

lOMoARcPSD|25553714
ABBREVIATIONS
AI Artificial Intelligence
ARN Amazon Resource Name
AWS Amazon Web Services
CDF Cumulative Distribution Function
CNN Convolution Neural Network
EC2 Elastic Compute Cloud
KNN K-Nearest Neighbour
ML Machine Learning
ORM Object Relational Model
PDF Probability Distribution Function
RDS Relational Database Service
SCP Secure Copy
SSH Secure Shell
SVM Support Vector Machine
UI User Interface
ix

lOMoARcPSD|25553714
CHAPTER 1
INTRODUCTION
1.1 About
There are times when we need a doctor all of a sudden but sometimes they are not available
due to some reason and we are left in trouble. The system we have proposed is user friendly to
get help and advice on health issues immediately through the online healthcare system. Now a
days, with the help of the statistics and posterior distribution the problems are swiftly and easily.
As the Bayesian statistics has a great success rate in the field of economic, social science and
a few other fields just like that, in medical fields, people have solved various medical problems
that are tiresome to be settled in classic statistics by classification and can be solved easily.
Naive Bayes is among the basic common classification techniques introduced by Reverend
Thomas Bayes. The classification rules which help in solving the prediction of disease are
generated by the samples trained by themselves and help in solving the problem easily.
It is approximated that greater than 70% of people in India are prone to various body dis-
eases like viral, flu, cough, cold etc. in intervals of 2 months. As many people don’t understand
that the general body diseases could be symptoms of something more harmful, 25% of this
population dies or gets some serious medical problem because of ignoring the early general
body symptoms and this is a very serious condition that we are facing and the problem can be
proven to be a very dangerous situation for the population and can be alarming if the people
will continue ignoring these diseases. Hence identifying or predicting the disease at the very
basic stage is very important to avoid any unwanted problems and deaths. The systems which
are available now a days are the systems that are either dedicated to a particular disease or are
in development or the research for solving the algorithms related to the problem when it comes
to generalized disease.
The main motive of the proposed system is the prediction of the commonly occurring dis-
eases in the early phase as when they are not checked or examined they can turn into a disease
more dangerous disease and can even cause death. The system applies data mining techniques,

lOMoARcPSD|25553714
decision tree algorithms, Naive Bayes algorithm and Random Forest algorithm. This system
will predict the most possible disease based on the given symptoms by the user and precaution-
ary measures required to avoid the aggression of disease, it will also help doctors to analyze the
patterns of diseases in the society. This project is dedicated to the Disease prediction System
that will have data mining techniques for the basic stages of the dataset and the main model
will be trained using the Machine Learning (ML) algorithms and will help in the prediction of
general diseases.

lOMoARcPSD|25553714
1.2 Data Mining and Machine Learning Algorithm
The Data Mining and the Machine Learning Algorithms are used for the prediction of Disease
in the Project. There are different Data Mining and Machine Learning used for the purpose of
correcting and evaluating the dataset and then testing the dataset on the basis of train score and
the test score of the ML model.
1.2.1 Data Analysis and Data Mining
The Data Mining is a process in which raw data is prepared and structured from the unstructured
data as to take meaningful information from the data which can be used in the project. Task
of making data organized and reflective about data is to way to get what this information does
the data contains in it and what it does not have in it. There are so many different types of
methods in which the people can make use of data analysis. It is simply very easy to use data
during the analysis phase and get to some certain conclusions or some agendas. The analysis
of data is a process of inspecting, cleaning, transforming, and modeling data with the objective
of highlighting useful information, suggesting conclusions, and supporting decision making
which are helpful to the user. Data analysis has multiple facets and approaches, encompassing
diverse techniques under an array of names, in different business, science, and social science
domains.
Data Mining is the discovery of unknown information found in databases, data mining func-
tions has some different methods for clustering, classification, prediction, and associations. In
the data mining important application is that of mining association rules, association rules was
first introduced in 1993 and are used to identify relationships among a set of items in databases
these different properties are not based on the properties of the data, but rather based on co-
occurrence of the data items. The Data mining helps in giving new and different perspectives
for data analysis the main role of data mining is to extract and discover new knowledge from
data. In the past few years, different methods have been coined and developed about the ca-
pabilities of data collection and data generation, data collection tools have provided us with
a huge amount of data, data mining processes have integrated techniques from multiple dis-
ciplines such as, statistics, machine learning, database technology, pattern recognition, neural

lOMoARcPSD|25553714
networks, information retrieval and spatial data analysis. The data mining techniques have been
used in many different fields such as, business management, science, engineering, banking, data
management, administration, and many other applications.
1.2.2 Machine Learning Algorithms
The ML is a small part of Artificial Intelligence (AI) which is used in the computation work and
the analysis work in the AI. The ML algorithms are used to find different patters and different
structures in the dataset which is provided to the dataset, the ML algorithms are used to give
a large computation capabilities to the system by which a large amount of data is given to the
model for the purpose of training and testing the data, the ML algorithms are used in decision
making process the model which is prepared by using the ML has a large amount of data in it
which makes it a very good for the process of decision making. ML algorithms have very high
computational power and are proven to be very helpful in today’s world.
Different types of ML algorithms are organized into different ways, based on the desired
outcome of the algorithm. Common algorithm types include:
\ Supervised learning — The supervised learning algorithm can apply what has been
learned in the past to new data using labelled examples to predict the future events. Start-
ing from analysis of a known training dataset. This algorithm is used to provide targets
for any new values after sufficient amount of training of the model.
\ Unsupervised learning — Unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labeled. This algorithm shows how the
system can infer a function to describe a hidden structure from unlabeled data.
\ Semi-supervised learning — This category of the ML algorithms falls somewhere be-

tween the supervised learning and the unsupervised learning algorithm which combines
both labeled and unlabeled examples to generate an appropriate function or classifier
which is used to make a model for the purpose of prediction or classification.
\ Reinforcement learning — This is the algorithm where the algorithm learns a policy
of how to act given an observation of the world. Every action has some impact in the
environment, and the environment provides feedback that guides the learning algorithm.
\ Transduction — This algorithm is similar to supervised learning, but does not explic-
itly construct a function: instead, tries to predict new outputs based on training inputs,
training outputs, and new inputs.

lOMoARcPSD|25553714
\ Learning to learn — This method is where the algorithm learns its own inductive bias
based on previous experience.
The performance and computational analysis of ML algorithms is a branch of statistics

known as computational learning theory.
Machine learning is about designing algorithms that allow a computer to learn. Learning is
not necessarily involves consciousness but learning is a matter of finding statistical regularities
or other patterns in the data. Thus, many machine learning algorithms will barely resemble
how human might approach a learning task. However, learning algorithms can give insight into
the relative difficulty of learning in different environments
Machine learning is made up of three parts:
\ The computational algorithm at the core of making determinations.
\ Variables and features that make up the decision.
\ Base knowledge for which the answer is known that enables (trains) the system to learn.
Initially, the model is fed parameter data for which the answer is known. The algorithm
is then run, and adjustments are made until the algorithm’s output (learning) agrees with the
known answer. At this point, increasing amounts of data are input to help the system learn and
process higher computational decisions.

lOMoARcPSD|25553714
1.3 Django Framework
The Django is a high-level Python framework that helps in the rapid development and clean,
pragmatic design, django makes it easier to build applications more quickly, efficiently and
with less code. Django is used for creating the User Interface (UI) for the application. The UI
created by Django is easy to use so that the person which are from the non-technical field can
also use the application for the prediction of disease without going anywhere any saving time
and money.
1.3.1 Advantages of Django
Here are few advantages of using Django which can be listed out here:
\ Object-Relational Mapping (ORM) Support: The django framework provides a connec-

tion between the data model and the database engine, and supports a large set of database
engines systems including MySQL, Oracle, Postgres, etc. Django also supports NoSQL
database through Django-nonrel fork. For now, the only NoSQL databases supported are
MongoDB and google app engine.
\ Multilingual Support: The django supports multilingual websites through its built-in in-
ternationalization system. So you can develop your website, which would support multi-
ple languages.
\ Framework Support: The django has built-in support for Ajax, RSS, Caching and various
other frameworks.
\ Administration GUI: The django provides a nice ready-to-use user interface for admin-
istrative activities.
\ Development Environment: The django comes with a lightweight web server to facilitate
end-to-end application development and testing.
1.3.2 Django Working
As you already know django is a Python web framework and like most modern framework,
Django supports the MVC pattern. First let’s see what is the Model-View-Controller (MVC)

lOMoARcPSD|25553714
pattern, and then we will look at Django’s specificity for the Model-View-Template (MVT)
pattern.
MVC Pattern: When talking about applications that provides UI (web or desktop), we
usually talk about MVC architecture. And as the name suggests, MVC pattern is based on
three components: Model, View, and Controller.
DJANGO MVC - MVT Pattern: It is slightly different from MVC. In fact the main differ-
ence between the two patterns is that Django itself takes care of the Controller part (Software
Code that controls the interactions between the Model and View), leaving us with the template.
The template is a HTML file mixed with Django Template Language (DTL). The following
figure 1.1 illustrates how each of the components of the MVT pattern interacts with each other
to serve a user request: This is how the django works in the development of the application,
Model
User Django URL View
Template
Figure 1.1: Components Of MVT
the migrations are used in django to provide the schema to the database and django also has its
integrated web server so that the application does not face any system configuration problem all
these things are controlled by manage.py file in the django project which also helps in creating
the web server.

lOMoARcPSD|25553714
1.4 Postgre SQL
PostgreSQL is an object-relational-database-management-system (ORDBMS). In this project
the postgre SQL is used to store various information like various login information like user-
name and password for the user and doctors and the super user, developed at the University of
California at Berkeley Computer Science Department. Postgres has pioneered many concepts
that only became available in some commercial database systems much later. PostgreSQL is
an open-source descendant of this original Berkeley code. It supports a large part of the SQL
standard and offers many modern features:
\ Complex queries
\ Foreign keys
\ Triggers
\ Updatable views
\ Transactional integrity
\ Multiversion concurrency control
Also, PostgreSQL can be extended by the user in many ways, for example by adding new
features like:
\ Data types
\ Functions
\ Operators
\ Aggregate functions
\ Index methods
\ Procedural languages
The postgre sql engine used in the project is taken from the Relational Database Service
(RDS) service of the Amazon Web Services (AWS) which makes the database remote so that it
can be accessed anywhere with the help of ARN.

lOMoARcPSD|25553714
1.5 Amazon Web Services(AWS)
Amazon Web Services offers a broad set of global cloud-based products which include com-
pute, storage, databases, analytics, networking, mobile, developer tools, management tools,
IoT, security and enterprise applications are on-demand available in seconds with pay-as-you-
go-pricing. From the data warehousing to deployment tools, directories to content delivery,
over 140+ services are available on the AWS now a days and new services can be provi-
sioned quickly without the upfront capital expense. This allows enterprises, start-ups, small
and medium-sized businesses, and customers in the public sector to access the building blocks
they need to respond quickly to changing business requirements. This white paper provides
you with an overview of the benefits of the AWS Cloud and introduces you to the services that
make up the platform.
In 2006, AWS began offering IT infrastructure services to businesses as web services—now

commonly known as cloud computing. One of the key benefits of cloud computing is the
opportunity to replace upfront capital infrastructure expenses with low variable costs that scale
with your business. With the cloud, businesses no longer need to plan for and procure servers
and other IT infrastructure weeks or months in advance. Instead, they can instantly spin up
hundreds or thousands of servers in minutes and deliver results faster.Today, AWS provides a
highly reliable, scalable, low-cost infrastructure platform in the cloud that powers hundreds of
thousands of businesses in 190 countries around the world.
1.5.1 Benefits of AWS
The key benefits of using AWS are as follows:
\ Keep Your Data Safe: The AWS infrastructure puts strong safeguards in place to help
protect your privacy. All data is stored in highly secure AWS data centers.
\ Meet Compliance Requirements: AWS manages dozens of compliance programs in its

infrastructure. This means that segments of your compliance have already been com-
pleted.
\ Save Money: Cut costs by using AWS data centers. Maintain the highest standard of
security without having to manage your own facility

lOMoARcPSD|25553714
\ Scale Quickly: Security scales with your AWS Cloud usage. No matter the size of your
business, the AWS infrastructure is designed to keep your data safe.
1.5.2 AWS Services Used
Amazon EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides
secure, re-sizable compute capacity in the cloud. It is designed to make web-scale computing
easier for developers.
The Amazon EC2 simple web service interface allows you to obtain and configure capacity
with minimal friction. It provides you with complete control of your computing resources
and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time
required to obtain and boot new server instances (called Amazon EC2 instances) to minutes,
allowing you to quickly scale capacity, both up and down, as your computing requirements
change. Amazon EC2 changes the economics of computing by allowing you to pay only for
capacity that you actually use. Amazon EC2 provides developers and system administrators
the tools to build failure resilient applications and isolate themselves from common failure
scenarios.
Instance types
Amazon EC2 passes on to you the financial benefits of Amazon’s scale. You pay a very low
rate for the compute capacity you actually consume.
\ On-Demand Instances—With On-Demand instances, you pay for compute capacity by

the hour with no long-term commitments. You can increase or decrease your compute
capacity depending on the demands of your application and only pay the specified hourly
rate for the instances you use. The use of On-Demand instances frees you from the costs
and complexities of planning, purchasing, and maintaining hardware and transforms what
are commonly large fixed costs into much smaller variable costs. On-Demand instances
also remove the need to buy “safety net” capacity to handle periodic traffic spikes.
\ Reserved Instances—Reserved Instances provide you with a significant discount (up

to 75%) compared to On-Demand instance pricing. You have the flexibility to change
families, operating system types, and tenancies while benefiting from Reserved Instance
pricing when you use Convertible Reserved Instances.
10

lOMoARcPSD|25553714
\ Spot Instances—Spot Instances are available at up to a 90% discount compared to On-

Demand prices and let you take advantage of unused EC2 capacity in the AWS Cloud.
You can significantly reduce the cost of running your applications, grow your applica-
tion’s compute capacity and throughput for the same budget, and enable new types of
cloud computing applications.
Amazon RDS: Amazon Relational Database Service (Amazon RDS) makes it easy to set
up, operate, and scale a relational database in the cloud. It provides cost-efficient and re-sizable
capacity while automating time-consuming administration tasks such as hardware provisioning,
database setup, patching and backups. It frees you to focus on your applications so you can
give them the fast performance,high availability, security and compatibility they need.
Amazon RDS is available on several database instance types - optimized for memory, per-
formance or I/O - and provides you with six familiar database engines to choose from, including
Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. You can
use the AWS Database Migration Service to easily migrate or replicate your existing databases
to Amazon RDS.
The postgresql used in the project is made on AWS by using the RDS of the cloud. So the
data can be accessed remotely as per the requirement of the administrator.
11

lOMoARcPSD|25553714
CHAPTER 2
LITERATURE SURVEY
2.1 Introduction to Old Models
• In the model proposed by [1] showed important ML approaches to predict the disease
but this model which was proposed by [1] works on the K-Nearest Neighbour (KNN)
and Convolution Neural Network (CNN) approach of the machine learning algorithm.
Both the KNN and CNN approaches are used in this system which is different from the
approach which is used in our project. The CNN uses both the structures as well as the
unstructured data for the prediction of the disease which makes it more time consuming.
The accuracy of the system proposed by [1] comes out to be very high i.e. above 95% for
the KNN algorithm and 100% for the CNN algorithm that is very high for a ML model,
In such cases the model is said to be overfitting.
Figure 2.1: Accuracy Of The Model
• The model proposed by [2] is used for Disease Prediction and uses different ML algo-
rithms like Iforest for correcting the dataset problems and SMOTET for balancing the
dataset and then it uses the Ensemble learning technique. The Input the the ML model is

lOMoARcPSD|25553714
taken only by the electronic reports which are produced by the blood examination of the
patient or the user. Some of the input taken in this model are glucose level, cholesterol,
lipoprotein, blood pressure and other inputs which are only be possible by the physical
examination the user or the patient.
• The model proposed by [3] uses big data analytics and the deep learning models for the
prediction of Disease The dataset is big so it uses the Big Data analytics like Map reduce
is used in this model and on that the deep learning models are used for the prediction of
the Disease which makes it a very big process and it becomes very time consuming. This
model needs the full medical examination of the user or the patient foe the prediction
of the disease. Full medical history of the patient or the user is taken as an input to
this model which is stored with the help of the big data tools and then used by the deep
learning models to predict the disease. this model also needs all the medical record of
the patient like all the medications which the patient or the user was taking and the list of
doctors which he or she has visited which help in proper analysis of the patient’s problem.
• In the model proposed by [4] uses different ML algorithms like Random Forest, Logistic
Regression, Decision tree and others for the sake of prediction of the Disease and is used
for the prediction of the Heart Disease, Breast Cancer and Diabetes. All the algorithms
used in the system have their own way of predicting the Disease and are used accordingly.
Different dataset are used in this model for the different disease like the heart disease
has a different dataset and the Breast Disease has a different dataset. For the different
dataset the algorithms have their different accuracy % accordingly and are used as per
the accuracy.
• In the model proposed by [5] uses different data mining and the classification algorithms
for the prediction of disease. This model is mainly used for the prediction of the Heart
Disease and the algorithms which are used in this model are Decision Tree and the Naive
Bayes algorithm which are used for the prediction of the Disease and various data mining
techniques are also used in this model for correcting and balancing the dataset so that the
system can work correctly and can predict the correct Disease. This model also needs the
blood report of the patient or the user of the model. Some of the inputs which are used in
this system are Cholesterol Level, Blood Pressure, Glucose Level in the body etc. This
model has accuracy rate of 91% for the decision tree algorithm and 87% for the Naive
Bayes algorithm but has a very limited scope in the prediction of the Diseases as it can
only predict the Disease which are related to the Heart, Diabetes and Breast Cancer and
cannot predict the general Diseases.
• In the model proposed by [6] makes use of Support Vector Machine (SVM) technique of
the Machine Learning for the prediction of the Diseases. The dataset used in this model
has dome general symptoms like eating habits, physical activity and they are rated in
this model between 1-5 where 1 is for excellent and 5 is for very bad. This model helps
to predict that weather a person’s lifestyle is healthy or not and does he or she have any
13

lOMoARcPSD|25553714
disease or not, The model does not predict the name of the Disease or any problem which
the patient is facing or not. The data from the user is collected by the means of a form
and then is used by SVM for the prediction. This model is more focused on the lifestyle
is the user that the user is active or not that how much physical work is he or she doing
in day to day life and how much stress he or she have in life and on the basis of that the
health of the user is predicted.
• In the model proposed by [7] uses big data techniques for the disorders and helps in
the prediction of the disease like thyroid, chronic diseases. This model uses the Mahout
Hadoop technique of the big data analytics for the prediction of the disease Mahout has
all the data mining techniques in it which makes the system efficient and powerful. In
this model the Mahout part of the Hadoop system helps in the analysis of the data which
is stored in Hbase and on the basis of that the disease are predicted in this model. The
size of the dataset is very large hence the overall system becomes very time consuming
and the system requirements are also very high to run this system so it needs the very
high and fast processing environment for its functioning and disease prediction.
14

lOMoARcPSD|25553714
2.2 Identification of Research Gap and Problems
• The model given by [1] uses KNN and CNN algorithms which is more time consuming
as it involves both the structured and the unstructured data so the time taken to process
the data is more as compared to the dataset which contains only the structured data as
in the proposed project which contains only the structured data and the classification
algorithms used in the proposed project are decision tree, Naive Bayes and Random
forest.The accuracy of the model given by [1] is above 90% which is not good for a
ML model as it is said to be in an over fitting situation whereas the proposed model has
accuracy of about 86% which is good enough for a model of disease prediction.
• The model given by [2] has a very limited scope as it is only meant for the prediction of
the diabetes and hypertension whereas the proposed model is used for the prediction of
the basic general disease. The model given by [2] needs the blood report of the patient or
the user for the prediction of the diabetes or the hypertension and the algorithms used in
this model are ensemble learning techniques whereas the predicted model does not need
any blood report or physical presence of the user or the patient. The system contains a
list of symptoms from which the user can select the symptoms which the user is facing
and can predict the disease very easily and the algorithms used are different from the
given model. The input required in the given model are based on the medical report of
the user like cholesterol, blood glucose etc whereas the proposed system does not require
any type of blood report for the prediction of the disease.
• The model given by [3] uses a very big data set and to manage that dataset the big data
analytics are used which makes this system slow as needs a lot of system requirements
to run this project and the deep learning algorithms are used in this project are FISM,
NAIS, DeepICF which is different from the proposed model which uses the classification
algorithms which are light weight for a PC and run faster as compared as compaired to
the big deep learning techniques and big data analytics which takes more time and space.
• The model given by [4] is best for the prediction of disease related to breast cancer,
diabetes and heart related problems and has different dataset for all the three different
kind of disease which is different from the proposed model as the proposed model helps
in the prediction of the general diseases with the help of the symptoms and has a single
dataset for all the diseases. The accuracy in some case of the given model by [4] is very
high.
15

lOMoARcPSD|25553714
Figure 2.2: Accuracy Of The Breast Cancer Dataset
so the model given by [4] can be said to be a over fitting model as the accuracy is too high
where as the accuracy in the proposed model is about 86% which is good for a model for
the disease prediction.
• The model given by [5] uses KNN algorithm for the prediction of heart related diseases
and uses parameters like high cholesterol, high blood sugar, diabetes, smoking habits,
consuming too much alcohol as the input for the prediction of heart related diseases this
model also gives information about the cardio vascular diseases and the cardiac arrest and
many more heart related problems, the efficiency of the system is high for the decision
tree algorithm i.e. about 91% whereas the proposed system has the capability to predict
the general diseases and is more helpful as compaired to a simple heart disease prediction
system which is only helpful for the heart diseases but the simple disease prediction
system is helpful for the prediction of more diseases.
• The model given by [6] uses SVM algorithm for the prediction. This model is used to
predict the lifestyle and weather a person is suffering from any disease or not. The input
in the model is given as per the rating i.e. from 1-5 where 1 is for excellent and 5 is for
very bad and the symptoms which are rated are lack of physical activity, obesity, stress
and activity, smoking etc.
16

lOMoARcPSD|25553714
Figure 2.3: Dataset For [6]
whereas the proposed model dataset has the data on the basis of 0 and 1 that weather a
symptom is present or not and helps in the prediction of disease in a better way as the
given model is only to predict the lifestyle of a person that he or she is physically active
or not and many other things and what are the chances that a person is prone to a disease
whereas the proposed system predicts the disease.
• The model given by [7] uses very big data set and uses big data analytics for the prediction
of disorders. this model uses mahout of the hadoop file system for the prediction as the
mahout contains all the data mining and analysis techniques for the prediction of the
disorders but as we can see that there is a huge amount of data associated with this
model so it’s a bit hard to process all the data for the predictions of the disorders and
the overall speed of the system becomes a bit slow as there is a huge amount of data to
be processed. This model helps in the prediction of the chronic disorders like thyroid
and needs a medical examination of the user as well whereas the proposed system is fast
as it has light weight algorithms and has higher efficiency and helps in the prediction of
the more commonly occurring diseases which can later result in big problems later to
the user or the patient. The system also helps in getting medical advice from the doctor
as the doctors are also registered with the system which helps in better diagnosis of the
disease and getting medical treatment.
17

lOMoARcPSD|25553714
CHAPTER 3
SYSTEM ANALYSIS
3.1 Major Inputs Required
The inputs required for the project are:
• Software Inputs:
– Jupyter Notebook
– Python version 3
– Pip version 3
– Pip virtual environment
– Django version 2
– Postgresql version 10
– Pgadmin version 3
• Hardware Inputs:
– Windows/Linux/Mac OS
– At least of 2 GB RAM
– At least 512 GB ROM
– At least a Integrated Graphic card
• User Inputs:
– Basic Details
– Symptoms

lOMoARcPSD|25553714
3.2 Feasibility Analysis
3.2.1 Technical Feasibility
The project is technically feasible as it can be built using the existing available technologies. It
is a web based applications that uses Django Framework. The technology required by Disease
Predictor is available and hence it is technically feasible.
3.2.2 Economic Feasibility
The project is economically feasible as the cost of the project is involved only in the hosting of
the project. As the data samples increases, which consume more time and processing power.
In that case better processor might be needed.
3.2.3 Operational Feasibility
The project is operationally feasible as the user having basic knowledge about computer and
Internet. Disease Predictor is based on client-server architecture where client is users and server
is the machine where dataset and project are stored.
19

lOMoARcPSD|25553714
3.3 Algorithms Used
In this project different ML algorithms are used and several techniques of data mining are also
used to check the dataset that weather it is a balanced dataset or not and check for the data is
structured or not for the disease prediction. The various ML algorithms used in this project are:
• Logistic Regression: The logistic regression is used the find the probability of a event
that weather a event is going to occur or not. The logistic regression is used in statistics
to find the probability of the occurrence of a event like the probability the school will
open or not is either 1 or 0 where 1 means that the school will open and 0 means that the
school will remain closed. For determining the logistic regression the sigmoid function
is used in this algorithm i.e.
1
Sigma(t) = (3.1)
1 + e− t
Figure 3.1: Sigmoid Activation Function
There are different types of logistic regression present in machine learning which have
their specific uses:
– Binary Logistic Regression: In Binary Logistic Regression there are only two
cases i.e. only a event can happen or not.
– Multinomial Logistic Regression: In Multinomial Logistic Regression there can
be three number of possibilities like a person can purely vegetarian, purely non-
vegetarian and both can consume both also in the third case. So this type of situation
comes under the category of the Multinomial Logistic Regression.
20

lOMoARcPSD|25553714
– Ordinal Logistic Regression:Ordinal Logistic Regression works when there are 3

or more categories on which the logistic regression is to be applied like Rating the
food of a hotel from 1-10 where 1 is for best and 10 is for bad.
• Naive Bayes Classifier: The Naive Bayes is used as a classification technique in the ML
which is used to classify the things and give the answer on the basis of the classification.
The main working of the Naive Bayes Classifier is based on the bayes theorm of the
statistics and the bayes theorm states that:
P (Y |X)P (X)
P (X|Y ) = (3.2)
P (Y )
By making the use of bayes theorm we can find the probability of X to occur when the
probability of Y occurrence is given to us. For exmaple:
Table 3.1: Prediction Using Naive Bayes
S.No. Temperature Weather Rain Humid Play Tennis

1 Hot Sunny No High Yes
2 Mild Overcast Yes Low Yes
3 Mild Rainy Yes High NO
4 Cold Rainy Yes High NO
5 Hot Sunny No Low Yes
This is kind of input which is given to the Naive Bayes Classifier for the prediction. On
this kind of table input the bayes theorm is used the ML algorithm for the prediction, with
this table the prediction can be done as probably of playing tennis when the temperature
is mild, weather is rainy, rain is yes, humid is high like this the prediction is done in the
ML model also.There are three different types of naive bayes classifier:
– Multinomial Naive Bayes: This is mainly used for the classification of big files
and documents by dividing them into different categories as weather a file or a
document is of sports category, politics, technology or something else. This method
of classification is very widely used in Machine Learning algorithms.
– Bernoulli Naive Bayes: This method of classification is very similar to the multi-
nomial naive bayes but the result of this type of naive bayes is only in either yes or
no.
– Gaussian Naive Bayes: This type of naive bayes is used for the prediction of the
continuous values while the other two types were used for the discrete value pre-
diction.
In this way the Naive Bayes Classifier is used in different ways in the ML models for the
process of prediction and analysis.
21

lOMoARcPSD|25553714
• Decision Tree Algorithm: The decision tree algorithm is a type of a supervised learning
algorithm which can be used in the case of classification as well as regression it has
capability to solve both the kind of problems. In decision tree for predicting the value
we start from the root of tree and then form a sub-tree from that root and finally come
to a conclusion which is nothing but the predicted value. The detailed overview of the
decision tree can be given as:
Figure 3.2: Decision Tree Overview
Decision tree helps in the method of classification as it sorts the values while traversing
down the tree and predicts the right value.The decision tree starts from the root node on
each step of moving downward in the tree the Information Gain and entropy are calcu-
lated, after that the branch with lowest entropy or the highest information gain is selected
and the information gain and the entropy are calculated again.
X
c
Entropy = −pi log2 pi (3.3)
i=1
Inf ormationgain = entropy(parent) − [averageentropy(children)] (3.4)

When the entropy is high then it is concluded that the randomness of the dataset is high
and then it is hard to predict the answer whereas the information gain tells us about that
how well structured the dataset is and if the information gain is high then it is easy to
predict the answer. All these steps are repeated until the decision tree reaches the leaf
node i.e. the final predicted value by the decision tree. Different types of Decision tree
present are:
– Categorical Variable Decision Tree: These types of decision tree works on the
whole value i.e. suppose we have to choose between 0 and 1 or something which is
taken as a whole.
22

lOMoARcPSD|25553714
– Continuous Variable Decision tree: These types of the decision tree works on the
continuous values that’s why they are known as the continuous variable decision
tree.
In this way the decision tree algorithm is used in classification and regression problems.
• Random Forest Algorithm: The random forest algorithm is a type of supervised learn-
ing algorithm which is used for both classification as well as regression as the basic idea
behind the random forest algorithm is the decision tree algorithm, this algorithm creates
multiple decision trees and then predicts the values as shown in the figure 3.3
Figure 3.3: Random Forest Overview
As shown in the figure there as multiple decision trees in a random forest algorithm.
Many of these decision trees will not be performing good enough and the best performing
decision trees are selected for prediction.
23

lOMoARcPSD|25553714
CHAPTER 4
SYSTEM DESIGN
The whole project can be divided into two parts i.e. The Machine Learning Model and The
User Interface and they can be elaborated as:
4.1 Machine Learning Model
SYMPTOMS
MACHINE
DATA PRE-
DATASET LEARNING
PROCESSING
ALGORITHM
DISEASE
TEST
TEST PREDICTIVE PREDICTED
DATA PRE-
DATA MODEL DISEASE
PROCESSING
Figure 4.1: Detailed Design Of Model
Data mining techniques are used in the project to see weather the dataset is good for prediction
or not. Various data mining libraries used in the project are:
1. Scipy: This is used for implementing scientific computing in Python programming lan-
guage. It is a collection of mathematical algorithms and convenience functions built on
Numpy. Following are some of the functionalities it provides Special Functions (special),

lOMoARcPSD|25553714
Integration (integrate), Optimization (optimize), Fourier Transforms (interpolate), Signal

Processing (signal), Linear Algebra (linalg), Statistics (stats), File IO (io) etc. In this
project stats (Statistics) library of this package is primarily used.
2. Sklearn : This stands for Scikit learn and is built on the Scipy package. It is the primary
package being used in this project. It is used for providing interface for supervised and
unsupervised learning algorithms. Following groups of models are provided by sklearn
Clustering, Cross Validation, Datasets, Dimensionality Reduction, Ensemble methods,
Feature extraction, Feature selection, Parameter Tuning, Manifold Learning, Supervised
Models.
3. Numpy : It is a library for the Python programming language, adding support for multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays. It provides functions for Array Objects, Routines,
Constants, Universal Functions, Packaging etc. In this project it is used for performing
multi-dimensional array operations.
4. Pandas : This library is used to provide high-performance, easy-to-use data structures and
data analysis tools for the Python programming language. It provides functionalities like
table manipulations, creating plots, calculate summary statistics, reshape tables, combine
data from tables, handle times series data, manipulate textual data etc. In this project it is
used for reading csv files, comparing null and alternate hypothesis etc.
5. Matplotlib : It is a library for creating static, animated, and interactive visualizations in

Python programming language. In this project it is used for creating simple plots, sub-
plots and its object is used alongside with the seaborn object to employ certain functions
such as show, grid etc. A %matplotlib inline function is also used for providing more
concise plots right below the cells that create that plot.
6. Seaborn : It provides a interface for making graphs that are more attractive and interactive
in nature. It is based on the matplotlib module. These graphs can be dynamic and are
much more informative and easier to interpret. It provides different presentation formats
for data such as Relational, Categorical, Distribution, Regression, Multiples and style
and color of all these types. In this project they are used for creating complex plots that
use various attributes.
7. Warning : It is used for handling any warnings that may arise when the program is
running. It is a subclass of Exception.
8. Stats : This library is used to incorporate statistics functionality in Python programming

language. This library is included in the scipy package. This library is not directly used
rather the required functions are directly imported as and when required i.e. for Measures
of Central Tendency, Measures of Variability . The functions used can be for simple con-
cepts like mean, median, mode, variance, percentiles, skew, kurtosis, range, Cumulative
25

lOMoARcPSD|25553714
Distribution Function (CDF), Probability Distribution Function (PDF),stats (used for re-
turning mean, variance, skew, kurtosis) etc, to complex hypothesis tests like chi2 contin-
gency (used for chi-square test), ttest-ind(used for performing t test), ks-2samp (used for
performing Kolmogorov-Smirnov test) etc.
9. Model selection : This library is used for helping in choosing the best model. This library
is present in the sklearn package. It is also used for functions like test-train-split which is
used for splitting the data into train and test data set which helps in improving accuracy of
the model, and like cross-val-score which is used for computing the accuracy of a model.
It also includes functions for techniques for improving the accuracy of a model like K-
Folds algorithm which is used in this project. It involves the functions of linear-model,
SVM etc.
10. Naive Bayes : It is a library built for implementing the Naive Bayes algorithm. It is also
defined in the sklearn package. In this project multinomial variant is used and it is one of
the most crucial algorithms in the project.
11. Tree : It is the library that comprises of all the functionality and concepts associated with
trees. It is included in the sklearn package. The most important algorithm included in
this library is the Decision Tree Classifier which gives very high accuracy and one of the
most used algorithm for projects like this.
12. Linear model : It is the library that implements the Regression algorithms. It is also in-
cluded in the sklearn package. It is used in this project to implement Logistic Regression.
13. Ensemble : It is the library that includes the ensemble methods. It is defined in the sklearn
package. As ensemble techniques are used to improve the accuracy of the models hence
Gradient Boosting Classifier and Random Forest Classifier(very important algorithm and
provides very high accuracy) are used in this project.
14. Metrics : It is the library used for presenting the accuracy of the model. The function
accuracy score is the most basic of them all. It is included in the sklearn package.
15. Joblib : It is a package that provides lightweight pipe lining in Python programming
language. It is included in the externals package in the sklearn main package. It is
used for providing transparent disk-caching of the output values, easy simple parallel
computing, logging and tracing of the execution. In this project it used to provide the
interaction with the user and perform operations accordingly.
All these libraries are used to create a model with the help of the dataset. The model created
by applying all these data mining techniques is a binary file so that the model is secure from
any kind of modifications and other security threats related to the system. The binary file of
the model cannot be opened as it does not have any extension to it. This binary file is basically
created with the help of job-lib library and is used for creating the UI.
26

lOMoARcPSD|25553714
4.2 User Interface
The UI is developed with the help of Django. The ML model developed with the use of data
mining techniques and the ML algorithms is used in the UI. The same version of the job-lib
library of the data mining is installed in the python environment on which the django is working
and after that the binary file of the model can be used in the prediction of disease.
The various modules which are included in the project are:
• Module 1: Creating a Linux machine Elastic Compute Cloud (EC2) instance on AWS.
• Module 2: Creating a postgres RDS instance for the database on AWS.
• Module 3: Downloading the project on EC2 instance and creating a suitable python3
environment with django installed in it.
• Module 4: Downloading and installing pgadmin and accessing the remote RDS database
on the local machine
• Module 5: Creating the migrations and launching the project.
• Module 6: Accessing the project on web browser by the public ip of the EC2 instance.
• Module 7: Creating a user account and predicting the disease.
All these modules are used in making the project work properly and fluently if we miss
any of these modules then the project will not work properly, all these module steps include the
tasks which are to be done for accessing the UI of the project and the ML model is implemented
in this UI for the prediction of disease.
27

lOMoARcPSD|25553714
4.2.1 Module Description
• Module 1: In this module we are creating a linux machine on EC2 for that the following
steps are to be followed:
– We create a AWS account on https://aws.amazon.com or login if the ac-

count is already created.
– Access the console of the AWS account as in figure 4.2 and find EC2 in the services
tab and click on it.
Figure 4.2: AWS Console
– Select the ubuntu ×64 bit machine AMI as in figure 4.3 and follow the instructions
to finally create a EC2 instance and see weather its is live or not as in figure 4.4.
Figure 4.3: Ubuntu AMI
28

lOMoARcPSD|25553714
Figure 4.4: EC2 Instance Is Live
– Launch the EC2 instance with the help of Secure Shell (SSH) as shown in figure
4.5
Figure 4.5: Connecting Through SSH
29

lOMoARcPSD|25553714
• Module 2: In Module 2 we are using the RDS service of the AWS for creating a postgre
sql database. The steps which are required to make a postgres database instance for
storing the project login details and launching it on the AWS are as follows:
– In the AWS console look for the RDS service of the AWS and click on it.
– After selecting the RDS select the postgre sql option on the screen as shown in
fig:4.6 and then the additional settings related to postgres will appear on the screen.
Figure 4.6: Postgres And Other DB Options In AWS
– Fill the essential details like master username and password for the postgres as in
fig:4.7 and also select the same security group as of the EC2 instance.
Figure 4.7: Postgres Master Username
30

lOMoARcPSD|25553714
– Enter the name of the database which you want to create in the postgres as in fig:4.8
and in which the data related to the project will be saved and then click on create
instance which will create a postgres instance.
Figure 4.8: Database Name And Details
– The postgres instance will provide the Amazon Resource Name (ARN) which is
used to access the instance.
• Module 3: The project when uploaded by the Secure Copy (SCP) command takes a lot
of time so to solve this problem the project is first uploaded to a version control system
like GIT or some storage service like AWS S3, dropbox or any other storage service.
We have uploaded the project to dropbox and there are basic commands to download
the poject from there to the EC2 instance. The very simple command "wget" is used to
download the project in the form of a zip file and then the "unzip" command is used to
extract the project.
The project runs in the python environment so we have installed python3, pip and venv
which is used to create a separate virtual environment for the project. In that virtual
environment all the libraries which are required are installed and django is also installed
to that the project can run without any error.
• Module 4: The module 4 is for accessing the database on the local machine by down-
loading pgadmin. Pgadmin is used for the management of the postgres database when
the postgres instance is formed on the AWS then it can be altered with the help of pgad-
min with the help of ARN provides by the running RDS postgres instance. To access the
remote database on the local machine the following steps are to be followed:
31

lOMoARcPSD|25553714
– Download the pgadmin from the official website of the Postgres in our case we are
using ubuntu in the local system also so pgadmin 3 is available in the app store and
can directly be downloaded.
– Open pgadmin and click on add connection button on the left corner.
– On clicking the add connection a new window will appear as shown in fig:4.9 that
will ask for the name in which the master username has to be entered and the host
in which the ARN of the instance is to be entered and the password is also entered.
Figure 4.9: Pgadmin Details To Access Instance
– After filling all the details connect to the instance and then the DB instance will be
open as shown in fig:4.10 and the user can alter the details as per his or her choice.
Figure 4.10: Postgres Instance In Pgadmin
32

lOMoARcPSD|25553714
– The postgres instance which is accessed by the pgadmin also has a pre-created
database in it for which the name was given while creating the db instance on AWS.
• Module 5: This module is is about creating the migrations. Migrations help in creating
the schema of the tables required in the database. Before starting the project migrations
are to be made so that the schema of all the tables which are required by the project
comes in the database on itself. These migrations are based on the models.py file in the
project, in models.py the class of the table is created and the fields are set as an object
in that class all this process is known as the Object Relational Model (ORM). ORM is a
very helpful technique in django which helps in very simple and easy creation of tables
in the database and for pushing the data into the database we then also create a object
which insert the data into the database.
For running the project manage.py file is used, Django has its inbuilt web server for
running the project. As soon as we start the web server the project will be up and running
and the post number set in django web server is 8000.
• Module 6: When the project is running on the web server of the django then it can be
accessed globally with the help of the public ip of the EC2 instance, the public ip of the
EC2 instance in given in the instance details we just have to open the browser on any
mobile phone or PC and just have to type that ip followed by the port number 8000 as
the django web server runs on that particular port only.
• Module 7: On accessing the project on the web browser the following tasks can be
performed:
– The user can register itself on this system by entering some basic details.
– The users can consult the problem doctors regarding their problems after the pre-
diction of disease as in fig:6.2.
Figure 4.11: Doctor Consultation
– The doctors can register themselves on the system for the help of the patients by
entering their details and qualifications.
– The patient dashboard has option for disease prediction, consultation history and
feedback as sown in fig:4.12.
33

lOMoARcPSD|25553714
Figure 4.12: Patient Dashboard
– The doctors dashboard has option for consultation history and feedback as shown
in fig:4.13
Figure 4.13: Doctor’s Dashboard
34

lOMoARcPSD|25553714
CHAPTER 5
CODING
The coding part in the project can be divided into two phases i.e. the Machine Learning Model
and the coding in the UI
5.1 Machine Learning Model
• Libraries Used:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import warnings
warnings.filterwarnings(’ignore’)
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
• Reading dataset and Finding the shape:
df = pd.read_csv(’Training.csv’)
df.shape
• Finding the Null Values and checking weather the dataset is balanced or not:
df.isnull().sum().sort_values(ascending=False)
df[’prognosis’].value_counts(normalize = True)
• Checking the co-relation between the symptoms:
corr = df.corr()
mask = np.array(corr)
mask[np.tril_indices_from(mask)] = False
plt.subplots_adjust(left = 0.5, right = 16 , top = 20,

lOMoARcPSD|25553714
bottom = 0.5)
sns.heatmap(corr, mask=mask,vmax=.9, square=True,annot
=True, cmap="YlGnBu")
• Dividing into training and testing:
x_train, x_test, y_train, y_test = train_test_split

(x, y, test_size=0.33, random_state=42)
• Cross evaluation for algorithms
def evaluate(train_data,kmax,algo):
test_scores = {}
train_scores = {}
for i in range(2,kmax,2):
kf = KFold(n_splits = i)
sum_train = 0
sum_test = 0
data = df
for train,test in kf.split(data):
train_data = data.iloc[train,:]
test_data = data.iloc[test,:]
x_train = train_data.drop(["prog"],axis=1)
y_train = train_data[’prognosis’]
x_test = test_data.drop(["prog"],axis=1)
y_test = test_data["prognosis"]
algo_model = algo.fit(x_train,y_train)
sum_train += algo_model.(x_train,y_train)
y_pred = algo_model.predict(x_test)
sum_test += accuracy_score(y_test,y_pred)
average_test = sum_test/i
average_train = sum_train/i
test_scores[i] = average_test
train_scores[i] = average_train
print("kvalue: ",i)
return(train_scores,test_scores)
• Finding test and train score of algorithm:
max_kfold = 11
for algo_name in algo_dict.keys():
print(algo_name)
trscore,tstscore=evaluate(dict[algo_name])
algo_train_scores[algo_name] = tr_score
algo_test_scores[algo_name] = tst_score
36

lOMoARcPSD|25553714
print(algo_train_scores)
print(algo_test_scores)
• Creating model for the best value ok K
test_scores={}
train_scores={}
for i in range(2,4,2):
kf = KFold(n_splits = i)
sum_train = 0
sum_test = 0
data = df
for train,test in kf.split(data):
train_data = data.iloc[train,:]
test_data = data.iloc[test,:]
x_train = train_data.drop(["prog"],axis=1)
y_train = train_data[’prognosis’]
x_test = test_data.drop(["prog"],axis=1)
y_test = test_data["prognosis"]
algo_model = dt.fit(x_train,y_train)
sum_train += dt.score(x_train,y_train)
y_pred = dt.predict(x_test)
sum_test += accuracy_score(y_test,y_pred)
average_test = sum_test/i
average_train = sum_train/i
test_scores[i] = average_test
train_scores[i] = average_train
print("kvalue: ",i)
• Saving the model in binary form:
from sklearn.externals import joblib

joblib.dump(dt,’my_model_for_healthcare’)
• Testing the model in ipynb file
a = list(range(2,134))
i_name = (input(’Enter your name :’))
i_age = (int(input(’Enter your age:’)))
for i in range(len(x.columns)):
print(str(i+1+1) + ":", x.columns[i])
choices = input(’Enter the Serial no.s which
is your Symptoms are exist: ’)
b = [int(x) for x in choices.split()]
count = 0
37

lOMoARcPSD|25553714
while count < len(b):

item_to_replace = b[count]
replacement_value = 1
indices_to_replace = [i for i,x in enumerate(a)
if x==item_to_replace]
count += 1
for i in indices_to_replace:
a[i] = replacement_value
a = [0 if x !=1 else x for x in a]
print(a)
y_diagnosis = dt.predict([a])
y_pred_2 = nb.predict_proba([a])
print((’Name of the infection = %s ,
confidence score of : = %s’) %(y_diagnosis[0],
y_pred_2.max()* 100),’%’ )
print((’Name = %s , Age : = %s’) %(i_name,i_age))
38

lOMoARcPSD|25553714
5.2 User Interface
• Installing Django:
$sudo apt-get update

$sudo apt-get install python3
$sudo apt-get install python3-pip
$sudo apt-get install python3-venv
$mkdir project
$cd project
$python3 -m venv env
$source env/bin/activate
$pip install django2
• Settings for Postgres Sql in settings.py:
DATABASES = {
’default’: {
’ENGINE’: ’django.db.backends.postgresql’,
’NAME’: ’predico’,
’USER’: ’postgres’,
’PASSWORD’: ’1234’,
’HOST’: ’localhost/AWS ARN’
}
}
• Using ML model in django:
from django.contrib import admin

from django.urls import path,include
urlpatterns = [
path(’admin/’, admin.site.urls),
path("", include("main_app.urls")),
path("accounts/", include("accounts.urls")),
path("", include("chats.urls"))
]
• Loading the trained model in views.py
import joblib as jb
model = jb.load(’trained_model’)
• Basic function to render a page:
39

lOMoARcPSD|25553714
def home(request):
if request.method == ’GET’:
if request.user.is_authenticated:
return render(request,’homepage/index.html’)
else :
return render(request,’homepage/index.html’)
• Function to predict the disease:
diseaselist = [ALL THE DISEASES]

symptomlist = [ALL THE SYMPTOMS]
testingsymptoms = []
#append zero in all coloumn fields...

for x in range(0, len(symptomslist)):
testingsymptoms.append(0)
#update 1 where symptoms gets matched...

for k in range(0, len(symptomslist)):
for z in psymptoms:
if (z == symptomslist[k]):
testingsymptoms[k] = 1
inputtest = [testingsymptoms]
print(inputtest)
predicted = model.predict(inputtest)
print("predicted disease is : ")
print(predicted)
y_pred_2 = model.predict_proba(inputtest)
confidencescore=y_pred_2.max() * 100
print(" confidence score of : = {0}".format)
confidencescore = format(confidencescore, ’.0f’)

predicted_disease = predicted[0]
• Models.py for creating table schema in postgres
class patient(models.Model):
user = models.OneToOneField(User,
on_delete=models.CASCADE, primary_key=True)
is_patient = models.BooleanField(default=True)
is_doctor = models.BooleanField(default=False)
name = models.CharField(max_length = 50)
40

lOMoARcPSD|25553714
dob = models.DateField()
address = models.CharField(max_length = 100)
mobile_no = models.CharField(max_length = 15)
gender = models.CharField(max_length = 10)
class doctor(models.Model):
user = models.OneToOneField(User,
on_delete=models.CASCADE, primary_key=True)
is_patient = models.BooleanField(default=False)
is_doctor = models.BooleanField(default=True)
name = models.CharField(max_length = 50)

dob = models.DateField()
address = models.CharField(max_length = 100)
mobile_no = models.CharField(max_length = 15)
gender = models.CharField(max_length = 10)
registration_no = models.CharField(max_length = 20)

year_of_registration = models.DateField()
qualification = models.CharField(max_length = 20)
State_Medical_Council = models.CharField(max_length = 30)
specialization = models.CharField(max_length = 30)

rating = models.IntegerField(default=0)
41

lOMoARcPSD|25553714
CHAPTER 6
CONCLUSION
• The outcome which is expected from the project is given with the help of machine leaning
models that are used in the project to predict the disease on the basis of input provided
by the user as a content of symptoms which are selected from a given list of symptoms
provided to the user as in fig:6.1.
Figure 6.1: Symptoms Selection List
• The expected outcome observed will also have an UI , so that it will be more easier ,
understandable and interactive to help a user to operate and predict the disease on the
basis of input provided and becomes an easier task to perform.
• The UI is made in such a manner that when the project is accessed on a mobile web
browser then the functions and tabs in the page adapt the size of the screen and does not
create any problem while being accessed on a mobile web browser.
• This model also helps user to interact or concern a doctor at the end of showing the
predicted result as in fig:6.2 which is purely based upon input which is provided by the
user, in the end the user can take help of a specific doctor who would be a specialist of
the respective disease and is registered with the disease prediction system.

lOMoARcPSD|25553714
Figure 6.2: Doctor Consultation
• The doctors will be registered with the system on the basis of their working area and
will be addressing the patients which will face the problem which are related to their
department.
• The doctors upon signing in to the account will see how many patients approached them
and the details of the patients, the doctors can interact with the users and give response
specific to the user queries and doctors can also suggest them various medications that
can be helpful to the user.
• The project designed is absolutely user friendly and essential machine learning mod-
els like Naive Bayes, Random Forest and Decision Tree are applied which will help in
predicting the disease in a better and easy way.
• The dataset used in the project contains 132 symptoms which are related to almost every
kind of general disease and contains data in the form of 0 and 1 where 1 means that the
symptom is present and 0 means that the symptom is not present.
• The system also shows the probability of the disease that how much chances are there
that the user is suffering from the disease which is predicted by the system which helps
in better identification of the disease and a better diagnosis also as in figure:6.3.
Figure 6.3: Predicted Disease
43

lOMoARcPSD|25553714
• As the result obtained is specific and depends upon the input provided by the users as
symptoms should be accurate with prior knowledge , so there is no misinterpretation of
disease.
• The accuracy of the disease prediction is about 86% so we can say that the dataset is not
in over-fitting situation and predicts the correct output.
44

lOMoARcPSD|25553714
CHAPTER 7
FUTURE ENHANCEMENT
Today’s, world most of the data is computerized, the data is distributed, and it is not utilizing
properly. With the help of the already present data and analysing it, we can also use for un-
known patterns. The primary motive of this project is the prediction of diseases with high rate
of accuracy. For predicting the disease, we can use logistic regression algorithm, naive Bayes,
sklearn in machine learning. The future scope of the paper is the prediction of diseases by using
advanced techniques and algorithms in less time complexity.
A technology called CAD is more beneficial as sometimes systems are better diagnostics
than Doctors. Machine Learning and its different branches are used in Cancer detection as well.
It helps or can say assist in making decisions on critical cases or on therapies. Artificial intel-
ligence plays an important role in development of many health related procedure or methods.
Artificial intelligence is very common now a days in surgeries, like Robotics surgery. Since we
are in the circumstances of growing population, we must need technology which can help us
to meet the expectations of the patients, their flawless cure, their better health and their smooth
and easy approachable access to health care industries to heal and get well soon!!

lOMoARcPSD|25553714
REFERENCES
[1] Dahiwade, D., Patle, G., and Meshram, E. (2019). “Designing disease prediction model us-
ing machine learning approach.” 2019 3rd International Conference on Computing Method-
ologies and Communication (ICCMC), IEEE. 1211–1215.
[2] Fitriyani, N. L., Syafrudin, M., Alfian, G., and Rhee, J. (2019). “Development of disease
prediction model based on ensemble learning approach for diabetes and hypertension.” IEEE
Access, 7, 144777–144789.
[3] Hong, W., Xiong, Z., Zheng, N., and Weng, Y. (2019). “A medical-history-based potential
disease prediction algorithm.” IEEE Access, 7, 131094–131101.
[4] Kohli, P. S. and Arora, S. (2018). “Application of machine learning in disease predic-
tion.” 2018 4th International Conference on Computing Communication and Automation
(ICCCA), IEEE. 1–4.
[5] Krishnan.J, M. “Prediction of heart disease using machine learning algorithms.
[6] Patil, M., Lobo, V. B., Puranik, P., Pawaskar, A., Pai, A., and Mishra, R. (2018). “A
proposed model for lifestyle disease prediction using support vector machine.” 2018 9th In-
ternational Conference on Computing, Communication and Networking Technologies (IC-
CCNT), IEEE. 1–6.
[7] Shobana, V. and Kumar, N. (2017). “A personalized recommendation engine for predic-
tion of disorders using big data analytics.” 2017 International Conference on Innovations in
Green Energy and Healthcare Technologies (IGEHT), IEEE. 1–4.
46

B Tech Major Project Report Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

B Tech Major Project Report Final

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|25553714

B Tech Major Project Report Final

Computer Science (SRM Institute of Science and Technology)

Studocu is not sponsored or endorsed by any college or university

DISEASE PREDICTION SYSTEM USING MACHINE

PRANAY CHOPRA [RA1611003030097]

Under the guidance of

in partial fulfillment for the award of the degree

COMPUTER SCIENCE & ENGINEERING

FACULTY OF ENGINEERING AND TECHNOLOGY

S.R.M. Nagar, Kattankulathur, Kancheepuram District

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

Certified that this project report titled “DISEASE PREDICTION SYS-

Mr. RAKESH YADAV Dr. R.P.Mahapatra

Signature of the Internal Examiner Signature of the External Examiner

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

LIST OF TABLES vii

LIST OF FIGURES viii

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

3.2.1 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 19

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

3.1 Prediction Using Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 21

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

1.1 Components Of MVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Accuracy Of The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Detailed Design Of Model . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.1 Symptoms Selection List . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

ARN Amazon Resource Name

AWS Amazon Web Services

CDF Cumulative Distribution Function

CNN Convolution Neural Network

EC2 Elastic Compute Cloud

KNN K-Nearest Neighbour

ORM Object Relational Model

PDF Probability Distribution Function

RDS Relational Database Service

SCP Secure Copy

SSH Secure Shell

SVM Support Vector Machine

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

1.2 Data Mining and Machine Learning Algorithm

1.2.1 Data Analysis and Data Mining

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

1.2.2 Machine Learning Algorithms

outcome of the algorithm. Common algorithm types include:

\ Semi-supervised learning — This category of the ML algorithms falls somewhere be-

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

The performance and computational analysis of ML algorithms is a branch of statistics

Machine learning is made up of three parts:

\ The computational algorithm at the core of making determinations.

\ Variables and features that make up the decision.

Downloaded by YAGESH PANDEY (yageshpandey88@gmail.com)

1.3 Django Framework

1.3.1 Advantages of Django

\ Object-Relational Mapping (ORM) Support: The django framework provides a connec-

1.3.2 Django Working