Disease

CHAPTER 1
INTRODUCTION
Disease Prediction using Machine Learning is a system that predicts the disease
based on the information provided by the user. It also predicts the disease of the patient
or the user based on the information or the symptoms he/she enters into the system and
provides accurate results based on that information. If the patient is not very serious and
the user just wants to know the type of disease, he/she has been through. It is a system
that provides the user the tips and tricks to maintain the health system of the user and it
provides a way to find out the disease using this prediction. Now a day’s health industry
plays a serious role in curing the diseases of the patients so this is often also some quite
help for the health industry to tell the user and also it's useful for the user just in case
he/she doesn’t want to travel to the hospital or the other clinics, so just by entering the
symptoms and every one other useful information the user can get to understand the
disease he/she is affected by and therefore the health industry also can get enjoy this
technique by just asking the symptoms from the user and entering in the system and in
just a few seconds they can tell the exact and up to some extent the accurate diseases.
Medicaid services and centers for Medicare reported that 50% of Americans had
multiple chronic diseases, which led the US health care to spend around $3.3 trillion in
2016, that amounts to $10,348 per person in the US. Moreover, the World Health
Organization and World Economic Forum. reported that India had a huge loss of $236.6
billion by 2015 because of fatal diseases, caused by malnutrition and morbid lifestyles.
Such expenditures revealed how prone people are to a spectrum of diseases, which
showcased how vital it is to detect diseases early, to consequently reduce the fatality of
1
these maladies. In addition, early disease prediction can lessen the financial pressure on
the economy and ensure better maintenance on the overall well-being of the community.
According to Yuan , ML algorithms are highly suscepti-ble to errors because of
two factors. Firstly, it depends on the quality and the selection of the datasets, which is
crucial to achieve accurate and unbiased decisions. Secondly, ML algo-rithms relies
heavily on the right selection of features extracted from the dataset, which proved to be
difficult, time consuming, and required high computational power. These factors hinder
the performance of the learning model and generate fatal errors that can endanger the
lives of patients. In contrast, Ismaeel argued that standard statistical techniques, the work
experience and the intuition of medical doctors led to undesirable biases and errors when
detecting risks associated to the disease For this reason, advanced computational
methodologies such as ML algorithms were introduced to discover meaningful patterns
and hidden information from data, which can be used for critical decision making. In
consequence, the burden on the medical staff decreased, while the survival rate of
patients was ameliorated.
Medicine and healthcare are some of the most crucial parts of the economy and
human life. There is a tremen-dous amount of change in the world we are living in now
and the world that existed a few weeks back. Every-thing has turned gruesome and
divergent. In this situ-ation, where everything has turned virtual, the doctors and nurses
are putting up maximum efforts to save peo-ple’s lives even if they have to danger their
own. There are also some remote villages which lack medical facil-ities. Virtual doctors
are board-certified doctors who choose to practice online via video and phone appoint-
ments, rather than in-person appointments but this is not possible in the case of
2
emergency. Machines are al-ways considered better than humans as, without any human
error, they can perform tasks more efficiently and with a consistent level of accuracy. A
disease pre-dictor can be called a virtual doctor, which can pre-dict the disease of any
patient without any human er-ror. Also, in conditions like COVID-19 and EBOLA, a
disease predictor can be a blessing as it can iden-tify a human’s disease without any
physical contact. Some models of virtual doctors do exist, but they do not comprise the
required level of accuracy as all the parameters required are not being considered. The
pri-mary goal was to develop numerous models to define which one of them provides the
most accurate predic-tions. While ML projects vary in scale and complexity, their general
structure is the same. Several rule-based techniques were drawn from machine learning to
re-call the development and deployment of the predictive model. Several models were
initiated by using various machine learning (ML) algorithms that collected raw data and
then bifurcated it according to gender, age group, and symptoms. The data-set was then
processed in several ML models like Fine, Medium and Coarse De-cision trees, Gaussian
Na¨ıve Bayes, Kernel Na¨ıve Bayes, Fine, Medium and Coarse KNN, Weighted KNN,
Sub-space KNN, and RUSBoosted trees. According to ML models, the accuracy varied.
While processing the data, the input parameters data-set was supplied to every model,
and the disease was received as an output with dissimilar accuracy levels. The model
with the highest accuracy has been selected.
3
Fig. 1. Proposed system for disease prediction
The doctor may not be available always when needed. But, in the modern time
scenario, according to necessity one can always use this prediction system anytime. The
symptoms of an individual along with the age and gender can be given to the ML model
to further process. After preliminary processing of the data, the ML model uses the
current input, trains and tests the algorithm resulting in the predicted disease.
Fig. 2 Proposed system flow diagram during training
4
The dataset consisting of gender, symptoms, and age of an individual was
preprocessed and fed as an input to different ML algorithms for the prediction of the
disease. The different ML models used were Fine, Medium and Coarse Decision trees,
Gaussian Na¨ıve Bayes, Kernel Na¨ıve Bayes, Fine, Medium and Coarse KNN, Weighted
KNN, Subspace KNN, and RUSBoosted trees. The outcome of the models is the disease
as per the symptoms, age, and gender is given to the processing model.
Fig. 3 Functioning of the ML models
The dataset was split into input consisting of age, gender, and symptoms and the
output as the diseases based on the input factors. We randomly split the available data
into train and test sets. These sets were then encoded and further trained using different
algorithms. After which the algorithms test the training set and predict the values,
resulting in the accuracy of different ML algorithms. The predicted values were then
decoded to give the output as the disease.
5
AIM
The aim of this study is to test the proposed hypothesis that supervised ML
algorithms can improve health care by the accurate and early detection of diseases. In this
study, we investigate studies that utilize more than one supervised ML model for each
disease recognition problem. This approach renders more comprehensiveness and
precision because the evaluation of the performance of a single algorithm over various
study settings induces bias which generates imprecise results. The analysis of ML models
will be conducted on few diseases located at heart, kidney, breast, and brain. For the
detection of the disease, numerous methodologies will be evaluated such as KNN, NB,
DT, CNN, SVM, and LR. At the end of this literature, the best performing ML models in
respect of each disease will be concluded.
6
CHAPTER 2
LITERATURE SURVEY
Numerous disquisition factory have been carried out for the prophecy of the
conditions predicated on the symptoms shown by an existent using machine knowledge
algorithms:
Monto et al. designed a statistical model to prognosticate whether a case had
influenza or not. They included 3744 unvaccinated grown-ups and adolescent cases of
influenza who had fever and at least 2 other symptoms of influenza. Out of 3744, 2470
were verified to have influenza by the laboratory. Predicated on this data, their model
gave an delicacy of 79. Colorful machine learning algorithms were streamlined for the
effective prophecy of a habitual complaint outbreak by Chen et al. The data collected
for the training purpose was deficient. To overcome this, a idle factor model was used. A
new convolutional neural network- grounded multimodal complaint trouble vaticination
(CNN-MDRP) was structured. The algorithm reached an delicacy of around 94.8%.
The DNN model performed more in terms of average performance and the LSTM
model gave close prognostications when circumstances were large. Haq et al. used a
database that contained information about patients having any heart complaint. They
pulled features using three selection algorithms which are relief, minimum redundancy,
and maximum connection (mRMR), and least absolute loss and selection motorist which
was cross-verified by theK-fold system. The pulled features were transferred to 6
different machine learning algorithms and also it was classified predicated on the
presence or absence of heart complaint.
7
Dahiwade et al. proposed a ML based system that pre-dicts common diseases. The
symptoms dataset was imported from the UCI ML depository, where it contained
symptoms of many common diseases. The system used CNN and KNN as classification
techniques to achieve multiple diseases pre-diction. Moreover, the proposed solution was
supplemented with more information that concerned the living habits of the tested
patient, which proved to be helpful in understanding the level of risk attached to the
predicted disease. Dahiwade et al compared the results between KNN and CNN
algorithm in terms of processing time and accuracy. The accuracy and processing time of
CNN were 84.5% and 11.1 seconds, respectively. The statistics proved that KNN
algorithm is under performing compared to CNN algorithm.
In light of this study, the findings of Chen et al. also agreed that CNN
outperformed typical supervised algorithms such as KNN, NB, and DT. The authors
concluded that the proposed model scored higher in terms of accuracy, which is
explained by the capabil-ity of the model to detect complex nonlinear relationships in the
feature space. Moreover, CNN detects features with high importance that renders better
description of the disease, which enables it to accurately predict diseases with high
complexity. This conclusion is well supported and backed with empirical observations
and statistical arguments. Nonetheless, the presented models lacked details, for instance,
Neural Networks parameters such as network size, architecture type, learning rate and
back propagation algorithm, etc. In addition, the analysis of the performances is only
evaluated in terms of accuracy, which debunks the validity of the presented findings.
Moreover, the authors did not take into consideration the bias problem that is faced by the
8
tested algorithms. In illustration, the incorporation of more feature variables could
immensely ameliorate the performance metrics of under performed algorithms.
Serek et al. planned a comparative study of classifiers performance for Chronic
Kidney disease (CKD) detection using The Kidney Function Test (KFT) dataset. In this
study, the classifiers used are KNN, NB, and RF classifier; their performance is examined
in terms of F-measure, precision, and accuracy. As per analysis, RF scored better in
phrases of F-measure and accuracy, while NB yielded better precision. In consideration
of this study, Vijayarani aimed to detect kid-ney diseases using SVM and NB. The
classifiers were used to identify four types of kidney diseases namely Acute Nephritic
Syndrome, Acute Renal Failure, Chronic Glomerulonephritis, and CKD. Additionally,
the research was focused on deter-mining the better performing classification algorithm
based on the accuracy and execution time. From the results, SVM considerably achieved
higher accuracy than NB, which makes it the better performing algorithm. However, NB
classified data with minimum execution time. Other several empirical studies also
focused on locating CKD; Charleonnan et al. and Kotturu et al. concluded that the SVM
classifier is the most adequate for kidney diseases because it deals well with semi-
structured and unstructured data. Such flexibility allowed SVM to handle larger features
spaces, which resulted in acquiring high accuracy when detecting complex kidney
diseases. Although supported by findings, the conclusion is weakened by prior suggestion
that different hyper-parameters were not experimented when evaluating the performances
of ML algorithms. According to Uddin the exploration of the hyper-parameter space can
generate different accuracy results and render better performances for ML algorithms.
9
Marimuthu et al. aimed to predict heart diseases using supervised ML techniques.
The authors structured the attributes of data as gender, age, chest pain, gender, target and
slope. The applied ML algorithms that were deployed are DT, KNN, LR and NB. As per
analysis, the LR algorithm gave a high accuracy of 86.89%, which deemed to be the most
effective compared to the other mentioned algorithms. In 2018, Dwivedi attempted to add
more precision to the prediction of heart diseases by accounting for additional parameters
such as Resting blood pressure, Serum Cholesterol in mg/dl, and Maximum Heart Rate
achieved. The used dataset was imported from the UCI ML laboratory; it was comprised
with 120 samples that were heart disease positive, and 150 samples that were heart
disease negative. Dwivedi attempted to evaluate the performance of Artificial Neural
Networks (ANN), SVM, KNN, NB, LR and Classification Tree. At the appliance of
tenfold cross validation, the results showed that LR has the highest classification
accuracy and sensitivity, which shows high dependability at detecting heart diseases. This
con-clusion is strengthened by the findings of Polaraju and Vahid et al., where the
Logistic Regression outperformed other techniques such as ANN, SVM, and Adaboost.
The studies excelled in conducting an extensive analysis on the ML models. For instance,
various hyper-parameters were tested at each ML algorithm to converge to the best
possible accuracy and precision values. Despite that advantage, the small size of the
imported datasets constraints the learning models from targeting diseases with higher
accuracy and precision.
Shubair attempted for the detection of breast cancer using ML algorithms, namely
RF, Bayesian Networks and SVM. The researchers obtained the Wisconsin original
breast cancer dataset from the UCI Repository and utilized it for comparing the learning
10
models in terms of key parameters such as accuracy, recall, precision, and area of ROC
graph. The classifiers were tested using K-fold validation method, where the chosen
value of K is equal to 10. The simulation results have proved that SVM excelled in terms
of recall, accuracy, and precision. However, RF had a higher probability in the correct
classification of the tumor, which was implied by the ROC graph.
In contrast, Yao experimented with various data mining methods including RF
and SVM to de-termine the best suited algorithm for breast cancer prediction. Per results,
the classification rate, sensitivity, and specificity of Random Forest algorithm were
96.27%, 96.78%, and 94.57%, respectively, while SVM scored an accuracy value of
95.85%, a sensitivity of 95.95%, and a specificity of 95.53%. Yao came to the conclusion
that the RF algorithm performed better than SVM because the former provides better
estimates of information gained in each feature attribute. Furthermore, RF is the most
adequate at breast diseases classification, since it scales well for large datasets and
prefaces lower chances of variance and data overfitting. the studies advantageously
presented multiple performance metrics that solidified the underlined argument.
Nevertheless, the inclusion of the pre-processing stage to prepare raw data for training
proved to be disadvantageous for ML models.
According to Yao , omitting parts of data reduces the quality of images, and
therefore the performance of the ML algorithm is hindered.
Chen et al. presented an effective diagnosis system using Fuzzy k-Nearest Neighbor
(FKNN) for the diagnosis of Parkinson’s disease (PD) . The study focused on comparing
the proposed SVM-based and the FKNN-based approaches. the Principal Component
11
Analysis (PCA) was utilized to assemble the most discriminated features for the
construction of an optimal FKNN model. The dataset was taken from the UCI depository,
and it recorded numerous biomedical voice measurement ranging from 31 people, 24
with PD. The experimental findings have indicated that the FKNN approach
advantageously achieves over the SVM methodology in terms of sensitivity, accuracy,
and specificity.
In line of this study, Behroozi aimed to propose a new classification frame-work
to diagnose PD, which was enhanced by a filter-based feature selection algorithm that
increased the classification accuracy up to 15%. The classification of the framework was
characterized by applying independent classifiers for each subset of the dataset to account
for the loss of valuable infor-mation. The chosen classifiers were KNN, SVM,
Discriminant Analysis and NB. The results showed that SVM achieved the highest in all
the performance metrics. In addition, Eskidere concentrated on tracking the progression
of PD by dis-cussing the performance of SVM with other classifiers such as Least Square
Support Vector (LS-SVM), General Regression Neural Network (GRNN) and Multi-
layer Perceptron Neural Network (MLPNN). The findings indicated that LS-SVM is the
highest performing model. This conclusion is strengthened by the adequate comparison
of decoders with their optimal performance metric. According to Lavesson, various ML
algorithms are designed to optimize numerous perfor-mance metrics (e.g., Neural
Networks optimizes squared error whereas KNN and SVM optimize accuracy).
Furthermore, the authors are particularly good at proposing frameworks with details. For
example, SVMs parameters such as the kernel and the regularization value were outlined
in depth.
12
CHAPTER – 3
SYSTEM DESIGN
The experiment was carried out on a publicly available database for heart disease.
The dataset contains a total of 303 records that were divided into two sets, training set
(40%) and testing set (60%). A data mining tool named Weka 3.6.11 was used for the
experiment. Additionally, multilayer perceptron neural network (MLPNN) with
backpropagation (BP) was used as the training algorithm.
Fig -4 MLPNN
MLPNN is one of the most significant models in artificial neural network. The
MLPNN consists of one input layer, one or more hidden layers and one output layer.3 In
MLPNN, the input nodes pass values to the first hidden layer, and then nodes of first
hidden layer pass values to the second and so on till producing outputs
13
CHAPTER – 4
SOFTWARE AND HARDWARE REQUIREMENTS
4.1 Hardware Requirements:
• System : Dual Core.

• Hard Disk : 500 GB.
• Monitor : Led Monitor.
• Mouse : Optical Mouse
• RAM : 4 GB
4.2 Software Requirements:
• Operating system : Windows 10.

• Coding Language : Python 3.7 (Tkinter)
• Compiler : Pycharm
• Data Base : Access
14
CHAPTER – 5
SYSTEM ANALYSIS
Fig- 5 Functioning of the ML models
The dataset was split into input consisting of age, gender, and symptoms and the
output as the diseases based on the input factors. We randomly split the available data
into train and test sets. These sets were then encoded and further trained using different
algorithms. After which the algorithms test the training set and predict the values,
resulting in the accuracy of different ML algorithms. The predicted values were then
decoded to give the output as the disease

15
Data Flow Diagram
Fig-6 BP Algorithm
The BP algorithm has served as a useful methodology to train multilayer
perceptron for a wide range of applications. The BP network calculates the difference
16
between real and predicted values, which is circulated from output nodes backwards to
nodes in previous layer. The BP learning algorithm can be divided into two phases,
propagation and weight
First, this learning algorithm provides training data to the network and compares
the actual and desired outputs. Then, it calculates the error in each neuron. Based on this,
the algorithm calculates what output should be for each neuron and how much higher or
lower output must be adjusted for desired output and finally adjusts the weights. The
overall process is done to improve weights during processing.
17
Chapter – 6
MODULES
6.1 Data collection
Using the UCI dataset, Collection of medical data of patients with heart diseases
is carried out. Throughout issues/matters are assumed for CAD and registered for
angiography. Every patient’s attributes are being assembled such as demographic,
historic and laboratory features such as sex, age, hypertension, smoking history, diabetes
mellitus, chest pain type, dyslipidemia, random blood sugar, low and high density
lipoprotein, cholesterol, triglycerides, systolic and diastolic blood pressure, weight,
height, BMI (body mass index), central obesity, waist circumference, ankle–brachial
index, duration of exercise, METS obtained, rate pressure product, recovery duration
with persistent ST changes, duke treadmill test and angiography result.
6.2 Disease Prediction:
With the launch of automated medical diagnosis system, there is high
development in the medical domain and at the same time cost consumption has reduced.
There are numerous factors prevailing for heart attack diagnosis and mostly patient’s test
records are being referred and analyzed for carrying out the diagnosis. For enhancing the
diagnosis process, experience and knowledge of various medical experts/doctors as well
as patient’s medical screening data is being collected in databases, resulting in an
extremely significant system. With the blend of clinical decision support and
computerized patient records, the medical faults can be reduced, patient’s safety can be
18
enhanced, variation in unwanted practices can be minimized thereby improvising
throughout patient’s results.
Doctors and medical professionals are always required in case of an emergency.
In the current situation of COVID-19, where sufficient facilities and resources are
unavailable, our prediction system can prove to be helpful and can be used in the
diagnosis of a disease.
19
Chapter – 7
SYSTEM TESTING
Fig. 7. Comparison of the accuracy values of the different ML algorithms.
The Weighted KNN model gave the highest accuracy as compared to the other ML
algorithms. The RUSBoosted trees were the least accurate model. The Fine KNN
performed better than the Subspace, Medium, and Coarse KNN models. The least
e_cient KNN model was coarse KNN. The Gaussian and the Kernel Na _ve Bayes
algorithm had a comparable accuracy with each other though less than the KNN
models. The Fine tree had a higher accuracy than the medium and the coarse
decision tree models.
20
It is a sort of supervised learning algorithmic program that's largely used for
classification issues. Surprisingly, it works for each categorical and continuous dependent
variable. In this algorithmic program, we tend to split the population into 2 or a lot of
homogenized sets. This is done supported most vital attributes/ freelance variables to
form as distinct teams as attainable. A tree has several analogies in real world, and seems
that it's influenced a large space of machine learning, covering each classification and
regression. In call analysis, a choice tree is wont to visually and expressly represent
selections and higher cognitive process. As the name goes, it uses a tree-like model of
decisions. Though a commonly used tool in data mining for deriving a strategy to reach a
particular goal, it’s also widely used in machine learning. Once we completed modelling
the Decision Tree classifier, we will use the trained model to predict whether the balance
scale tip to the right or tip to the left or be balanced.
Random Forest is a great algorithm to train early in the model development
process, to see how it performs and it’s hard to build a “bad” Random Forest, because of
its simplicity. This rule is additionally an excellent alternative, if you would like to
develop a model during a short amount of your time. On prime of that, it provides a fairly
sensible indicator of the importance it assigns to your options. Random Forests are
terribly onerous to ram down terms of performance. And on prime of that, they'll handle
tons of various feature varieties, like binary, categorical and numerical. Overall, Random
Forest may be a (mostly) quick, easy and versatile tool, though it's its limitations.
Random forests are an ensemble learning method for classification, regression and other
tasks, that operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the categories (classification) or mean prediction
21
(regression) of the individual trees Random call forests correct for call trees' habit of over
fitting to their training set.
It is a modi_ed version of KNN. In KNN we chose an integer parameter K and by
using that parameter we found where the major predicted values lied. But if the value of
K is too small the algorithm is much more sensitive to the points that are outliers. Also, if
the value of K is too large then all the points that are almost very close to the K value are
selected. To overcome this issue the weighted KNN gave more weight to the points that
were nearest to the K value and the less weight to the points that were farther away. We
were able to get the highest accuracy using this model. Also among all the KNN models,
this model gave us the best results.
22
Chapter – 8
ALGORITHM
8.1. Multinomial Naive Bayes: The naive Bayes classification strategy works with Bayes'
theorem, which is known to be used in arithmetic and computer science probabilistic
analysis. This algorithm considers each property as an autonomous property that
contributes to the final classification. For this reason, it has been praised in various
studies for its accuracy and unwavering quality.
It is a straightforward method for building classifiers, which are models that give
class labels to problem cases represented as vectors of characteristic values. Here the
class labels are selected from a finite set. For training such classifiers, there is no one
algorithm, but rather a variety of algorithms based on the same principle: all naive Bayes
classifiers assume that the value of one feature is independent of the value of any other
feature, given the class variable.
For example, if the fruit is red, round, and around 10 cm in diameter, it is termed
an apple. A naive Bayes classifier examines each of these characteristics to contribute
independently to the likelihood that this fruit is an apple, regardless of any possible
confounding variables.
8.2. Random Forest Classifier: The decision tree is the basic building block of random
forest classifiers. A decision tree is a hierarchical structure created from a data set's
23
characteristics (or independent variables). The decision tree is divided into nodes based
on a measure connected with a subset of the characteristics. The random forest approach
was used to create the prediction model, and the results were compared to those of a
decision tree based on multiple logistic regression and a classification and regression
tree. The recognition rate was used to calculate the model's forecast accuracy. Random
forests are multi-decision tree ensemble classifiers that train multiple decision trees at
random. The random forest approach is made up of two steps: a training step that creates
numerous decision trees and a test step that classifies or predicts an outcome variable
based on an input vector. Forest F = f1,..., fn represents the ensemble form of random
forest training data .After averaging the distributions obtained from each forest's
decision trees by T (the number of decision trees), classification was performed. The
mean for continuous target variables and the majority vote for categorical target
variables were used to aggregate the predictors of each sample.
24
Chapter – 9
SAMPLE CODE
from tkinter import *
from tkinter import messagebox
import numpy as np
import pandas as pd
l1=['itching','skin_rash','nodal_skin_eruptions','continuous_sneezing','shivering','chills','jo
int_pain',
'stomach_pain','acidity','ulcers_on_tongue','muscle_wasting','vomiting','burning_micturiti
on','spotting_ urination','fatigue',
'weight_gain','anxiety','cold_hands_and_feets','mood_swings','weight_loss','restlessness','l
ethargy','patches_in_throat',
'irregular_sugar_level','cough','high_fever','sunken_eyes','breathlessness','sweating','dehyd
ration','indigestion',
'headache','yellowish_skin','dark_urine','nausea','loss_of_appetite','pain_behind_the_eyes',
'back_pain','constipation',
'abdominal_pain','diarrhoea','mild_fever','yellow_urine','yellowing_of_eyes','acute_liver_f
ailure','fluid_overload',
'swelling_of_stomach','swelled_lymph_nodes','malaise','blurred_and_distorted_vision','ph
legm','throat_irritation',
'redness_of_eyes','sinus_pressure','runny_nose','congestion','chest_pain','weakness_in_lim
bs','fast_heart_rate',
25
'pain_during_bowel_movements','pain_in_anal_region','bloody_stool','irritation_in_anus',
'neck_pain','dizziness','cramps',
'bruising','obesity','swollen_legs','swollen_blood_vessels','puffy_face_and_eyes','enlarged
_thyroid','brittle_nails',
'swollen_extremeties','excessive_hunger','extra_marital_contacts','drying_and_tingling_li
ps','slurred_speech','knee_pain','hip_joint_pain',
'muscle_weakness','stiff_neck','swelling_joints','movement_stiffness','spinning_movemen
ts','loss_of_balance','unsteadiness','weakness_of_one_body_side',
'loss_of_smell','bladder_discomfort','foul_smell_of
urine','continuous_feel_of_urine','passage_of_gases','internal_itching','toxic_look_(typhos
)',
'depression','irritability','muscle_pain','altered_sensorium','red_spots_over_body','belly_pa
in','abnormal_menstruation','dischromic _patches',
'watering_from_eyes','increased_appetite','polyuria','family_history','mucoid_sputum','rus
ty_sputum','lack_of_concentration','visual_disturbances',
'receiving_blood_transfusion','receiving_unsterile_injections','coma','stomach_bleeding','
distention_of_abdomen','history_of_alcohol_consumption',
'fluid_overload','blood_in_sputum','prominent_veins_on_calf','palpitations','painful_walki
ng','pus_filled_pimples','blackheads','scurring','skin_peeling',
'silver_like_dusting','small_dents_in_nails','inflammatory_nails','blister','red_sore_around
_nose','yellow_crust_ooze']
disease=['Fungal infection','Allergy','GERD','Chronic cholestasis','Drug Reaction',
26
'Peptic ulcer diseae','AIDS','Diabetes','Gastroenteritis','Bronchial
Asthma','Hypertension',
' Migraine','Cervical spondylosis',
'Paralysis (brain hemorrhage)','Jaundice','Malaria','Chicken

pox','Dengue','Typhoid','hepatitis A',
'Hepatitis B','Hepatitis C','Hepatitis D','Hepatitis E','Alcoholic hepatitis','Tuberculosis',
'Common Cold','Pneumonia','Dimorphic hemmorhoids(piles)',
'Heartattack','Varicoseveins','Hypothyroidism','Hyperthyroidism','Hypoglycemia','Osteoar
thristis',
'Arthritis','(vertigo) Paroymsal Positional Vertigo','Acne','Urinary tract

infection','Psoriasis',
'Impetigo']
l2=[]
for x in range(0,len(l1)):
l2.append(0)
# TESTING DATA
tr=pd.read_csv("Testing.csv")
tr.replace({'prognosis':{'Fungal infection':0,'Allergy':1,'GERD':2,'Chronic
cholestasis':3,'Drug Reaction':4,
'Peptic ulcer diseae':5,'AIDS':6,'Diabetes ':7,'Gastroenteritis':8,'Bronchial

Asthma':9,'Hypertension ':10,
'Migraine':11,'Cervical spondylosis':12,
'Paralysis (brain hemorrhage)':13,'Jaundice':14,'Malaria':15,'Chicken

pox':16,'Dengue':17,'Typhoid':18,'hepatitis A':19,
'Hepatitis B':20,'Hepatitis C':21,'Hepatitis D':22,'Hepatitis E':23,'Alcoholic

hepatitis':24,'Tuberculosis':25,
27
'Common Cold':26,'Pneumonia':27,'Dimorphic hemmorhoids(piles)':28,'Heart
attack':29,'Varicose veins':30,'Hypothyroidism':31,
'Hyperthyroidism':32,'Hypoglycemia':33,'Osteoarthristis':34,'Arthritis':35,
'(vertigo) Paroymsal Positional Vertigo':36,'Acne':37,'Urinary tract

infection':38,'Psoriasis':39,
'Impetigo':40}},inplace=True)
X_test= tr[l1]
y_test = tr[["prognosis"]]
np.ravel(y_test)
# TRAINING DATA
df=pd.read_csv("Training.csv")
df.replace({'prognosis':{'Fungal infection':0,'Allergy':1,'GERD':2,'Chronic
cholestasis':3,'Drug Reaction':4,
'Peptic ulcer diseae':5,'AIDS':6,'Diabetes ':7,'Gastroenteritis':8,'Bronchial

Asthma':9,'Hypertension ':10,
'Migraine':11,'Cervical spondylosis':12,
'Paralysis (brain hemorrhage)':13,'Jaundice':14,'Malaria':15,'Chicken

pox':16,'Dengue':17,'Typhoid':18,'hepatitis A':19,
'Hepatitis B':20,'Hepatitis C':21,'Hepatitis D':22,'Hepatitis E':23,'Alcoholic

hepatitis':24,'Tuberculosis':25,
'Common Cold':26,'Pneumonia':27,'Dimorphic hemmorhoids(piles)':28,'Heart

attack':29,'Varicose veins':30,'Hypothyroidism':31,
'Hyperthyroidism':32,'Hypoglycemia':33,'Osteoarthristis':34,'Arthritis':35,
'(vertigo) Paroymsal Positional Vertigo':36,'Acne':37,'Urinary tract

infection':38,'Psoriasis':39,
28
'Impetigo':40}},inplace=True)
X= df[l1]
y = df[["prognosis"]]
np.ravel(y)
def message():
if (Symptom1.get() == "None" and Symptom2.get() == "None" and Symptom3.get()

== "None" and Symptom4.get() == "None" and Symptom5.get() == "None"):
messagebox.showinfo("OPPS!!", "ENTER SYMPTOMS PLEASE")
else :
NaiveBayes()
def NaiveBayes():
from sklearn.naive_bayes import MultinomialNB
gnb = MultinomialNB()
gnb=gnb.fit(X,np.ravel(y))
from sklearn.metrics import accuracy_score
y_pred = gnb.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(accuracy_score(y_test, y_pred, normalize=False))
psymptoms =
[Symptom1.get(),Symptom2.get(),Symptom3.get(),Symptom4.get(),Symptom5.get()]
29
for k in range(0,len(l1)):
for z in psymptoms:
if(z==l1[k]):
l2[k]=1
inputtest = [l2]
predict = gnb.predict(inputtest)
predicted=predict[0]
h='no'
for a in range(0,len(disease)):
if(disease[predicted] == disease[a]):
h='yes'
break
if (h=='yes'):
t3.delete("1.0", END)
t3.insert(END, disease[a])
else:
t3.delete("1.0", END)
t3.insert(END, "No Disease")
root = Tk()
root.title(" Disease Prediction From Symptoms")
root.configure()
30
Symptom1 = StringVar()
Symptom1.set(None)
Symptom2.set(None)
Symptom3.set(None)
Symptom4.set(None)
Symptom5.set(None)
w2 = Label(root, justify=LEFT, text=" Disease Prediction From Symptoms ")
w2.config(font=("Elephant", 30))
w2.grid(row=1, column=0, columnspan=2, padx=100)
NameLb1 = Label(root, text="")
NameLb1.config(font=("Elephant", 20))
NameLb1.grid(row=5, column=1, pady=10, sticky=W)
S1Lb = Label(root, text="Symptom 1")
S1Lb.config(font=("Elephant", 15))
S1Lb.grid(row=7, column=1, pady=10 , sticky=W)
31
S2Lb.grid(row=8, column=1, pady=10, sticky=W)
lr = Button(root, text="Predict",height=2, width=20, command=message)
lr.config(font=("Elephant", 15))
lr.grid(row=15, column=1,pady=20)
OPTIONS = sorted(l1)
S1En = OptionMenu(root, Symptom1,*OPTIONS)
S1En.grid(row=7, column=2)
32
NameLb = Label(root, text="")
NameLb.config(font=("Elephant", 20))
NameLb.grid(row=13, column=1, pady=10, sticky=W)
NameLb = Label(root, text="")
NameLb.config(font=("Elephant", 15))
NameLb.grid(row=18, column=1, pady=10, sticky=W)
t3 = Text(root, height=2, width=30)
t3.config(font=("Elephant", 20))
t3.grid(row=20, column=1 , padx=10)
root.mainloop()
33
Chapter -10
SAMPLE OUTPUT
10.1 First Page
Fig -8 Selecting Symptoms
34
Fig -9 Selecting Symptoms as per required
Fig -10 prediction according to given symptoms
35
Fig -11 Different results with different symptoms
Fig -12 Different results with different symptoms
36
Chapter – 11
CONCLUSION
11.1 Conclusion
The ultimate goal is to facilitate coordinated and well-informed health care
systems capable of ensuring maximum patient satisfaction. In developing nations,
predictive analytics are the next big idea in medicine –the next evolution in statistics –
and roles will change as a result. Patients can get to become higher knowing and can get
to assume a lot of responsibility for his or her own care, if they are to make use of the
information derived. Physician roles can probably modification to a lot of an advisor than
head, who will advise, warn and help individual patients. Physicians might notice a lot of
joy in apply as positive outcomes increase and negative outcomes decrease. Perhaps time
with individual patients can increase and physicians will another time have the time to
create positive and lasting relationships with their patients. Time to assume, to interact,
and to really help people; relationship formation is one of the reasons physicians say they
went into medicine, and when these diminish, so does their satisfaction with their
profession. Hospitals, pharmaceutical corporations and insurance suppliers can see
changes furthermore. These changes which will virtually revolutionize the manner drugs
are practiced for higher health and unwellness reduction.
37
11.2 Future Work
Every one of us would like to have a good medical care system and physicians are
expected to be medical experts and take good decisions all the time. But it’s highly
unlikely to memorize all the knowledge, patient history, records needed for every
situation. Although they have all the massive amount of data and information; it’s
difficult to compare and analyse the symptoms of all the diseases and predict the
outcome. So, integrating information into patient’s personalized profile and performing
an in-depth research is beyond the scope a physician. So the solution is ever heard of a
personalized healthcare plan – exclusively crafted for an individual. Predictive analytics
is the process to make predictions about the future by analyzing historical data. For health
care, it would be convenient to make best decisions in case of every individual. Predictive
modeling uses artificial intelligence to create a prediction from past records, trends,
individuals, diseases and the model is deployed so that a new individual can get a
prediction instantly. Health and Medicare units can use these predictive models to
accurately assess when a patient can safely be released.
38
REFERENCES
1. Aditya Tomar, “Disease Prediction System using data mining techniques”, in
International Journal of Advanced Research in computer and Communication
Engineering, ISO 3297, July 2016.
2. Dr. B.Srinivasan, K.Pavya, “A study on data mining prediction techniques in
healthcare sector”, in International Research Journal of Engineering and
Technology (IRJET), March-2016.
3. Megha Rathi, Vikas Pareek, “An integrated hybrid data mining approach for
healthcare” , in IRACST - International Journal of Computer Science and
Information Technology Security (IJCSITS), ISSN: 2249-9555 , Vol.6,
No.6,Nov-Dec 2016.
4. Feixiang Huang, Shengyong Wang, and Chien-Chung Chan, “Predicting Disease
By Using Data Mining Based on Healthcare Information System” , in IEEE 2012.
5. Jeffrey MGirardJeffrey FCohn,et al., “Jeffrey MGirardJeffrey FCohn,” Science,

Volume 4, August 2015,, no. 3875,pp. 86–88, 1969.
6. [4] Christine L. Lisetti, Diane J. Schiano “Automatic Facial Expression
Interpretation: Where Human-ComputerInteraction, Artificial Intelligence and
Cognitive Science Intersect.” Pragmatics and Cognition (Special Issue on Facial
Information Processing: A MultidisciplinaryPerspective), Vol. 8(1): 185-235,
2000.
7. [5] James R. Williamson. MIT Lincoln, “Detecting Depression using Vocal,
Facial and SemanticCommunication Cues,” AVEC'16, October 16 2016,
Amsterdam, NetherlandsACM. ISBN 978-1-4503-4516-3/16/10
8. .[6] V. Surakka and J. K. Hietanen, “Facial and emotional reactionsto
duchenne and non-duchenne smiles,” International Journal ofPsychophysiology,
vol. 29, no. 1, pp. 23–33, 1998.
39
9. [7] P. M. Niedenthal, M. Mermillod, M. Maringer, and U. Hess,
“Thesimulation of smiles (sims) model: Embodied simulation and themeaning of
facial expression,” Behavioral and brain sciences, vol. 33,no. 06, pp. 417–433,
2010.
10. [8] K. El Haddad, H. Cakmak, S. Dupont, and T. Dutoit, “Laughterand smile
processing for human-computer interactions,” Justtalking-casual talk among
humans and machines, Portoroz, Slovenia,pp. 23–28, 2016.
11. [9] H. Yu, O. G. Garrod, and P. G. Schyns, “Perception-driven
facialexpression synthesis,” Computers & Graphics, vol. 36, no. 3, pp.152–162,
2012.
12. [10] S. Oh, J. Bailenson, N. Krämer, and B. Li, “Let the avatar brightenyour
smile: Effects of enhancing facial expressions in virtualenvironments,” PLoS
ONE, vol. 11, no. 9, p. e0161794, 2016.
13. Soujanya Poria , Erik Cambria, “A review of affective computing: From
unimodal analysis to multimodal fusion,” Trends in cognitive sciences,
https://doi.org/10.1016/j.inffus.2017.02.003
40

Disease

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Disease

Uploaded by

Copyright:

Available Formats

CHAPTER 1

According to Yuan , ML algorithms are highly suscepti-ble to errors because of

crucial to achieve accurate and unbiased decisions. Secondly, ML algo-rithms relies

methodologies such as ML algorithms were introduced to discover meaningful patterns

patients was ameliorated.

with the highest accuracy has been selected.

Fig. 2 Proposed system ﬂow diagram during training

Fig. 3 Functioning of the ML models

decoded to give the output as the disease.

disease recognition problem. This approach renders more comprehensiveness and

respect of each disease will be concluded.

conditions predicated on the symptoms shown by an existent using machine knowledge

Monto et al. designed a statistical model to prognosticate whether a case had

new convolutional neural network- grounded multimodal complaint trouble vaticination

(CNN-MDRP) was structured. The algorithm reached an delicacy of around 94.8%.

was cross-verified by theK-fold system. The pulled features were transferred to 6

presence or absence of heart complaint.

algorithm is under performing compared to CNN algorithm.

immensely ameliorate the performance metrics of under performed algorithms.

Serek et al. planned a comparative study of classifiers performance for Chronic

in terms of F-measure, precision, and accuracy. As per analysis, RF scored better in

phrases of F-measure and accuracy, while NB yielded better precision. In consideration

Syndrome, Acute Renal Failure, Chronic Glomerulonephritis, and CKD. Additionally,

diseases. Although supported by findings, the conclusion is weakened by prior suggestion

of ML algorithms. According to Uddin the exploration of the hyper-parameter space can

disease negative. Dwivedi attempted to evaluate the performance of Artificial Neural

various hyper-parameters were tested at each ML algorithm to converge to the best

accuracy and precision.

classification of the tumor, which was implied by the ROC graph.

In contrast, Yao experimented with various data mining methods including RF

estimates of information gained in each feature attribute. Furthermore, RF is the most

presented multiple performance metrics that solidified the underlined argument.

proved to be disadvantageous for ML models.

therefore the performance of the ML algorithm is hindered.

and it recorded numerous biomedical voice measurement ranging from 31 people, 24

advantageously achieves over the SVM methodology in terms of sensitivity, accuracy,

In line of this study, Behroozi aimed to propose a new classification frame-work

the performance metrics. In addition, Eskidere concentrated on tracking the progression

highest performing model. This conclusion is strengthened by the adequate comparison

of decoders with their optimal performance metric. According to Lavesson, various ML

algorithms are designed to optimize numerous perfor-mance metrics (e.g., Neural

experiment. Additionally, multilayer perceptron neural network (MLPNN) with

backpropagation (BP) was used as the training algorithm.

SOFTWARE AND HARDWARE REQUIREMENTS

4.1 Hardware Requirements:

• System : Dual Core.

4.2 Software Requirements:

• Operating system : Windows 10.

Fig- 5 Functioning of the ML models

decoded to give the output as the disease

The BP algorithm has served as a useful methodology to train multilayer

propagation and weight

overall process is done to improve weights during processing.

6.1 Data collection

angiography. Every patient’s attributes are being assembled such as demographic,

lipoprotein, cholesterol, triglycerides, systolic and diastolic blood pressure, weight,

with persistent ST changes, duke treadmill test and angiography result.

6.2 Disease Prediction:

With the launch of automated medical diagnosis system, there is high

diagnosis process, experience and knowledge of various medical experts/doctors as well

as patient’s medical screening data is being collected in databases, resulting in an

throughout patient’s results.

Doctors and medical professionals are always required in case of an emergency.