You are on page 1of 40

CHAPTER 1

INTRODUCTION

Disease Prediction using Machine Learning is a system that predicts the disease

based on the information provided by the user. It also predicts the disease of the patient

or the user based on the information or the symptoms he/she enters into the system and

provides accurate results based on that information. If the patient is not very serious and

the user just wants to know the type of disease, he/she has been through. It is a system

that provides the user the tips and tricks to maintain the health system of the user and it

provides a way to find out the disease using this prediction. Now a day’s health industry

plays a serious role in curing the diseases of the patients so this is often also some quite

help for the health industry to tell the user and also it's useful for the user just in case

he/she doesn’t want to travel to the hospital or the other clinics, so just by entering the

symptoms and every one other useful information the user can get to understand the

disease he/she is affected by and therefore the health industry also can get enjoy this

technique by just asking the symptoms from the user and entering in the system and in

just a few seconds they can tell the exact and up to some extent the accurate diseases.

Medicaid services and centers for Medicare reported that 50% of Americans had

multiple chronic diseases, which led the US health care to spend around $3.3 trillion in

2016, that amounts to $10,348 per person in the US. Moreover, the World Health

Organization and World Economic Forum. reported that India had a huge loss of $236.6

billion by 2015 because of fatal diseases, caused by malnutrition and morbid lifestyles.

Such expenditures revealed how prone people are to a spectrum of diseases, which

showcased how vital it is to detect diseases early, to consequently reduce the fatality of

1
these maladies. In addition, early disease prediction can lessen the financial pressure on

the economy and ensure better maintenance on the overall well-being of the community.

According to Yuan , ML algorithms are highly suscepti-ble to errors because of

two factors. Firstly, it depends on the quality and the selection of the datasets, which is

crucial to achieve accurate and unbiased decisions. Secondly, ML algo-rithms relies

heavily on the right selection of features extracted from the dataset, which proved to be

difficult, time consuming, and required high computational power. These factors hinder

the performance of the learning model and generate fatal errors that can endanger the

lives of patients. In contrast, Ismaeel argued that standard statistical techniques, the work

experience and the intuition of medical doctors led to undesirable biases and errors when

detecting risks associated to the disease For this reason, advanced computational

methodologies such as ML algorithms were introduced to discover meaningful patterns

and hidden information from data, which can be used for critical decision making. In

consequence, the burden on the medical staff decreased, while the survival rate of

patients was ameliorated.

Medicine and healthcare are some of the most crucial parts of the economy and

human life. There is a tremen-dous amount of change in the world we are living in now

and the world that existed a few weeks back. Every-thing has turned gruesome and

divergent. In this situ-ation, where everything has turned virtual, the doctors and nurses

are putting up maximum efforts to save peo-ple’s lives even if they have to danger their

own. There are also some remote villages which lack medical facil-ities. Virtual doctors

are board-certified doctors who choose to practice online via video and phone appoint-

ments, rather than in-person appointments but this is not possible in the case of

2
emergency. Machines are al-ways considered better than humans as, without any human

error, they can perform tasks more efficiently and with a consistent level of accuracy. A

disease pre-dictor can be called a virtual doctor, which can pre-dict the disease of any

patient without any human er-ror. Also, in conditions like COVID-19 and EBOLA, a

disease predictor can be a blessing as it can iden-tify a human’s disease without any

physical contact. Some models of virtual doctors do exist, but they do not comprise the

required level of accuracy as all the parameters required are not being considered. The

pri-mary goal was to develop numerous models to define which one of them provides the

most accurate predic-tions. While ML projects vary in scale and complexity, their general

structure is the same. Several rule-based techniques were drawn from machine learning to

re-call the development and deployment of the predictive model. Several models were

initiated by using various machine learning (ML) algorithms that collected raw data and

then bifurcated it according to gender, age group, and symptoms. The data-set was then

processed in several ML models like Fine, Medium and Coarse De-cision trees, Gaussian

Na¨ıve Bayes, Kernel Na¨ıve Bayes, Fine, Medium and Coarse KNN, Weighted KNN,

Sub-space KNN, and RUSBoosted trees. According to ML models, the accuracy varied.

While processing the data, the input parameters data-set was supplied to every model,

and the disease was received as an output with dissimilar accuracy levels. The model

with the highest accuracy has been selected.

3
Fig. 1. Proposed system for disease prediction

The doctor may not be available always when needed. But, in the modern time

scenario, according to necessity one can always use this prediction system anytime. The

symptoms of an individual along with the age and gender can be given to the ML model

to further process. After preliminary processing of the data, the ML model uses the

current input, trains and tests the algorithm resulting in the predicted disease.

Fig. 2 Proposed system flow diagram during training

4
The dataset consisting of gender, symptoms, and age of an individual was

preprocessed and fed as an input to different ML algorithms for the prediction of the

disease. The different ML models used were Fine, Medium and Coarse Decision trees,

Gaussian Na¨ıve Bayes, Kernel Na¨ıve Bayes, Fine, Medium and Coarse KNN, Weighted

KNN, Subspace KNN, and RUSBoosted trees. The outcome of the models is the disease

as per the symptoms, age, and gender is given to the processing model.

Fig. 3 Functioning of the ML models

The dataset was split into input consisting of age, gender, and symptoms and the

output as the diseases based on the input factors. We randomly split the available data

into train and test sets. These sets were then encoded and further trained using different

algorithms. After which the algorithms test the training set and predict the values,

resulting in the accuracy of different ML algorithms. The predicted values were then

decoded to give the output as the disease.

5
AIM

The aim of this study is to test the proposed hypothesis that supervised ML

algorithms can improve health care by the accurate and early detection of diseases. In this

study, we investigate studies that utilize more than one supervised ML model for each

disease recognition problem. This approach renders more comprehensiveness and

precision because the evaluation of the performance of a single algorithm over various

study settings induces bias which generates imprecise results. The analysis of ML models

will be conducted on few diseases located at heart, kidney, breast, and brain. For the

detection of the disease, numerous methodologies will be evaluated such as KNN, NB,

DT, CNN, SVM, and LR. At the end of this literature, the best performing ML models in

respect of each disease will be concluded.

6
CHAPTER 2

LITERATURE SURVEY

Numerous disquisition factory have been carried out for the prophecy of the

conditions predicated on the symptoms shown by an existent using machine knowledge

algorithms:

Monto et al. designed a statistical model to prognosticate whether a case had

influenza or not. They included 3744 unvaccinated grown-ups and adolescent cases of

influenza who had fever and at least 2 other symptoms of influenza. Out of 3744, 2470

were verified to have influenza by the laboratory. Predicated on this data, their model

gave an delicacy of 79. Colorful machine learning algorithms were streamlined for the

effective prophecy of a habitual complaint outbreak by Chen et al. The data collected

for the training purpose was deficient. To overcome this, a idle factor model was used. A

new convolutional neural network- grounded multimodal complaint trouble vaticination

(CNN-MDRP) was structured. The algorithm reached an delicacy of around 94.8%.

The DNN model performed more in terms of average performance and the LSTM

model gave close prognostications when circumstances were large. Haq et al. used a

database that contained information about patients having any heart complaint. They

pulled features using three selection algorithms which are relief, minimum redundancy,

and maximum connection (mRMR), and least absolute loss and selection motorist which

was cross-verified by theK-fold system. The pulled features were transferred to 6

different machine learning algorithms and also it was classified predicated on the

presence or absence of heart complaint.

7
Dahiwade et al. proposed a ML based system that pre-dicts common diseases. The

symptoms dataset was imported from the UCI ML depository, where it contained

symptoms of many common diseases. The system used CNN and KNN as classification

techniques to achieve multiple diseases pre-diction. Moreover, the proposed solution was

supplemented with more information that concerned the living habits of the tested

patient, which proved to be helpful in understanding the level of risk attached to the

predicted disease. Dahiwade et al compared the results between KNN and CNN

algorithm in terms of processing time and accuracy. The accuracy and processing time of

CNN were 84.5% and 11.1 seconds, respectively. The statistics proved that KNN

algorithm is under performing compared to CNN algorithm.

In light of this study, the findings of Chen et al. also agreed that CNN

outperformed typical supervised algorithms such as KNN, NB, and DT. The authors

concluded that the proposed model scored higher in terms of accuracy, which is

explained by the capabil-ity of the model to detect complex nonlinear relationships in the

feature space. Moreover, CNN detects features with high importance that renders better

description of the disease, which enables it to accurately predict diseases with high

complexity. This conclusion is well supported and backed with empirical observations

and statistical arguments. Nonetheless, the presented models lacked details, for instance,

Neural Networks parameters such as network size, architecture type, learning rate and

back propagation algorithm, etc. In addition, the analysis of the performances is only

evaluated in terms of accuracy, which debunks the validity of the presented findings.

Moreover, the authors did not take into consideration the bias problem that is faced by the

8
tested algorithms. In illustration, the incorporation of more feature variables could

immensely ameliorate the performance metrics of under performed algorithms.

Serek et al. planned a comparative study of classifiers performance for Chronic

Kidney disease (CKD) detection using The Kidney Function Test (KFT) dataset. In this

study, the classifiers used are KNN, NB, and RF classifier; their performance is examined

in terms of F-measure, precision, and accuracy. As per analysis, RF scored better in

phrases of F-measure and accuracy, while NB yielded better precision. In consideration

of this study, Vijayarani aimed to detect kid-ney diseases using SVM and NB. The

classifiers were used to identify four types of kidney diseases namely Acute Nephritic

Syndrome, Acute Renal Failure, Chronic Glomerulonephritis, and CKD. Additionally,

the research was focused on deter-mining the better performing classification algorithm

based on the accuracy and execution time. From the results, SVM considerably achieved

higher accuracy than NB, which makes it the better performing algorithm. However, NB

classified data with minimum execution time. Other several empirical studies also

focused on locating CKD; Charleonnan et al. and Kotturu et al. concluded that the SVM

classifier is the most adequate for kidney diseases because it deals well with semi-

structured and unstructured data. Such flexibility allowed SVM to handle larger features

spaces, which resulted in acquiring high accuracy when detecting complex kidney

diseases. Although supported by findings, the conclusion is weakened by prior suggestion

that different hyper-parameters were not experimented when evaluating the performances

of ML algorithms. According to Uddin the exploration of the hyper-parameter space can

generate different accuracy results and render better performances for ML algorithms.

9
Marimuthu et al. aimed to predict heart diseases using supervised ML techniques.

The authors structured the attributes of data as gender, age, chest pain, gender, target and

slope. The applied ML algorithms that were deployed are DT, KNN, LR and NB. As per

analysis, the LR algorithm gave a high accuracy of 86.89%, which deemed to be the most

effective compared to the other mentioned algorithms. In 2018, Dwivedi attempted to add

more precision to the prediction of heart diseases by accounting for additional parameters

such as Resting blood pressure, Serum Cholesterol in mg/dl, and Maximum Heart Rate

achieved. The used dataset was imported from the UCI ML laboratory; it was comprised

with 120 samples that were heart disease positive, and 150 samples that were heart

disease negative. Dwivedi attempted to evaluate the performance of Artificial Neural

Networks (ANN), SVM, KNN, NB, LR and Classification Tree. At the appliance of

tenfold cross validation, the results showed that LR has the highest classification

accuracy and sensitivity, which shows high dependability at detecting heart diseases. This

con-clusion is strengthened by the findings of Polaraju and Vahid et al., where the

Logistic Regression outperformed other techniques such as ANN, SVM, and Adaboost.

The studies excelled in conducting an extensive analysis on the ML models. For instance,

various hyper-parameters were tested at each ML algorithm to converge to the best

possible accuracy and precision values. Despite that advantage, the small size of the

imported datasets constraints the learning models from targeting diseases with higher

accuracy and precision.

Shubair attempted for the detection of breast cancer using ML algorithms, namely

RF, Bayesian Networks and SVM. The researchers obtained the Wisconsin original

breast cancer dataset from the UCI Repository and utilized it for comparing the learning

10
models in terms of key parameters such as accuracy, recall, precision, and area of ROC

graph. The classifiers were tested using K-fold validation method, where the chosen

value of K is equal to 10. The simulation results have proved that SVM excelled in terms

of recall, accuracy, and precision. However, RF had a higher probability in the correct

classification of the tumor, which was implied by the ROC graph.

In contrast, Yao experimented with various data mining methods including RF

and SVM to de-termine the best suited algorithm for breast cancer prediction. Per results,

the classification rate, sensitivity, and specificity of Random Forest algorithm were

96.27%, 96.78%, and 94.57%, respectively, while SVM scored an accuracy value of

95.85%, a sensitivity of 95.95%, and a specificity of 95.53%. Yao came to the conclusion

that the RF algorithm performed better than SVM because the former provides better

estimates of information gained in each feature attribute. Furthermore, RF is the most

adequate at breast diseases classification, since it scales well for large datasets and

prefaces lower chances of variance and data overfitting. the studies advantageously

presented multiple performance metrics that solidified the underlined argument.

Nevertheless, the inclusion of the pre-processing stage to prepare raw data for training

proved to be disadvantageous for ML models.

According to Yao , omitting parts of data reduces the quality of images, and

therefore the performance of the ML algorithm is hindered.

Chen et al. presented an effective diagnosis system using Fuzzy k-Nearest Neighbor

(FKNN) for the diagnosis of Parkinson’s disease (PD) . The study focused on comparing

the proposed SVM-based and the FKNN-based approaches. the Principal Component

11
Analysis (PCA) was utilized to assemble the most discriminated features for the

construction of an optimal FKNN model. The dataset was taken from the UCI depository,

and it recorded numerous biomedical voice measurement ranging from 31 people, 24

with PD. The experimental findings have indicated that the FKNN approach

advantageously achieves over the SVM methodology in terms of sensitivity, accuracy,

and specificity.

In line of this study, Behroozi aimed to propose a new classification frame-work

to diagnose PD, which was enhanced by a filter-based feature selection algorithm that

increased the classification accuracy up to 15%. The classification of the framework was

characterized by applying independent classifiers for each subset of the dataset to account

for the loss of valuable infor-mation. The chosen classifiers were KNN, SVM,

Discriminant Analysis and NB. The results showed that SVM achieved the highest in all

the performance metrics. In addition, Eskidere concentrated on tracking the progression

of PD by dis-cussing the performance of SVM with other classifiers such as Least Square

Support Vector (LS-SVM), General Regression Neural Network (GRNN) and Multi-

layer Perceptron Neural Network (MLPNN). The findings indicated that LS-SVM is the

highest performing model. This conclusion is strengthened by the adequate comparison

of decoders with their optimal performance metric. According to Lavesson, various ML

algorithms are designed to optimize numerous perfor-mance metrics (e.g., Neural

Networks optimizes squared error whereas KNN and SVM optimize accuracy).

Furthermore, the authors are particularly good at proposing frameworks with details. For

example, SVMs parameters such as the kernel and the regularization value were outlined

in depth.

12
CHAPTER – 3

SYSTEM DESIGN

The experiment was carried out on a publicly available database for heart disease.

The dataset contains a total of 303 records that were divided into two sets, training set

(40%) and testing set (60%). A data mining tool named Weka 3.6.11 was used for the

experiment. Additionally, multilayer perceptron neural network (MLPNN) with

backpropagation (BP) was used as the training algorithm.

Fig -4 MLPNN

MLPNN is one of the most significant models in artificial neural network. The

MLPNN consists of one input layer, one or more hidden layers and one output layer.3 In

MLPNN, the input nodes pass values to the first hidden layer, and then nodes of first

hidden layer pass values to the second and so on till producing outputs

13
CHAPTER – 4

SOFTWARE AND HARDWARE REQUIREMENTS

4.1 Hardware Requirements:

• System : Dual Core.


• Hard Disk : 500 GB.
• Monitor : Led Monitor.
• Mouse : Optical Mouse
• RAM : 4 GB

4.2 Software Requirements:

• Operating system : Windows 10.


• Coding Language : Python 3.7 (Tkinter)
• Compiler : Pycharm
• Data Base : Access

14
CHAPTER – 5

SYSTEM ANALYSIS

Fig- 5 Functioning of the ML models

The dataset was split into input consisting of age, gender, and symptoms and the

output as the diseases based on the input factors. We randomly split the available data

into train and test sets. These sets were then encoded and further trained using different

algorithms. After which the algorithms test the training set and predict the values,

resulting in the accuracy of different ML algorithms. The predicted values were then

decoded to give the output as the disease


15
Data Flow Diagram

Fig-6 BP Algorithm

The BP algorithm has served as a useful methodology to train multilayer

perceptron for a wide range of applications. The BP network calculates the difference

16
between real and predicted values, which is circulated from output nodes backwards to

nodes in previous layer. The BP learning algorithm can be divided into two phases,

propagation and weight

First, this learning algorithm provides training data to the network and compares

the actual and desired outputs. Then, it calculates the error in each neuron. Based on this,

the algorithm calculates what output should be for each neuron and how much higher or

lower output must be adjusted for desired output and finally adjusts the weights. The

overall process is done to improve weights during processing.

17
Chapter – 6

MODULES

6.1 Data collection

Using the UCI dataset, Collection of medical data of patients with heart diseases

is carried out. Throughout issues/matters are assumed for CAD and registered for

angiography. Every patient’s attributes are being assembled such as demographic,

historic and laboratory features such as sex, age, hypertension, smoking history, diabetes

mellitus, chest pain type, dyslipidemia, random blood sugar, low and high density

lipoprotein, cholesterol, triglycerides, systolic and diastolic blood pressure, weight,

height, BMI (body mass index), central obesity, waist circumference, ankle–brachial

index, duration of exercise, METS obtained, rate pressure product, recovery duration

with persistent ST changes, duke treadmill test and angiography result.

6.2 Disease Prediction:

With the launch of automated medical diagnosis system, there is high

development in the medical domain and at the same time cost consumption has reduced.

There are numerous factors prevailing for heart attack diagnosis and mostly patient’s test

records are being referred and analyzed for carrying out the diagnosis. For enhancing the

diagnosis process, experience and knowledge of various medical experts/doctors as well

as patient’s medical screening data is being collected in databases, resulting in an

extremely significant system. With the blend of clinical decision support and

computerized patient records, the medical faults can be reduced, patient’s safety can be

18
enhanced, variation in unwanted practices can be minimized thereby improvising

throughout patient’s results.

Doctors and medical professionals are always required in case of an emergency.

In the current situation of COVID-19, where sufficient facilities and resources are

unavailable, our prediction system can prove to be helpful and can be used in the

diagnosis of a disease.

19
Chapter – 7

SYSTEM TESTING

Fig. 7. Comparison of the accuracy values of the different ML algorithms.

The Weighted KNN model gave the highest accuracy as compared to the other ML

algorithms. The RUSBoosted trees were the least accurate model. The Fine KNN

performed better than the Subspace, Medium, and Coarse KNN models. The least

e_cient KNN model was coarse KNN. The Gaussian and the Kernel Na _ve Bayes

algorithm had a comparable accuracy with each other though less than the KNN

models. The Fine tree had a higher accuracy than the medium and the coarse

decision tree models.

20
It is a sort of supervised learning algorithmic program that's largely used for

classification issues. Surprisingly, it works for each categorical and continuous dependent

variable. In this algorithmic program, we tend to split the population into 2 or a lot of

homogenized sets. This is done supported most vital attributes/ freelance variables to

form as distinct teams as attainable. A tree has several analogies in real world, and seems

that it's influenced a large space of machine learning, covering each classification and

regression. In call analysis, a choice tree is wont to visually and expressly represent

selections and higher cognitive process. As the name goes, it uses a tree-like model of

decisions. Though a commonly used tool in data mining for deriving a strategy to reach a

particular goal, it’s also widely used in machine learning. Once we completed modelling

the Decision Tree classifier, we will use the trained model to predict whether the balance

scale tip to the right or tip to the left or be balanced.

Random Forest is a great algorithm to train early in the model development

process, to see how it performs and it’s hard to build a “bad” Random Forest, because of

its simplicity. This rule is additionally an excellent alternative, if you would like to

develop a model during a short amount of your time. On prime of that, it provides a fairly

sensible indicator of the importance it assigns to your options. Random Forests are

terribly onerous to ram down terms of performance. And on prime of that, they'll handle

tons of various feature varieties, like binary, categorical and numerical. Overall, Random

Forest may be a (mostly) quick, easy and versatile tool, though it's its limitations.

Random forests are an ensemble learning method for classification, regression and other

tasks, that operate by constructing a multitude of decision trees at training time and

outputting the class that is the mode of the categories (classification) or mean prediction

21
(regression) of the individual trees Random call forests correct for call trees' habit of over

fitting to their training set.

It is a modi_ed version of KNN. In KNN we chose an integer parameter K and by

using that parameter we found where the major predicted values lied. But if the value of

K is too small the algorithm is much more sensitive to the points that are outliers. Also, if

the value of K is too large then all the points that are almost very close to the K value are

selected. To overcome this issue the weighted KNN gave more weight to the points that

were nearest to the K value and the less weight to the points that were farther away. We

were able to get the highest accuracy using this model. Also among all the KNN models,

this model gave us the best results.

22
Chapter – 8

ALGORITHM

8.1. Multinomial Naive Bayes: The naive Bayes classification strategy works with Bayes'

theorem, which is known to be used in arithmetic and computer science probabilistic

analysis. This algorithm considers each property as an autonomous property that

contributes to the final classification. For this reason, it has been praised in various

studies for its accuracy and unwavering quality.

It is a straightforward method for building classifiers, which are models that give

class labels to problem cases represented as vectors of characteristic values. Here the

class labels are selected from a finite set. For training such classifiers, there is no one

algorithm, but rather a variety of algorithms based on the same principle: all naive Bayes

classifiers assume that the value of one feature is independent of the value of any other

feature, given the class variable.

For example, if the fruit is red, round, and around 10 cm in diameter, it is termed

an apple. A naive Bayes classifier examines each of these characteristics to contribute

independently to the likelihood that this fruit is an apple, regardless of any possible

confounding variables.

8.2. Random Forest Classifier: The decision tree is the basic building block of random

forest classifiers. A decision tree is a hierarchical structure created from a data set's

23
characteristics (or independent variables). The decision tree is divided into nodes based

on a measure connected with a subset of the characteristics. The random forest approach

was used to create the prediction model, and the results were compared to those of a

decision tree based on multiple logistic regression and a classification and regression

tree. The recognition rate was used to calculate the model's forecast accuracy. Random

forests are multi-decision tree ensemble classifiers that train multiple decision trees at

random. The random forest approach is made up of two steps: a training step that creates

numerous decision trees and a test step that classifies or predicts an outcome variable

based on an  input vector. Forest F =  f1,..., fn represents the ensemble form of random 

forest training data .After averaging the distributions obtained from each forest's

decision trees by T (the number of decision trees), classification was performed. The

mean for continuous target variables and the majority vote for categorical target

variables were used to aggregate the predictors of each sample.

24
Chapter – 9

SAMPLE CODE

from tkinter import *

from tkinter import messagebox

import numpy as np

import pandas as pd

l1=['itching','skin_rash','nodal_skin_eruptions','continuous_sneezing','shivering','chills','jo
int_pain',

'stomach_pain','acidity','ulcers_on_tongue','muscle_wasting','vomiting','burning_micturiti
on','spotting_ urination','fatigue',

'weight_gain','anxiety','cold_hands_and_feets','mood_swings','weight_loss','restlessness','l
ethargy','patches_in_throat',

'irregular_sugar_level','cough','high_fever','sunken_eyes','breathlessness','sweating','dehyd
ration','indigestion',

'headache','yellowish_skin','dark_urine','nausea','loss_of_appetite','pain_behind_the_eyes',
'back_pain','constipation',

'abdominal_pain','diarrhoea','mild_fever','yellow_urine','yellowing_of_eyes','acute_liver_f
ailure','fluid_overload',

'swelling_of_stomach','swelled_lymph_nodes','malaise','blurred_and_distorted_vision','ph
legm','throat_irritation',

'redness_of_eyes','sinus_pressure','runny_nose','congestion','chest_pain','weakness_in_lim
bs','fast_heart_rate',

25
'pain_during_bowel_movements','pain_in_anal_region','bloody_stool','irritation_in_anus',
'neck_pain','dizziness','cramps',

'bruising','obesity','swollen_legs','swollen_blood_vessels','puffy_face_and_eyes','enlarged
_thyroid','brittle_nails',

'swollen_extremeties','excessive_hunger','extra_marital_contacts','drying_and_tingling_li
ps','slurred_speech','knee_pain','hip_joint_pain',

'muscle_weakness','stiff_neck','swelling_joints','movement_stiffness','spinning_movemen
ts','loss_of_balance','unsteadiness','weakness_of_one_body_side',

'loss_of_smell','bladder_discomfort','foul_smell_of
urine','continuous_feel_of_urine','passage_of_gases','internal_itching','toxic_look_(typhos
)',

'depression','irritability','muscle_pain','altered_sensorium','red_spots_over_body','belly_pa
in','abnormal_menstruation','dischromic _patches',

'watering_from_eyes','increased_appetite','polyuria','family_history','mucoid_sputum','rus
ty_sputum','lack_of_concentration','visual_disturbances',

'receiving_blood_transfusion','receiving_unsterile_injections','coma','stomach_bleeding','
distention_of_abdomen','history_of_alcohol_consumption',

'fluid_overload','blood_in_sputum','prominent_veins_on_calf','palpitations','painful_walki
ng','pus_filled_pimples','blackheads','scurring','skin_peeling',

'silver_like_dusting','small_dents_in_nails','inflammatory_nails','blister','red_sore_around
_nose','yellow_crust_ooze']

disease=['Fungal infection','Allergy','GERD','Chronic cholestasis','Drug Reaction',

26
'Peptic ulcer diseae','AIDS','Diabetes','Gastroenteritis','Bronchial
Asthma','Hypertension',

' Migraine','Cervical spondylosis',

'Paralysis (brain hemorrhage)','Jaundice','Malaria','Chicken


pox','Dengue','Typhoid','hepatitis A',

'Hepatitis B','Hepatitis C','Hepatitis D','Hepatitis E','Alcoholic hepatitis','Tuberculosis',

'Common Cold','Pneumonia','Dimorphic hemmorhoids(piles)',

'Heartattack','Varicoseveins','Hypothyroidism','Hyperthyroidism','Hypoglycemia','Osteoar
thristis',

'Arthritis','(vertigo) Paroymsal Positional Vertigo','Acne','Urinary tract


infection','Psoriasis',

'Impetigo']

l2=[]

for x in range(0,len(l1)):

l2.append(0)

# TESTING DATA

tr=pd.read_csv("Testing.csv")

tr.replace({'prognosis':{'Fungal infection':0,'Allergy':1,'GERD':2,'Chronic
cholestasis':3,'Drug Reaction':4,

'Peptic ulcer diseae':5,'AIDS':6,'Diabetes ':7,'Gastroenteritis':8,'Bronchial


Asthma':9,'Hypertension ':10,

'Migraine':11,'Cervical spondylosis':12,

'Paralysis (brain hemorrhage)':13,'Jaundice':14,'Malaria':15,'Chicken


pox':16,'Dengue':17,'Typhoid':18,'hepatitis A':19,

'Hepatitis B':20,'Hepatitis C':21,'Hepatitis D':22,'Hepatitis E':23,'Alcoholic


hepatitis':24,'Tuberculosis':25,

27
'Common Cold':26,'Pneumonia':27,'Dimorphic hemmorhoids(piles)':28,'Heart
attack':29,'Varicose veins':30,'Hypothyroidism':31,

'Hyperthyroidism':32,'Hypoglycemia':33,'Osteoarthristis':34,'Arthritis':35,

'(vertigo) Paroymsal Positional Vertigo':36,'Acne':37,'Urinary tract


infection':38,'Psoriasis':39,

'Impetigo':40}},inplace=True)

X_test= tr[l1]

y_test = tr[["prognosis"]]

np.ravel(y_test)

# TRAINING DATA

df=pd.read_csv("Training.csv")

df.replace({'prognosis':{'Fungal infection':0,'Allergy':1,'GERD':2,'Chronic
cholestasis':3,'Drug Reaction':4,

'Peptic ulcer diseae':5,'AIDS':6,'Diabetes ':7,'Gastroenteritis':8,'Bronchial


Asthma':9,'Hypertension ':10,

'Migraine':11,'Cervical spondylosis':12,

'Paralysis (brain hemorrhage)':13,'Jaundice':14,'Malaria':15,'Chicken


pox':16,'Dengue':17,'Typhoid':18,'hepatitis A':19,

'Hepatitis B':20,'Hepatitis C':21,'Hepatitis D':22,'Hepatitis E':23,'Alcoholic


hepatitis':24,'Tuberculosis':25,

'Common Cold':26,'Pneumonia':27,'Dimorphic hemmorhoids(piles)':28,'Heart


attack':29,'Varicose veins':30,'Hypothyroidism':31,

'Hyperthyroidism':32,'Hypoglycemia':33,'Osteoarthristis':34,'Arthritis':35,

'(vertigo) Paroymsal Positional Vertigo':36,'Acne':37,'Urinary tract


infection':38,'Psoriasis':39,

28
'Impetigo':40}},inplace=True)

X= df[l1]

y = df[["prognosis"]]

np.ravel(y)

def message():

if (Symptom1.get() == "None" and Symptom2.get() == "None" and Symptom3.get()


== "None" and Symptom4.get() == "None" and Symptom5.get() == "None"):

messagebox.showinfo("OPPS!!", "ENTER SYMPTOMS PLEASE")

else :

NaiveBayes()

def NaiveBayes():

from sklearn.naive_bayes import MultinomialNB

gnb = MultinomialNB()

gnb=gnb.fit(X,np.ravel(y))

from sklearn.metrics import accuracy_score

y_pred = gnb.predict(X_test)

print(accuracy_score(y_test, y_pred))

print(accuracy_score(y_test, y_pred, normalize=False))

psymptoms =
[Symptom1.get(),Symptom2.get(),Symptom3.get(),Symptom4.get(),Symptom5.get()]

29
for k in range(0,len(l1)):

for z in psymptoms:

if(z==l1[k]):

l2[k]=1

inputtest = [l2]

predict = gnb.predict(inputtest)

predicted=predict[0]

h='no'

for a in range(0,len(disease)):

if(disease[predicted] == disease[a]):

h='yes'

break

if (h=='yes'):

t3.delete("1.0", END)

t3.insert(END, disease[a])

else:

t3.delete("1.0", END)

t3.insert(END, "No Disease")

root = Tk()

root.title(" Disease Prediction From Symptoms")

root.configure()

30
Symptom1 = StringVar()

Symptom1.set(None)

Symptom2 = StringVar()

Symptom2.set(None)

Symptom3 = StringVar()

Symptom3.set(None)

Symptom4 = StringVar()

Symptom4.set(None)

Symptom5 = StringVar()

Symptom5.set(None)

w2 = Label(root, justify=LEFT, text=" Disease Prediction From Symptoms ")

w2.config(font=("Elephant", 30))

w2.grid(row=1, column=0, columnspan=2, padx=100)

NameLb1 = Label(root, text="")

NameLb1.config(font=("Elephant", 20))

NameLb1.grid(row=5, column=1, pady=10, sticky=W)

S1Lb = Label(root, text="Symptom 1")

S1Lb.config(font=("Elephant", 15))

S1Lb.grid(row=7, column=1, pady=10 , sticky=W)

S2Lb = Label(root, text="Symptom 2")

31
S2Lb.config(font=("Elephant", 15))

S2Lb.grid(row=8, column=1, pady=10, sticky=W)

S3Lb = Label(root, text="Symptom 3")

S3Lb.config(font=("Elephant", 15))

S3Lb.grid(row=9, column=1, pady=10, sticky=W)

S4Lb = Label(root, text="Symptom 4")

S4Lb.config(font=("Elephant", 15))

S4Lb.grid(row=10, column=1, pady=10, sticky=W)

S5Lb = Label(root, text="Symptom 5")

S5Lb.config(font=("Elephant", 15))

S5Lb.grid(row=11, column=1, pady=10, sticky=W)

lr = Button(root, text="Predict",height=2, width=20, command=message)

lr.config(font=("Elephant", 15))

lr.grid(row=15, column=1,pady=20)

OPTIONS = sorted(l1)

S1En = OptionMenu(root, Symptom1,*OPTIONS)

S1En.grid(row=7, column=2)

S2En = OptionMenu(root, Symptom2,*OPTIONS)

32
S2En.grid(row=8, column=2)

S3En = OptionMenu(root, Symptom3,*OPTIONS)

S3En.grid(row=9, column=2)

S4En = OptionMenu(root, Symptom4,*OPTIONS)

S4En.grid(row=10, column=2)

S5En = OptionMenu(root, Symptom5,*OPTIONS)

S5En.grid(row=11, column=2)

NameLb = Label(root, text="")

NameLb.config(font=("Elephant", 20))

NameLb.grid(row=13, column=1, pady=10, sticky=W)

NameLb = Label(root, text="")

NameLb.config(font=("Elephant", 15))

NameLb.grid(row=18, column=1, pady=10, sticky=W)

t3 = Text(root, height=2, width=30)

t3.config(font=("Elephant", 20))

t3.grid(row=20, column=1 , padx=10)

root.mainloop()

33
Chapter -10

SAMPLE OUTPUT

10.1 First Page

Fig -8 Selecting Symptoms

34
Fig -9 Selecting Symptoms as per required

Fig -10 prediction according to given symptoms

35
Fig -11 Different results with different symptoms

Fig -12 Different results with different symptoms

36
Chapter – 11

CONCLUSION

11.1 Conclusion

The ultimate goal is to facilitate coordinated and well-informed health care

systems capable of ensuring maximum patient satisfaction. In developing nations,

predictive analytics are the next big idea in medicine –the next evolution in statistics –

and roles will change as a result. Patients can get to become higher knowing and can get

to assume a lot of responsibility for his or her own care, if they are to make use of the

information derived. Physician roles can probably modification to a lot of an advisor than

head, who will advise, warn and help individual patients. Physicians might notice a lot of

joy in apply as positive outcomes increase and negative outcomes decrease. Perhaps time

with individual patients can increase and physicians will another time have the time to

create positive and lasting relationships with their patients. Time to assume, to interact,

and to really help people; relationship formation is one of the reasons physicians say they

went into medicine, and when these diminish, so does their satisfaction with their

profession. Hospitals, pharmaceutical corporations and insurance suppliers can see

changes furthermore. These changes which will virtually revolutionize the manner drugs

are practiced for higher health and unwellness reduction.

37
11.2 Future Work

Every one of us would like to have a good medical care system and physicians are

expected to be medical experts and take good decisions all the time. But it’s highly

unlikely to memorize all the knowledge, patient history, records needed for every

situation. Although they have all the massive amount of data and information; it’s

difficult to compare and analyse the symptoms of all the diseases and predict the

outcome. So, integrating information into patient’s personalized profile and performing

an in-depth research is beyond the scope a physician. So the solution is ever heard of a

personalized healthcare plan – exclusively crafted for an individual. Predictive analytics

is the process to make predictions about the future by analyzing historical data. For health

care, it would be convenient to make best decisions in case of every individual. Predictive

modeling uses artificial intelligence to create a prediction from past records, trends,

individuals, diseases and the model is deployed so that a new individual can get a

prediction instantly. Health and Medicare units can use these predictive models to

accurately assess when a patient can safely be released.

38
REFERENCES

1. Aditya Tomar, “Disease Prediction System using data mining techniques”, in

International Journal of Advanced Research in computer and Communication

Engineering, ISO 3297, July 2016.

2. Dr. B.Srinivasan, K.Pavya, “A study on data mining prediction techniques in

healthcare sector”, in International Research Journal of Engineering and

Technology (IRJET), March-2016.

3. Megha Rathi, Vikas Pareek, “An integrated hybrid data mining approach for

healthcare” , in IRACST - International Journal of Computer Science and

Information Technology Security (IJCSITS), ISSN: 2249-9555 , Vol.6,

No.6,Nov-Dec 2016.

4. Feixiang Huang, Shengyong Wang, and Chien-Chung Chan, “Predicting Disease

By Using Data Mining Based on Healthcare Information System” , in IEEE 2012.

5. Jeffrey MGirardJeffrey FCohn,et al., “Jeffrey MGirardJeffrey FCohn,” Science,


Volume 4, August 2015,, no. 3875,pp. 86–88, 1969.
6. [4] Christine L. Lisetti, Diane J. Schiano “Automatic Facial Expression
Interpretation: Where Human-ComputerInteraction, Artificial Intelligence and
Cognitive Science Intersect.” Pragmatics and Cognition (Special Issue on Facial
Information Processing: A MultidisciplinaryPerspective), Vol. 8(1): 185-235,
2000.
7. [5] James R. Williamson. MIT Lincoln, “Detecting Depression using Vocal,
Facial and SemanticCommunication Cues,” AVEC'16, October 16 2016,
Amsterdam, NetherlandsACM. ISBN 978-1-4503-4516-3/16/10
8. .[6] V. Surakka and J. K. Hietanen, “Facial and emotional reactionsto
duchenne and non-duchenne smiles,” International Journal ofPsychophysiology,
vol. 29, no. 1, pp. 23–33, 1998.

39
9. [7] P. M. Niedenthal, M. Mermillod, M. Maringer, and U. Hess,
“Thesimulation of smiles (sims) model: Embodied simulation and themeaning of
facial expression,” Behavioral and brain sciences, vol. 33,no. 06, pp. 417–433,
2010.
10. [8] K. El Haddad, H. Cakmak, S. Dupont, and T. Dutoit, “Laughterand smile
processing for human-computer interactions,” Justtalking-casual talk among
humans and machines, Portoroz, Slovenia,pp. 23–28, 2016.
11. [9] H. Yu, O. G. Garrod, and P. G. Schyns, “Perception-driven
facialexpression synthesis,” Computers & Graphics, vol. 36, no. 3, pp.152–162,
2012.
12. [10] S. Oh, J. Bailenson, N. Kr¨amer, and B. Li, “Let the avatar brightenyour
smile: Effects of enhancing facial expressions in virtualenvironments,” PLoS
ONE, vol. 11, no. 9, p. e0161794, 2016.
13. Soujanya Poria , Erik Cambria, “A review of affective computing: From

unimodal analysis to multimodal fusion,” Trends in cognitive sciences,

https://doi.org/10.1016/j.inffus.2017.02.003

40

You might also like