You are on page 1of 96

PARKINSON’S DISEASE PREDICTION USING

MACHINE LEARNING TECHNIQUES

A Project report submitted in partial fulfillment of the requirements for


the award of the degree of

BACHELOR OF TECHNOLOGY IN
COMPUTER SCIENCE ENGINEERING

Submitted by

K. MANOJ (317126510078)
I. MANIKANTA (318126510L13)
CH. DEEKSHITH (318126510L16)
K.V. MUKESH (318126510L20)

Under the guidance of


DR. K.S. DEEPTHI

(Associate Professor)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND SCIENCES


(UGC AUTONOMOUS)
(Permanently Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’ Grade)
Sangivalasa, bheemili mandal, visakhapatnam dist. (A.P)
2020-2021

i
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND
SCIENCES (UGC AUTONOMOUS)
(Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’
Grade)
Sangivalasa, bheemili mandal, visakhapatnam dist.(A.P)

BONAFIDE CERTIFICATE

This is to certify that the project report entitled “PARKINSON’S DISEASE


PREDICTION USING MACHINE LEARNING TECHNIQUES” submitted by
K. MANOJ (317126510078), I. MANIKANTA (318126510L13), CH.
DEEKSHITH (318126510L16), K.V. MUKESH (318126510L20) in partial
fulfillment of the requirements for the award of the degree of Bachelor of Technology
in Computer Science Engineering of Anil Neerukonda Institute of technology and
sciences (A), Visakhapatnam is a record of bonafide work carried out under my
guidance and supervision.

Project Guide Head of the Department


DR. K.S. DEEPTHI DR. R. SIVARANJANI
Associate Professor Professor
Department of CSE Department of CSE
ANITS ANITS

ii
DECLARATION

We, K.MANOJ, I. MANIKANTA, CH. DEEKSHITH, K.V. MUKESH, of final


semester B.Tech., in the department of Computer Science and Engineering from ANITS,
Visakhapatnam, hereby declare that the project work entitled PARKINSON’S
DISEASE PREDICTION USING MACHINE LEARNING TECHNIQUES is
carried out by us and submitted in partial fulfillment of the requirements for the award
of Bachelor of Technology in Computer Science Engineering , under Anil
Neerukonda Institute of Technology & Sciences(A) during the academic year 2017-2021
and has not been submitted to any other university for the award of any kind of degree.

K. MANOJ 317126510078
I. MANIKANTA 318126510L13
CH. DEEKSHITH 318126510L16
K.V. MUKESH 318126510L20

iii
ACKNOWLEDGEMENT

We would like to express our deep gratitude to our project guide Dr. K.S. Deepthi
Associate Professor, Department of Computer Science and Engineering, ANITS, for her
guidance with unsurpassed knowledge and immense encouragement. We are grateful to
Dr. R. Sivaranjani, Head of the Department, Computer Science and Engineering, for
providing us with the required facilities for the completion of the project work.
We are very much thankful to the Principal and Management, ANITS, Sangivalasa, for
their encouragement and cooperation to carry out this work.
We express our thanks to Project Coordinator Dr. K.S. Deepthi, for her Continuous support
and encouragement. We thank all teaching faculty of Department of CSE, whose
suggestions during reviews helped us in accomplishment of our project. We would like to
thank Mrs B.V. Udaya Lakshmi of the Department of CSE, ANITS for providing great
assistance in accomplishment of our project.
We would like to thank our parents, friends, and classmates for their encouragement
throughout our project period. At last but not the least, we thank everyone for supporting
us directly or indirectly in completing this project successfully.

PROJECT STUDENTS

K. MANOJ (317126510078),
I. MANIKANTA (318126510L13),
CH. DEEKSHITH (318126510L16),
K.V. MUKESH (318126510L20).

iv
ABSTRACT

In various data repositories, there are large medical datasets available that
are used to identifying the diseases. Parkinson’s is considered one of the
deadliest and progressive nervous system diseases that affect movement.
It is the second most common neurological disorder that causes disability,
reduces the life span, and still has no cure. Nearly 90% of affected people
with this disease have speech disorders. In real-world applications, the
information is been generated by using various Machine Learning
techniques. Machine learning algorithms help to generate useful content
from it. To increase the lifespan of elderly people the machine learning
algorithms are used to detect diseases in the early stages. Speech features
are the main concept while taking into consideration the term
‘Parkinson’s’. In this paper, the author is using various Machine Learning
techniques like KNN, Naïve Bayes, and Logistic Regression and how
these algorithms are used to predict Parkinson’s based on the input taken
from the user and the input for algorithms is the dataset. Based on these
features the author predicts the algorithm that gives more accuracy. The
accuracies obtained for the three algorithms are KNN with 80%, Logistic
Regression with 79%, and Naïve Bayes with the highest accuracy of 81%
and it is used in the frontend to predict whether the patient has Parkinson’s
is present or not. To recover the patients from early stages, prediction is
important. This process can be done with the help of Machine Learning.

Keywords: Parkinson’s, Machine Learning, Speech disorders, KNN,


Naïve Bayes, Logistic regression.

v
CONTENTS
ABSTRACT v
LIST OF FIGURES ix
LIST OF TABLES xi
LIST OF ABBREVATIONS xi

CHAPTER 1 INTRODUCTION 01
1.1 Introduction to Parkinson’s disease 01
1.2 Parkinson’s disease symptoms 04
1.3 Introduction to Machine Learning 04
1.3.1 Supervised learning 05
1.3.2 Unsupervised learning 07
1.3.3 Applications of Machine Learning 07
1.4 Motivation of the work 09
1.5 Problem Statement 09
1.6 Organization of Thesis 10

CHAPTER 2 LITERATURE SURVEY 11

CHAPTER 3 METHODOLOGY 16
3.1 Proposed System 16
3.1.1 System Architecture 16
3.2 Modules Division 17
3.2.1 Speech Dataset 17
3.2.2 Data Pre-processing 20
3.2.3 Training data 23
3.2.4 Apply Machine Learning Algorithms 23
3.2.4.1 K-Nearest Neighbor 24
3.2.4.2 Naïve Bayes 27
3.2.4.3 Logistic Regression 29
3.2.5 Testing Data 32

vi
3.3 User Interface 32

CHAPTER 4 EXPERIMENTAL ANALYSIS AND RESULTS 33


4.1 System Requirements 33
4.1.1 Functional Requirements 33
4.1.2 Non-Functional Requirements 33
4.2 System Configuration 34
4.2.1 Software Requirements 34
4.2.1.1 Introduction to Python 34
4.2.1.2 Introduction to Flask Framework 35
4.2.1.3 Python Libraries 36
4.2.2 Hardware Requirements 37
4.3 Feasibility Study 38
4.3.1 Economic Feasibility 38
4.3.2 Technical Feasibility 38
4.3.3 Operational Feasibility 39
4.4 Sample Code 40
4.5 Experimental analysis and Performance Measures 56
4.5.1 Confusion Matrix 56
4.5.2 Accuracy 57
4.5.3 Precision 57
4.5.4 Recall (or) Sensitivity 57
4.5.5 F1-Score 57
4.5.6 Specificity 57
4.5.7 LR- 58
4.5.8 LR+ 58
4.5.9 Odd Score 58
4.5.10 Youden Score 58
4.5.11 Comparative Analysis 58

4.6 Results 60

vii
CHAPTER 5 CONCLUSION AND FUTURE WORK 65
5.1 Conclusion 65
5.2 Future Work 65

REFERENCES 66

viii
LIST OF FIGURES

Fig. No. Name Page No.

1.1 Structure of Neuron 02

3.1 System architecture 17

3.2 Sample of acquired speech dataset from kaggle 18

3.3 Reading the data from the CSV file into notebook 19

3.4 Gender count 20

3.5 Correlation matrix 21

3.6 Dropping unnecessary features from data frame 21

3.7 Process of Feature selection and sample data 22

3.8 Splitting dataset into training data and test data 23

3.9 KNN Graph-1 24

3.10 KNN Graph-2 25

3.11 Finding K value using error rate 26

3.12 KNN classifier model 26

3.13 Naïve Bayes classifier model 29

3.14 S-shaped curve 30

3.15 Logistic Regression model 31

4.1 Comparative analysis of three models 59

4.2 Comparative analysis of three models using evaluation


metrics 59

4.3 KNN Test Accuracy 60

4.4 Naïve Bayes Test Accuracy 61

ix
4.5 Logistic Regression Test Accuracy 61

4.6 User Interface to enter the patient details 62

4.7 Prediction Result showing for fig 4.6 details 62

4.8 User Interface to enter the another patient details 63

4.9 Prediction Result showing for fig 4.8 details 63

x
LIST OF TABLES

Table No. Table Name Page No.

4.1 Confusion Matrix 56

LIST OF ABBREVATIONS

DNA DeoxyriboNucleicAcid
PD Parkinson’s Disease
IT Information
KNN K-Nearest Neighbor
CSV Comma Separated Value
NB Naïve Bayes
HTML Hyper Text Markup Language
CSS Cascading Style Sheets

xi
CHAPTER 1 – INTRODUCTION

1.1 Introduction to Parkinson’s disease:


The recent report of the World Health Organization shows a visible increase in the
number and health burden of Parkinson’s disease patients increases rapidly. In China,
this disease is spreading so fast and estimated that it reaches half of the population in
the next 10 years. Classification algorithms are mainly used in the medical field for
classifying data into different categories according to the number of characteristics.
Parkinson’s disease is the second most dangerous neurological disorder that can lead
to shaking, shivering, stiffness, and difficulty walking and balance. It caused mainly
due by the breaking down of cells in the nervous system. Parkinson’s can have both
motor and non-motor symptoms. The motor symptoms include slowness of movement,
rigidity, balance problems, and tremors. If this disease continues, the patients may have
difficulty walking and talking. The non-motor symptoms include anxiety, breathing
problems, depression, loss of smell, and change in speech. If the above-mentioned
symptoms are present in the person then the details are stored in the records. In this
paper, the author considers the speech features of the patient, and this data is used for
predicting whether the patient has Parkinson’s disease or not.
Neurodegenerative disorders are the results of progressive tearing and neuron loss in
different areas of the nervous system. Neurons are functional units of the brain. They
are contiguous rather than continuous. A good healthy looking neuron as shown in fig
1 has extensions called dendrites or axons, a cell body, and a nucleus that contains our
DNA. DNA is our genome and a hundred billion neurons contain our entire genome
which is packaged into it. When a neuron gets sick, it loses its extension and hence its
ability to communicate which is not good for it and its metabolism becomes low so it
starts to accumulate junk and it tries to contain the junk in the little packages in little
pockets. When things become worse and if the neuron is a cell culture it completely
loses its extension, becomes round and full of vacuoles.

1
Fig-1.1 Structure of Neuron

This work deals with the prediction of Parkinson’s disorder which is now a day is
tremendously increasing incurable disease. Parkinson’s disease is a most spreading
disease which gets its name from James Parkinson who earlier described it as a
paralysis agitans and later gave his surname was known as PD. It generally affects the
neurons which are responsible for overall body movements. The main chemicals are
dopamine and acetylcholine which affect the human brain. There is a various
environmental factor which has been implicated in PD below are the listed factor which
caused Parkinson’s disease in an individual.
 Environmental factors: Environment is defined as the surroundings or the
place in which an individual lives. So the environment is the major factor that
will not only affects the human’s brain but also affects all the living organism
who lives in the vicinity of it. Many types of research and evidence have proved
that the environment has a big hand in the development of neurodegenerative
disorders mainly Alzheimer’s and Parkinson’s. There are certain environmental
factors that are influencing neurodegenerative disorder with high pace are:-
 Exposure to heavy metals (like lead and aluminum) and pesticides.
 Air Quality: Pollution results in respiratory diseases.

2
 Water quality: Biotic and Abiotic contaminants present in water lead to
water pollution.
 Unhealthy lifestyle: It leads to obesity and a sedentary lifestyle.
 Psychological stress: It increases the level of stress hormone that
depletes the functions of neurons.

 Brain injuries or Biochemical Factors: The brain is the control center of our
complete body. Due to certain trauma, people have brain injuries which leads
some biochemical enzymes to come into the picture which provides neurons
stability and provides support to some chromosomes and genes in maintenance.

 Aging Factor: Aging is one of the reasons for the development of Parkinson’s
disease. According to the author in India, 11,747,102 people out of 1, 065, 070,
6072 are affected by Parkinson’s disease.

 Genetic factors: Genetic factor is considered as the main molecular


physiological cause which leads to neurodegenerative disorders. The size,
depth, and effect of actions of different genes define the status or level of
neurodegenerative disease which increases itself gradually over time. Mainly
the genetic factors which lead to Neurodegenerative disorders are categorized
into pharmacodynamics and pharmacokinetics.

 Speech Articulation factors: Due to the condition associated with Parkinson’s


disease (rigidity and bradykinesia), some speech-language pathology such as
voice, articulation and swallowing alterations are found. There are various
ways in which Parkinson’s disease (PD) might affect the individual.
 The voice get breathy and softer.
 Speech may be smeared.
 The person finds difficulty in finding the right words due to which
speech becomes slower.

3
1.2 Parkinson’s disease symptoms

The symptoms of Parkinson’s disease broadly divided into two categories.


 Motor symptoms: This is a symptom where any voluntary action involved. It
indicates the movement-related disorders like tremors, rigidity, freezing,
Bradykinesia or any voluntary muscle movement.
 Non-Motor symptoms: Non motor symptoms include disorders of mood and
affect with apathy, cognitive dysfunction as well as complex behavioral
disorders. There are two other categories of PD which are divided by doctors:
Primary symptom and Secondary symptom.
 Primary symptoms: It is the most important symptom. Primary symptoms are
rigidity, tremor and slowness of movement.
 Secondary symptoms: It is a symptom that directly impacts the life of an
individual. These can be either motor or non-motor. Its effect depends on person
to person. A very wide range of symptoms is associated with Parkinson‘s,.
Besides these symptoms, there are some other symptoms found that lead to
Parkinson’s disease. These symptoms are micrographic, decreased olfaction &
postural instability, slowing of the digestive system, constipation, fatigue,
weakness, and Hypotension. Speech difficulties i.e. dysphonia (impaired speech
production) and dysarthria (speech articulation difficulties) are found in patients
with Parkinson’s.

1.3 Introduction to Machine Learning

Machine Learning may be a sub-area of AI, whereby the term refers to the
power of IT systems to independently find solutions to problems by
recognizing patterns in databases. In other words: Machine Learning enables
IT systems to acknowledge patterns in the idea of existing algorithms and data
sets and to develop adequate solution concepts. Therefore, in Machine
Learning, artificial knowledge is generated on the idea of experience. In order
to enable the software to independently generate solutions, the prior action of

4
people is important. For example, the required algorithms and data must be
fed into the systems in advance and the respective analysis rules for the
recognition of patterns in the data stock must be defined. Once these two steps
have been completed, the system can perform the following tasks by Machine
Learning:

 Finding, extracting and summarizing relevant data


 Making predictions based on the analysis data
 Calculating probabilities for specific results

Basically, algorithms play a crucial role in Machine Learning: On the one


hand, they're liable for recognizing patterns and on the opposite hand, they
will generate solutions. Algorithms can be divided into different categories:

1.3.1 Supervised learning:


In the course of monitored learning, example models are defined beforehand.
So as to make sure an adequate allocation of the knowledge to the respective
model groups of the algorithms, these then need to be specified. In other
words, the system learns on the idea of given input and output pairs. within
the course of monitored learning, a programmer, who acts as a sort of teacher,
provides the acceptable values for specific input. The aim is to coach the
system within the context of successive calculations with different inputs and
outputs to determine connections.
Supervised learning is where you've got input variables (X) and an output
variable (Y) and you employ an algorithm to find out the mapping function
from the input to the output. Y = f(X) The goal is to approximate the mapping
function so well that once you have a new input file (X) that you simply can
predict the output variables (Y) for that data. It's called supervised learning
because the method of an algorithm learning from the training dataset is often
thought of as an educator supervising the training process. We all know the
correct answers, the algorithm iteratively makes predictions on the training
data and is corrected. Learning stops when the algorithm achieves a suitable
level of performance.

5
Techniques of Supervised Machine Learning algorithms include linear and
logistic regression, multi-class classification, Decision Tree, and Support
Vector Machine.

Supervised Learning problems are a kind of machine learning technique often


further grouped into Regression and Classification problems. The difference
between these two is that the dependent attribute is numerical for regression
and categorical for classification:

 Regression:
Linear regression could also be a linear model, e.g. a model
that assumes a linear relationship between the input variables (x) and
thus the only output variable (y). More specifically, that y is usually
calculated from a linear combination of the input variables (x).

When there's one input variable (x), the tactic is mentioned as


simple linear regression. When there are multiple input variables,
literature from statistics often refers to the tactic as multiple linear
regression.

 Classification:
Classification could also be a process of categorizing a given
set of data into classes, It is often performed on both structured or
unstructured data. the tactic starts with predicting the category of given
data points. The classes are often mentioned as target, label, or
categories.

In short, classification either predicts categorical class labels or


classification data supported the training set and thus the values(class
labels) in classifying attributes and uses it in classifying new data.

There is a variety of classification models. Classification


models include Logistic Regression, Decision Tree, Random Forest,
Gradient Boosted Tree, One-vs.-One, and Naïve Bayes.

6
1.3.2 Unsupervised learning:

In unsupervised learning, AI learns without predefined target values and


without rewards. It's mainly used for learning segmentation (clustering). The
machine tries to structure and type the info entered consistent with certain
characteristics. For instance, a machine could (very simply) learn that coins of
various colors are often sorted consistent with the characteristic "color" so as
to structure them. Unsupervised Machine Learning algorithms are used when
the knowledge used to train is neither classified nor labeled. The system
doesn't determine the right output but it explores the data and should draw
inferences from datasets to elucidate hidden structures from unlabeled data.
Unsupervised Learning is that the training of Machines using information
that's neither classified nor labeled and allowing the algorithm to act thereon
information without guidance.

Unsupervised Learning is accessed into two categories of algorithms:

 Clustering: A clustering problem is where you would like to get


the inherent grouping in the data such as grouping customers by
purchasing behavior.
 Association: An Association rule learning problem is where you
would wish to get rules that describe large portions of your data
such as folks that buy X also tend to shop for Y.

1.3.3 Applications of Machine Learning:


Virtual Personal Assistants:
Siri, Alexa, Google Now are a number of the favored samples of
virtual personal assistants. As the name suggests, they assist find
information, when asked over voice. Machine learning is a crucial a
part of these personal assistants as they collect and refine the
knowledge on the idea of your previous involvement with them.

7
Later, this set of knowledge is employed to render results that are
tailored to your preferences.

Virtual Assistants are integrated to a spread of platforms. For example:

• Smart Speakers : Amazon Echo and Google Home

• Smartphone : Samsung Bixby on Samsung S8

• Mobile Apps : Google Allo

Videos Surveillance:
 Imagine one person monitoring multiple video cameras! Certainly, a
difficult job to try to do and boring also. This is why the thought of
coaching computers to try to do this job is sensible.
 The video closed-circuit television nowadays is powered by AI that creates
it possible to detect crimes before they happen. They track unusual
behavior of individuals like standing motionless for an extended time,
stumbling, or napping on benches, etc. The system can thus give an
awareness of human attendants, which may ultimately help to avoid
mishaps. And when such activities are reported and counted to be true,
they assist to enhance the surveillance services. This happens with
machine learning doing its job at the backend.

Social Media Services:


From personalizing your news feed to raised ads targeting, social
media platforms are utilizing machine learning for his or her own
and user benefits.
 People You May Know

 Face Recognition

8
Search Engine Result Refining:
Google and other search engines use machine learning to enhance
the search results for you. Every time you execute an inquiry, the
algorithms at the backend keep a watch on how you answer the
results. If you open the highest results and stay on the online page
for long, the program assumes that the results it displayed were in
accordance with the query. Similarly, if you reach the second or third
page of the search results but don't open any of the results, the
program estimates that the results served did not match the
requirement. This way, the algorithms performing at the backend
improve the search results

1.4 Motivation of the work


Many of the people aged 65 or more do have a neurodegenerative disease, which has
no cure. If we detect the disease in the early stages, then we can control it. Almost 30%
of the patients are facing this incurable disease. Current treatment is available for
patients who have minor symptoms. If these symptoms cannot be found at the early
stages, it leads to death. The main cause for Parkinson’s disease is the accumulation of
protein molecules in the neuron which gets misfolded and hence causing Parkinson’s
disease. So till now, researchers got the symptoms and the root causes i.e. from where
this disease had evolved. But very few symptoms have come to their cure and there are
many symptoms that have no solution. So in this era where Parkinson’s disease is
increasing, it is very important to find the solution which can predict it in its early
stages.

1.5 Problem Statement

The main aim is to predict the prediction efficiency that would be beneficial for the
patients who are suffering from Parkinson and the percentage of the disease will be
reduced. Generally in the first stage, Parkinson's can be cured by the proper treatment.

9
So it‘s important to identify the PD at the early stage for the betterment of the patients.
The main purpose of this research work is to find the best prediction model i.e. the best
machine learning technique which will distinguish the Parkinson’s patient from the
healthy person. The techniques used in this problem are KNN, Naïve Bayes, and
Logistic Regression. The experimental study is performed on the voice dataset of
Parkinson’s patients which is downloaded from the Kaggle. The prediction is evaluated
using evaluation metrics like confusion matrix, precision, recall accuracy, and f1-score.
The author used feature selection where the important features are taken into
consideration to detect Parkinson’s.

1.6 Organization of Thesis

The chapters in this document is described as follows:


Chapter-1 is about the introduction of Parkinson’s disease, the different type of
symptoms on it and we have given clear insights about our project domain and related
concepts.
Chapter-2 which gives an account of the review on literature survey where all
different existing methods and models are examined.
Chapter-3 which deals with the problem statement and specifies about proposed
system with a system architecture along with machine learning techniques used.
Chapter-4 specifies about the experimental analysis of our system along with
performance measures and comparisons between different models. It also specifies
about implementation along with sample code.
Chapter-5 gives the conclusion to our work and future scope.

10
CHAPTER 2 – LITERATURE SURVEY

Speech or voice data is assumed to be 90% helpful to diagnose a person for


identifying the presence of disease. It is one of the most important problems that have to be
detected in the early stages so that the progression rate of the disease is reduced. Many of
the researchers work on different datasets to predict the disease more efficiently. In general,
Persons with PD suffer from speech problems, which can be categorized into two:
hypophonia and dysarthria. Hypophonia indicates a very soft and weak voice from a person
and dysarthria indicates slow speech or voice, that can hardly be understood at one time
and this causes damage to the central nervous system. So, most of the clinicians who treat
PD patients observe dysarthria and check out to rehabilitate with specific treatments to
improvise vocal intensity. Lots of researchers did work on the pre-processing data and
feature selection in the past.

Anila M and Dr G Pradeepini proposed the paper titled “Diagnosis of Parkinson’s disease
using Artificial Neural network” [2]. The main objective of this paper is that the detection
of the disease is performed by using the voice analysis of the people affected with
Parkinson's disease. For this purpose, various machine learning techniques like ANN,
Random Forest, KNN, SVM, XG Boost are used to classify the best model, error rates are
calculated, and the performance metrics are evaluated for all the models used. The main
drawback of this paper is that it is limited to ANN with only two hidden layers. And this
type of neural networks with two hidden layers are sufficient and efficient for simple
datasets. They used only one technique for feature selection which reduces the number of
features.

Arvind Kumar Tiwari Proposed the paper titled “Machine Learning-based Approaches for
Prediction of Parkinson’s Disease” [3]. In this paper, minimum redundancy maximum
relevance feature selection algorithms were used to select the most important feature among
all the features to predict Parkinson diseases. Here, it was observed that the random forest
with 20 number of features selected by minimum redundancy maximum relevance feature
selection algorithms provide the overall accuracy 90.3%, precision 90.2%, Mathews

11
correlation coefficient values of 0.73 and ROC values 0.96 which is better in comparison
to all other machine learning based approaches such as bagging, boosting, random forest,
rotation forest, random subspace, support vector machine, multilayer perceptron, and
decision tree based methods.

Mohamad Alissa Proposed the paper titled “Parkinson’s Disease Diagnosis Using Deep
Learning” [14]. This project mainly aims to automate the PD diagnosis process using deep
learning, Recursive Neural Networks (RNN) and Convolutional Neural Networks (CNN),
to differentiate between healthy and PD patients. Besides that, since different datasets may
capture different aspects of this disease, this project aims to explore which PD test is more
effective in the discrimination process by analysing different imaging and movement
datasets (notably cube and spiral pentagon datasets). In general, the main aim of this paper
is to automate the PD diagnosis process in order to discover this disease as early as possible.
If we discover this disease earlier, then the treatments are more likely to improve the quality
of life of the patients and their families.
There are some limitations to this paper namely:
 They used the validation set only to investigate the model performance during the
training and this reduced the number of samples in the training set.
 RNN training is too slow and this is not flexible in practice work.
 Disconnecting and resource exhaustion: working with cloud services like Google
Collaboratory causes many problems like disconnecting suddenly. And because it
is shareable service by the world zones, this leads to resource exhaustion error many
times.

Afzal Hussain Shahid and Maheshwari Prasad Singh proposed the paper titled “A deep
learning approach for prediction of Parkinson’s disease progression” [19]. This paper
proposed a deep neural network (DNN) model using the reduced input feature space of
Parkinson’s telemonitoring dataset to predict Parkinson’s disease (PD) progression and also
proposed a PCA based DNN model for the prediction of Motor-UPDRS and Total-UPDRS
in Parkinson's Disease progression. The DNN model was evaluated on a real-world PD
dataset taken from UCI. Being a DNN model, the performance of the proposed model may

12
improve with the addition of more data points in the datasets.

T. J. Wroge, Y. Özkanca, C. Demiroglu, D. Si, D. C. Atkins and R. H. Ghomi, proposed


the paper titled “Parkinson’s Disease Diagnosis Using Machine Learning and Voice” [24]
is that it explores the effectiveness of using supervised classification algorithms, such as
deep neural networks, to accurately diagnose individuals with the disease. Historically, PD
has been difficult to quantify and doctors have tended to focus on some symptoms while
ignoring others, relying primarily on subjective rating scales. The analysis of this paper
provides a comparison of the effectiveness of various machine learning classifiers in
disease diagnosis with noisy and high dimensional data. Their peak accuracy of 85%
provided by the machine learning models exceeds the average clinical diagnosis accuracy
of non-experts (73.8%) and average accuracy of movement disorder specialists (79.6%
without follow-up, 83.9% after follow-up) with pathological post-mortem examination as
ground truth.

Siva Sankara Reddy Donthi Reddy and Udaya Kumar Ramanadham proposed the paper
“Prediction of Parkinson’s Disease at Early Stage using Big Data Analytics” [21]. This
paper describes mainly various Big Data Analytical techniques that may be used in
diagnosing of right disease in the right time. The main intention is to verify the accuracy of
prediction algorithms. Their future study aims to propose an efficient method to diagnose
this type of neurological disorder by some symptoms at the early stage with better accuracy
using different Big Data Analytical techniques like Hadoop, Hive, R Programming,
MapReduce, PIG, Zookeeper, HBase, Cassandra, Mahout etc…

Daiga Heisters proposed the paper titled “Parkinson’s: symptoms, treatments and research”
[9]. This paper initially says that Current treatments can help to ease the symptoms but none
can repair the damage in the brain or slow the progress of the condition; now, Parkinson’s
UK researchers are working to develop new treatments that can and finally worked together
to build on existing discoveries and explore these innovative areas of research, it is hoped
that a cure for Parkinson’s will be found. Parkinson’s UK offers support for everyone
affected,, including people with the condition, their family, friends and careers, researchers

13
and professionals working in this area.

T. Swapna, Y. Sravani Devi proposed a paper and titled “Performance Analysis of


Classification algorithms on Parkinson’s Dataset with Voice Attributes” [23].This paper
deals with the application of seven classification algorithms on the acquired data set and
then drawing out a comparison of the results to one another and also predicting the outcome
whether the person is healthy or Parkinson disease effected from the given data. The results
of the selected algorithms namely Naïve Bayes, Random Forest, Neural Networks,
Decision Trees, AdaBoost, SVM, KNN were compared and tabulated. According to the
outputs derived with the help of python, implementing Scikit Libraries. Final accuracy was
calculated using these parameters. Random Forest algorithm gives with optimum accuracy
of 78.56% which is closely followed by Decision Tree Algorithm with the optimal accuracy
of 77.63%. Following the Decision Tree Algorithm is the MLP Classifier with an optimal
accuracy of 76.72%, and lastly the Naïve Bayes Algorithm which has the optimal accuracy
of 70.82%. Finally, these algorithms can help in classifying whether a person get effected
with Parkinson’s disease or not.

M. Abdar and M. Zomorodi-Moghadam proposed a paper “Impact of Patients’ Gender on


Parkinson’s disease using Classification Algorithms” [10]. In this paper, the author chooses
the UCI PD dataset for finding the accuracy of Parkinson’s using SVM and Bayesian
Network algorithms. The author chooses the most ten important features in the dataset to
predict PD. The output variable is Sex and other factors are input, the author provides an
approach for finding relationships between genders. The result obtained is SVM algorithm
gives better performance than Bayesian Network with 90.98% accuracy.

Dragana Miljkovic, et al, proposed a paper “Machine Learning and Data Mining Methods
for Managing Parkinson’s Disease” [7]. In this paper, the author concluded that based on
the medical tests taken by the patients the Predictor part was able to predict the 15 different
Parkinson’s symptoms separately. The machine learning and data mining techniques are
applied on different symptoms separately and gives an accuracy range between 57.1% and
77.4% where tremor detection has the highest accuracy.

14
Sriram, T. V., et al. proposed a paper “Intelligent Parkinson Disease Prediction Using
Machine Learning Algorithms” [22]. In this paper, the author used voice measures of the
patients to check whether the patient has Parkinson’s or not. The author applied the dataset
to various machine learning algorithms and find the maximum accuracy. To analyse the
models the author used the ROC curve and sieve graph. The random forest results with
more accuracy i.e. are 90.26%.

Dr. R.GeethaRamani, G.Sivagami, and ShomonaGraciajacob proposed a paper “Feature


Relevance Analysis and Classification of Parkinson’s Disease TeleMonitoring data
Through Data Mining” [6]. In this paper, the author used thirteen classification algorithms
to diagnose the disease. The author used the Tele-monitoring dataset which contains 16
biomedical voice features for evaluating the system. The aim of this paper is to predict
motor UPDRS and total UPDRS from the voice measures.

A. Ozcift, proposed a paper “SVM feature selection based rotation forest ensemble
classifiers to improve computer-aided diagnosis of Parkinson disease” [1]. In this paper,
the author summarizes that improve the PD diagnosis accuracy with the use of support
vector machine feature selection. To evaluate the performances the author used accuracy,
kappa statistics, and area under the curve of the classification algorithms. The rotation
Forest ensemble of these classifiers used to increase the performance of the system.

15
CHAPTER 3 – METHODOLOGY
3.1 Proposed system
3.1.1 System Architecture
Machine learning has given computer systems the ability to automatically learn
without being explicitly programmed. In this, the author has used three machine learning
algorithms (Logistic Regression, KNN, and Naïve Bayes). The architecture diagram
describes the high-level overview of major system components and important working
relationships. It represents the flow of execution and it involves the following five major
steps:
 The architecture diagram is defined with the flow of the process which is used to
refine the raw data and used for predicting the Parkinson’s data.
 The next step is preprocessing the collected raw data into an understandable format.
 Then we have to train the data by splitting the dataset into train data and test data.
 The Parkinson’s data is evaluated with the application of a machine learning
algorithm that is Logistic Regression, KNN, and Naïve Bayes algorithm, and the
classification accuracy of this model is found.
 After training the data with these algorithms we have to test on the same algorithms.
 Finally, the result of these three algorithms is compared on the basis of classification
accuracy.

3.2.1 Speech Dataset


3.2.2 Pre-processing data
3.2.3 Training data
3.2.4 Apply Machine Learning Algorithms
3.2.4.1 KNN
3.2.4.2 Naïve Bayes
3.2.4.3 Logistic Regression
3.2.5 Testing Data

16
SPEECH DATASET

PRE-PROCESSING
DATA

TRAINING DATA

APPLY MACHINE
LEARNING ALGORITHMS

KNN NAÏVE BAYES LOGISTIC REGRESION

TEST DATA

OUTPUT

Fig-3.1 System Architecture

3.2 Modules Division


Let us discuss about the various modules in our proposed system and what each
module contributes in achieving our goal.

3.2.1 Speech Dataset:


The main aim of this step is to spot and acquire all data-related problems. during
this step, we'd like to spot the various data sources, as data are often collected from various

17
sources like files and databases. The number and quality of the collected data will determine
the efficiency of the output. The more are going to be the info, the more accurate are going
to be the prediction. We’ve collected our data from the Kaggle website.

Fig-3.2 Sample of acquired speech dataset from kaggle

In the above Fig-3.2, we can see the speech dataset that has collected from kaggle
website. This acquired dataset has around 756 patient’s data and each row has 755 different
voice features. But in this paper, we chosen 10 main features that required to find the
prediction.
The features are listed below:
 Id
 Gender
 PPE(Pitch Period Entropy)
 DFA(Detrended Fluctuation Analysis)
 RPDE(Recurrent Period Density Entropy)

18
 numPulses
 numPeriodPulses
 meanPeriodPulses
 stdDevPeriodPulses
 locPctJitter
 locAbsJitter
 rapJitter
 locShimmer, etc.

Fig-3.3 Reading the dataset from the CSV file into notebook

The dataset we chose is in the form of CSV (Comma Separated Value) file. After
acquiring the data our next step is to read the data from the CSV file into the Google
colab also called a Python notebook. Python notebook is used in our project for data pre-
processing, features selection, and for model comparison. In the fig-3.3, we have shown
how to read data from CSV files using the inbuilt python functions that are part of the

19
pandas library.
When compared in genders the Parkinson’s disease is mostly found in the male
rather than female. As this dataset consists of more male persons we chose. In Fig-3.4, we
have shown the male people are more than female.

Fig-3.4 Gender count

3.2.2 Data Pre-Processing:

The main aim of this step is to study and understand the nature of data that was
acquired in the previous step and also to know the quality of data. A real-world data
generally contains noises, missing values, and maybe in an unusable format that cannot be
directly used for machine learning models. Data pre-processing is a required task for
cleaning the data and making it suitable for a machine learning model which also increases
the accuracy and efficiency of a machine learning model. Identifying duplicates in the
dataset and removing them is also done in this step.

Actually, in this dataset, we have 755 features out of which some may not be useful
in building our model. So, we have to leave out all those unnecessary features which are
not responsible to produce the output. If we take more features in this model the accuracy
we got is less. When we check the correlation of the features, some of them are the same.
In Fig 3.5, a screenshot of our notebook is shown the correlation of the columns where two
of the columns have similar values. So, one of them is removed.

20
Fig-3.5 Correlation matrix

As the correlation values of the two attributes are similar and one of them can be
removed. This kind of feature must be dropped. As our data is now stored as a data frame
in a python notebook, we can easily drop those unnecessary features using the inbuilt
functions. In Fig 3.6, a screenshot of our notebook is shown where we have dropped some
features.

Fig-3.6 Dropping unnecessary features from data frame

21
After identifying and dropping some features, the initial 755 features that we have
are reduced to 10 features. Those features are as follows:
 Id
 Gender
 PPE(Pitch Period Entropy)
 DFA(Detrended Fluctuation Analysis)
 RPDE(Recurrent Period Density Entropy)
 numPulses
 numPeriodPulses
 meanPeriodPulses
 stdDevPeriodPulses
 locPctJitter

After pre-processing the acquired data, the next step is to identify the best features.
The identified best features should be able to give high efficiency. In Fig 3.7, a
screenshot of our notebook is shown how to select k best features using sckitlearn. The
classes within the sklearn.feature_selection module are often used for feature
selection/dimensionality reduction on sample sets, either to enhance estimators’
accuracy scores or to spice up their performance on very high-dimensional datasets.

Fig-3.7 Process of Feature selection and sample data

22
3.2.3 Training data:

Splitting the dataset into Training set and testing set:


In machine learning data preprocessing, we have to break our dataset into both
training set and test set. This is often one among the crucial steps of knowledge
preprocessing as by doing this, we will enhance the performance of our machine learning
model.
Suppose, if we've given training to our machine learning model by a dataset and
that we test it by a totally different dataset. Then, it'll create difficulties for our model to
know the correlations between the models.
If we train our model alright and its training accuracy is additionally very high, but
we offer a replacement dataset there to, then it'll decrease the performance. So we always
attempt to make a machine learning model which performs well with the training set and
also with the test dataset.

Fig-3.8 Splitting dataset into training data and test data

Usually, we split the dataset into train and test in the ratio of 7:3 i.e., 70 percent of data is
used for training and 30 percent of data is used for testing the model. We have done it in
the same way and it has been shown in the above Fig 3.8.

3.2.4 Apply Machine Learning Algorithms:

Now, we've both the train and test data. The subsequent step is to spot the possible
training methods and train our models. As this is often a classification problem, we've used
three different classification methods KNN, Naïve Bayes, and Logistic Regression. Each

23
algorithm has been run over the Training dataset and their performance in terms of accuracy
is evaluated alongside the prediction wiped out the testing data set.

K-Nearest Neighbor:

The k-nearest neighbors (KNN) algorithm may b e a simple, supervised


machine learning algorithm that can be used to solve both classification and
regression problems. It's easy to implement and understand. It belongs to the
supervised learning domain.
Let m be the amount of training data samples. Let p be an unknown point.
 Store the training samples in an array of data points arr[]. This means
each element of this array represents a tuple (x, y).
 for i=0 to m:
 Calculate Euclidean distance d(arr[i],p)
 Make set S of K smallest distances obtained. Each of those distances
corresponds to an already classified datum.
 Return the majority label among S.

Let's see this algorithm can be seen with the help of a simple example. Suppose
the dataset have two variables, which are plotted and shown in fig 3.9.

Fig-3.9 KNN Graph-1

24
Your task is to classify a replacement datum with 'X' into "Blue" class or "Red"
class. The coordinate values of the info point are x=45 and y=50. If the K value is
of 3 then the KNN algorithm starts by calculating the space of point X from all the
points. Then it finds the nearest three points with least distance to point X. This
process can be shown in the fig 3.10. The three nearest points in the results have
been encircled.

Fig-3.10 KNN Graph-2

The final step of the KNN algorithm is to assign a replacement point to the category
to which the bulk of the three nearest points belong. From the figure above we can
see that two of the three nearest points belong to the class "Red" while one belongs
to the class "Blue". Therefore the new datum are going to be classified as "Red".

In KNN, finding the value of K is not so easy. So we used an optimal way to identify
the k value through error rate. We will find the error value at each k value and from
that we identify the value which gives minimal error. It is show in Fig 3.11.

25
Fig-3.11 Finding K value using error rate
In scikit-learn python library, from sklearn.neighbors import KNeighborsClassifier
Module is used for carrying out the K Nearest Neighbor. We have to specify the value of
K from the above and assign an object to the classifier. We will use our training dataset to
fit the model. Fig 3.12 shows the sample code for training model using K-Nearest Neighbor.

Fig-3.12 KNN classifier model

26
3.2.4.1 Naïve Bayes:

Naive Bayes may be a statistical classification technique supported Bayes


Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes
classifier may be a fast, accurate, and reliable algorithm. Naive Bayes classifiers have
high accuracy and speed on large datasets. Naive Bayes classifier assumes that the
effect of a specific feature during a class is independent of other features. For
example, a loan applicant is desirable or not counting on his/her income, previous
loan and transaction history, age, and site. Even if these features are interdependent,
these features are still considered independently. This assumption simplifies
computation, and that's why it is considered naive. This assumption is called class
conditional independence.
Naive Bayes classifier makes an assumption that every particular feature in
the dataset is independent of all other features. For example, a patient is having
Parkinson’s or not depends on the speech features of the patient.
D
h P( )∗P(h)
h
P (D) = (1)
P(D)

P (h): the probability of hypothesis h being true (regardless of the data). This is known
as the prior probability of h.
P (D): the probability of the data (regardless of the hypothesis). This is known as the
prior probability.
P (h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.
P (D|h): the probability of data d given that the hypothesis h was true. This is known
as posterior probability.
We can frame classification as a conditional classification problem with Bayes
Theorem as follows:
P(yi | x1, x2, …, xn) = P(x1, x2, …, xn | yi) * P(yi) / P(x1, x2, …, xn)
The prior P(yi) is easy to estimate from a dataset, but the conditional probability of
the observation based on the class P(x1, x2, …, xn | yi) is not feasible unless the
number of examples is extraordinarily large, e.g. large enough to effectively estimate

27
the probability distribution for all different possible combinations of values.
As such, the direct application of Bayes Theorem also becomes intractable, especially
as the number of variables or features (n) increases.
Naive Bayes classifier calculates the probability of an event in the following steps:
Step 1: Calculate the prior probability for given class labels
Step 2: Find Likelihood probability with each attribute for each class
Step 3: Put these value in Bayes Formula and calculate posterior probability.
Step 4: See which class has a higher probability, given the input belongs to the higher
probability class.
Types of Naive Bayes Algorithms:
 Gaussian Naïve Bayes: When the feature values are continuous in the nature
then there is an assumption to be made that the values linked with each
category are dispersed according to Gaussian that is Normal Distribution.
 Multinomial Naïve Bayes: Multinomial Naive Bayes is mostly favored to be
used on the data that is multinomial distributed. It is widely utilized in text
classification in NLP. Each event in text classification constitutes the
presence of a word in a document.
 Bernoulli Naïve Bayes: When data is deleted according to the multivariate
Bernoulli distributions then came the Bernoulli Naive Bayes. That means
there can be exist a different number of features but each one is assumed to
contain a binary value. So, it requires features to be binary-valued.

In scikit-learn python library, from sklearn.naive_bayes import GaussianNB


Module is used for carrying out the Naïve Bayes classifier. We will use our training
dataset to fit the model. Fig 3.13 shows the sample code for training model using
Naïve Bayes.

28
Fig-3.13 Naïve Bayes Classifier Model

3.2.4.2 Logistic Regression:

Logistic regression is additionally one among the foremost popular Machine


Learning algorithms, which comes under the Supervised Learning technique. it's used
for predicting the specific variable employing a given set of independent variables. It
becomes a classification technique only a choice threshold is brought into the image.
The setting of the edge value may be a vital aspect of Logistic regression and depends
on the classification problem itself.

The decision for the worth of the edge value is majorly suffering from the values of
precision and recall. Ideally, we would like both precision and recall to be 1, but this
seldom is that the case. within the case of Precision-Recall tradeoffs we use the
subsequent arguments to make a decision upon the threshold:

 Low Precision/High Recall: In applications where we would like to scale back


the amount of false negatives without necessarily reducing the amount of false
positives, we elect a choice value that features a low value of Precision or a high
value of Recall.
 High Precision/Low Recall: In applications where we would like to scale back
the amount of false positives without necessarily reducing the amount of false

29
negatives, we elect a choice value that features a high value of Precision or a
low value of Recall.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).The curve from the
logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.

Fig-3.14 S-shaped curve

The sigmoid function is a mathematical function used to map the predicted


values to probabilities. It maps any real value into another value within a variety of
0 and 1. The value of the logistic regression must be between 0 and 1, which cannot
transcend this limit, so it forms a curve just like the "S" form. The S-form curve is
called the sigmoid function or the logistic function which is shown above in Fig
3.14. In logistic regression, we use the concept of the edge value, which defines the
probability of either 0 or 1. Such values above the threshold value tend to 1, and a
value below the threshold value tends to 0.
1
F(x) = 1+ex
(2)

F(x) = Output between the 0 and 1 value.

30
x = input to the function
e = base of natural logarithm.
On the idea of the categories, Logistic Regression are often classified into three
types:
 Binomial: The target variable can have only 2 possibilities either “0” or “1”
which may represent “win” or “loss”, “pass” or “fail”, “dead” or “alive”,
etc.
 Multinomial: In multinomial Logistic regression, the target variable can
have 3 or more possibilities which are not ordered that means it has no
measure in quantity like “disease A” or “disease B” or “disease C”.
 Ordinal: In ordinal Logistic regression, the target variables deals with
ordered categories. For example, a test score can be categorized as: “very
poor”, “poor”, “good”, and “very good”. Here, each category can be given
a score like 0, 1, 2, and 3.
In scikit-learn python library, from sklearn.linear_model importLogisticRegression
Module is used for carrying out the Logistic Regression. We have to specify the
iterations to the function parameter and assign an object to the classifier. We will
use our training dataset to fit the model. Fig 3.15 shows the sample code for training
model using Logistic Regression.

Fig-3.15 Logistic Regression model

31
3.2.5 Testing Data

Once Parkinson’s disease Prediction model has been trained on the pre-processed
dataset, then the model is tested using different data points. In this testing step, the model
is checked for correctness and accuracy by providing a test dataset to it. All the training
methods need to be verified for finding out the best model to be used. In figures 3.12, 3.13,
3.15, after fitting our model with training data, we used this model to predict values for the
test dataset. These predicted values on testing data are used for model comparison and
accurate calculation.

3.3 User Interface


Our Front-End implementation is completed using HTML, CSS, JavaScript and Flask
Framework in Scientific Python Development Environment (Spyder). The user interface is
extremely essential for any project because everyone who tries to utilize the system for a
purpose will attempt to access it using an interface. Indeed, our system also features a user
interface built to facilitate users to utilize the services we provide. Where we have used
HTML a terminology, utilized for creating web sites. One among the useful aspects of
HTML is, it can embed programs written during a scripting language like JavaScript, which
is liable for affecting the behavior and content of web pages. CSS inclusion would affect
the layout and appearance of the content.
Flask gives the sorts of choice when developing web applications, it provides you with
tools, libraries, and mechanics that allow you to build, create a various application but it
will not enforce any dependencies or tell you the way how the project should be look like.

32
CHAPTER-4
EXPERIMENTAL ANALYSIS AND RESULTS

4.1 System Requirements


A requirement is a feature that the system must have or a constraint that it must to
be accepted by the client. Requirement Engineering aims at defining the wants of the system
under construction. Requirement Engineering include two main activities requirement
elicitation which results in the specification of the system that the client understands and
analysis which in analysis model that the developer can unambiguously interpret. A
requirement may be a statement about what the proposed system will do.
Requirements can be divide into two major categories:

 Functional Requirements.
 Non-Functional Requirements.

4.1.1 Functional Requirements:


A Functional Requirement may be a description of the service that the software
must offer. It describes a software system or its component. A function is nothing but inputs
to the software, its behavior, and outputs. It are often a calculation, data manipulation,
business process, user interaction, or the other specific functionality which defines what
function a system is probably going to perform. Functional Requirements describe the
interactions between the system and its environment independent of its application.
 Applying the algorithms on the test data.
 Display the result with the description of having Parkinson’s or not.

4.1.2 Non-Functional Requirements:


Non-Functional Requirements specifies the standard attribute of a software . They
judge the software supported Responsiveness, Usability, Security, Portability, and other

33
non-functional standards that are critical to the success of the software.
An example of a nonfunctional requirement, “how fast does the website load?”
Failing to satisfy non-functional requirements may result in systems that fail to satisfy user
needs.
Non-functional Requirements allow you to impose constraints or restrictions on the
planning of the system across the varied agile backlogs.
 Accuracy
 Reliability
 Flexibility

4.2 System Configuration


4.2.1 Software Requirements:
1. Software:
 Spyder
 Google Colab
2. Operating System: Windows 10
3. Tools: Web Browser
4. Python Libraries: numpy, pandas, matplotlab, seaborn, sklearn, pickle.

4.2.1.1 Introduction to Python:


Python is an interpreter, high-level, general-purpose programming language. Python is
simple and easy to read syntax emphasizes readability and thus reduces system maintenance
costs. Python supports modules and packages, which promote system layout and code
reuse. It saves space but it takes a rather higher time when its code is compiled. Indentation
must be taken care while coding.
Python does the following:

 Python are often used on a server to make web applications.


 It connects the database systems. It also read and modify files.
 It often able to handle big data and perform complex mathematics.
 It can be used for production-ready software development.

34
Python has many inbuilt library functions that can be used easily for working with machine
learning algorithms. All the necessary python libraries must be pre-installed using “pip”
command.

4.2.1.2 Introduction to Flask Framework:

Web Application Framework or just Web Framework represents a set of libraries


and modules that permits an internet application developer to write down applications
without having to bother about low-level details such as protocols, thread management, etc.
Flask may be a web application framework written in Python. It is developed by Armin
Ronacher, who leads a world group of Python enthusiasts named Pocco. Flask is predicated
on the Werkzeug WSGI toolkit and Jinja2 template engine. Both are Pocco projects. Python
2.6 or higher is usually required for the installation of Flask. Although Flask and its
dependencies work well with Python 3, many Flask extensions do not support it properly.
Hence, it's recommended that Flask should be installed on Python 2.7. Importing the flask
module in the project is mandatory. An object of the Flask class is our WSGI application.

Flask constructor takes the name of current module (__name__) as argument.


The route () function of the Flask class may be a decorator, which tells the appliance which
URL should call the associated function.
The rule parameter represents URL binding with the function.
The options is an inventory of parameters to be forwarded to the underlying Rule object.
Finally the run() method of Flask class runs the application on the local development server.
Open the URL (localhost:5000) in the browser. The message will be displayed on it.

A Flask application is started by calling the run() method. However, while the appliance is
under development, it should be restarted manually for every change within the code. To
avoid this inconvenience, enable debug support. The server will then reload itself if the
code changes. It will also provide a useful debugger to trace the errors if any, within the
application.
The Debug mode is enabled by setting the debug property of the application object
to True before running or passing the debug parameter to the run() method.

35
4.2.1.3 Python Libraries:
NumPy:

NumPy is a general-purpose array-processing package. It provides a high-


performance multidimensional array object, and tools for working with these arrays. It is
the elemental package for scientific computing with Python. It contains various features
including these important ones:
 A powerful N-dimensional array object
 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy also can be used as an efficient multi-
dimensional container of generic data.

Pandas:

Pandas is an open-source library that’s built on top of NumPy library. It is a


Python package that gives various data structures and operations for manipulating
numerical data and time series. It is fast and it has high-performance & productivity for
users. It provides high-performance and is easy-to-use data structures and data analysis
tools for the Python language. Pandas is employed during a wide range of fields including
academic and commercial domains including economics, Statistics, analytics, etc.

Sklearn:

Scikit-learn (Sklearn) is that the most useful and robust library for machine learning
in Python. It is an open-source Python library that implements a variety of machine
learning, pre-processing, cross-validation and visualization algorithms employing a unified
interface. Sklearn provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction via

36
a consistence interface in Python. This library, which is essentially written in Python, is
made upon NumPy, SciPy and Matplotlib.

Pickle:

Python pickle module is employed for serializing and de-serializing a Python object
structure. Pickling is a way to convert a python object (list, dict, etc.) into a character
stream. The idea is that this character stream contains all the information necessary to
reconstruct the thing in another python script. Pickling is benefitial for applications where
you would like a point of persistency in your data. Your program's state data are often saved
to disk, so you’ll continue working on it later on.

Matplotlib:

It is a very powerful plotting library useful for those working with Python and
NumPy. And for creating statistical interference, it becomes very necessary to visualize our
data and Matplotlib is that the tool which will be very helpful for this purpose. It provides
MATLAB like interface only difference is that it uses Python and is open source.

Seaborn:

Seaborn may be a data visualization library built on top of matplotlib and closely
integrated with pandas data structures in Python. Visualization is that the central part of
Seaborn which helps in exploration and understanding of data.
It offers the following functionalities:
 Dataset oriented API to determine the relationship between variables.
 Automatic estimation and plotting of linear regression plots.
 It supports high-level abstractions for multi-plot grids.
 Visualizing univariate and bivariate distribution.

4.2.2. Hardware Requirements:

37
1. RAM: 4 GB or above
2. Storage: 30 to 50 GB
3. Processor: Any Processor above 500MHz

4.3 Feasibility Study

The preliminary investigation examines project feasibility, the likelihood the


system are going to be useful to the organization. The main objective of the feasibility study
is to check the Technical, Operational, and Economical feasibility for adding new modules
and debugging old running systems. All systems are possible if they have unlimited
resources and infinite time to do a task. There are aspects within the feasibility study portion
of the preliminary investigation:
 Economical Feasibility
 Technical Feasibility
 Operational Feasibility

4.3.1 Economic Feasibility:


As system are often developed technically which are going to be used if installed
must still be an honest investment for the organization. In the economic feasibility, the
event cost in creating the system is evaluated against the last word benefit derived from the
new systems. Financial benefits must equal or exceed the costs. The system is economically
feasible. It doesn't require any addition hardware or software. Since the interface for this
system is developed using the existing resources and technologies java1.6 open source,
there is nominal expenditure and economic feasibility for certain.

4.3.2 Technical Feasibility:


This assessment focuses on the technical resources available to the organization. It
helps organizations determine whether the technical resources meet capacity and whether
the technical team is capable of converting the ideas into working systems. Technical
feasibility also involves evaluation of the hardware, software, and other technology
requirements of the proposed system. This assessment is predicated on an overview design

38
of system requirements, to work out whether the corporate has the technical expertise to
handle completion of the project. When writing a feasibility report, the subsequent should
be taken to consideration:
 A brief description of the business to assess more possible factors which could
affect the study
 The part of the business being examined
 The human and economic factor
 The possible solutions to the problem At this level, the concern is whether the
proposal is both technically and legally feasible (assuming moderate cost). The
technical feasibility assessment is focused on gaining an understanding of the
present technical resources of the organization and their applicability to the
expected needs of the proposed system. It is an evaluation of the hardware and
software and how it meets the need of the proposed system.

4.3.3 Operational Feasibility:


Proposed projects are beneficial only if they can be turned out into information
system. That will meet the organization’s operating requirements. Operational feasibility
aspects of the project are to be taken as a crucial a part of the project implementation.
Some of the important issues raised are to check the operational feasibility of a project
includes the following:

 Is there sufficient support for the management from the users?


 Will the system be used and work properly if it is being developed and
implemented?
 Will there be any resistance from the user that will undermine the possible
application benefits? This system is targeted to be in accordance with the above-
mentioned issues. Beforehand, the management issues and user requirements have
been taken into consideration. So there is no question of resistance from the users
that can undermine the possible application benefits. The well-planned design
would ensure the optimal utilization of the computer resources and would help in
the improvement of performance status.

39
4.4 Sample Code
(a) model.py:
import numpy as np # linear algebra
import pandas as pd # analyze data
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive


drive.mount('/content/drive',force_remount = True)

df=pd.read_csv("/content/drive/MyDrive/pd_speech_features.csv")
df.head()

def precision(class_id,TP, FP, TN, FN):


sonuc=0
for i in range(0,len(class_id)):
if (TP[i]==0 or FP[i]==0):
TP[i]=0.00000001
FP[i]=0.00000001
sonuc+=(TP[i]/(TP[i]+FP[i]))

sonuc=sonuc/len(class_id)
return sonuc

def recall(class_id,TP, FP, TN, FN):


sonuc=0
for i in range(0,len(class_id)):
if (TP[i]==0 or FN[i]==0):
TP[i]=0.00000001
FN[i]=0.00000001
sonuc+=(TP[i]/(TP[i]+FN[i]))

40
sonuc=sonuc/len(class_id)
return sonuc
def accuracy(class_id,TP, FP, TN, FN):
sonuc=0
for i in range(0,len(class_id)):
sonuc+=((TP[i]+TN[i])/(TP[i]+FP[i]+TN[i]+FN[i]))

sonuc=sonuc/len(class_id)
return sonuc
def specificity(class_id,TP, FP, TN, FN):
sonuc=0
for i in range(0,len(class_id)):
if (TN[i]==0 or FP[i]==0):
TN[i]=0.00000001
FP[i]=0.00000001
sonuc+=(TN[i]/(FP[i]+TN[i]))

sonuc=sonuc/len(class_id)
return sonuc
def NPV(class_id,TP, FP, TN, FN):
sonuc=0
for i in range(0,len(class_id)):
if (TN[i]==0 or FN[i]==0):
TN[i]=0.00000001
FN[i]=0.00000001
sonuc+=(TN[i]/(TN[i]+FN[i]))

sonuc=sonuc/len(class_id)
return sonuc
def perf_measure(y_actual, y_pred):

41
class_id = set(y_actual).union(set(y_pred))
TP = []
FP = []
TN = []
FN = []

for index ,_id in enumerate(class_id):


TP.append(0)
FP.append(0)
TN.append(0)
FN.append(0)
for i in range(len(y_pred)):
if y_actual[i] == y_pred[i] == _id:
TP[index] += 1
if y_pred[i] == _id and y_actual[i] != y_pred[i]:
FP[index] += 1
if y_actual[i] == y_pred[i] != _id:
TN[index] += 1
if y_pred[i] != _id and y_actual[i] != y_pred[i]:
FN[index] += 1

return class_id,TP, FP, TN, FN

df.info()

df.columns

man=df.gender.sum()
total=df.gender.count()
woman=total-man

42
print("man: "+str(man)+" woman: "+str(woman))

sns.heatmap(df[df.columns[0:10]].corr(),annot=True)

df.shape

from sklearn.feature_selection import SelectKBest


from sklearn.feature_selection import f_classif
y=df["class"]
x=df.iloc[:,2:7]
xnew2=SelectKBest(f_classif, k=5).fit_transform(x, y)

auc_scor=[]
precision_scor=[]
x.head()

x=pd.DataFrame(xnew2)
x.head()

y.value_counts()
y=y.values
type(y)

from sklearn.metrics import classification_report, precision_score, recall_score, f1_score,


roc_auc_score, accuracy_score

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state=1)

score_liste=[]
recall_scor=[]

43
f1_scor=[]
LR_plus=[]
LR_eksi=[]
odd_scor=[]
NPV_scor=[]
youden_scor=[]
specificity_scor=[]

from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import roc_curve

error_rate = []
for i in range(1,100):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)
pred_i = knn.predict(x_test)
error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1,100),error_rate,color='blue', linestyle='dashed',
marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error:-",min(error_rate),"at K =",error_rate.index(min(error_rate)))

k=10
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(x_train,y_train)
y_head=knn.predict(x_test)
print("KNN Algorithm test accuracy",knn.score(x_test,y_test))

44
classid,tn,fp,fn,tp=perf_measure(y_test,y_head)
auc_scor.append(roc_auc_score(y_test,y_head))
score_list.append(accuracy(classid,tn,fp,fn,tp))
precision_scor.append(precision(classid,tn,fp,fn,tp))
recall_scor.append(recall(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_head,average='macro'))
NPV_scor.append(NPV(classid,tn,fp,fn,tp))
specificity_scor.append(specificity(classid,tn,fp,fn,tp))

LR_plus.append((recall(classid,tn,fp,fn,tp)/(1-specificity(classid,tn,fp,fn,tp))))
LR_minus.append(((1-recall(classid,tn,fp,fn,tp))/specificity(classid,tn,fp,fn,tp)))
odd_scor.append(((recall(classid,tn,fp,fn,tp)/(1-specificity(classid,tn,fp,fn,tp))))/(((1-
recall(classid,tn,fp,fn,tp))/specificity(classid,tn,fp,fn,tp))))
youden_scor.append((recall(classid,tn,fp,fn,tp)+specificity(classid,tn,fp,fn,tp)-1))

print("KNN algorithm report: \n",classification_report(y_test,y_head))

from sklearn.metrics import confusion_matrix


cmknn = confusion_matrix(y_test,y_head)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cmknn,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.title("KNN Algorithm")
plt.show()

from sklearn.naive_bayes import GaussianNB


nb=GaussianNB()
nb.fit(x_train,y_train)
y_head=nb.predict(x_test)

45
print("Naive Bayes Algorithm test accuracy",nb.score(x_test,y_test))

classid,tn,fp,fn,tp=perf_measure(y_test,y_head)
auc_scor.append(roc_auc_score(y_test,y_head))
score_list.append(accuracy(classid,tn,fp,fn,tp))
precision_scor.append(precision(classid,tn,fp,fn,tp))
recall_scor.append(recall(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_head,average='macro'))
NPV_scor.append(NPV(classid,tn,fp,fn,tp))
specificity_scor.append(specificity(classid,tn,fp,fn,tp))
TPR=recall(classid,tn,fp,fn,tp)
TNR=specificity(classid,tn,fp,fn,tp)
FPR=1-TNR
if FPR==0:
FPR=0.00001
FNR=1-TPR
lrminus=FNR/TNR
lrarti=TPR/FPR
if lrminus==0:
lrminus=0.00000001
LR_plus.append(TPR/FPR)
LR_minus.append(FNR/TNR)
odd_scor.append(lrarti/lrminus)
youden_scor.append(TPR+TNR-1)

print("Naive Bayes algorithm report: \n",classification_report(y_test,y_head))

cmnb = confusion_matrix(y_test,y_head)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cmnb,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred")

46
plt.ylabel("y_true")
plt.title("Naive Bayes Algorithm")
plt.show()

from sklearn.linear_model import LogisticRegression


lr=LogisticRegression(random_state=0,max_iter=300)
lr.fit(x_train,y_train)
y_head=lr.predict(x_test)
print("Logistic Regression testaccuracy ",lr.score(x_test,y_test))

classid,tn,fp,fn,tp=perf_measure(y_test,y_head)
auc_scor.append(roc_auc_score(y_test,y_head))
score_list.append(accuracy(classid,tn,fp,fn,tp))
precision_scor.append(precision(classid,tn,fp,fn,tp))
recall_scor.append(recall(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_head,average='macro'))
NPV_scor.append(NPV(classid,tn,fp,fn,tp))
specificity_scor.append(specificity(classid,tn,fp,fn,tp))
TPR=recall(classid,tn,fp,fn,tp)
TNR=specificity(classid,tn,fp,fn,tp)
FPR=1-TNR
if FPR==0:
FPR=0.00001
FNR=1-TPR
lrminus=FNR/TNR
lrarti=TPR/FPR
if lrminus==0:
lrminus=0.00000001
LR_plus.append(TPR/FPR)
LR_minus.append(FNR/TNR)
odd_scor.append(lrarti/lrminus)

47
youden_scor.append(TPR+TNR-1)

print("Logistic Regression report: \n",classification_report(y_test,y_head))

cmlr = confusion_matrix(y_test,y_head)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cmlr,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.title("Logistic Regression")
plt.show()

algo_list=["KNN","Naive Bayes","Logistic Regression"]


score={"algo_list":algo_list,"score_list":score_list,"precision":precision_scor,"recall":rec
all_scor,"f1_score":f1_scor,"AUC":auc_scor,"LR+":LR_plus,"LR-
":LR_minus,"ODD":odd_scor,"YOUDEN":youden_scor,"Specificity":specificity_scor}

z=pd.DataFrame(score)
z

f,ax1 = plt.subplots(figsize =(12,12))


sns.pointplot(x=df['algo_list'],
y=df['score_list'],data=df,color='lime',alpha=0.8,label="score_list")
sns.pointplot(x=df['algo_list'],
y=df['precision'],data=df,color='red',alpha=0.8,label="precision")
sns.pointplot(x=df['algo_list'],
y=df['recall'],data=df,color='black',alpha=0.8,label="recall")
sns.pointplot(x=df['algo_list'],
y=df['f1_score'],data=df,color='blue',alpha=0.8,label="f1_score")
sns.pointplot(x=df['algo_list'],
y=df['AUC'],data=df,color='yellow',alpha=0.8,label="AUC")

48
sns.pointplot(x=df['algo_list'], y=df['LR-
'],data=df,color='orange',alpha=0.8,label="YOUDEN")

sns.pointplot(x=df['algo_list'],
y=df['YOUDEN'],data=df,color='brown',alpha=0.8,label="LR-")
sns.pointplot(x=df['algo_list'],
y=df['Specificity'],data=df,color='purple',alpha=0.8,label="Specificity")
plt.xlabel('Algorithms',fontsize = 15,color='blue')
plt.ylabel('Metrics',fontsize = 15,color='blue')
plt.xticks(rotation= 45)
plt.title('Parkinsons Disease (PD) Evaluation Metrics',fontsize = 20,color='blue')
plt.grid()
plt.legend()
plt.show()

with open('model.pkl','wb') as f:
pickle.dump(nb,f)

model=pickle.load(open('model.pkl','rb'))
print(model)

(b) app.py:
import numpy as np
from flask import Flask, request, jsonify, render_template
import pickle

app = Flask(__name__)
model = pickle.load(open('model.pkl', 'rb'))

@app.route('/')

49
def home():
return render_template('index.html')

@app.route('/predict',methods=['POST'])
def predict():

features = [float(x) for x in request.form.values()]


final_features = [np.array(features)]

prediction = model.predict(final_features)
print("final features",final_features)
print("prediction:",prediction)
output = round(prediction[0], 2)
print(output)
if output == 0:
return render_template('index.html', prediction_text='THE PATIENT DOESNOT
HAVE A PARKINSONS DISEASE')
else:
return render_template('index.html', prediction_text='THE PATIENT HAVE A
PARKINSONS DISEASE')

@app.route('/predict_api',methods=['POST'])
def results():

data = request.get_json(force=True)
prediction = model.predict([np.array(list(data.values()))])

output = prediction[0]
return jsonify(output)

if __name__ == "__main__":

50
app.run(debug=False)

(c) index.html:
<!DOCTYPE html>
<html >
<!--From https://codepen.io/frytyler/pen/EGdtg-->
<head>
<meta charset="UTF-8">
<title>PREDICTION</title>
<link href='https://fonts.googleapis.com/css?family=Pacifico' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Arimo' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Hind:300' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300'
rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
</head>

<body>

<div class="login">
<h1>Parkinson's Disease Prediction Analysis</h1>
{{ prediction_text }}
<!-- Main Input For Receiving Query to our ML -->
<form action="{{ url_for('predict')}}"method="post">

<input type="text" name="PPE" placeholder="ppe" required="required" />


<input type="text" name="DFA" placeholder="dfa" required="required" />

51
<input type="text" name="RPDE" placeholder="rpde" required="required"
/>
<input type="text" name="numPulses" placeholder="numpulses"
required="required" />
<input type="text" name="numPeriodsPulses"
placeholder="numperiodpulses" required="required" />

<button type="submit" class="btn btn-primary btn-block btn-large">Predict</button>


</form>

</div>

</body>
</html>

(d) style.css:
@import url(https://fonts.googleapis.com/css?family=Open+Sans);
.btn { display: inline-block; *display: inline; *zoom: 1; padding: 4px 10px 4px; margin-
bottom: 0; font-size: 13px; line-height: 18px; color: #333333; text-align: center;text-
shadow: 0 1px 1px rgba(255, 255, 255, 0.75); vertical-align: middle; background-color:
#f5f5f5; background-image: -moz-linear-gradient(top, #ffffff, #e6e6e6); background-
image: -ms-linear-gradient(top, #ffffff, #e6e6e6); background-image: -webkit-
gradient(linear, 0 0, 0 100%, from(#ffffff), to(#e6e6e6)); background-image: -webkit-
linear-gradient(top, #ffffff, #e6e6e6); background-image: -o-linear-gradient(top, #ffffff,
#e6e6e6); background-image: linear-gradient(top, #ffffff, #e6e6e6); background-repeat:
repeat-x; filter: progid:dximagetransform.microsoft.gradient(startColorstr=#ffffff,
endColorstr=#e6e6e6, GradientType=0); border-color: #e6e6e6 #e6e6e6 #e6e6e6; border-
color: rgba(0, 0, 0, 0.1) rgba(0, 0, 0, 0.1) rgba(0, 0, 0, 0.25); border: 1px solid #e6e6e6; -
webkit-border-radius: 4px; -moz-border-radius: 4px; border-radius: 4px; -webkit-box-

52
shadow: inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0, 0, 0, 0.05); -moz-box-
shadow: inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0, 0, 0, 0.05); box-shadow:
inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0, 0, 0, 0.05); cursor: pointer;
*margin-left: .3em; }
.btn:hover, .btn:active, .btn.active, .btn.disabled, .btn[disabled] { background-color:
#e6e6e6; }
.btn-large { padding: 9px 14px; font-size: 15px; line-height: normal; -webkit-border-radius:
5px; -moz-border-radius: 5px; border-radius: 5px; }
.btn:hover { color: #333333; text-decoration: none; background-color: #e6e6e6;
background-position: 0 -15px; -webkit-transition: background-position 0.1s linear; -moz-
transition: background-position 0.1s linear; -ms-transition: background-position 0.1s
linear; -o-transition: background-position 0.1s linear; transition: background-position 0.1s
linear; }
.btn-primary, .btn-primary:hover { text-shadow: 0 -1px 0 rgba(0, 0, 0, 0.25); color: #ffffff;
}
.btn-primary.active { color: rgba(255, 255, 255, 0.75); }
.btn-primary { background-color: #4a77d4; background-image: -moz-linear-gradient(top,
#6eb6de, #4a77d4); background-image: -ms-linear-gradient(top, #6eb6de, #4a77d4);
background-image: -webkit-gradient(linear, 0 0, 0 100%, from(#6eb6de), to(#4a77d4));
background-image: -webkit-linear-gradient(top, #6eb6de, #4a77d4); background-image: -
o-linear-gradient(top, #6eb6de, #4a77d4); background-image: linear-gradient(top,
#6eb6de, #4a77d4); background-repeat: repeat-x; filter:
progid:dximagetransform.microsoft.gradient(startColorstr=#6eb6de,
endColorstr=#4a77d4, GradientType=0); border: 1px solid #3762bc; text-shadow: 1px 1px
1px rgba(0,0,0,0.4); box-shadow: inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0,
0, 0, 0.5); }
.btn-primary:hover, .btn-primary:active, .btn-primary.active, .btn-primary.disabled, .btn-
primary[disabled] { filter: none; background-color: #4a77d4; }
.btn-block { width: 100%; display:block; }

* { -webkit-box-sizing:border-box; -moz-box-sizing:border-box; -ms-box-sizing:border-

53
box; -o-box-sizing:border-box; box-sizing:border-box; }

html { width: 100%; height:100%; overflow:hidden; }

body {
width: 100%;
height:100%;
font-family: 'Open Sans', sans-serif;
background: #092756;
color: #fff;
overflow: scroll;
font-size: 18px;
text-align:center;
letter-spacing:1.2px;
background: -moz-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%),-moz-linear-gradient(top, rgba(57,173,219,.25) 0%,
rgba(42,60,87,.4) 100%), -moz-linear-gradient(-45deg, #670d10 0%, #092756 100%);
background: -webkit-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), -webkit-linear-gradient(top, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), -webkit-linear-gradient(-45deg, #670d10 0%,#092756
100%);
background: -o-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), -o-linear-gradient(top, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), -o-linear-gradient(-45deg, #670d10 0%,#092756 100%);
background: -ms-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), -ms-linear-gradient(top, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), -ms-linear-gradient(-45deg, #670d10 0%,#092756 100%);
background: -webkit-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), linear-gradient(to bottom, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), linear-gradient(135deg, #670d10 0%,#092756 100%);
filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#3E1D6D',

54
endColorstr='#092756',GradientType=1 );

}
.login {
position: absolute;
top: 40%;
left: 50%;
margin: -150px 0 0 -150px;
width:400px;
height:400px;
}

.login h1 { color: #fff; text-shadow: 0 0 10px rgba(0,0,0,0.3); letter-spacing:1px; text-


align:center; }

input {
width: 100%;
margin-bottom: 10px;
background: rgba(0,0,0,0.3);
border: none;
outline: none;
padding: 10px;
font-size: 13px;
color: #fff;
text-shadow: 1px 1px 1px rgba(0,0,0,0.3);
border: 1px solid rgba(0,0,0,0.3);
border-radius: 4px;
box-shadow: inset 0 -5px 45px rgba(100,100,100,0.2), 0 1px 1px
rgba(255,255,255,0.2);
-webkit-transition: box-shadow .5s ease;
-moz-transition: box-shadow .5s ease;

55
-o-transition: box-shadow .5s ease;
-ms-transition: box-shadow .5s ease;
transition: box-shadow .5s ease;
}

input:focus { box-shadow: inset 0 -5px 45px rgba(100,100,100,0.4), 0 1px 1px


rgba(255,255,255,0.2); }

4.5 Experimental Analysis and Performance Measures


Performance evaluations measures are the parameters which helps in comparative
analysis of different machine learning techniques i.e. it tells the best algorithm among all
other algorithms or method which can be used by medical science in the early prediction of
Parkinson’s diseases.
We have used several measures to evaluate the predictive results. These measures are
Confusion Matrix, Accuracy, Precision, Recall or Sensitivity, Specificity, F1-score, LR-,
LR+, odd and youden score.

4.5.1 Confusion Matrix:


The confusion matrix is also called as Error matrix. It is a table that is often used to
describe the performance of a classification method on a set of test data for which actual
value are known. Each column of the matrix represents the instances in a predicted class.
the correlation matrix is represented as given:

Positive Negative

Positive
TP FN

Negative
FP TN

Table 1. Confusion Matrix

56
Where TP: True positive
FP: False Positive
FN: False Negative
TN: True Negative

4.5.2 Accuracy:
Accuracy is the proportion of the total number of predictions that were correct. It
can be obtained by the sum of true positive and true negative instances divided by the total
number of Samples.
(TP+TN)
It is expressed as: Accuracy = (TP+FP+FN+TN) (3)

4.5.3 Precision:
Precision is fraction of true positive and predicted yes instances. It is also known as
the ratio of correct positive results to the total number of positive results predicted by the
system.
TP
It is expressed as: Precision(P) = (4)
(TP + FP)

4.5.4 Recall (or) Sensitivity:


Recall is defined as the fraction between True Positive instances and Actual yes
instances or it is the ratio of correct positive results to the number of all relevant samples.
TP
It is expressed as: Recall(R) = (5)
(TP +FN)

4.5.5 F1-Score:
F1-score is the fraction between product of the recall and precision to the
summation of recall and precision parameter of classification. It is the harmonic mean of
Precision and Recall. It measures the test accuracy. The range of this metric is 0 to 1.
1 2𝑃𝑅
It is expressed as: F1 score = 2 ∗ 1 1 = (6)
( )+( ) (𝑃+𝑅)
Precision Recall

4.5.6 Specificity:

57
Specificity is a measure of how well a test can identify true negatives. Specificity is
also referred to as selectivity or true negative rate, and it is the percentage, or proportion,
of the true negatives out of all the samples that do not have the condition (true negatives
and false positives).
TN
It is expressed as: Specificity = (7)
(TN+FP)

4.5.7 LR-:
LR- is defined as the likelihood ratio for negative results in the test.
1−sensitivity
It is expressed as: LR− = (8)
specificity

4.5.8 LR+:
LR+ is defined as the likelihood ratio for positive results in the test.
sensitivity
It is expressed as: LR+ = (9)
1−specificity

4.5.9 Odd Score:


Odd Score is defined as to find the probability of an outcome of an event when there
are two possible outcomes and there is a casual effect.
sensitivity/(1−specificity)
It is expressed as: odd = (10)
(1−sensitivity)/specificity

4.5.10 Youden Score:


The Youden index, is the difference between the proportion of class 0 which are
assigned to class 0, and the proportion of class 1 which are assigned to class 0.
It is expressed as: youden = sensitivity + specificity − 1 (11)

4.5.11 Comparative Analysis:


The model that suites for our system has been found out through mentioned graph
comparison. In this graph we shown the comparative analysis of the three models based on
some of the above mentioned evaluation metrics.

58
Comparative Analysis

1.2

0.97
1 0.9
0.86 0.87 0.89
0.8 0.81 0.79 0.81
0.8

0.6

0.4

0.2

0
KNN Naïve Bayes Logistic Regression

Accuracy Precision Recall

Fig 4.1 Comparative analysis of three models


In Fig 4.1, we observed that results after applying all three machine learning techniques
KNN, Naïve Bayes and Logistic Regression. And it is shown that Naïve Bayes algorithms
gives the more accuracy of 81%. The Naïve Bayes algorithm is the predicted model with
more accuracy and taken into consideration and finally we deployed using flask where
this technique is used.

Fig 4.2 Comparative analysis using all metrics

We also compared remaining evaluation metrics that are evaluated on the three
algorithms and that can be seen in the Fig 4.2. In this we can see that AUC, LR-, LR+,
Odd, Youden and Specificity.

59
4.6 Results
To demonstrate the results of our project, we take the remaining test data and it is
tested using three algorithms. After that our trained model to ready to predict the disease is
present or not. The test accuracy is done in the Google colab which is our python notebook.

Below we described how the three algorithms are processed. First, KNN algorithm is
trained with the training dataset and later it was tested with the remaining test data. In Fig
4.3, a screenshot of our notebook is showing that how the process of KNN algorithms is
done and the accuracy the model returns and it is of 80%.

Fig 4.3 KNN Test Accuracy

Second, the Naïve Bayes algorithm is trained with the training dataset and later it was tested
with the remaining test data. In Fig 4.4, a screenshot of our notebook is showing that how
the process of Naïve Bayes algorithms is done and the accuracy the model returns and it is
of more with 81%.

60
Fig 4.4 Naïve Bayes Test Accuracy

Third, the Logistic Regression algorithm is trained with the training dataset and later it was
tested with the remaining test data. In Fig 4.5, a screenshot of our notebook is showing that
how the process of Logistic Regression algorithms is done and the accuracy the model
returns and it is of more with 79%.

Fig 4.5 Logistic Regression Test Accuracy

From this three techniques, we got Naïve Bayes with more accuracy and this model is used
in front end. The model is loaded into the pickle file and that file is opened in the frontend

61
and compares the user input values with this corresponding model. Finally it results with a
text message displaying that either the patient having Parkinson’s disease or not.

Fig 4.6 User Interface to enter the patient details

Fig 4.7 Predicted Result showing for fig 4.6 details

62
Fig 4.8 User Interface to enter the another patient details

Fig 4.9 Predicted Result showing for fig 4.8 details

63
In Fig-4.6 and Fig-4.8, the screenshots describe that we have given the data in the user
interface and it is stored in a data frame. This new data will also go through all the data pre-
processing steps for converting the data into the same format as that of training data set.

Now, we have used our trained model variable to make prediction on the new data and we
got the predicted result i.e., the output of our system. This can be seen in Fig-4.7 and Fig-
4.9. As like the above data, every time we want to make a prediction, we need to continue
the whole process. And we gave both class prediction, one stating that “THE PATIENT
HAVE A PARKINSON’S DISEASE” and another one stating that “THE PATIENT
DOESN’T HAVE A PARKINSON’S DISEASE”. But many people who are not familiar
with programming or python notebooks will find difficult to do the whole thing. To avoid
this type of problems we have created a user interface from which everyone can just enter
the details and they get to know about the report.

64
CHAPTER-5
CONCLUSION AND FUTURE WORK

5.1 Conclusion
Parkinson’s disease is the second most dangerous neurodegenerative disease which
has no cure till now and to make it reduce prediction is important. In this project, we have
used three various prediction models to predict the Parkinson’s disease which are Machine
Learning Techniques i.e. KNN, Naïve Bayes and Logistic Regression. The dataset is trained
using these models and we also compared these different models built using different
methods and identifies the best model that fits.
The aim is to use various evaluation metrics such as Accuracy, Precision, Recall,
Specificity, F1-score, LR+, LR- and Youden score that produce the predicts the disease
efficiently. We have used the Speech dataset that contains voice features of the patients
which is available in the Kaggle website. The dataset consists of more than 700 features
and 750 patient details. The models are built using the five best features which were
identified by feature selection.
From this results, Naïve Bayes outstands from the other two machine learning algorithms
with an accuracy of 81%. This system we designed can make the predictions of the
Parkinson’s disease.

5.2 FUTURE WORK


In future, these models can be trained with different datasets that have best features
and can be predicted more accurately. If the accuracy rate increases, it can be used by the
laboratories and hospitals so that it is easy to predict in early stages. This models can be
also used with different medical and disease datasets. In future the work can be extended
by building a hybrid model that can find more than one disease with an accurate dataset
and that dataset has common features of two diseases. In future the work can extended to
build a model that may extract more important features among all features in the dataset so
that it produce more accuracy.

65
REFERENCES
[1] A. Ozcift, “SVM feature selection based rotation forest ensemble classifiers to improve
computer-aided diagnosis of Parkinson disease” Journal of medical systems, vol-36, no. 4,
pp. 2141-2147, 2012.
[2] Anila M Department of CS1, Dr G Pradeepini Department of CSE, “DIAGNOSIS OF
PARKINSON’S DISEASE USING ARTIFICIAL NEURAL NETWORK”, JCR, 7(19):
7260-7269, 2020.
[3] Arvind Kumar Tiwari, “Machine Learning based Approaches for Prediction of
Parkinson’s Disease” Machine Learning and Applications: An International Journal
(MLAU) vol. 3, June 2016.
[4] Carlo Ricciardi, et al, “Using gait analysis’ parameters to classify Parkinsonism: A data
mining approach” Computer Methods and Programs in Biomedicine vol. 180, Oct. 2019.
[5] Dr. Anupam Bhatia and Raunak Sulekh, “Predictive Model for Parkinson’s Disease
through Naive Bayes Classification” International Journal of Computer Science &
Communication vol-9, Dec. 2017, pp. 194- 202, Sept 2017 - March 2018.
[6] Dr. R.GeethaRamani, G.Sivagami, ShomonaGraciajacob “Feature Relevance Analysis
and Classification of Parkinson’s Disease TeleMonitoring data Through Data Mining”
International Journal of Advanced Research in Computer Science and Software
Engineering, vol-2, Issue 3, March 2012.
[7] Dragana Miljkovic et al, “Machine Learning and Data Mining Methods for Managing
Parkinson’s Disease” LNAI 9605, pp. 209-220, 2016.
[8] FarhadSoleimanianGharehehopogh, PeymenMohammadi, “A Case Study of
Parkinson’s Disease Diagnosis Using Artificial Neural Networks” International Journal of
Computer Applications, Vol-73, No.19, July 2013.
[9] Heisters. D, “Parkinson’s: symptoms, treatments and research”. British Journal of
Nursing, 20(9), 548–554. doi:10.12968/bjon.2011.20.9.548, 2011.

[10] M. Abdar and M. Zomorodi-Moghadam, “Impact of Patients’ Gender on Parkinson’s


disease using Classification Algorithms” Journal of AI and Data Mining, vol-6, 2018.
[11] M. A. E. Van Stiphout, J. Marinus, J. J. Van Hilten, F. Lobbezoo, and C. De Baat,
“Oral health of Parkinson’s disease patients: a case-control study” Parkinson’s disease, vol-

66
2018, Article ID 9315285, 8 pages, 2018.
[12] Md. Redone Hassan, et al, “A Knowledge Base Data Mining based on Parkinson’s
Disease” International Conference on System Modelling & Advancement in Research
Trends, 2019.
[13] Mandal, Indrajit, and N. Sairam. “New machine-learning algorithms for prediction of
Parkinson's disease” International Journal of Systems Science 45.3: 647-666, 2014.
[14] Mohamad Alissa,” Parkinson’s Disease Diagnosis Using Deep Learning”, August
2018.
[15] PeymanMohammadi, AbdolrezaHatamlou and Mohammed Msdaris “A Comparative
Study on Remote Tracking of Parkinson’s Disease Progression Using Data Mining
Methods” International Journal in Foundations of Computer Science and
Technology(IJFCST), vol-3, No.6, Nov 2013.
[16] R. P. Duncan, A. L. Leddy, J. T. Cavanaugh et al., “Detecting and predicting balance
decline in Parkinson disease: a prospective cohort study” Journal of Parkinson’s Disease,
vol-5, no. 1, pp. 131–139, 2015.
[17] Ramzi M. Sadek et al., “Parkinson’s Disease Prediction using Artificial Neural
Network” International Journal of Academic Health and Medical Research, vol-3, Issue 1,
January 2019.
[18] Satish Srinivasan, Michael Martin & Abhishek Tripathi, “ANN based Data Mining
Analysis of Parkinson’s Disease” International Journal of Computer Applications, vol-168,
June 2017.
[19] Shahid, A.H., Singh, M.P. A deep learning approach for prediction of Parkinson’s
disease progression, https://doi.org/10.1007/s13534-020-00156-7, Biomed. Eng. Lett. 10,
227–239, 2020.

[20] Shubham Bind, et al, “A survey of machine learning based approaches for Parkinson
disease prediction” International Journal of Computer Science and Information
Technologies vol-6, Issue 2, pp. 1648- 1655, 2015.
[21] Siva Sankara Reddy Donthi Reddy and Udaya Kumar Ramanadham “Prediction of
Parkinson’s Disease at Early Stage using Big Data Analytics”ISSN: 2249 – 8958, Volume-
9 Issue-4, April 2020
[22] Sriram, T. V., et al. “Intelligent Parkinson Disease Prediction Using Machine Learning

67
Algorithms” International Journal of Engineering and Innovative Technology, vol-3, Issue
3, September 2013.
[23] T. Swapna, Y. Sravani Devi, “Performance Analysis of Classification algorithms on
Parkinson’s Dataset with Voice Attributes”. International Journal of Applied Engineering
Research ISSN 0973-4562 Volume 14, Number 2 pp. 452-458, 2019.
[24] T. J. Wroge, Y. Özkanca, C. Demiroglu, D. Si, D. C. Atkins and R. H. Ghomi,
"Parkinson’s Disease Diagnosis Using Machine Learning and Voice," IEEE Signal
Processing in Medicine and Biology Symposium (SPMB), pp.1-7, doi:
10.1109/SPMB.2018.8615607, 2018.

68
International conference on Signal Processing, Communication, Power and Embedded System (SCOPES)-2016

An Improved Approach for Prediction of


Parkinson’s Disease using Machine Learning
Techniques
Kamal Nayan Reddy Challa Venkata Sasank Pagolu Ganapati Panda
School of Electrical Sciences School of Electrical Sciences School of Electrical Sciences
Computer Science and Engineering Computer Science and Engineering Indian Institute of Technology
Indian Institute of Technology Indian Institute of Technology Bhubaneswar, India 751013
Bhubaneswar, India 751013 Bhubaneswar, India 751013 Email: gpanda@iitbbs.ac.in
arXiv:1610.08250v1 [cs.LG] 26 Oct 2016

Email: kc11@iitbbs.ac.in Email: vp12@iitbbs.ac.in

Babita Majhi
Department of Computer Science and IT
G.G Vishwavidyalaya, Central University
Bilaspur, India 495009
Email: babita.majhi@gmail.com

Abstract—Parkinson’s disease (PD) is one of the major public the Parkinson’s disease is treated as disorder of the central
health problems in the world. It is a well-known fact that nervous system which is the result of loss of cells from various
around one million people suffer from Parkinson’s disease in parts of the brain. These cells also include substantia nigra
the United States whereas the number of people suffering from
Parkinson’s disease worldwide is around 5 millions. Thus, it is cells that produce dopamine. Dopamine plays a vital role in
important to predict Parkinson’s disease in early stages so that the coordination of movement. It acts as a chemical messenger
early plan for the necessary treatment can be made. People are for transmitting signals within the brain. Due to the loss of
mostly familiar with the motor symptoms of Parkinson’s disease, these cells, patients suffer from movement disorder.
however an increasing amount of research is being done to predict The symptoms of PD can be classified into two types i.e.
the Parkinson’s disease from non-motor symptoms that precede
the motor ones. If early and reliable prediction is possible then non-motor and motor symptoms. Many people are aware of
a patient can get a proper treatment at the right time. Non- the motor symptoms as they can be visually perceived by
motor symptoms considered are Rapid Eye Movement (REM) human beings. These symptoms are also called as cardinal
sleep Behaviour Disorder (RBD) and olfactory loss. Developing symptoms, these include resting tremor, slowness of movement
machine learning models that can help us in predicting the (bradykinesia), postural instability (balance problems) and
disease can play a vital role in early prediction. In this paper we
extend a work which used the non-motor features such as RBD rigidity [2]. It is now established that there exists a time-
and olfactory loss. Along with this the extended work also uses span in which the non-motor symptoms can be observed.
important biomarkers. In this paper we try to model this classifier This symptoms are called as dopamine-non-responsive symp-
using different machine learning models that have not been toms. These symptoms include cognitive impairment, sleep
used before. We developed automated diagnostic models using difficulties, loss of sense of smell, constipation, speech and
Multilayer Perceptron, BayesNet, Random Forest and Boosted
Logistic Regression. It has been observed that Boosted Logistic swallowing problems, unexplained pains, drooling, constipa-
Regression provides the best performance with an impressive tion and low blood pressure when standing. It must be noted
accuracy of 97.159 % and the area under the ROC curve was that none of these non-motor symptoms are decisive, however
98.9%. Thus, it is concluded that this models can be used for when these features are used along with other biomarkers
early prediction of Parkinson’s disease. from Cerebrospinal Fluid measurement (CSF) and dopamine
Keywords—Improved Accuracy, Prediction of Parkinson’s Dis-
ease, Non Motor Features, Biomarkers, Machine Learning Tech- transporter imaging, they may help us to predict the PD.
niques, Boosted Logistic Regression, BayesNet, Multilayer Per- In this paper we extend works by Prashant et al [3]. This
ceptron, work takes into consideration the non-motor symptoms and
the biomarkers such as cerebrospinal fluid measurements and
I. I NTRODUCTION dopamine transporter imaging. In this paper we follow a simi-
Parkinson’s disease (PD) is a chronic, degenerative neu- lar approach, however we try to use different machine learning
rological disorder. The main cause of Parkinson’s disease is algorithms that can help in improving the performance of
actually unknown. However, it has been researched that the model and also play a vital role in making in early prediction
combination of environmental and genetic factors play an of PD which in turn will help us to initiate neuroprotective
important role in causing PD [1]. For general understanding therapies at the right time.

1
The rest of the paper is organized as follows. Section 2 con- A. Database
tains the related work. Section 3 contains the flowchart of the In this study the data from Parkinson’s Progression Markers
analysis carried out and describes about the PPMI database, Initiative (PPMI) database [8] was obtained. PPMI is an obser-
explanation of different features extracted, statistical analysis vational, multicentre study that collects clinical and imaging
of this features, classification and prediction/prognostic model data and biologic samples from various cohorts that can be
design. Section 4 provides the results and discussion from the used by researchers to establish markers of disease progression
experiments carried out. And finally conclusion of the work in PD. PPMI has established a comprehensive, standardized,
is provided in Section 5. longitudinal PD data and biological sample repository that can
play a vital role in the development of tools which assist in
II. R ELATED R ESEARCH W ORK prediction of PD. To obtain the recent information, the official
website of PPMI. ( www.ppmi-info.org ) can be visited. This
Different researchers have used different features and data dataset is similar to the one used in [3]. We downloaded the
to predict Parkinson’s disease. Indira et al. [4] have used database on 8th August 2016. On this date the data of 184
biomedical voice of human as the main feature. The authors normal patients and 402 early PD subjects were collected. It
have developed a model to automatically predict whether a is noted that PPMI has observations from each of the patients
person is suffering from PD by analysing the voice of the at different time intervals. Thus the data of each patient at
patients. They have used fuzzy c-means (FCM) clustering and different periods like screening or baseline, first visit, second
pattern recognition methods on the dataset and have attained visit and so on are available. In the present investigation the
an accuracy of 68.04%, 75.34% sensitivity and 45.83% speci- data at baseline observation are considered.
ficity. Amit et al. [5] have presented a unique approach of In [3] , the authors have used features from University
classifying PD patients on the basis of their postural instability of Pennsylvania Smell Identification Test, RBD screening
and have used L2 norm metric in conjunction with support questionnaire, CSF Markers of Aβ1-42,α- syn, P-tau181, T-
vector machine. In [6], the authors have applied University of tau, T-tau/Aβ1-42, P-tau181/Aβ1-42 and P-tau181/T tau, and
Pennsylvania 40-item smell identification test (UPSIT-40) and SPECT measurements of striatal binding ratio (SBR) data.
16-item identification test from Sniffins Sticks. This study was In this study these features have been used because we felt
conducted on Brazilian population. The authors have applied that they are a good combination of non-motor features and
logistic regression considering each of the above features sep- biomarkers. The details of these features are given in section
arately. They observed that the Sniffin Sticks gave a specificity III B.
of 89.0 % and a sensitivity of 81.1 %. Similarly they found out
that the UPSIT-40 specificity was 83.5% and sensitivity 82.1%. B. Feature Description
Prashant et al. [7] have used olfactory loss feature loss from 1) University of Pennsylvania Smell Identification Test (UP-
40-item UPSIT and sleep behaviour disorder from Rapid eye SIT): Olfactory dysfunction is an important marker of Parkin-
movement sleep Behaviour Disorder Screening Questionnaire son’s disease [9]. It acts as sensitive and early marker for
(RBDSQ). Support Vector machine and classification tree Parkinson’s disease. It is a fact that most of the people who
methods have been employed to train their methods. They suffer from PD have olfactory loss however it doesn‘t mean
have reported an accuracy of 85.48% accuracy and 90.55% that all the people with olfactory loss are suffering from PD
sensitivity. This work has been extended by the same authors [10]. Olfactory dysfunction are in various forms for instance it
in [3]. In this paper they added new features in the form may be impairment in odour detection or odour differentiation.
of CSF measurements and SPECT imaging markers. They A study by Posen et al [11] showed that about 10% of the
reported an accuracy of 96.40% and 97.03% sensitivity. This subjects who were suffering from odour dysfunction were at
paper has motived us to further the study. In the present the risk of PD.
paper an attempt has been made to improve the accuracy For quantifying this odour loss the data of University of
by using advanced machine learning models. Some recent Pennsylvania Smell Identification Test is used. This test is
machine learning algorithms have been chosen for prediction commercially available and is also one of the most reliable
and have made a comparative performance analysis of these tests [12]. The procedure of the test is as follows. A subject is
models based on accuracy, area under the ROC curve and other provided with 4 different 10 page booklets. Each of this pages
measures. has a different odour. A subject has to scratch the page and
smell it. For each of this pages, there exists a question with
III. M ATERIALS AND M ETHODS four options. Depending on the odour the subject selects one
of the options. This procedure is repeated for all the pages in
A flowchart of the proposed analysis is shown in Fig 1. all the booklets. Once the test is completed the UPSIT score
The data was first collected and the required non-motor and is calculated. The maximum score can be 40 when the subject
biomarker features are then extracted. Then different machine identifies each of the odours correctly. One main advantage of
learning algorithms are employed for the classification task. this is that the test takes only a few minutes. For the present
Finally, a comparative analysis is made based on the accuracy analysis the UPSIT score at baseline check-up from PPMI [8]
provided by different machine learning models. has been taken.
2) REM sleep Behaviour Disorder Screening Questionnaire uses gamma rays [17]. The SPECT is a common routine for
(RBDSQ): RBD is another non-motor symptom that plays an helping a doctor to decide whether a subject is suffering from
important role in early prediction of Parkinson’s disease. Peo- neurodegenerative diseases. According to [18], the SPECT
ple suffering from RBD have disturbances in sleep. These dis- imaging can detect the dopaminergic transporter loss during
turbances include vivid, aggressive or action packed dreams. the early stages of PD. When a subject has an abnormal scan-
Similar to olfactory loss, studies have shown that disorder ning then the person has more probability of being affected
in sleep behaviour increases the risk of being affected with with Parkinson’s disease or other neuro degenerative disease.
Parkinson’s disease. For quantifying this non-motor symptom, However, a normal scan denotes that the subject is suffering
the REM Sleep Behaviour Disorder Screening Questionnaire is from other type of diseases [18].
used. The RBDSQ is a 10-item patient self-rating instrument DatScan SPECT imaging obtained from PPMI imaging centres
[13]. The test contains ten short questions with answers as are used in this study. At PPMI the striatal binding ratios
yes or no. A yes is equivalent to 1 and a no is equivalent were calculated. The DatScan SPECT images are collected
to 0. The ten questions are divided such that each of the according to the PPMI imaging protocol. This raw images are
group of the questions provides the observations about a then reconstructed so as to ensure consistency among different
particular behaviour. Some of the examples of the questions imaging centres. After this attenuation correction is performed
from [13] are “I sometimes have vivid dreams”, “The dream on these images. After this the Gaussian filter is applied and
contents mostly matches my nocturnal behaviour”, “My sleep it is followed by normalization. Finally the required part is
is frequently disturbed”, etc. As some of the subjects may have extracted from the images and then the striatal binding ratio for
a bed partner, they can also be used in this test. left and right caudate, the left and right putamen are calculated
Each of the answers are provided as either one or zero. In [19]. In this paper, these four striatal binding values are used
the present study the feature for sleep disorder is obtained by as neuroimaging biomarkers.
summing up all the answers. This sum can be a maximum
of 12 if we take the first nine questions. It is observed here C. Prediction models for distinguishing early PD and healthy
that a higher score in this case means a higher risk of PD in normal subjects
contrast to that of UPSIT score. This RBDSQ score is taken In this study, four different machine learning classifiers
from PPMI [8]. are chosen for classification task. A brief description of
3) Cerebrospinal Fluid Biomarkers: Biomarkers play a each of them is provided in this section. WEKA [20] is
pivotal role in this analysis. Without the aid of biomarkers used for classification using Multilayer Perceptron, Bayesian
the prediction of PD is less accurate. The biomarkers are the Network, Random Forest, and Boosted Logistic Regression.
significant factors in increasing the accuracy of the model. The main motive is to find an algorithm that can improve
Biomarkers need to be sensitive, reproducible and must be the already reported accuracy as well as to see how various
closely associated with the disease. Cerebrospinal fluid is a models are performing. Firstly, the dataset is normalized using
clear, colourless body fluid found in the brain. It has more the Normalize filter in WEKA [20].Then then the dataset
physical contact with the brain as compared to any other fluid is divided in such a way that 70% is used for training
[14]. Due to the close proximity with the brain, any protein and the rest 30% is used for testing. While partitioning the
or peptide which is related to the brain specific functionalities dataset the same class proportion in both the test and train
or disease are diffused into CSF. Hence, the CSF can act as data is maintained. For example, if the proportion of healthy
an important biomarker for brain related diseases which in people in the complete data is 40% then both in training
the present case is Parkinson’s disease. and testing the proportion of healthy people to PD subjects
is maintained at 40%. This type of partitioning is known as
The CSF samples are collected from PPMI. In PPMI, for stratified partitioning. The accuracy, recall, precision and f-
each of subjects the CSF samples are obtained and certain measure for each these algorithms are computed and the ROC
measurements are made. These measurements include Aβ1- of each of the classifiers are plotted. Finally the performance
42(amyloid beta (1-42), T-tau (total tau) and P-tau181 (tau measure of different classifiers used in this paper as well as
phosphorylated at threonine) [15]. According to PPMI Re- in [3] are compared.
search Laboratory these three are the important biomarkers 1) Multilayer Perceptron: Multilayer perceptron is a feed-
that can be extracted from the CSF fluid. Along with this forward artificial neural network. The basic principle of mul-
the concentration of α-Syn was also collected from PPMI tilayer perceptron is that it takes the input and maps it to
database. Kang et al have mentioned that ratios like T- a nonlinear space, then it tries to predict the corresponding
tau/Aβ1-42, P-tau181/Aβ1-42 and P-tau181/T-tau also play outputs. A MLP architecture is viewed as a multiple layers
a significant role in early detection of Parkinson’s disease of nodes, with each layer being fully connected with the next
[16]. In the present investigation the measurements of Aβ1- layer. Each node in the MLP is interpreted as a neuron that
42, T-tau and P-tau181 and also the ratios T- tau/Aβ1-42, P- has an activation function which is non-linear [21] [22]. The
tau181/Aβ1-42 and P-tau181/T-tau are taken. back-propagation algorithm which is a supervised learning
4) Neuroimaging markers: Single-photon emission com- technique is used for training the model. The number of
puted tomography (SPECT) is a neuroimaging technique that hidden layers in the MLP have a significant impact on the
TABLE I: Performance Measures for various classifiers used in the study
Multilayer Perceptron BayesNet Random Forest Boosted Logistic Regression
Performance Measures
Training Testing Training Testing Training Testing Training Testing
Accuracy(%) 96.09 95.4545 96.5854 96.027 100 96.59 95.8537 97.1591
Recall 0.961 0.955 0.966 0.960 1 0.966 0.959 0.972
Precision 0.962 0.955 0.967 0.965 1 0.970 0.959 0.974
F-Measure 0.961 0.955 0.966 0.961 1 0.967 0.959 0.972
AUC 0.989 0.986 0.994 0.994 1 0.997 0.995 0.989

performance of the classifier. search algorithms and quality measures are used. A Simple
2) Bayesian Network: The Bayesian network is one of the Estimator is chosen for estimating the conditional probability
probabilistic graphical models used in machine learning. The tables of a Bayes network once the structure has been learned.
Bayes Net corresponds to graphical model structures which For searching K2 algorithm is used. It uses a hill climbing
are known as directed acyclic graph (DAG). This graphical algorithm which is restricted by an order on the variables. In
models are understood in the following manner [23]. The boosted logistic regression Adaboost M1 method is used to
nodes in the graph represent the random variables and the boost the logistic regression.
edge between node x and node y denotes the probabilistic
dependencies among random variables corresponding to the It is observed that all the classifiers performed reasonably
respective nodes. Hence the nodes that are not connected in well with boosted logistic regression giving the best perfor-
the Bayesian network are the random variables which are inde- mance with 97.16% accuracy and 98.9% area under the ROC
pendent to each other. Different computational and statistical (AUC). Table 2: shows how this models performed in relation
methods are used to estimate the conditional dependencies. to the previous work [3]. It is found that the accuracy and
Bayes Network learning uses various search algorithms and area under the ROC curve are nearly same among the different
quality measures. In the present model K2 learning algorithm classifiers used. The present work and [3] have the advantage
for searching is used. that the dataset used is very large as compared to others.
3) Random Forest: Random forest are part of ensemble However, it is noted that the PPMI study includes subjects
learning method that is used for classification, regression and who are in early stages of PD and healthy normal, however it
other tasks. In Random forest, there are many decision trees. doesn‘t include subjects who are having premotor symptoms
For a given input, each of the decision trees classify it as but are not diagnosed as PD due to lack of motor symptoms.
yes/no (in case of binary classification)[24] [25]. Then once
each of the trees have classified as yes/no, the value which has V. C ONCLUSION
the majority among them is taken as output. The advantages The diagnosis of Parkinson’s Disease is not direct which
are that this algorithm runs effectively on large inputs and it means that one particular test like blood test or ECG cannot
also helps in estimating which of the features are important. determine whether a person is suffering from PD or not.
4) Boosted Logistic Regression: Logistic regression was Doctors go through the medical history of a patient, followed
developed by statistician David Cox in 1958[26] [27]. A by a thorough neurological examination. They find out at least
logistic model is used to predict the binary class using one or two cardinal symptoms among the subjects and then predict
more features. Logit- the natural algorithm for an odds ratio whether the subject is suffering from PD. The misdiagnosis
is the central mathematical concept behind logistic regression. rate of PD is significant due to a no definitive test. In such
Logistic regression is well suited in case when one wants to case it will be helpful for us to aid the doctor by providing a
establish relationship between a categorical outcome variable machine learning model. The prediction models are developed
and one or more categorical or continuous predictor variables using machine learning techniques of boosted logistic regres-
[28]. sion, classification trees , Bayes Net and multilayer perceptron
Boosting is a machine learning ensemble meta-algorithm based on these significant features. It is observed that the
for primarily reducing bias, and also variance in supervised performance is better. It is demonstrated that Boosted Logistic
learning. It belongs to the family of machine learning algo- Regression produce superior results. These results encourage
rithms which convert weak learners to strong ones. AdaBoost us to try other ensemble learning techniques. The present
is used for boosting different classifiers. work employs different machine learning algorithms which
are not used in [3]. This study plays an important role in
IV. R ESULTS AND D ISCUSSION having a comparative analysis of various machine learning
Table 1 shows the performance of various classifiers used algorithms. In conclusion, this models can provide the nuclear
in the study. Fig 2 shows the corresponding ROC plots. In experts an assistance that can aid them in better and accurate
Multilayer Perceptron the back propagation algorithm is used decision making and clinical diagnosis. It is also found that the
to train the model. The learning rate is set at 0.4 and number proposed method is fully automated and provides improved
of hidden layers is chosen as 8 in this case ((number of performance and hence can be recommended for real life
attributes + number of classes)/2 =8). In BayesNet various applications.
TABLE II: Comparative Analysis of Machine Learning models
in the current work and previous work
Accuracy(%) AUC(%)
Machine Learning Algorithms
Training Testing Training Testing
Multilayer Perceptron 96.09 95.45 98.9 98.6
BayesNet 96.5854 96.02 99.4 99.4
Random Forest 100 96.59 100 0.997
Boosted Logistic Regression 95.8537 97.16 99.5 98.9
Boosted Trees 100 95.08 100 98.23
Naive Bayes 94.67 93.12 98.66 96.77
Support Vector Machine 97.14 96.40 99.27 98.88
Logistic Regression 96.50 95.63 99.20 98.66

R EFERENCES Medicine and Institute onAging, Center for Neurodegenerative Disease


Research, Perelman School of Medicine, University of Pennsylvania
[1] Alves G, Forsaa EB, Pedersen KF, Dreetz Gjerstad M, Larsen JP (2008) [16] J.H. Kang, D.J. Irwin, A.S. Chen-Plotkin, A. Siderowf, C. Caspell,
Epidemiology of Parkinson’s disease. J Neurol 255 Suppl 5: 1832. C.S. Coffey, T. Waligrska, P. Taylor, S. Pan, M. Frasier, K. Marek, K.
[2] Vu TC, Nutt JG, Holford NH (2012) Progression of Motor and Non- Kieburtz, D. Jennings, T. Simuni, C.M. Tanner, A. Singleton, A.W. Toga,
Motor Features of Parkinson’s Disease and Their Response to Treatment. S. Chowdhury, B. Mollenhauer, J.Q. Trojanowski, L.M. Shaw, Parkinson’s
Br J Clin Pharmacol. Progression Markers Initiative, Association of cerebrospinal fluid -amyloid
1-42, T-tau, P-tau181, and -synuclein levels with clinical features of drug-
[3] High Accuracy Detection of Early Parkinson’s Disease through Multi-
naive patients with early Parkinson disease, JAMA Neurol, 70 (2013) 1277-
modal Features and Machine Learning. R Prashant, Sumantra Dutta Roy,
1287.
Pravat K. Mandal, Shantanu Ghosh.
[17] SPECT at the US National Library of Medicine Medical Subject
[4] Rustempasic, Indira, & Can, M. (2013). Diagnosis of Parkinson’s disease Headings (MeSH)
using Fuzzy C-Means Clustering and Pattern Recognition. SouthEast [18] J.L. Cummings, C. Henchcliffe, S. Schaier, T. Simuni, A. Waxman, P.
Europe Journal of Soft Computing, 2(1). Kemp, The role of dopaminergic imaging in patients with symptoms of
[5] Amit S., Ashutosh M., A. Bhattacharya, F. Revilla, (2014, dopaminergic system neurodegeneration, Brain, 134 (2011) 3146-3166.
March).Understanding Postural Response of Parkinson’s Subjects [19] DatScan SPECT Image Processing Methods for Calculation of Striatal
Using Nonlinear Dynamics and Support Vector Machines. Austin J, Binding Ratio (SBR), Gary Wisniewski, John Seibyl, Ken Marek, Institute
Biomed Eng 1(1): id1005. for Neurodegenerative Disorders (IND)
[6] L. Silveira-Moriyama, J. Carvalho Mde, R. Katzenschlager, A. Petrie, R. [20] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Ranvaud, E.R. Barbosa, A.J. Lees, The use of smell identification tests Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An
in the diagnosis of Parkinson’s disease in Brazil, Mov Disord, 23 (2008) Update; SIGKDD Explorations, Volume 11, Issue 1.
2328-2334. [21] Rosenblatt, Frank. x. Principles of Neurodynamics: Perceptrons and the
[7] Parkinson’s disease detection using olfactory loss and REM sleep disorder Theory of Brain Mechanisms. Spartan Books, Washington DC, 1961
features. R. Prashanth-EMBS Member, S. Dutta Roy-IEEE Member, and [22] Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. ”Learning
P. K. Mandal, S. Ghosh Internal Representations by Error Propagation”. David E. Rumelhart, James
[8] K. Marek, D. Jennings, S. Lasch, A. Siderowf, C. Tanner, T. Simuni, C. L. McClelland, and the PDP research group. (editors), Parallel distributed
Coffey, K. Kieburtz, E. Flagg, S. Chowdhury, W. Poewe, B. Mollenhauer, processing: Explorations in the microstructure of cognition, Volume 1:
P.-E. Klinik, T. Sherer, M. Frasier, C. Meunier, A. Rudolph, C. Casaceli, Foundations. MIT Press, 1986.
J. Seibyl, S. Mendick, N. Schuff, Y. Zhang, A. Toga, K. Crawford, A. [23] Ben-Gal I., Bayesian Networks, in Ruggeri F., Faltin F. & Kenett
Ansbach, P. De Blasio, M. Piovella, J. Trojanowski, L. Shaw, A. Singleton, R.,Encyclopedia of Statistics in Quality & Reliability, Wiley & Sons
K. Hawkins, J. Eberling, D. Brooks, D. Russell, L. Leary, S. Factor, B. (2007).
Sommerfeld, P. Hogarth, E. Pighetti, K. Williams, D. Standaert, S. Guthrie, [24] Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of
R. Hauser, H. Delgado, J. Jankovic, C. Hunter, M. Stern, B. Tran, J. the 3rd International Conference on Document Analysis and Recognition,
Leverenz, M. Baca, S. Frank, C.-A. Thomas, I. Richard, C. Deeley, L. Montreal, QC, 1416 August 1995. pp. 278282.
Rees, F. Sprenger, E. Lang, H. Shill, S. Obradov, H. Fernandez, A. Winters, [25] Ho, Tin Kam (1998). ”The Random Subspace Method for Constructing
D. Berg, K. Gauss, D. Galasko, D. Fontaine, Z. Mari, M. Gerstenhaber, D. Decision Forests” (PDF). IEEE Transactions on Pattern Analysis and
Brooks, S. Malloy, P. Barone, K. Longo, T. Comery, B. Ravina, I. Grachev, Machine Intelligence. 20 (8): 832844. doi:10.1109/34.709601
K. Gallagher, M. Collins, K.L. Widnell, S. Ostrowizki, P. Fontoura, T. Ho, [26] Walker, SH; Duncan, DB (1967). ”Estimation of the probability of an
J. Luthman, M.v.d. Brug, A.D. Reith, P. Taylor, The Parkinson Progression event as a function of several independent variables”. Biometrika. 54:
Marker Initiative (PPMI), Prog Neurobiol, 95 (2011) 629-635. 167178. doi:10.2307/2333860
[9] Olfactory dysfunction in Parkinson disease Richard L. Doty [27] Cox, DR (1958). ”The regression analysis of binary sequences (with
[10] https://www.michaeljfox.org/understanding-parkinsons/living-with- discussion)”. J Roy Stat Soc B. 20: 215242.
pd/topic.php?smell-loss [28] An Introduction to Logistic Regression Analysis and Reporting CHAO-
[11] M.M. Ponsen, D. Stoffers, J. Booij, B.L. van Eck-Smit, E.C. Wolters, YING JOANNE PENG KUK LIDA LEE GARY M. INGERSOLL Indiana
H.W. Berendse, Idiopathic hyposmia as a preclinical sign of Parkinson’s University-Bloomington
disease, Ann Neurol, 56 (2004) 173-181.
[12] R.L. Doty, P. Shaman, M. Dann, Development of the University of
Pennsylvania Smell Identification Test: a standardized microencapsulated
test of olfactory function, Physiol Behav, 32 (1984) 489-502
[13] The REM Sleep Behaviour Disorder Screening Questionnaire A new
Diagonostic Instrument, Karin Stiasny-Kolster, Geert Mayer, Hephata
Klinik Schwalmstadt-Treysa, Germany,Sylvia Schfer, Wolfgang H Oertel
[14] Cerebrospinal fluid biomarkers in parkinsonian conditions: an update
and future directions Nadia Magdalinou, Andrew J Lees, Henrik Zetter-
berg.
[15] Aβ1-42, t-tau and p-tau181 measurements in 120 PPMI CSF samples
using xMAP/Luminex multiplex immunoassay, Ju Hee Kang, John Q
Trojanowski and Leslie M Shaw, Department of Pathology & Laboratory
Fig. 1: Flow Chart of the proposed analysis

(a) ROC for Classification using Multilayer Perceptron (Test Data) (b) ROC for Classification using BayesNet (Test Data)

(d) ROC for Classification using Boosted Logistic Regression


(c) 1f ROC for Classification using Random Forest (Test Data)
(Test Data)
Fig. 2: ROC Plots for different machine learning algorithms
X-axis = False Positive Rate Y-axis = True Positive Rate
The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

Parkinson’s disease Prediction Using Machine Learning


Techniques
Kapuganti Manoj1

Mrs. Dr.Selvani Deepthi Kavila, Ph.D2, Immidi Manikanta3, Cheemalapathi Deekshith4, Kalaga Venkata Mukesh5

1 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam

2Associate professor, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam

3 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam

4 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam

5 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam

Abstract- Parkinson’s is considered as one of the deadliest and progressive nervous system diseases that affect the
movement. It is the second most common neurological disorder that causes disability, reduces the life span, and still has no
cure. Nearly 90% of affected people with this disease have speech disorders. In various data repositories, large datasets are
available which used to solve real-world applications. Machine Learning Techniques also help in the medical field to detect
diseases such as Parkinson’s which has affected various people. In this paper, the author have proposed a Parkinson’s
prediction model for better classification of Parkinson’s which includes features like PPE (Pitch Period Entropy), DFA
(Detrended Fluctuation Analysis), RPDE (Recurrent Period Density Entropy), etc. In this paper, the author is using various
Machine Learning techniques like KNN, Naïve Bayes, and Logistic Regression and how these algorithms are used to predict
Parkinson’s based on the input taken from the user and the input for algorithms is the dataset. The dataset used in this paper
is downloaded from the Kaggle website which contains the speech features of Parkinson’s patients. Based on these features
the author predict the algorithm that gives more accuracy. The accuracies obtained for the three algorithms are KNN with
80%, Logistic Regression with 79%, and Naïve Bayes with the highest accuracy of 81% and it is used in the frontend to
predict whether the patient has Parkinson’s is present or not.

Keywords - Health, Parkinson, PPE, DFA, RPDE Logistic Regression, KNN, Naive Bayes, Prediction.

I. INTRODUCTION

The recent report of the World Health Organization shows a remarkable hike in the number and health burden of
Parkinson’s disease patient’s increases rapidly and it estimated that China will reach half of the Parkinson’s disease
population in 2030. Classification techniques are broadly used in the medical field for better classifying data into different
classes according to some features. Parkinson’s disease is a neurological disorder that leads to shaking, shivering, stiffness,
and difficulty walking and balance. Mainly it is caused by the breaking down of cells in the nervous system. Parkinson’s can
cause both motor and non-motor symptoms. The motor symptoms include slowness of movement, rigidity, balance problems,
and tremors. As the disease continues, the affected people may have difficulty walking and talking. The non-motor symptoms
include anxiety, breathing problems, depression, loss of smell, and change in speech. If the mentioned symptoms are present
in the person then the details are stored in the records. In this paper, the author consider the speech features of the patient,
and this data is used for predicting whether the patient has Parkinson’s disease or not.

The biggest risk factor for getting this disease is Age. It affects people who are 60 and older, as the years go the risk goes
up. Both genders can have Parkinson’s disease and it affects more in men than women and the ratio is 2:1. Whites get more

Volume XIII, Issue VI, June/2021 Page No: 836


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

often than other groups. People working in farms, factory and more in contact with the chemicals have a chance of getting
this disease. Parkinson’s disease may affect speech in different ways. People with this disease get their voice softer, breathe,
speech may be a slur. The tone of this kind of people was monotone, and difficult in finding the right words. Such diseases
spread all over the body gradually without early warning. To solve this kind of problem, Machine Learning algorithms play
an important role. These algorithms find solutions to problems by recognizing patterns in databases. Machine Learning
analyse complex data and store medical records for further analysis. An automated machine can predict disease in a better
way once it trained. For the linear data, Classification algorithms are used. Classification is a process of categorizing the
given dataset into classes, it either predicts classification labels or classification data based on the training dataset, and the
values are used in classifying new datasets. There are many classification algorithms which include Logistic Regression,
Decision Tree, Random Forest, Naïve Bayes, and KNN. Out of these mentioned algorithms in this paper, the author used
KNN, Logistic Regression, and Naïve Bayes to predict Parkinson’s disease using the Speech dataset and produce the
accuracy of every algorithm depending on the speech features the author are going to predict Parkinson’s.

In this paper the author follow a similar approach, try to use different machine learning algorithms that help in increasing
the accuracy of the model and extend the work by taking the Naïve Bayes algorithm which gives more accuracy. It is used
to detect the Parkinson’s disease is present or not by entering the speech features of the person and it checks with the training
set and predicts the output.

II. LITERATURE SURVEY

Voice features of the patients are assumed to be 90% helpful to identify the presence of Parkinson’s disease.
Different researchers applied different features of the dataset to predict the disease. Many of the authors used the voices of
the patients to analyse Parkinson’s. In general, the speech problems are two and are hypophonia and dysarthria. Hypophonia
means a very weak and soft tone from a person. Dysarthria means a slow voice that is understandable at one time caused
due to the central nervous system.

Dr. Anupam Bhatia and Raunak Sulekh proposed a paper “Predictive Model for Parkinson’s Disease through Naive Bayes
Classification” [3]. This paper summarizes that the Naïve Bayes classifier is used to analyse the performance of the dataset.
The dataset used in this paper is recorded speech signals. The features in this dataset are voice measures and the aim is to
predict the PD 0 for healthy and 1 for a person having the disease. In this paper, the author has used the Rapid miner tool
to analyse the data. Naïve Bayes classifier produces 98.5% accuracy. This model helps doctors and patients to detect the
disease and take preventive measures at the right time.

Carlo Ricciardi, et al, proposed a paper “Using gait analysis’ parameters to classify Parkinsonism: A data mining approach”
[2]. In this paper, the author compares performance analysis using two algorithms. The algorithms used are Random Forest
and Gradient Boosted Trees. In this paper, the PD patients at different stages were taken into consideration and are identified
as typical and atypical based on gait analysis using a data mining approach. The comparative analysis gives the result as
Random Forest obtained the highest accuracy of 86.4%. This model also helped clinicians to distinguish PD patients at an
early stage.

Arvind Kumar Tiwari proposed a paper “Machine Learning-based Approaches for Prediction of Parkinson’s Disease” [1].
The dataset used in this paper is voice recordings of the patients. And in this paper, the author chooses the most important
features among all features to predict Parkinson’s disease by using minimum redundancy maximum relevance feature
selection algorithms. The author applied different machine learning algorithms and compared them. Random Forest
provides the highest accuracy of 90.3%.

Mehrbakhsh Nilashi, et al, proposed a paper “A hybrid intelligent system for the prediction of Parkinson’s Disease
progression using Machine Learning techniques” [7]. In this paper, Unified Parkinson's Disease Rating Scale (UPDRS) is

Volume XIII, Issue VI, June/2021 Page No: 837


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

mostly used to assess Parkinsonism. The author described the relationship between speech signals and UPDRS is important
to find Parkinson’s. In this paper, the Incremental Support Vector Machine is used to predict Total-UPDRS and Motor-
UPDRS. The author concluded that a combination of ISVR, SOM, and NIPALS are used to get effective results to predict
the UPDRS.

M. Abdar and M. Zomorodi-Moghadam proposed a paper “Impact of Patients’ Gender on Parkinson’s disease using
Classification Algorithms” [5]. In this paper, the author chooses the UCI PD dataset for finding the accuracy of Parkinson’s
using SVM and Bayesian Network algorithms. The author chooses the most ten important features in the dataset to predict
PD. The output variable is Sex and other factors are input, the author provides an approach for finding relationships between
genders. The result obtained is SVM algorithm gives better performance than Bayesian Network with 90.98% accuracy.

Dragana Miljkovic, et al, proposed a paper “Machine Learning and Data Mining Methods for Managing Parkinson’s
Disease” [4]. In this paper, the author concluded that based on the medical tests taken by the patients the Predictor part was
able to predict the 15 different Parkinson’s symptoms separately. The machine learning and data mining techniques are
applied on different symptoms separately and gives an accuracy range between 57.1% and 77.4% where tremor detection
has the highest accuracy.

Md. Redone Hassan, et al, proposed a paper “A Knowledge Base Data Mining based on Parkinson ’s disease” [6]. In this
paper, different classification algorithms are used to predict Parkinson’s disease such as SVM, KNN, and Decision Tree.
These algorithms are applied to the training dataset and provides different accuracies. The paper summarizes that the
Decision tree algorithm provides 78.2% precision compared to the remaining algorithms.

Satish Srinivasan, Michael Martin & Abhishek Tripathi proposed a paper “ANN based Data Mining Analysis of Parkinson’s
Disease” [9]. In this paper study, it was intended to found that getting different accuracies based on different pre-processing
steps applied to the dataset. In the process the author has classified that the Parkinson’s disease dataset using the ANN-
based MLP classifier obtains the highest prediction accuracy when the dataset was pre-processed using this discretization
and Resample techniques, the train and test data are split in the ratio of 80:20 achieved 100% classification accuracy with
F1-score.

Ramzi M. Sadek, et al, proposed a paper “Parkinson’s Disease Prediction using Artificial Neural Network” [8]. In this
paper, the author chooses 195 samples in the dataset, further divided into 170 training and 25 validating samples. Then the
dataset was imported into the Just Neural Network (JNN) environment, the author trained and validated the Artificial Neural
Network model with the most important attributes contributing to this model. The ANN model provided 100% accuracy.

III. PROPOSED SYSTEM

This architecture diagram describes the high-level overview of major system components and important working
relationships. It represents the flow of execution.

3.1 Speech Dataset

3.2 Pre-processing data

3.3 Training data

3.4 Apply Machine Learning Algorithms

3.4.1 KNN

3.4.2 NAÏVE BAYES

3.4.3 Logistic Regression

Volume XIII, Issue VI, June/2021 Page No: 838


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

3.5 Test data

SPEECH DATASET

PRE-PROCESSING DATA

TRAINING DATA

APPLY MACHINE LEARNING


ALGORITHMS

KNN NAÏVE BAYES LOGISTIC REGRESION

TEST DATA

OUTPUT

Fig.1. System Architecture


3.1 Speech Dataset:
The goal of this step is to identify the dataset that gives efficient results. In this step, the author need to collect
different datasets that available in data repositories, and the data might be raw data that is stored in the form of files and
databases. Important features to identify the data are quality and quantity. The more data points in the dataset, the result is
more accurate. The author have collected the speech data, where it contains the voice features and this is downloaded from
Kaggle website.

Volume XIII, Issue VI, June/2021 Page No: 839


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

The dataset the author collected is put in one place and prepared to use in the training phase. The dataset contains more
voice features namely PPE, RPDE, DFA, numPulses, numPeriodPulses, meanPeriodPulses, stdDevPeriodPulses,
locPctJitter, etc. which are useful to predict Parkinson’s disease.
3.2 Pre-Processing data:
The main aim of this step is to understand the features of data that the author have to work with. It must be in an
understandable way that the characteristics the author need to predict the output.

The data generally the author collected may contains noises, missing values, and duplicate values that cannot be directly
used for machine learning algorithms. Data preprocessing is the method for cleaning the data, removing duplicates, and
filling the null values with the average of that attribute. Thus making the final dataset must suitable for a machine learning
model which produces the accuracy and efficiency of machine learning algorithms. The author split the data into 70:30.

3.3 Training data:


In machine learning data preprocessing, the author divide our dataset into a training dataset and a test dataset. This
is one of the crucial methods of data pre-processing as by doing this, the author can get more performance of our machine
learning model.

Suppose, if the resultant dataset the author get is given to our machine learning model and the author test it with a completely
different dataset. Then, there will be more difficulties the author get for our model to understand the correlations between
the models with different datasets. Then the performance will be decreased. So always the author need to do is giving the
same dataset as input to both training and testing. Training of machine learning model is required so that it can understand
various features and patterns.

3.4 Apply Machine Learning Algorithms:

3.4.1 KNN:

The k-nearest neighbor is the supervised machine learning algorithm used to analyze the accuracy of the system.
It solves both classification and regression problems. After training the data to the model it stores all the results and used
to classify the new data points. The new data which were given as input put into the category that is more similar to the
available cases that have been stored previously. This means when the test data is given to the model, then it spots easily
and classify the data point into a suitable category. As it was a non-parametric algorithm, does not make predictions on the
dataset. KNN is also called as lazy learner algorithm as it stores the data at the training phase and does not learn
immediately. It performs the action, at the time of classification. KNN is used in both statistical estimation and pattern
recognition. A data point is said to be classified by a getting more votes from its neighbors, as the trained data points are
being assigned to the class among its k nearest neighbors. The value of k is a non-negative integer and non-zero, which is
small. If the k value is 1 then it simply chooses the single nearest neighbor. In the KNN algorithm, the efficiency of the
problem depends on the value of k and it is user-defined. In this paper, the author finds the value automatically by using an
optimal algorithm that returns the value of k where the error rate is less.

Algorithm steps:

3.4.1.1) Load the dataset.

3.4.1.2) Initialize the value of k to a positive integer.

3.4.1.3) To find the class variable, iterate from starting to ending of all training data points.
3.4.1.3.1) Calculate the Euclidean distance between test data point and each data point in the training data. Here the author
used Euclidean distance as the distance metric since it is the most popular method. The other metrics that can be used are

Volume XIII, Issue VI, June/2021 Page No: 840


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

Chebyshev, cosine, etc.

3.4.1.3.2) After that sort the resultant distances in ascending order.

3.4.1.3.3) Return the first k rows from the sorted list.

3.4.1.3.4) Find the most frequent class occurred from these rows.

3.4.1.3.5) Return the predicted class.

3.4.2 Naïve Bayes:


Naïve Bayes is the classification machine learning algorithm, which is based on the Bayes theorem. It is mainly
used in many fields such as text mining that includes a large training dataset. Naïve Bayes Classifier is one of the simpler
and efficient Classification algorithms which helps in finding more accurate results and produce predictions faster. The
Naive Bayes algorithm mainly based on probability, so it is also known as a probabilistic classifier. Some popular examples
of the Naïve Bayes Algorithm are classification problems such as spam mail detection, Sentimental analysis, etc.

Naive Bayes classifier tells that every particular attribute in the dataset is independent of all other attributes. For example,
a patient having Parkinson’s or not depends on the speech features of the patient.

D
h P( )∗P(h)
h
P( ) = (1)
D P(D)

P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h.

P(D): the probability of the data (regardless of the hypothesis). This is known as the prior probability.

P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.

P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior probability.
3.4.3 Logistic Regression:
Logistic regression is also a type of supervised learning classification algorithm that is used to solve both regression
and classification problems. In the classification problems, the target variable may be either 0 or 1. Logistic regression
algorithm works on the function called sigmoid which is an s-shaped curve, so the class variable results as 0 or 1, yes or
no, true or false, correct or wrong, etc. It is a classification algorithm that works on mathematical functions. This algorithm
uses a function known as logistic or sigmoid which is a complex cost function. This functions return the value between 0
and 1. If the value less than 0.5 then it is considered as 0 and greater than 0.5 it considered as 1. Thus to build a model using
logistic regression sigmoid function is required.

There are three main types of logistic regression:

3.4.3.1) Binomial: The resultant variable can have only 2 classes either “0” or “1” which represent “true” or “false”, “yes”
or “no”, “correct” or “wrong”, etc.

3.4.3.2) Multinomial: Here, the resultant variable can have more than 3 possibilities which are not in the order that means
it has no measure in quantity like “class A” or “class B” or “class C”.

3.4.3.3) Ordinal: In this case, the resultant variable deals with organized categories. For example, rating of teaching for the

Volume XIII, Issue VI, June/2021 Page No: 841


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

faculty by students can be given as: “very bad”, “bad”, “average”, “good”, “very good” and “excellent”. Here, each category
can be given a score like 0, 1, 2, 3, 4 and 5.

1
f(x) = (2)
1+ex

f(x) = Output between the 0 and 1 value.

x = input to the function

e = base of natural logarithm.

The value of the logistic regression must be in the range from 0 to 1, does not go beyond this limit, so the only possible
curve formed is S-shaped and it is called the sigmoid function or the logistic function. In logistic regression, the threshold
value plays an essential role, which lies between 0 and 1. The resultant value which is greater than the threshold value
reaches 1, and a value lower than the threshold value reaches 0.
3.5 Test data:
Once our Parkinson’s disease Prediction model has been trained with the speech dataset, then the author test the model
using the remaining dataset. As the author divide the ratio into 70:30, the 70 percent data are trained and the other 30 percent
data is tested in this step. And the author check for the correctness and accuracy of our model by providing a test dataset to
it. The test data is applied to all three algorithms and then the author will check whether which model gives more accuracy.

IV. PERFORMANCE ANALYSIS AND RESULTS

This is the final step for prediction. In this paper, various machine learning-based methods like KNN, Logistic
Regression, and Naïve Bayes are used to predict Parkinson’s disease. The author chosen only ten important speech features
in the dataset includes gender, PPE, DFA, RPDE, numPulses, numPeriodPulses, meanPeriodPulses, stdDevPeriodPulses.
locPctJitter, locAbsJitter. It is important to measure the performance using these mentioned algorithms. To evaluate the
experiment, the author used different evaluation metrics like accuracy, confusion matrix, precision, recall, and f1-score are
used.

Classification Accuracy- It is the ratio of number of correct predictions to total number of inputs in the dataset.

(TP+TN)
It is expressed as: Accuracy =
(TP+FP+FN+TN)
(3)

Confusion Matrix- It gives us matrix as output and gives total performance of the system.

Table 1. Confusion Matrix

Positive Negative

Positive TP FN

Negative
FP TN

Volume XIII, Issue VI, June/2021 Page No: 842


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

Where TP: True positive

FP: False Positive

FN: False Negative

TN: True Negative

Precision- It is the ratio of correct positive results to the total number of positive results predicted by the system.

TP
It is expressed as: Precision(P) = (4)
(TP + FP)

Recall- It is the ratio of correct positive results to the number of all relevant samples.

TP
It is expressed as: Recall(R) = (5)
(TP +FN)

F1 score- It is the harmonic mean of Precision and Recall. It measures the test accuracy. The range of this metric is 0 to 1.

1 2𝑃𝑅
It is expressed as: F1 score = 2 ∗ 1 1 = (6)
( )+( ) (𝑃+𝑅)
Precision Recall

After applying various machine learning algorithms on the speech dataset, the results obtained for each individual algorithm
is:

Evaluation for KNN:

Table 2. Confusion Matrix for KNN

Parkinson’s Non Parkinson’s

Parkinson’s 161 18
Non Parkinson’s 27 21
TP=161 FN=18

FP=27 TN=21

The values of accuracy, precision, recall, and f1-score:

Accuracy = (161+21)/ (161+27+18+21) = 0.80

Precision = 161/ (161+27) = 0.86

Recall = 161/ (161+18) = 0.90

F1-score = (2*0.86*0.9)/ (0.86+0.9) = 0.88

Evaluation for Naïve Bayes:

Table 3. Confusion Matrix for Naïve Bayes

Parkinson’s Non Parkinson’s


Parkinson’s 159 20
Non Parkinson’s 23 25
TP=159 FN=20

FP=23 TN=25

Volume XIII, Issue VI, June/2021 Page No: 843


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

The values of accuracy, precision, recall, and f1-score:

Accuracy = (159+25)/ (159+23+20+25) = 0.81

Precision = 159/ (159+23) = 0.87

Recall = 159/ (159+20) = 0.89

F1-score = (2*0.87*0.89)/ (0.87+0.89) = 0.88

Evaluation for Logistic Regression:

Table 4. Confusion Matrix for Logistic Regression

Parkinson’s Non Parkinson’s


Parkinson’s 173 6
Non Parkinson’s 41 7
TP=173 FN=6

FP=41 TN=7

The values of accuracy, precision, recall, and f1-score:

Accuracy = (173+7)/ (173+41+6+7) = 0.79

Precision = 173/ (173+41) = 0.81

Recall = 173/ (173+6) = 0.97

F1-score = (2*0.81*0.97)/ (0.81+0.97) = 0.88

Table 5. Accuracy Table

Algorithms Accuracy
KNN 80%
Naïve Bayes 81%
Logistic Regression 79%

Volume XIII, Issue VI, June/2021 Page No: 844


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

Comparative Analysis

1.2

0.97
1
0.9 0.87 0.89
0.86
0.8 0.81 0.79 0.81
0.8

0.6

0.4

0.2

0
KNN Naïve Bayes Logistic Regression

Accuracy Precision Recall

After applying various machine learning algorithms on the speech dataset and the evaluation metrics that are being used to
compare are Accuracy, Confusion Matrix, Precision, Recall and F1-score. Naïve Bayes gives the highest accuracy of 81%.

V. CONCLUSION

Parkinson’s is the second most neurodegenerative disease which has no cure. It results in difficulty of body movements,
anxiety, breathing problems, loss of smell, depression, and speech. In this paper, the three different machine learning
algorithms used to measure the performance are KNN, Naïve Bayes, and Logistic Regression applied on the dataset. The
author chose the voice features of patients as the dataset contains more than 700 features and finally took the ten important
features that are useful to evaluate the system. The author compared all the three machine learning methods accuracies and
based on this one prediction model is generated. Hence, the aim is to use various evaluation metrics like confusion matrix,
accuracy, precision, recall, and f1-score which predicts the disease efficiently. Comparing all the three the Naïve Bayes

classifier gives the highest accuracy of 81%.

REFERENCES

[1] Arvind Kumar Tiwari, “Machine Learning based Approaches for Prediction of Parkinson’s Disease”, Machine Learning
and Applications: An International Journal (MLAU) vol. 3, June 2016.

[2] Carlo Ricciardi, et al, “Using gait analysis’ parameters to classify Parkinsonism: A data mining approach” Computer
Methods and Programs in Biomedicine vol. 180, Oct. 2019.

[3] Dr. Anupam Bhatia and Raunak Sulekh, “Predictive Model for Parkinson’s Disease through Naive Bayes Classification”
International Journal of Computer Science & Communication vol. 9, Dec. 2017, pp. 194- 202, Sept 2017 - March 2018.

[4] Dragana Miljkovic et al, “Machine Learning and Data Mining Methods for Managing Parkinson’s Disease” LNAI 9605,
pp 209-220, 2016.

[5] M. Abdar and M. Zomorodi-Moghadam, “Impact of Patients’ Gender on Parkinson’s disease using Classification
Algorithms” Journal of AI and Data Mining, vol. 6, 2018.

Volume XIII, Issue VI, June/2021 Page No: 845


The International journal of analytical and experimental modal analysis ISSN NO:0886-9367

[6] M. A. E. Van Stiphout, J. Marinus, J. J. Van Hilten, F. Lobbezoo, and C. De Baat, “Oral health of Parkinson’s disease
patients: a case-control study,” Parkinson’s Disease, vol. 2018, Article ID 9315285, 8 pages, 2018.

[7] Md. Redone Hassan et al, “A Knowledge Base Data Mining based on Parkinson’s Disease” International Conference
on System Modelling & Advancement in Research Trends, 2019.

[8] Mehrbakhsh Nilashi et al, “A hybrid intelligent system for the prediction of Parkinson’s Disease progression using
Machine Learning techniques” Biocybernetics and Biomedical Engineering 2017.

[9] R. P. Duncan, A. L. Leddy, J. T. Cavanaugh et al., “Detecting and predicting balance decline in Parkinson disease: a
prospective cohort study,” Journal of Parkinson’s Disease, vol. 5, no. 1, pp. 131–139, 2015.

[10] Ramzi M. Sadek et al., “Parkinson’s Disease Prediction using Artificial Neural Network” International Journal of
Academic Health and Medical Research, vol. 3, Issue 1, January 2019.

[11] Satish Srinivasan, Michael Martin & Abhishek Tripathi, “ANN based Data Mining Analysis of Parkinson’s Disease”
International Journal of Computer Applications, vol. 168, June 2017.

Volume XIII, Issue VI, June/2021 Page No: 846

You might also like