Professional Documents
Culture Documents
Detection of Parkinson's Disease Using Machine Learning
Detection of Parkinson's Disease Using Machine Learning
Associate Professor
Dept of CSE
Agenda
• Introduction
• Literature Survey
• Existing System
• Problem Statement
• System Design
• Applications
• References
Introduction
Parkinson’s disease is a neurodegenerative disorder of central
nervous system that causes partial or full loss of motor reflexes,
speech, behavior, mental processing, and other vital functions.
1. Collection and Analysis of a Betul Erdogdu Sakar, M. 2013 • Introduced a dataset which was collection of voice samples
Parkinson Speech Dataset Erdem Isenkul, C. Okan Sakar, from 40 individuals out of which 20 were healthy and 20 were
with Multiple Types of Ahmet Sertbas, Fikret Gurgen, suffering from Parkinson disease.
Sound Recordings Sakir Delil, Hulya Apaydin, • Compared the leave one subject out and summarised leave one
and Olcay Kursun out validation schemes on classification algorithms KNN and
SVM. Results showed that s-loso performed much better than
the loo method.
• Mean and standard deviation were found to better metrics for
summarization of the features obtained from the voice samples
rather than considering each and every sample of a subject.
• Main drawback is the model gave low accuracy results with
KNN using loso and s-loso schemes since vowels carried more
weight on PD classification rather than sentences or words.
2. Analysis of multiple types of Achraf Benba, Abdelilah 2016 • MFCC technique was applied to obtain voiceprint from
voice recordings in cepstral Jilbab, Ahmed Hammouch recorded speech samples which was compressed by taking
domain using MFCC for the average value. This prevented low accuracy obtained and
discriminating between time for processing by the classification algorithms.
patients with Parkinson’s • SVM with different kernels- rbf, linear and polynomial was
disease and healthy people applied on the Istanbul PD dataset. These were applied for a,
o and u vowels and their combination. The results indicated
that svm with mlp kernel provided highest accuracy for
vowel a , o and u with 80.0%, 77.5% and 82.5% separately.
• SVM with linear kernel obtained highest accuracy of 77.5%
with vowels a, o and u combined together.
• Main drawback is SVM models obtained low accuracy with
different kernels which could not implemented for clinical or
medical diagnosis of PD as feature selection was not
performed.
Sl. No. Title Author Year Contributions & Drawbacks
3. DETECTION OF Ömer ESKøDERE, Ali 2015 • Random subspace ensemble method using knn to
PARKINSON’S DISEASE KARATUTLU, Cevat ÜNAL improve the performance of individual classifiers. With
FROM VOCAL FEATURES the help of k-fold cross validation KNN, LDA and
USING RANDOM QDA were used as the base classifiers.
SUBSPACE CLASSIFIER • These were applied on normalized features of the
ENSEMBLE Istanbul PD dataset. They were tested on varied
number of features and with number of KNN, LDA
and QDA learners and it was found that ensemble of
KNN learners out performed the LDA and QDA. It
was seen that with k valued 10 and with 114 KNN
classifiers on 7 dimensional subspaces demonstrated
lowest classification error.
• It was also shown that variation and the type of base
classifier varied the performance of the random
subspace classifier ensemble method.
• Main drawback is random selection of
feature subspaces because some of
the randomly selected subsets might have poor
discrimination capability
4. An LSTM based Deep Danish Raza Rizvi1 , Iqra 2020 • DNN and LSTM to detect and predict the Parkinson
Learning model for voice- Nissar , Sarfaraz Masood , disease on the UCI Parkinson voice sample dataset.
based detection of Mumtaz Ahmed , Faiyaz • It was observed that using 256, 128 and 64 size for
Parkinson’s disease Ahmad first second and third hidden layers for DNN produced
an accuracy of 97%. Along with of dropout of 0.5
ADAM optimizer to prevent overfitting and
categorical cross entropy loss was used.
• Dropout also made sure model doesn’t rely on single
node. Activation functions relu and softmax was used
in hidden and output layer respectively.
• For LSTM a batch size of 16. Hidden layer size of 32
when trained for 80 epoch produced an accuracy of
99.03%.
• These proposed models outperformed previously
applied and studied methods.
Sl. No. Title Author Year Contributions & Drawbacks
5. Simultaneous Learning of Yongming Li , Yunjian Jia, 2015 • Feature selection algorithm which attains new hybrid
Speech Feature and Segment Xiaoheng Zhang , Cheng features for classification without any feature
for Classification of Zhang, Ping Wang. Tingjie Xie transformation. Istanbul Dataset was split into training
Parkinson Disease and test datasets after hybrid features were constructed
from combining features and segments.
• Hybrid voice features were chosen after normalizing
using loso for constructing a new training set.
• The SVM classifier was applied on datasets for
classification. The results indicated that linear kernel
performed better than rbf kernel and p_value and sdc
were better evaluation criteria compared to corrcoef.
• The classification algorithms svm performed better on
the hybrid selected features than on the original dataset
features with a mean of 82.5% accuracy, 85%
sensitivity and 80% specificity.
• Main drawback is very few samples were considered
for feature selection. new samples could be acquired
for further verification and modification
6. Diagnosis of the Abdullah CALISKAN, Hasan 2017 • A DNN with stacked autoencoder and the softmax
Parkinson Disease by BADEM, Alper BAŞTÜRK , classifier cascaded to one another for the prediction on
using Deep Neural Mehmet Emin YÜKSEL Istanbul PD dataset and OPD dataset with 10 fold cross
Network Classifier validation 30 times.
• DNNs use autoencoders thereby reducing dimension of
the features and softmax layers for the process of
classification when compared with conventional
classifier techniques.
• Several simulations are performed over two databases to
demonstrate the effectiveness of the deep neural network
classifier .
• The proposed DNN model performs better than SVM, DT
and NB classification algorithms with an accuracy of
65.549% and 86.095% on PD and OPD dataset
respectively. The proposed DNN classifier has the ability
to extract hidden features increasing the performance of
the classifier.
Sl. No. Title Author Year Contributions & Drawbacks
7. Classifying Parkinson Lucijano Berus, Simon 2018 • Multiple ANNs were used on PD Dataset using LOSO
Disease based on acoustic Klancnik, Miran Brezocnik and scheme. LOSO has far less bias, and provides
measures using artificial Mirko Ficko practically unbiased prediction.
neural networks • The method involved feature selection and then
selecting best result from multiple ANN classifiers
using majority voting technique. Feature selection was
done using Pearsons, Kendalls correlation coefficient,
pca and self-organizing maps.
• The number 4 and short sentence 4 from the voice
samples contained more value in classifying PD
according to Pearson and Kendall’s correlation
coefficients. NN is fine-tuned, and a test accuracy of
86.47% was achieved.
• Main drawback is performance of ANNs’ could be
improved by using other feature selection procedures
and by additional work on fine-tuning. Several vocal
tests in other languages were not included in the study
and performing the classification on those datasets
would help increase the reach of the models in
predicting PD.
8. A Multiple-Classifier Mahnaz Behroozi and Ashkan 2016 • Instead of considering each and every voice sample for
Framework for Sami the prediction of Istanbul PD author separated each and
Parkinson’s Disease every vocal sample from the PD dataset and applied the
Detection Based on classification algorithms.
Various Vocal Tests • Features were selected based on Pearson Correlation
Coefficient. If the vocal tests had no relevant features,
MCFS and A-MCFS was applied to select them based
on prevalent features to remove unsuccessful voice
sample respectively.
• LOO CVV technique was used on KNN, SVM, Naïve
Bayes and discriminant analysis classifiers where all of
them obtained highest accuracy with A-MCFS method
compared to LOSO, s-LOSO and MCFS methods. Final
classification result would be a majority vote from all of
the classifiers.
• Main drawback is discriminating ability of all the vocal
terms is not the same, even some of those vocal terms
that have been considered to be discriminating in the
literature, such as vowel “a,” failed to be successful.
Further studies on different vocal terms from the
proposed perspective and several vocal tests from
various languages can be studied.
Sl. No. Title Author Year Contributions & Drawbacks
9. Can a Smartphone Mahnaz Behroozi and Ashkan 2017 • SAE for the process of dimension reduction and applied
Diagnose Parkinson Sami various classification algorithms such as KNN, LDA, NB,
Disease? A Deep Neural LSBM, RSVM, CART, KELM, MSVM to predict
Network Method and Parkinson disease on the PD voice telemonitoring datasets.
Telediagnosis System
Implementation • The proposed SAE had a batch size of 20 with two hidden
layers with 10,9,8 and 8,7,6 neurons respectively. It was
shown that SAE with KNN gave the most accurate
classification result on the Istanbul PD dataset. The
dimensional space was low for the proposed method to
remap time frequency features.
10. A Machine Learning Ismail Cantürk and Fethullah 2016 • Four feature selection algorithms LASSO, Relief, LLBFS
System for the Diagnosis Karabiber and RMR was applied on features of the PD voice dataset.
of Parkinson’s Disease Six classification methods Adabost, LibSVM, MLP,
from Speech Signals and Ensemble, K-NN and NB were applied on the features
Its Application to Multiple obtained through feature selection methods.
Speech Signal Types • The validation methods were 10 cross fold CV and LOSO.
The result using 10 cross fold CV indicated that KNN for
k=3, MLP with 20 neurons, MLP with 10 neurons and KNN
for k=7 performed best for LASSO, Relief, LLBFS, RMR FS
algorithms respectively.
• The result using LOSO using n=12 indicated that LibSVM
linear kernel, Adaboost 10 iteration, LibSBM Linear Kernel,
MLP with 10 neurons performed best for LASSO, Relief,
LLBFS, RMR FS algorithms respectively. The results also
indicated that use of feature selection algorithms improved
performance of the classifiers by a significant amount.
• Main drawback is daily speech and speech variation is
ignored. daily speech is more important, because if patients
are diagnosed with certain phonations, their stress level in the
PD diagnosis process might be raised as voice is easily
affected by our stress or excitement like vibrations in sound.
Sl. No. Title Author Year Contributions & Drawbacks
11. Automated Detection of Liaqat Ali, Ce Zhu, Zhonghao 2019 • PD lacked generalization, provided low accuracy
Parkinson’s Disease Based Zhang, Yipeng Liu predictions and had issues such as subject overlap. A
on Multiple Types of hybrid intelligent system which used LDA and genetic
Sustained Phonations Using algorithm for reducing dimensionality and optimize
Linear Discriminant hyperparameters of the neural networks to predict PD is
Analysis and Genetically proposed.
Optimized Neural Network • The proposed model had low complexity and provided
accuracy of 82.14% on test dataset after balancing the
gender imbalanced dataset.
• This provided a generalized model, highlighted the gender
imbalance in the Istanbul PD dataset and how these
associated features could be eliminated to provide high
accuracy.
• Main drawback is method was not exploited for prodromal
and differential diagnosis which are considered challenging
tasks. Independent dataset which was only collected from
PD patients and is highly imbalanced was used as test data.
missing information about the feature extraction process
such as the extraction of features corrected for pitch was
not considered.
12. Classification He‑Hua Zhang , Liuyang 2016 • Multi edit nearest neighbor algorithm along with an ensemble
of Parkinson’s disease Yang , Yuchuan Liu , Pin learning algorithm was proposed. It was observed that present
utilizing multi-edit Wang , Jun Yin , Yongming Li, classification methods did not consider the optimization of
nearest-neighbor Mingguo Qiu , Xueru Zhu and the sample via selection method. This caused noise as outliers
and ensemble learning Fang Yan to be considered for training.
algorithms with speech • The proposed method MENN performed sample selection and
samples reduced effect of these outliers. MENN removed the
overlapping regions present in sample with different classes
by which the misleading data was supressed.
• The MENN was combined with RF and DNNE method to
improve the accuracy of prediction on the Istanbul PD
dataset.
• Main drawback is compressed speech feature data was not
examined. This would requireverification and research to
further verify and possibly modify the PD_MEdit_EL
algorithm.
Sl. No. Title Author Year Contributions & Drawbacks
13. Parkinsons Disease Anchana Khemphila and Veera 2012 • The authors noticed that previous classifiers had low
Classification using Neural Boonjing accuracy due the fact that the dataset involved more
Network and Feature number of attributes. A Multi layer Perceptron with back
selection propagation learning method to classify the UCI Oxford
PD disease effectively was proposed.
• The features were selected according to importance of
them using information gain. The 22 features were reduced
to 16 features using information gain and ANN were used
to classify on them.
• This method obtained an accuracy of 82.05% and 83.33%
on the training and validation datasets. The results
indicated that reduction in number of attributes increased
the accuracy in the classification of PD.
• Main drawback is information gain is biased toward
variables with large number of distinct values not variables
that have observations with large values. An attributes
(variable) with many distinct values, the information gain
fails to accurately discriminate among the attributes.
•
14. An efficient diagnosis Hui-Ling Chen, Chang-Cheng 2013 • Fuzzy KNN method was proposed and was compared against
system for detection of Huang, Xin-Gang Yu, Xin Xu, SVM and ANN on the PD dataset. The dataset was
Parkinson’s disease using Xin Sun, Gang Wang and Su- normalized by scaling to range 0 to 1. After reducing features
fuzzy k-nearest neighbor Jing Wang. of the UCI PD dataset using PCA, optimized Fuzzy KNN
approach. classifier was applied.
• The step size was 0.01 for the fuzzy strength parameter m and
multiple analysis was carried for different number of
neighbours k. The Dataset was divided using the 10 fold CV
method to gain an unbiased estimate of generalised accuracy.
• This increased the reliability as the test dataset was
independent. FKNN4 with k as 7 and m as 1.02 performed
much better than SVM linear and SVM RBF with an accuracy
of 95.79%. .
• Main drawback of FKNN is that it is computationally
expensive. Since this model should be run for all data set, it is
time consuming and requires large memory size to store all
training data.
Sl. No. Title Author Year Contributions & Drawbacks
15. Accurate telemonitoring of Athanasios Tsanas, Max A. 2009 • 3 linear and 1 non linear methods of regression was
Parkinson’s disease Little, Patrick E. McSharry, studied to link the voice attribute measures with total and
progression by non-invasive Lorraine O. Ramig motor UPDRS scores on AHTD PD dataset. CART
speech tests outperformed linear methods with small deviation from
interpolated scores.
• CART was seen to have low error for prediction and
performed well in tracking the linear interpolated
UPDRS. Using 1000 runs and 10 fold cv they could
accurately predict motor and total UPDRS score with low
prediction error.
• Linear predictors used performed better than
conventional LS and LASSO methods. Using LASSO
the non classical phonetic attributes also contributed to
right prediction and was backed by AIC and BIC
techniques.
• Main drawback is study was confined to using dysphonia
measures to predict the average clinical overview of the
PD metric UPDRS. Although the dysphonia measures
have physiological interpretations, it is difficult to link
self-perception and physiology.
16. A novel diagnosis system Huseyin Guruler 2017 • Noticing the need of a hybrid system for diagnosis of
for Parkinson’s disease PD disease the a combination of k-means clustering
using complex-valued based feature weighting method and a complex valued
artificial neural network artificial neural network is introduced.
with k-means clustering • The feature based wights method help in achieving high
feature weighting method classification accuracy. The weighting method used
collects like data points and helps in the conversion of
nonlinear separable dataset into linear dataset which is
separable.
• The extracted new features were converted into complex
number format and fed to the neural network as its
input. The proposed method achieved high accuracy of
99.52% with tenfold CV method and 99.39% with 50-
50 training testing data selection method.
• Their method provided a fast and low computational
load classification of PD disease.
Sl. No. Title Author Year Contributions & Drawbacks
17. A comparative analysis of C. Okan Sakar, Gorkem Serbes 2018 • Authors have applied tunable Q factor wavelet
speech signal processing Aysegul Gunduz, Hunkar C. transform(TQWT) to voice signal samples od PD patients.
algorithms for Parkinson Tunc, Hatice Nizam, Betul The required frequency for the band pass filters can be
disease classification and Erdogdu Sakar, Melih Tutuncu, determined by changing Q-factor,oversampling rate and
the use of the tunable Q- Tarkan Aydin, M. Erdem number of analysis level. .
factor wavelet transform Isenkul • Higher Q-values in TQWT analysis made it possible to
obtain narrower frequency responses and helps to obtain
narrower frequency responses which helps to obtain better
decomposition of sub band.
• It was seen that highest accuracy was obtained by using
normal Q-wavelet transform features in multilayer
perceptron classifier. It was noted that increasing Q value
too much such as to 4,5 does not increase the classification
accuracy due to the need of more number of analysis levels.
• Main drawback is the TQWT technique, which has showed
promising results in PD classification problem, can be used
to predict the Unified Parkinson’s Disease Rating Scale
(UPDRS) score of PD patients to build a robust PD
telemonitoring system.
18. Classification of Parkinsons Sajid Ullah Khan 2015 • Proposed idea cluster analysis which is an iterative process
Disease Using Data Mining was applied to modify data pre-processing and model
Techniques parameters until the required properties are achieved.
• The data is passed through the data pre-processing phases
that are data cleaning, recovering missing values and
transformed. After which the three clustering techniques are
applied.
• The techniques that are used are K-NN, Random Forest and
Ada-Boost This is done to get the accurate model for
detecting disease. The accuracy achieved were 90.25%,
87.17, 88.71% for K-NN, Random Forest and Ada-Boost
respectively. It was observed from the results that K-NN was
the best model for classification with highest accuracy.
• Main drawback is SVM was not applied on the reduced
dataset and its comparison to previous works conducted.
Sl. No. Title Author Year Contributions & Drawbacks
19. An Ensemble Method for Razieh Sheibani, Elham 2019 ◦ In this paper authors have applied an ensemble-based
Diagnosis of Parkinson's Nikookar, and Seyed method to identify patients by class label prediction using
Disease Based on Voice Enayatollah Alavi voice frequency characteristics. In this method probability
Measurements of making mistake in determining class label is
significantly reduced.
◦ It have three stages of data pre-processing , internal
classification, and ultimate classification. In first stage
dataset is divided into six subsets according to recorded
voice types. In next stage, prediction models are generated
by applying internal classifiers.
◦ Then, the result of each prediction model is calculated used
as an input for next stage's input. Finally, ultimate
classifiers determine the final class label of the sample. The
authors have used WEKA software which include machine
learning and data mining algorithms for this. .
◦ It was observed from the results that k-NN algorithm with k
value as 1 gave the best performance with 90% accuracy.
20. SVM Classification to Ipsita Bhattacharya and M.P.S 2010 • In this paper authors have used a data mining tool called
Distinguish Parkinson Bhatia Weka to pre-process the dataset. They then use SVM method
Disease Patients to differentiate between healthy people and people with
Parkinson’s disease. The accuracy achieved was 65.217%.
• It was observed from the results that by increasing the value
of cross validation fold the value of true positive rate also
increases and the value of false positive rate decreases.
• This happened because as the cross validation value is
increased the number of training set also increases and
number of test set decreases which leads to increase in
accuracy. It was also observed that accuracy can be increased
by changing the split ratio and repeating the test.
• Main drawback is testing the same dataset on different tools
like in Matlab and compare the efficiency of the two was not
done. With proper partitioning of the dataset better accuracy
can be achieved.
Drawbacks of Existing System
• There are currently no blood or laboratory tests to diagnose
nongenetic cases of Parkinson's disease.
• Parkinson's disease can't be cured, but medications can help
control the symptoms, often dramatically. Early detection and
treatment becomes vital.
• There has been little attempt to summarize and synthesize
qualitative studies concerning the experience and perception
of living with Parkinson’s disease.
• There is need for improvement in accuracy of detection of
Parkinson’s disease.
• Outpatient integrated PD care models may improve patient‐
reported health‐related quality of life compared with standard
care.
Problem Statement
Problem statement :
To develop a web application to detect if a person has Parkinson's disease or
not on the basis of input voice features using supervised machine learning
models.
Input: The training data belongs to 20 PWP (6 female, 14 male) and 20
healthy individuals (10 female, 10 male) who appealed at the Department of
Neurology in Cerrahpasa Faculty of Medicine, Istanbul University. From all
subjects, multiple types of sound recordings (26 voice samples including
sustained vowels, numbers, words and short sentences) are taken. A group of
26 linear and time frequency based features are extracted from each voice
sample. These voice features are used to train and are given as input for the
machine learning model to predict.
Output: Presence of Parkinson’s disease or not
Proposed System
The system proposes a method for detecting Parkinson’s
disease using the voice features using supervised machine
learning classification algorithms – K Nearest Neighbor,
Support Vector Machines, Random forest and Decision
Trees.
A web application where data about patients can be recorded
on daily basis which can be accessed and reviewed by
doctors to monitor the condition of the patients is also
proposed.
Requirement Engineering
Hardware Requirements:
• Processor (CPU) with 2 gigahertz (GHz) frequency or above
• Minimum of 4 GB of RAM and minimum of 2 GB of available space on
the hard disk.
• Internet Connection Broadband connection with a speed of 2 Mbps or
higher.
• NVIDIA GEFORCE GTX graphics card.
Software Requirements:
• Jupyter Notebook (open source web application that you can use to create
and share documents)
• Programming Language – Python
• Python Libraries for machine learning
• Operating System - Windows 7 / Windows 8 / Windows 10
• HTML, CSS, Bootstrap, JavaScript and Flask
Conceptual/Analysis Modelling
Scenarios
A new patient can provide his voice parametres to test if
he has Parkinsons’ or not. In case he is diagnosed with
PD, he can register a new account. Older patients can
login to their accounts and provide daily records for the
doctor’s reference. Doctors can login to their accounts and
check on the patient’s daily records. In case they notice
any sufficient changes, they can interact with the patients
and set up an appointment at a suitable time.
Use Case Diagram
Sequence Diagram
Sequence Diagram
Doctors in PD system
Activity Diagram for Patients
Activity Diagram for Doctors
State Diagram for Patients
State Diagram for Doctors
Class Diagram
Software Requirements Specification
Functional Requirements:
• Classify whether or not a person is suffering from Parkinsons Disease based
on input voice features.
• Users of the web application should be authenticated whenever he/she logs
into the system.
• Only the patient and their concerned doctor have the right to view the
patients history data.
Non-Functional Requirements:
• Machine learning models developed should predict if a person is suffering
from Parkinsons Disease with high accuracy.
• Provide security by removing the access to a user if he/she fails to
authenticate the account several number of times.
• Easy to use user interface.
Project Scheduling
System Design
• System Architecture
Component Design / Module Decomposition
Machine Learning Components
• Dataset collection: In this module, we collect the data from
UCI dataset archives. This dataset contains the information of
voice parameters and the presence of Parkinson’s disease.
• Data Cleaning: In this module data cleaning is done to
prepare the data for analysis by removing or modifying the
data that may be incorrect, incomplete, duplicated or
improperly formatted.
• Feature Extraction: This is done to reduce the number of
attributes in the dataset hence providing advantages like
speeding up the training and accuracy improvements.
• Model training : In this module we use supervised
classification algorithms like knn, svm, random forest and
xgboost to train the model on the cleaned dataset after
dimensionality reduction.
• Testing the trained model: In this module we test the trained
machine learning model using the test dataset.
• Performance Evaluation: In this module, we evaluate the
performance of trained machine learning model using
performance evaluation criteria such as F1 score, accuracy
and classification error. In case the model performs poorly,
we optimize the machine learning algorithms to improve the
performance.
• Prediction of Parkinson’s disease: In this module we use
trained and optimized machine learning model to predict
whether the patient has Parkinson’s disease or not using the
voice features.
Wen Application Components
• Login : This module is used to check credentials and provide
access to the application.
• Registration: This module is used to register new patients in
the web application database. After successful registration
patient can login from next time.
• Patient profile: This module provides information about the
patient and his/her previous prediction test results.
• Disease Prediction: This module is used to predict whether
the patient has Parkinson’s disease or not using the trained
machine learning model based on the patient’s voice feature
inputs.
• Doctor Profile: This module provides information about the
doctor and his/her associated patients.
• Patient History: This module provides past test history of
the patient.
• Appointment : This module helps the patient to book
multiple appointments with his/her respective associated
doctor. The doctor can then accept or reject his appointments.
Both patient and doctor can also view their scheduled
appointments..
• Information Page: This module provides information about
Parkinson’s disease such as symptoms, mental health issues,
advancements about cure and research.
Module Description - Dataset Collection
● Provides information about the doctor and the list of his / her
associated patients.
● These information need to be filled by the doctor at the time
of registration of account.
● The list of patients will be updated as and when the patients
select an appointment with the doctor
Module Description - Patient History
STEPS :
1. Calculate accuracy and error rate for different values of k,
plot the error rate vs k graph and find the suitable value of k
for which error rate is low
2. For the selected value of k, find the accuracy for all distance
metrics and choose the metric with the best accuracy as the
selected training model.
Implementation of KNN Algorithm
class distanceMetrics:
def __init__(self):
pass
def euclideanDistance(self, vector1, vector2):
self.vectorA, self.vectorB = vector1, vector2
if len(self.vectorA) != len(self.vectorB):
raise ValueError("Undefined for sequences of unequal length.")
distance = 0.0
for i in range(len(self.vectorA)-1):
distance += (self.vectorA[i] - self.vectorB[i])**2
return (distance)**0.5
def manhattanDistance(self, vector1, vector2):
self.vectorA, self.vectorB = vector1, vector2
if len(self.vectorA) != len(self.vectorB):
raise ValueError("Undefined for sequences of unequal length.")
return np.abs(np.array(self.vectorA) - np.array(self.vectorB)).sum()
def hammingDistance(self, vector1, vector2):
self.vectorA, self.vectorB = vector1, vector2
if len(self.vectorA) != len(self.vectorB):
raise ValueError("Undefined for sequences of unequal length.")
return sum(el1 != el2 for el1, el2 in zip(self.vectorA, self.vectorB))
class kNNClassifier:
def __init__(self,k = 3, distanceMetric = 'euclidean'):
pass
def fit(self, xTrain, yTrain):
assert len(xTrain) == len(yTrain)
self.trainData = xTrain
self.trainLabels = yTrain
def getNeighbors(self, testRow):
calcDM = distanceMetrics()
distances = []
for i, trainRow in enumerate(self.trainData):
if self.distanceMetric == 'euclidean':
distances.append([trainRow, calcDM.euclideanDistance(testRow,
trainRow), self.trainLabels[i]])
elif self.distanceMetric == 'manhattan':
distances.append([trainRow, calcDM.manhattanDistance(testRow,
trainRow), self.trainLabels[i]])
elif self.distanceMetric == 'hamming':
distances.append([trainRow, calcDM.hammingDistance(testRow,
trainRow), self.trainLabels[i]])
distances.sort(key=operator.itemgetter(1))
neighbors = []
for index in range(self.k):
neighbors.append(distances[index])
return neighbors
def predict(self, xTest, k, distanceMetric):
self.testData = xTest
self.k = k
self.distanceMetric = distanceMetric
predictions = []
for i, testCase in enumerate(self.testData):
neighbors = self.getNeighbors(testCase)
output= [row[-1] for row in neighbors]
prediction = max(set(output), key=output.count)
predictions.append(prediction)
return predictions
error_rate = []
for i in range(1,40):
knn = kNNClassifier(k=i)
knn.fit(xTrain,yTrain)
pred_i = knn.predict(xTest,i,'euclidean')
error_rate.append(np.mean(pred_i != y_test))
ALGORITHM DESIGN - SVM
Steps:
1. Split the dataset into train and test datasets
2. Normalise the attribute values in train datasets
3. Initialise C,gamma,bias,weight,train and test dataset variables
4. Create Kernel Matrix for the train datset attribute values
5. For selected no of epoch –
* Calculate the margin value set by the boundary for train dataset
attribute values
* If the margin is less than 1 consider it as a misclassification
* For each misclassification
- Calculate the no of data points in the plot which are exceeding the
margin threshold and store it in form of loss in a loss array
6. After the epoch no of runs plot the graph for loss and print score achieved through
comparing predicted target value with the actual target test value for validation
ALGORITHM - SVM OPTIMISATION
Algorithm :
Steps
1. Initialize C value and gamma values for SVM in the form of
a list
2. We use GridSearch CV method along with the specified C
and gamma values to optimize the SVM algorithm.
GridSearchCV tries all the combinations of the values passed
in the dictionary and evaluates the model for each
combination using the Cross-Validation method. We will be
using this function to get accuracy/loss for every combination
of hyperparameters and then choose the one with the best
performance to set the parameters for the model.
SVM - PSEUDO CODE
PURPOSE: Prediction of presence of Parkinson’s disease.
INPUT: Voice features obtained from cleaned dataset after dimensionality
reduction.
OUTPUT: Presence of Parkinson’s disease or not.
STEPS:
1. Generate hyperplanes which segregates the classes in the best possible
way. There are many hyperplanes that might classify the data. We should
look for the best hyperplane that represents the largest separation, or
margin, between the two classes.
2. We choose the hyperplane so that distance from it to the support vectors
on each side is maximized. If such a hyperplane exists, it is known as
the maximum margin hyperplane. This divides the target values and hence
does the classification
IMPLEMENTATION OF SVM
class SVMPrimalProblem:
def __init__(self, C=1.0, kernel='rbf', sigma=.1, degree=2):
if kernel == 'poly':
self.kernel = self._polynomial_kernel
self.c = 1
self.degree = degree
else:
self.kernel = self._rbf_kernel
self.sigma = sigma
self.C = C
self.w = None
self.b = None
self.X = None
self.y = None
self.K = None
def get_params(self, deep = False):
return {'C':self.C}
STEPS:
1.Select random data points from the training set. Build the decision
trees associated with the selected data points randomly.Choose the
number for decision trees that you want to build.
2. For new data points, find the predictions of each decision tree, and
assign the new data points to the category that has majority Score and
also predict the accuracy.
IMPLEMENTATION OF RANDOM FOREST
import collections
import pandas as pd
import numpy as np
class random_forest_classifier:
def __init__(self, n_trees = 10, max_depth=None, n_features='sqrt', mode='rfnode', seed=None):
self.n_trees = n_trees
self.max_depth = max_depth
self.n_features = n_features
self.tree_filter_pairs = []
self.mode = mode
if seed:
self._seed = seed
np.random.seed(seed)
def find_number_of_columns(self, X):
if isinstance(self.n_features, int):
return self.n_features
if self.n_features == 'sqrt':
return int(np.sqrt(X.shape[1])+0.5)
if self.n_features == 'div3':
return int(X.shape[1]/3+0.5)
else:
raise ValueError("Invalid n_features selection")
def get_bagged_data(self, X, y):
index = np.random.choice(np.arange(len(X)),len(X))
return X[index], y[index]
def randomize_columns(self,X):
num_col = self.find_number_of_columns(X)
filt = np.random.choice(np.arange(0,X.shape[1]),num_col,replace=False)
filtered_X = self.apply_filter(X, filt)
return filtered_X, filt
def apply_filter(self, X, filt):
filtered_X = X.T[filt]
return filtered_X.T
def fit(self, X, y):
X = self.convert_to_array(X)
y = self.pandas_to_numpy(y)
try:
self.base_filt = [x for x in range(X.shape[1])]
except IndexError:
self.base_filt = [0]
for _ in range(self.n_trees):
filt = self.base_filt
bagX, bagy = self.get_bagged_data(X,y)
if self.mode == 'rftree':
bagX, filt = self.randomize_columns(bagX)
new_tree = decision_tree_classifier(self.max_depth, mode=self.mode, n_features=self.n_features)
new_tree.fit(bagX, bagy)
self.tree_filter_pairs.append((new_tree, filt))
def score(self, X, y):
pred = self.predict(X)
correct = 0
for i,j in zip(y,pred):
if i == j:
correct+=1
return float(correct)/float(len(y))
def pandas_to_numpy(self, x):
if type(x) == type(pd.DataFrame()) or type(x) == type(pd.Series()):
return x.to_numpy()
if type(x) == type(np.array([1,2])):
return x
return np.array(x)
def handle_1d_data(self,x):
if x.ndim == 1:
x = x.reshape(-1,1)
return x
def convert_to_array(self, x):
x = self.pandas_to_numpy(x)
x = self.handle_1d_data(x)
return x
ALGORITHM DESIGN - XGBOOST
Steps:
1. Split the dataset into train and test datasets
2. Construct a base tree using the initial training set.
3. Calculate the similarity weight and total gain for each split and
use total gain to decide priorities of the features.
4. Continue this process to form the complete decision tree.
5. Update the values of the residuals for the initial dataset.
6. Create a new decision tree in the same way as done earlier but
using the updated residual values.
7. Keep updating the residual values after each iteration of creation
of decision tree and repeat the process till the required number of
iterations are succefully performed.
ALGORITHM – XGBOOST OPTIMISATION
Algorithm :
Steps
1. Write the parameters that we want to consider and from these
parameters select the best ones.
2. Create RandomizedSearchCV object and fit the data for various
values of the parameters and print the accuracy.
RandomizedSearchCV tries all the combinations of the values
passed and evaluates the model for each combination. We will be
using this function to get accuracy for every combination of
parameters and choose the one that gives the best accuracy and
performance
XGBOOST - PSEUDO CODE
PURPOSE: Prediction of presence of Parkinson’s disease.
INPUT: Voice features obtained from cleaned dataset after dimensionality
reduction.
OUTPUT: Presence of Parkinson’s disease or not.
STEPS:
1. Create the initial base decision tree using the similarity weight and total
gain for each split using the training set.
2. Update the residual values and create new decision trees using these new
values. This way the new learners learn from the residuals of the
previous model and suppress it in new models.
IMPLEMENTATION OF XGBOOST
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=.2, random_state=123)
xgb_clf = xgb.XGBClassifier(random_state=123)
xgb_clf.set_params(n_estimators=10)
xgb_clf.fit(X_train, y_train)
preds = xgb_clf.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("Baseline accuracy:", accuracy)
matplotlib.rcParams['figure.figsize'] = (10.0, 8)
xgb.plot_importance(xgb_clf)
xgb.plot_importance(xgb_clf, importance_type="gain")
df_dmatrix = xgb.DMatrix(data=X,label=y)
params = {"objective":"binary:logistic", 'max_depth': 3}
xgb_cv = xgb.cv(dtrain=df_dmatrix, params=params, nfold=3, num_boost_round=10, seed=123)
accuracy= 1 - xgb_cv["test-error-mean"].iloc[-1]
print("baseline cv accuracy:", accuracy)
xgb_cv = xgb.cv(dtrain=df_dmatrix, params=params, nfold=3,
num_boost_round=40,early_stopping_rounds=10, seed=123)
accuracy= 1 - xgb_cv["test-error-mean"].iloc[-1]
print("accuracy:", accuracy)
xgb_clf = xgb.XGBClassifier(n_estimators=25, random_state=123)
xgb_clf.set_params(max_depth=10)
xgb_clf.fit(X_train, y_train)
preds = xgb_clf.predict(X_test)
accuracy_score(y_test, preds)
xgb_clf.set_params(colsample_bytree=0.5)
xgb_clf.fit(X_train, y_train)
preds = xgb_clf.predict(X_test)
accuracy_score(y_test, preds)
xgb_clf.set_params(subsample=0.75)
xgb_clf.fit(X_train, y_train)
preds = xgb_clf.predict(X_test)
accuracy_score(y_test, preds)
xgb_clf.set_params(gamma=0.25)
xgb_clf.fit(X_train, y_train)
preds = xgb_clf.predict(X_test)
accuracy_score(y_test, preds)
xgb_clf.set_params(learning_rate=0.3)
xgb_clf.fit(X_train, y_train)
preds = xgb_clf.predict(X_test)
accuracy_score(y_test, preds)
xgb_clf.set_params(reg_alpha=0.01)
xgb_clf.fit(X_train, y_train)
preds = xgb_clf.predict(X_test)
accuracy_score(y_test, preds)
rs_param_grid = {
'max_depth': list((range(3,12))),
'alpha': [0,0.001, 0.01,0.1,1],
'subsample': [0.5,0.75,1],
'learning_rate': np.linspace(0.01,0.5, 10),
'n_estimators': [10, 25, 40]
}
xgb_clf = xgb.XGBClassifier(random_state=123)
xgb_rs = RandomizedSearchCV(estimator=xgb_clf,param_distributions=rs_param_grid, cv=3, n_iter=5,
verbose=2, random_state=123)
xgb_rs.fit(X_train, y_train)
Testing
Unit Testing:
Login (Patient & Doctor)
Test case 1:
Test cases which are checked when a Patient or Doctor tries to login from
their respective login page
2
If any fields missing It should not book an It should not book an Pass
appointment and an alert to appointment and an alert to
enter all fields must be enter all fields must be
displayed displayed
Appointment Acceptance (Doctor)
Test case 4:
Test cases which are covered when a doctor accepts an appointment
3 User leave a field Generate the error Model doesn’t process Pass
empty while entering “empty field detected” the input because of
voice attribute values wrong number of
attributes passed and an
alert is released to enter
values of all fields
4 User enters negative Generate the error Model process the input Pass
value as one of the “negative value with the negative values
input value for entered” and an alert is thrown to
prediction enter only positive
numerical values
Applications
• To help in early detection of Parkinson’s disease which
would help in early diagnosis thus slowing down disease
progression.