Professional Documents
Culture Documents
Abstract— In this study, a novel method is proposed for the detection study, SMOTE is an over-sampling method developed by
of Parkinson's disease with the features obtained from the speech Chawla et al. in 2002 [4].
signals. Detection and early diagnosis of Parkinson's disease are Over-sampling is carried out by simply replacing existing
essential in terms of disease progression and treatment process. elements of the minority class in the educational setting. This
Parkinson's disease dataset used in this study was obtained from the
UCI machine learning repository. The proposed hybrid machine
method leads to overfitting [1,2,3]. To prevent this overfitting,
learning method consists of two stages: i) data pre-processing (over- new samples can be artificially produced by the distribution of
sampling), ii) classification. The Parkinson's disease dataset (PD the minority class. This approach is the Synthetic Minority
dataset) is a two-class dataset. While 192 data belong to normal Over-sampling Technique (SMOTE) [4].
(healthy) individuals, 564 data belong to the diseased class (PD). The There are many studies in the literature regarding the
data set has an imbalanced class distribution. To transform this classification of Parkinson's disease dataset. Some of these
imbalanced dataset to balanced dataset, SMOTE (Synthetic Minority studies are given below. Sakar et al. used several signal-
Over-Sampling Technique) method is used. Then, after converting to processing algorithms for Parkinson's disease from speech
a balanced class distribution, Random Forests classification method signals and formed the PD data set. The authors examined the
was used for classification of Parkinson's disease dataset. The PD
dataset consists of 753 attributes. Only the random forests
effect of tunable Q-factor wavelet transform (TQWT) method
classification were classified as 87.037% in the classification of PD and obtained good results [5]. In the classification of
dataset, while the proposed hybrid method (the combination of Parkinson's disease, Salama A.Mostafa et al. have proposed a
SMOTE and random forests) achieved 94.89% classification success. new method called Multiple Feature Evaluation Approach
Obtained results showed that promising resultshad been achieved in (MFEA) and individually combined with classification
discrimination of the PD dataset with this hybrid method. algorithms. They achieved their best success with the SVM-
MFEA combination [6]. Deepak Joshi et al. proposed a new
Keywords— Parkinson disease, Hybrid method, imbalanced dataset, hybrid method on the detection of Parkinson's disease from
SMOTE, classification walking signals. In this method, wavelet analysis methods are
I. INTRODUCTION combined with SVM [7].
Apart from the literature, a new hybrid method based on
SMOTE and Random Forests classification was proposed, and
It is difficult to classify and categorize unbalanced data sets. promising results were obtained by applying the Parkinson's
While there are any data in a class, there is very little data in the disease dataset.
other class, so generalization on the data set is a difficult
problem. There are several approaches to solve the class II. MATERIAL AND METHOD
imbalance problem in the literature. Before the model is created,
the problem of imbalance can be eliminated, regardless of the A. Parkinson Disease Data Set
classifier by artificially balancing the training data set. This Parkinson's disease dataset was taken from the UCI machine-
method is known as data re-sampling. Alternatively, classifiers learning database [8]. In this dataset, there are 756 samples and
can solve the problem of class-imbalance by classifying models 753 features. Also, this dataset is a two-class problem: 192
for better estimation of the minority class. Alternatively, only samples belonging to healthy and 564 samples belonging to
one class of classes can be modeled, and this is called one-class patients. Therefore, this dataset is an imbalanced dataset. Sakar
learning [1,2,3]. and et al. have created the Parkinson disease dataset in 2018 [5].
Data resampling can be done by reducing data in the The features have been obtained from the speech signal
intensive class, or by increasing instances of the minority class. processing methods. The data in PD dataset have been taken
This strategy is primarily used in large data sets, where the loss from 188 patients with PD (107 men and 81 women) with ages
of information by marginalization is marginal [1,2,3]. In this ranging from 33 to 87 [5]. Figure 1 shows the class distribution
Fig.1. The class distribution of Parkinson disease dataset Fig.3. The schematic representation of the SMOTE algorithm [9]
TABLE III [7] Deepak Joshi, Aayushi Khajuria, Pradeep Joshi, An automatic non-invasive
THE PROPOSED HYBRID METHOD RESULTS IN DETECTION OF PARKINSON method for Parkinson's disease classification, Computer Methods and
DISEASE DATASET USING HOLD OUT METHOD (50-50% TRAIN –TEST Programs in Biomedicine, 145, 2017, 135-145.
PARTITION)
[8] https://archive.ics.uci.edu/ml/datasets.html, (last accessed: April, 2019).
Classification
Kappa Precision Recall F-Measure AUC
Acc. (%) [9] https://medium.com/coinmonks/smote-and-adasyn-handling-imbalanced-
92.34 0.833 0.924 0.923 0.923 0.973 data-set-34f5223e167, (last accessed: April, 2019).
[10] https://towardsdatascience.com/the-random-forest-algorithm-
d457d499ffcd, (last accessed: April, 2019).
TABLE IV
THE PROPOSED HYBRID METHOD RESULTS IN DETECTION OF PARKINSON
[11] L. Breiman, Random forests, Machine Learning, 45 (1) (2001), pp. 5-32.
DISEASE DATASET USING 10-FOLD CROSS VALIDATION
Classification
Kappa Precision Recall F-Measure AUC
Acc. (%)