Professional Documents
Culture Documents
Perkinson Thesis
Perkinson Thesis
MANAGEMENT,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2021
Automatic Diagnosis of
Parkinson's Disease Using
Machine Learning
A Comparative Study of Different Feature
Selection Algorithms, Classifiers and Sampling
Methods
JEANNIE HE
JEANNIE HE
Abstract
Over the past few years, several studies have been published to propose algo-
rithms for the automated diagnosis of Parkinson’s Disease using simple exams
such as drawing and voice exams. However, at the same time as all classifiers
appear to have been outperformed by another classifier in at least one study,
there appear to lack a study on how well different classifiers work with a certain
feature selection algorithm and sampling method. More importantly, there
appear to lack a study that compares the proposed feature selection algorithm
and/or sampling method with a baseline that does not involve any feature selec-
tion or oversampling. This leaves us with the question of which combination
of feature selection algorithm, sampling method and classifier is the best as
well as what impact feature selection and oversampling may have on the per-
formance. Given the importance of providing a quick and accurate diagnosis
of Parkinson’s disease, a comparison is made between different systems of
classifier, feature selection and sampling method with a focus on the predictive
performance. A system was chosen as the best system for the diagnosis of
Parkinson’s disease based on its comparative predictive performance on two
sets of data - one from drawing exams and one from voice exams.
Keywords
Machine learning, Parkinson’s disease, Feature Selection, Greedy Search, Ge-
netic Algorithm, Diagnosis of Parkinson’s Disease, Drawing Exams, Voice
Exams
ii | Abstract
Sammanfattning | iii
Sammanfattning
Som en av världens mest vanligaste sjukdom med en tendens att leda till funk-
tionshinder har Parkinsons sjukdom länge varit i centrum av forskning. För att
se till att så många som möjligt får en behandling innan det blir för sent har
flera studier publicerats för att föreslå algoritmer för automatisk diagnos av
Parkinsons sjukdom. Samtidigt som alla klassificerare verkar ha överträffats av
en annan klassificerare i minst en studie, verkar det saknas en studie om hur
väl olika klassificerare fungerar med en viss kombination av urvalsalgoritm
(feature selection algorithm på engelska) och provtagningsmetod. Därutöver
verkar det saknas en studie där resultatet från den föreslagna urvalsalgoritmen
och/eller samplingsmetoden jämförs med resultatet av att applicera klassifice-
raren direkt på datan utan någon urvalsalgoritm eller resampling. Detta läm-
nar oss en fråga om vilket system av klassificerare, urvalsalgoritm och samp-
lingsmetod man bör välja och ifall det är värt att använda en urvalsalgoritm
och överprovtagningsmetod. Med tanke på vikten av att snabbt och noggrant
upptäcka Parkinsons sjukdom har en jämförelse gjorts för att hitta den bästa
kombinationen av klassificerare, urvalsalgoritm och provtagningsalgoritm för
den automatiska diagnosen av Parkinsons sjukdom.
Nyckelord
Maskininlärning, Parkinsons sjukdom, Greedy search, Genetisk algorithm, Ur-
valsalgoritm, Diagnos av Parkinsons sjukdom, Ritningstester, Rösttester
iv | Sammanfattning
Sammanfattning | v
Acknowledgements
I would like to thank my supervisor and examiner for their support as well
as the Second Affiliated Hospital Zhejiang University School of Medicine for
giving me this exciting project.
vi | CONTENTS
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Original Problem . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Scientific and Engineering Issues . . . . . . . . . . . 2
1.2.3 Research Question . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose and Goals . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Parkinson’s Disease . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Automatic Diagnosis of Parkinson’s Disease . . . . . 6
2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Search Strategy . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Search Direction . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Search Heuristic . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Greedy Search Algorithm for Feature Selection . . . . 10
2.2.5 Genetic Algorithm for Feature Selection . . . . . . . . 11
2.3 Random Oversampling . . . . . . . . . . . . . . . . . . . . . 12
2.4 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Random Forest Classifier . . . . . . . . . . . . . . . . 13
2.4.2 Support Vector Machine . . . . . . . . . . . . . . . . 13
2.4.3 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . 14
2.5 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Significance Testing . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Null Hypothesis and P-value . . . . . . . . . . . . . . 17
CONTENTS | vii
3 Methodology 27
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Drawing Data . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Voice Data . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . 28
3.3 Finding the Best System . . . . . . . . . . . . . . . . . . . . 31
3.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Problem Encoding . . . . . . . . . . . . . . . . . . . 31
3.4.2 Forward Greedy Search . . . . . . . . . . . . . . . . 31
3.4.3 Genetic Search . . . . . . . . . . . . . . . . . . . . . 32
3.4.4 Random Oversampling Versus No Oversampling . . . 36
3.5 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Validation and Testing . . . . . . . . . . . . . . . . . . . . . 37
3.6.1 Cross-validation . . . . . . . . . . . . . . . . . . . . 37
3.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.3 Confidence Intervals . . . . . . . . . . . . . . . . . . 40
3.6.4 Significant Testing . . . . . . . . . . . . . . . . . . . 41
3.7 Programming Language and Library . . . . . . . . . . . . . . 41
3.8 Hyperparameter Settings . . . . . . . . . . . . . . . . . . . . 41
viii | Contents
4 Results 43
4.1 Clarification of the Names . . . . . . . . . . . . . . . . . . . 43
4.2 Results on Drawing Data . . . . . . . . . . . . . . . . . . . . 43
4.2.1 MCC . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Selected Features . . . . . . . . . . . . . . . . . . . . 46
4.2.5 Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Results on Voice Data . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 MCC . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4 Selected Features . . . . . . . . . . . . . . . . . . . . 51
4.3.5 Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Discussion 55
5.1 Greedy Search Versus Genetic Algorithm . . . . . . . . . . . 55
5.2 The Best System . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Alternatives to the Best System . . . . . . . . . . . . . . . . . 57
5.4 Ethics, Economics and Sustainability . . . . . . . . . . . . . . 58
5.5 Potential Parties of Interest . . . . . . . . . . . . . . . . . . . 59
References 63
List of Figures
CV cross-validation
DT Decision Tree
GA genetic algorithm
GS greedy search
LR Logistic Regression
PD Parkinson’s disease
RF Random Forest
Chapter 1
Introduction
1.1 Background
As much as there are research results with promising results, it is still unclear
which approach is the best as they all use different data at the same time as there
is a problem of researchers using the same data for hyper-tuning and testing,
making the performance look better than it is [11]. Hence, it is in the interest
of the Second Affiliated Hospital Zhejiang University School of Medicine to
implement and compare the best performing components from recent studies
as a basis to develop this software program for the screening of PD.
computing resource. By solving this, we will help hospitals lower their work-
load while making sure that those who suffer from PD can be detected before
it is too late.
1.3.2 Goals
The goal of this project is to implement and evaluate our proposal to see how
well it performs. This has been divided into the following sub-goals:
says so and the probability that a system will identify a patient as having PD
given that the patient has PD.
For the sake of simplicity and time, the thesis scope was limited to only
using one classifier, one sampling technique and one feature selection algo-
rithm in each system. For the same reason, no analysis was made on the result
per epoch for any combination at the same time as no hyper-parameter tun-
ing was conducted. To do this, all classifiers were trained using the default
hyper-parameters provided by Python’s Scikit-Learn Library; and the feature
selection algorithms were implemented to run until the validation score stops
improving.
Moreover, since this is a thesis for a Master’s degree in Computer Science,
the medical, environmental, economical and ethical aspects were discussed to
a limited degree.
Furthermore, given that the voice and drawing data came from different
sources where our knowledge of the participants is limited, no discussion was
made upon whether the voice data is better or worse than the drawing data.
Finally, considering the overrated performance problem mentioned by [13],
no comparison was made between the result shown in this study and the other
studies.
Background | 5
Chapter 2
Background
Drawing Exams
As mentioned earlier, drawing exam is often a part of the clinical diagnosis of
PD thanks to its ability to capture the main symptoms of PD [2]. Because of
this, several models have been proposed to automate this procedure. Often,
the automation implies that a diagnosis would be made by extracting relevant
features from the drawing(s) made by the patient and then using a pre-trained
model to make a prediction based on these features [25]. Figure 2.1 shows
the device used in Memedi et al [1]’s study where the participants were asked
Background | 7
to drawing upon a spiral template on a digital device. Figure 2.2 shows some
drawings from Gupta et al. [2]’s study where an algorithm was proposed to
automatically diagnose PD using scanned drawings.
Figure 2.2 – Some spiral drawings from Gupta et al. [2]’s study, each
belonging to different participants: (a) 58-years old healthy participant and
(b) 28-years old healthy participant, and (c) 56-years old PD patient and (d)
65-years old PD patient.
Voice Exams
Although not as widely used as drawing exams in practice, voice exams have
drawn the attention of several machine learning scholars as a way to enable the
early detection of PD. This is both because vocal impairment is amongst the
most common symptoms amongst early PD patients [20] and because early
PD patients tend to be those whose vocal abnormalities might be too vague
to be perceptible to humans. In other words, it is hypothesized that the auto-
matic diagnosis of PD through voice exams can help one detect PD earlier and
thereby worth the investigation [26].
8 | Background
Figure 2.3 – A set of waveforms from Sakar et al.[3]. The upper waveform
belongs to the voice of a healthy individual. The lower waveform belongs to
a PD patient. The y-axis shows the amplitude of the signal whereas the x-axis
shows the timeline [3]
and high variance [30]. This gives feature selection the potential to improve
the performance of a machine learning algorithm by removing the part of the
data that is noisy, redundant and irrelevant [29].
To conduct a feature selection, one must decide upon the search strategy,
the search direction, the search heuristic and the stopping criterion [31].
3. Random search. This one refers to those with random elements in the
search, such as the genetic algorithm that starts with randomly selected
features and proceeds through recreations and selections with random
elements. This one has gained several researchers’ attention as an ap-
proach to avoid getting stuck in a local optimum without having to go
through all possibilities.
1. Forward. This one refers to those starting with an empty list to gradually
add new features without changing previous choices.
2. Backward. This one refers to those starting with the full data to sequen-
tially remove features without changing previous choices.
Stopping Criteria
To get an output, one must also define a stopping point for the algorithm. To
do this, one can 1) set a maximum number of iterations; 2) set a limit on the
number of features; 3) let the algorithm stop after exploring all alternatives; 4)
stop the algorithm when the outcome stops improving; 5) stop the algorithm
when the change in outcome become insignificant; and/or 6) stop the algorithm
when the result is "good enough", i.e. the evaluation measure has reached a
certain value [27].
lead to the global optimum solution, the aim is thus to find a solution that is
as good as possible within a reasonable amount of time [32].
When used for feature selection, the greedy search algorithm can be di-
vided into two categories. The first one is the forward greedy search algorithm,
where the solution is initialized as an empty set to be gradually populated by
adding those features that lead to the best outcome. In the backward greedy
search algorithm, the solution is instead initialized as the entire data set to have
it gradually reduced by removing those with the least positive impact on the
outcome [33].
The latter one is discouraged by Zhang et al., partially because it can be
computationally costly to start with all features, partially because it has a higher
risk of leading to high dimensionality by removing features that are more infor-
mative but was removed at the start because it holds information that overlaps
with the information provided by other less informative features [33].
Population Initialization
In general, a GA is initialized by randomly generating a population of indi-
viduals, each represented by a chromosome in the form of a sequence of val-
ues corresponding to a possible solution to the problem. With the population
initialized, a cycle of natural selection, cross-over and mutation can then be
conducted as an imitation of the natural evolution until the stopping criterion
is met [34].
Selection
Often, each iteration of the evolution process starts by selecting a set of solu-
tions for recombination such that new solutions can be generated. Often, this is
done through a mechanism that gives solutions with higher performance in the
12 | Background
problem space, also known as fitness, a higher probability of being chosen for
recombination. Here, one example is the roulette wheel selection mechanism
where a solution’s probability of being chosen is proportional to its fitness
[34].
Cross-Over
With the parents chosen, a set of individuals is to be generated through a pro-
cess called cross-over. Commonly, this is done by, for each pair of parents,
swap some elements between the parents. For instance, one common approach
is the single-point cross-over where a cross-over point is randomly chosen such
that two new solutions can be generated by swapping the elements situated to
the right of the cross-over point. Another common approach is the two-point
cross-over where two cross-over points are randomly chosen such that two
new solutions can be generated by swapping the elements situated between
the chosen cross-over points [34].
Mutation
With the new solutions produced, the algorithm would generally continue with
a process called a mutation. In the case of binary encoding, this is commonly
done by randomly performing a bit-wise negation in one or more bits in the
current solution [34].
Elitism
Often, researchers would ensure that the best solution so far is never lost during
the process by passing the best solution(s) to the next generation. This concept
is called elitism [34].
2.4 Classifiers
2.4.1 Random Forest Classifier
centring
Figure 2.6 – An illustration of the classification flow of an MLP classifier from
the article written by Faghfour and Frish [6].
Figure 2.7 – An illustration of nested CV with an outer loop for testing and an
inner loop for validation. The white, blue, grey and black boxes are the decode,
test, training and validation data respectively. At each outer fold, the decode
data was partitioned into 4 inner folds so that an inner CV can be done using
the decode data. The illustration is made based on the description provided by
Varoquauax et al. [7].
2.5 Cross-Validation
In Machine learning, one main challenge is the bias-variance trade-off where
adapting the model too much to the training data would lead to variance due to
overfitting, whereas the contrary can lead to bias due to underfitting [45]. Both
are components of total expected error, where bias reflects how the estimated
value differs from the true value whereas variance reflects how the predicted
value differs depending on training data [46]. To address the bias-variance
trade-offs, one popular tool is CV [47]. By randomly partitioning the data
into folds and by, for each fold, use the other folds for model training and the
fold for testing [7], CV can serve as a better tool for performance evaluation
[48] and hyperparameter tuning than regular validation [47].
Firstly, by evaluating the model on each combination of training and test
data, CV can help one reduce the risk of the performance being affected by
how the data is being partitioned. Especially after repeated use of CV [48].
Secondly, using CV for hyperparameter tuning means that one can set the
hyperparameters based on the results from using different combinations of
samples as training data and thereby reducing the risk of overfitting. As a
result, the model’s general performance enhances. For this reason, CV is a
widely used tool for performance measurement [47] and hyperparameter tun-
ing [48].
16 | Background
Nested Cross-Validation
While being widely used for performance measurement and hyperparameter
tuning, the use of CV for both purposes would make the report unreliable as it
means that the hyperparameter would be affected by the test data in a way that
is not possible in real life [7]. In fact, Abdulaal et al. in [11] that one problem
with the contemporary research is that several researchers have overreported
the performance of their model by using flat CV for both hyperparameter tun-
ing and performance measurement. As a solution, Varoquaux et al. [7] and
Abdulaal et al. [11] proposed using nested CV with an outer loop for perfor-
mance evaluation and an inner loop for hyperparameter tuning. This way, one
can avoid bias for both performance measurement and hyperparameter tuning
[7].
The nested CV starts by dividing the data into several folds. Depending
on the type of CV, the division can be done differently (see section 2.5). With
the data divided into folds, an outer loop is formed where the folds take turn to
be the test data while the rest are sent into the inner loop for hyperparameter
tuning [7].
Inside each inner loop, the data is again divided into folds such that the
folds can take turn to be the validation data while the rest are being sent to
the classifier for training. With the classifier trained using the current hyper-
parameter setting and the current training data, a prediction is made on the
current validation data and a performance metric is computed by putting the
classifiers’ prediction in juxtaposition with the true outcome. By utilizing and
comparing different hyperparameter settings, the "best" model is then found
for each validation set of data in the inner loop [7].
Having built the final model for the current outer step, a prediction can
then be made on the test data at the current outer step. By comparing the
prediction with the true value, performance metrics can then be computed and
the process continues until there is no fold left to be tested [7]
Once all folds have been utilized as the test data for performance mea-
surement, one can compute the final performance metrics by averaging the
performance metric at each outer step [11].
(a) Two-sided Test with H0: µ1 = µ2 (b) One-sided Test with H0: µ1 ≥ µ2
Figure 2.8 – Ross [8]’s illustration of a null hypothesis testing on two sample
means where µ1 and µ2 are the sample means; H0 is the null hypothesis; zα/2
and zα are two constants corresponding the significance level in a two- and
one-sided test respectively [8].
to be made such that the distribution of classes is about the same between the
folds. This way, the bias and variance problem commonly seen in regular CV
on imbalanced data can be mitigated, although not solved entirely [7].
the α. Hence, if the null hypothesis. Often, α would be set to 0.05, but there
are also cases where 0.10 is used [8].
As a threshold for the rejection of the null hypothesis and thereby accep-
tance of the alternative hypothesis as the opposite of the null hypothesis, the
significance level α sets the upper threshold to the probability that a certain
observation with stated statistical significance happened by chance. Since the
rejection of the null hypothesis is based on the threshold on probability [8],
another common way to define the threshold is to state the confidence level
1 - α as the lower threshold for the likelihood that a certain observation with
stated statistical significance did not happen by chance. For instance, if the
null hypothesis is rejected on a significance level of α = 0.05, then one can
state that the alternative hypothesis is true on a 95% confidence level, meaning
that one can be 95% certain that the alternative hypothesis is true. Hence the
confidence level of 95% can also be used as a threshold as it automatically
implies α = 0.05 and so on [49].
OCFA, Sharma et al. [52] showed that MGWO provided better result than
OCFA.
from using one single dimension with the result from using all dimensions.
As a result, they found that the highest accuracy comes from using all five
dimensions followed by using one of the coordinates at the horizontal plane
[59].
pressure [25].
As features related to rigidity and bradykinesia, they computed average
computing drawing speed, total drawing time as features along with the spiral
size in the form of spiral width and height. The motivation for including spi-
ral size was that, given that the participants were asked to draw on the same
template, the spiral size could be used to reveal signs of micrographia [25].
Finally, they added average pressure; average grip angle and standard de-
viation in grip angle as features specifically related to rigidity [25].
By measuring the performance of a logistic regression classifier trained
on the features independently, they found that a static test is better at detecting
PD through the observation of reduction and variation in pressure while the
dynamic test with a blinking template is better for detecting PD through the
observation of greater variation in grip angle and increased pressure, possibly
because of the challenge posed by the dynamic test where the participants were
required to trace a blinking spiral template while being stressed and distracted
by the blinking of the template [25].
Methodology | 27
Chapter 3
Methodology
3.1 Data
3.1.1 Drawing Data
For the drawing data, we have utilized the data provided by Isenkul et al. [9]
where a pen-and-tablet device called Wacom Cintiq 12WX graphics was used
to record each participant’s drawing movements in terms of grip angle, pres-
sure and the position of the pen at each timestamp.
centring
Figure 3.1 – A spiral drawing from a PD patient in Isenkul et al [9]’s study.
The red line is the patient’s drawing, the black spiral is the template.
Since not all participants have conducted the same tests, only those who
have undertaken both the static and dynamic spiral test were included. In both
static and dynamic spiral tests, the participant is asked to draw on a spiral
template as shown in figure 3.1. The only difference between a static and a
spiral test is that a dynamic spiral test (DST) is one where the template blinks
whereas a static spiral test (SST) is one where the template does not blink [9].
This has resulted in 89 PD patients and 15 healthy participants.
Also, given that the goal is to develop a screening test for the general public,
the project provider is interested in a system that can accurately identify PD
28 | Methodology
using the data collectable through a device that everyone owns. For this reason,
grip angles were excluded.
3.2 Method
3.2.1 Feature Extraction
Since the voice data provided by Sakar et al. [54] was already converted into
features, no feature extraction was needed for the voice data. However, since
this is not the case with the drawing data, a set of functions were implemented
to compute 12 kinematic variables which were, in turn, converted into 132
features. The following sections describe how the kinematic variables and
features were computed.
Kinematic Variables
As shown in figure 3.2, the coordinate system used by the device in Isenkul
et al. [9]’s study has a centre at (xc , yc ) = (250, 200). Using this as the
Methodology | 29
centre of the spiral, the radius and angle corresponding each data point were
computed using the following formula:
p
ri = (xi − xc )2 + (yi − yc )2 (3.1)
xi − xc
θi = arccos( ) (3.2)
ri
where (xi , yi ) is the pen’s coordinate at timestamp i, ri is the radial distance
between (xi , yi ) and the spiral centre; θi is the corresponding angle.
Using the aforementioned definitions, twelve kinematic variables were com-
puted for each test and each participant as the basis of the feature extraction.
The variables were computed with the motivation that they were used by Memedi
et al [1] and/or Poon et al [25] for the automatic diagnosis of PD. The variables
are defined as follows:
(xi+1 −xi )2 +(yi+1 −yi )2
1. velocity vi = ti+1 −ti
for i ∈ 1, ..., n − 1,
vi+1 −vi
2. acceleration ai = (ti+2 −ti )/2
for i ∈ 1, ..., n − 2,
p
3. radius ri = (xi − xc )2 + (yi − yc )2 for i ∈ 1, ..., n − 1
ri+1 −ri
4. radial velocity rvi = ti+1 −ti
for i ∈ 1, ..., n − 1
ri+1 −ri
5. radial acceleration rai = (ti+2 −ti )/2
for i ∈ 1, ..., n − 2
θi+1 −θi
6. angular velocity ωi = ti+1 −ti
for i ∈ 1, ..., n − 1
θi+1 −θi
7. angular acceleration αi = (ti+2 −ti )/2
for i ∈ 1, ..., n − 2
To make a distinction between the static and dynamic spiral test result, we
have kept the features derived from each test as separate features. Since there
are 2 test results and 12 kinematic variables, the aforementioned 5 statistical
measures have thus led to 120 features per participants.
In contrast to the other features, the histogram distance was computed based
on the result from two different tests - the static and dynamic spiral tests. For
Methodology | 31
this reason, only 1 histogram distance was computed for each kinematic vari-
able and participant. As there are 12 kinematic variables, this resulted in 12
additional features for each participant.
leads to the highest MCC without removing any previously added features.
The algorithm was implemented to run until the MCC stops increasing.
The MCC was chosen because of several reasons. Firstly, studies have
shown that performance-based search heuristics like MCC can provide higher
performance than correlation-based search heuristics like χ2 by adapting the
search on the chosen classifier [27, 55]. Secondly, the data sets used in this
study are imbalanced and MCC is suitable for imbalanced data as it takes ac-
count of the ratio between the positive and negative samples [12].
The MCC was computed by conducting a 5-fold CV on the data that was
sent to the greedy algorithm and then averaging the results. The implementa-
tion of the CV can be found in section 3.6.1.
Algorithm 1 Greedy
1: procedure greedy(X, y)
2: n ← the number of feature values in X
3: mask ← an array with n zeros
4: backlog ← a shuffled array with i ∈ 1, 2..., n
5: repeat
6: for i ∈ backlog do
7: masknew ← mask.copy()
8: masknew [i] ← 1
9: MCCnew ← avg(CrossValidation(X, y, masknew ))
10: if MCCnew > MCC then
11: MCC ← MCCnew
12: mask ← masknew
13: chosen ← i
14: end if
15: end for
16: backlog.remove(chosen)
17: mask[chosen] ← 1
18: until sum(mask) = sum(masknew )
19: return mask
20: end procedure
Population Initialization
To reduce the risk of getting stuck at local optima without having to go through
too many solutions, the population was initiated by randomly generating a
mask and then an opposite solution by conducting a bit-wise negation opera-
tion on the generated mask. The initial population was complemented with an
array with value one at all indices as a way to ensure that the final solution is
at least as good as using all indices in terms of validation MCC.
Selection
At each iteration of evolution, a number of solutions were selected for repro-
duction by randomly selecting 2 distinctive parent solutions n times where:
min(nsolutions , nf eatures )
n=b c (3.4)
2
Here, nsolutions is the number of solutions in the population whereas nf eatures
is the number of features in the data.
Given MCC’s suitability for this study (see section 3.4.2), each solution’s
probability of being chosen was set to be proportional to its MCC. Again, the
MCC was computed by averaging the MCC results from a 5-fold CV on the
data that was sent to the genetic algorithm (see algorithm 4)
34 | Methodology
Cross-Over
Without changing the solutions in the current generation, a cross-over opera-
tion was conducted on each pair of parents to produce new solutions. To do
this, two cross-over points were randomly chosen between the right point of
the first element and the left point of the last
2 element such that the probability
1
of them being the same became p = n−1 .
In the case where two distinctive cross-over points were chosen, a two-
point cross-over operation would be conducted by swapping the elements be-
tween the two cross-over points as shown in figure 3.3a. In the case where the
two cross-over points happen to be the same, a one-point cross-over operation
would be conducted by swapping the elements after the cross-over points as
shown in figure 3.3b.
Figure 3.3 – An illustration of the cross-over operations used in this study. The
vertical lines are the cross-over points. The arrays to the left are the parents
and the arrays to the right are the children.
Mutation
As mentioned earlier, each iteration of evolution involves the creation of mu-
tated copies of half the population. This was achieved by creating a copy of
Methodology | 35
Stopping Criteria
To avoid the problem of local optima, the algorithm was also implemented
to stop when the best solution has been the same in three rounds. By always
giving the eldest population a higher ranking in the scenario where two solu-
tions have the same number of features and the same performance score. The
best solution would only change if a new solution is generated with a better
36 | Methodology
performance score or as high performance as the current best solution but with
less number of features (see algorithm 2).
3.5 Classifiers
As mentioned earlier, this study involves three classifiers. These are: RF, RBF-
SVM and MLP.
The rationale behind this is that these classifier have outperformed sev-
eral classifiers in related studies: Amongst the studies about drawing-based
diagnosis of PD, RF has outperformed kNN, DT and SVM in study written
by Sharma et al. [52], Gupta et al. [2], Parziale et al. [62] and Memedi
[1]; Amongst the studies about voice-based diagnosis of PD, RBF-SVM has
outperformed linear SVM, MLP, Naive Bayes, LR, RF and kNN in Sakar et al
[54]’s study about voice-based automatic diagnosis of PD whereas MLP has
outperformed DT, Naive Bayes, RF, and RBF-SVM in Mostafa et al. [53]’s
study.
Since the focus in this study is not the implementation of a classifier, a
library was used to implement the classifier using existing code, namely the
Methodology | 37
Scikit-learn library.
For the same reason, all classifiers were initiated using the default hyper-
parameters set by the provider of the library.
As shown in algorithm 4, both the inner and outer loop started by parti-
tioning the already shuffled data into k stratified folds. The rationale behind
shuffling outside the CV function was to make sure that each combination of
classifier, random sampling and feature selection algorithm can be evaluated
under the same circumstances. The reason why stratified CV was used was
that the data sets in this study are imbalanced.
With the data partitioned into K-folds, the algorithm is implemented to
make a random oversampling of the current training data. The rationale be-
38 | Methodology
hind the oversampling was to prevent the classifier from being biased towards
the majority class. This was done with the help of the RandomOversampler
module in the Python library Imbalanced-learn.
Having conducted a random oversampling on the data, the algorithm con-
tinued with calling the function evaluate() using the training and test data along
with the mask produced by a search algorithm either as a parameter input when
the CV was used as part of the search heuristic inside the search algorithm
or as a result returned by the search algorithm when the CV was used for
performance measurement.
Having looped through all folds, the function would return a list of metrics
corresponding to the metrics returned by the function evaluate().
3.6.2 Metrics
Prediction-related Metrics
For the purpose of performance measurements, a confusion matrix was com-
puted for each outer fold with the following components:
With the metrics in the confusion matrix calculated, the following metrics
were then calculated as performance metrics:
1 + M CC
M CCnorm = (3.5)
2
0
if TP + FN = 0 or TN + FP = 0
M CC = 0 if TP + FP = 0 or TN + FN = 0
√
TP·TN−FP·FN
otherwise
(TP+FP)(TP+FN)(TN+FP)(TN+FN)
(3.6)
TP + TN
Accuracy = (3.7)
TP + TN + FP + FN
TP
P recision = (3.8)
TP + FP
TP
Recall = (3.9)
TP + FN
P recision ∗ Recall
F1 = 2 ∗ (3.10)
P recision + Recall
The MCC was chosen to tackle the problem of imbalanced data as sug-
gested by Chicco and Jurman [12]. The accuracy and F1 score were chosen
both as a way to make this thesis comparable with other studies and as a way
40 | Methodology
to make the results more understandable as they are some of the measures
most people would understand. The precision was chosen to demonstrate the
probability that a patient has PD given that the system says so. The recall rate
was chosen as a metrics to demonstrate how well the system is at detecting PD
patients.
Other Metrics
Given that feature selection is a process with the potential of reducing the
amount of required computational resource by reducing the number of features
used for training and testing [27], one may want to see the number of features
selected by the different feature algorithms depending on which system it is
used, i.e. with which classifier and sampling method it is used. For this reason,
we have, for each system, registered the number of features selected by the
feature selection algorithm. For the sake of simplicity, this metric will be
henceforth denoted as F eatures.
Similarly, given that feature selection is a process that requires a certain
amount of computational resource and that the feature set evaluation through
CV is likely the part requiring the largest proportion of computational resource,
we have, for each system, registered the number of times CVs has been called
for the evaluation of a feature set throughout the course of feature selection.
To achieve this, we have set a counter to zero at the start of each feature se-
lection and configured it to increment by one each time CV is called for the
evaluation of one feature set. Following this logic, the registered counter is
also the number of feature sets being evaluated during feature selection. For
the sake of simplicity, the value of the counter will be henceforth denoted as
Calls.
where µ is the mean value, σ is the standard deviation between the folds
and n is the number of folds.
Methodology | 41
fiers and the partition algorithm StratifiedKFold were implemented using the
default hyperparameters set by the Scikit-learn library. The same applies to
the random oversampling method from the Imbalanced-Learn library.
Results | 43
Chapter 4
Results
Table 4.1 – Test results on drawing data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.
4.2.1 MCC
As shown in table 4.2, RFGA+ has achieved the highest mean MCC followed
by RF+ and RF with insignificant difference between them. This suggests that
while the use of GA and oversampling has led to a higher mean MCC, the
mere use of an RF may be sufficient as the observed difference may be caused
by random.
While there is no significant difference between SVMGS, SVMGS+ and
the systems involving the use of RF (p > 0.5), it should be noted that RFGA+ is
the only system having achieved a higher MCC with significance difference on
the confidence level of 95% than 8 other systems (e.g. SVM, SVM+, SVMGA,
MLP, MLP+, MLPGA and MLPGA+, see table 4.2 for the exact p-values).
4.2.2 Accuracy
As shown in table 4.1 and table 4.3, RFGA+ has achieved the highest mean
accuracy followed by RF+ and RF with insignificant difference between them.
This suggests that while the use of GA and oversampling has led to higher
mean accuracy, the mere use of an RF may be sufficient as the observed dif-
ference may be caused by random.
Results | 45
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 1.000e+00 - 4.051e-06 3.489e-01 1.000e+00 1.000e+00 4.051e-06 2.633e-01 1.212e-04 1.357e-01 1.000e+00 1.000e+00 1.076e-05 5.874e-01
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 6.855e-08 2.999e-02 1.000e+00 8.245e-01 6.855e-08 2.146e-02 3.090e-06 9.810e-03 1.000e+00 1.000e+00 2.040e-07 5.586e-02
RFGS - - - - - - 2.415e-05 9.660e-01 1.000e+00 1.000e+00 2.415e-05 7.473e-01 5.951e-04 4.077e-01 1.000e+00 1.000e+00 6.079e-05 1.000e+00
RFGS+ - - 1.000e+00 - - - 2.574e-05 1.000e+00 1.000e+00 1.000e+00 2.574e-05 7.751e-01 6.297e-04 4.237e-01 1.000e+00 1.000e+00 6.467e-05 1.000e+00
RFGA - - 1.000e+00 1.000e+00 - - 1.352e-05 6.969e-01 1.000e+00 1.000e+00 1.352e-05 5.347e-01 3.554e-04 2.863e-01 1.000e+00 1.000e+00 3.466e-05 1.000e+00
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.630e-08 1.649e-02 1.000e+00 5.168e-01 2.630e-08 1.166e-02 1.298e-06 5.191e-03 7.284e-01 1.000e+00 8.025e-08 3.142e-02
SVM - - - - - - - - - - - - - - - - - -
SVM+ - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 -
SVMGS - - - - - - 2.256e-03 1.000e+00 - 1.000e+00 2.256e-03 1.000e+00 3.215e-02 1.000e+00 1.000e+00 1.000e+00 4.880e-03 1.000e+00
SVMGS+ - - - - - - 8.467e-02 1.000e+00 - - 8.467e-02 1.000e+00 7.177e-01 1.000e+00 - - 1.585e-01 1.000e+00
SVMGA - - - - - - - - - - - - - - - - - -
SVMGA+ - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 1.000e+00 - - 1.000e+00 -
MLP - - - - - - 1.000e+00 - - - 1.000e+00 - - - - - 1.000e+00 -
MLP+ - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 -
MLPGS - - - - - - 5.636e-02 1.000e+00 - 1.000e+00 5.636e-02 1.000e+00 5.091e-01 1.000e+00 - - 1.074e-01 1.000e+00
MLPGS+ - - - - - - 8.238e-03 1.000e+00 - 1.000e+00 8.238e-03 1.000e+00 9.855e-02 1.000e+00 1.000e+00 - 1.697e-02 1.000e+00
MLPGA - - - - - - 1.000e+00 - - - 1.000e+00 - - - - - - -
MLPGA+ - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 -
Table 4.2 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same MCC. The symbol "-"
is used when system A does not have a higher mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 1.000e+00 - 3.033e-01 2.514e-06 1.000e+00 8.209e-02 3.033e-01 8.611e-07 5.631e-02 1.133e-04 1.000e+00 5.444e-01 6.826e-04 5.989e-05
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 3.791e-02 7.724e-08 1.000e+00 8.409e-03 3.791e-02 2.391e-08 5.463e-03 5.123e-06 1.000e+00 7.486e-02 3.772e-05 2.531e-06
RFGS - - - - - - 8.926e-01 1.677e-05 1.000e+00 2.712e-01 8.926e-01 6.097e-06 1.920e-01 6.027e-04 1.000e+00 1.000e+00 3.231e-03 3.316e-04
RFGS+ - - 1.000e+00 - - - 8.862e-01 1.656e-05 1.000e+00 2.691e-01 8.862e-01 6.016e-06 1.904e-01 5.959e-04 1.000e+00 1.000e+00 3.197e-03 3.278e-04
RFGA - - 1.000e+00 1.000e+00 - - 4.876e-01 5.741e-06 1.000e+00 1.387e-01 4.876e-01 2.017e-06 9.645e-02 2.348e-04 1.000e+00 8.548e-01 1.346e-03 1.263e-04
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 3.150e-02 5.714e-08 1.000e+00 6.872e-03 3.150e-02 1.754e-08 4.444e-03 3.913e-06 1.000e+00 6.269e-02 2.929e-05 1.922e-06
SVM - - - - - - - 1.000e+00 - 1.000e+00 - 9.557e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGS - - - - - - 1.000e+00 1.233e-03 - 1.000e+00 1.000e+00 5.198e-04 1.000e+00 2.515e-02 1.000e+00 1.000e+00 1.007e-01 1.528e-02
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA - - - - - - - 1.000e+00 - 1.000e+00 - 9.557e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - - 1.000e+00 - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - 1.000e+00 -
MLPGS - - - - - - 1.000e+00 7.377e-03 - 1.000e+00 1.000e+00 3.335e-03 1.000e+00 1.156e-01 - 1.000e+00 4.039e-01 7.355e-02
MLPGS+ - - - - - - - 9.660e-01 - 1.000e+00 - 5.485e-01 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - - - - - -
MLPGA+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -
Table 4.3 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same accuracy. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
Again, regardless if one look at the confidence level of 90%, 95% or 99%,
RFGA+ and RF+ have shared the highest position in terms of the number of
systems compared to which each of them has achieved a higher accuracy with
significant difference (see table 4.3 for the exact systems and the corresponding
p-values).
Nevertheless, it should be noted that RFGA+ has outperformed RFGA both
in terms of mean accuracy and in terms of the number of systems compared
to which it has achieved a significantly higher accuracy with a significant dif-
ference on the confidence level of 90%, 95% and 99%. More importantly, it
should be noted that RFGA+ and RF+ are the only systems achieving a signif-
icantly higher accuracy on a confidence level of 95% than all systems except
for SVMGS, SVMGS+, MLPGS+ and those involving the use of RF (see table
4.3 for the exact p-values). This suggests that it may be advisable to combine
RF with random oversampling or another oversampling method.
46 | Results
4.2.3 F1 Score
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 1.000e+00 - 8.630e-01 1.457e-06 1.000e+00 1.757e-02 8.630e-01 5.492e-07 9.989e-02 7.870e-05 1.000e+00 1.908e-01 9.898e-04 2.215e-05
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.165e-01 3.575e-08 1.000e+00 1.282e-03 1.165e-01 1.225e-08 9.406e-03 2.934e-06 1.000e+00 1.990e-02 4.959e-05 7.190e-07
RFGS - - - - - - 1.000e+00 1.650e-05 1.000e+00 9.203e-02 1.000e+00 6.651e-06 4.371e-01 6.623e-04 1.000e+00 7.770e-01 6.758e-03 2.056e-04
RFGS+ - - 1.000e+00 - - - 1.000e+00 2.532e-05 1.000e+00 1.226e-01 1.000e+00 1.034e-05 5.635e-01 9.627e-04 1.000e+00 9.886e-01 9.453e-03 3.044e-04
RFGA - - 1.000e+00 1.000e+00 - - 1.000e+00 2.408e-06 1.000e+00 2.485e-02 1.000e+00 9.199e-07 1.363e-01 1.225e-04 1.000e+00 2.565e-01 1.477e-03 3.517e-05
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.025e-01 2.844e-08 1.000e+00 1.087e-03 1.025e-01 9.687e-09 8.099e-03 2.392e-06 1.000e+00 1.724e-02 4.113e-05 5.813e-07
SVM - - - - - - - 4.542e-01 - 1.000e+00 - 2.638e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGS - - - - - - 1.000e+00 2.884e-03 - 1.000e+00 1.000e+00 1.369e-03 1.000e+00 5.674e-02 1.000e+00 1.000e+00 3.506e-01 2.230e-02
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA - - - - - - - 4.542e-01 - 1.000e+00 - 2.638e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - - 1.000e+00 - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - -
MLPGS - - - - - - 1.000e+00 7.626e-03 - 1.000e+00 1.000e+00 3.747e-03 1.000e+00 1.290e-01 - 1.000e+00 7.170e-01 5.329e-02
MLPGS+ - - - - - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - -
MLPGA+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -
Table 4.4 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same F1 score. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
As shown in table 4.1 and table 4.4, RFGA+ has achieved the highest mean
F1 score followed by RF+ and RF with insignificant difference between them.
This suggests that while the use of GA and oversampling has led to a higher
mean F1 score, the mere use of an RF may be sufficient as the observed differ-
ence may be caused by random.
Also, regardless if one look at the confidence level of 95% or 99%, RFGA+
and RF+ have shared the highest position in terms of the number of systems
compared to which each of them has achieved a higher F1 score with significant
difference (see table 4.4 for the exact systems and the corresponding p-values).
Nevertheless, it should be noted that RFGA+ has outperformed RFGA both
in terms of mean F1 score and in terms of the number of systems compared
to which it has achieved a significantly higher F1 score with a significant dif-
ference on the confidence level of 90%, 95% and 99%. More importantly, it
should be noted that RFGA+ and RF+ are the only systems achieving a signif-
icantly higher F1 score on a confidence level of 95% than all systems except
for SVMGS, SVMGS+, MLPGS+ and those involving the use of RF (see table
4.4 for the exact p-values). This suggests that it may be advisable to combine
RF with random oversampling or another oversampling method.
System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - 1.000e+00 9.885e-05 1.969e-05 - - 1.734e-02 4.136e-06 1.000e+00 1.000e+00 2.070e-07 5.439e-06
RFGS+ - - 2.470e-04 5.206e-05 - - 3.523e-02 1.152e-05 1.000e+00 1.000e+00 6.324e-07 1.501e-05
RFGA - - - 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00
RFGA+ - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGS 1.000e+00 1.000e+00 6.477e-06 1.103e-06 - 1.000e+00 2.027e-03 2.004e-07 1.000e+00 1.000e+00 7.735e-09 2.702e-07
SVMGS+ 1.000e+00 1.000e+00 2.725e-05 5.031e-06 - - 6.325e-03 9.853e-07 1.000e+00 1.000e+00 4.347e-08 1.311e-06
SVMGA - - 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA+ - - - - - - - - - - 1.000e+00 1.000e+00
MLPGS - - 1.522e-02 4.255e-03 - - 7.725e-01 1.220e-03 - 1.000e+00 1.070e-04 1.521e-03
MLPGS+ - - 2.515e-02 7.309e-03 - - 1.000e+00 2.172e-03 - - 2.031e-04 2.691e-03
MLPGA - - - - - - - - - - - -
MLPGA+ - - - - - - - - - - 1.000e+00 -
Table 4.5 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same features. The symbol
"-" is used when system A does not have a lower mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
observed on the confidence level of 95% are when a system utilising greedy
search (GS) is compared with a system utilising GA. This suggests that GS
indeed tends to lead to a lower number of features than GA and that while
the choice of classifier and sampling method appears to have an insignificant
impact on the resulting number of selected features, the choice of feature se-
lection algorithm has a certain impact.
4.2.5 Calls
System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - 1.000e+00 - - - - - - 1.000e+00 1.000e+00 - -
RFGS+ - - - - - - - - 1.000e+00 1.000e+00 - -
RFGA 1.146e-03 4.839e-04 - - 1.058e-02 3.477e-03 1.000e+00 - 3.681e-06 1.753e-06 - -
RFGA+ 1.384e-04 5.391e-05 1.000e+00 - 1.598e-03 4.681e-04 1.000e+00 1.000e+00 2.718e-07 1.223e-07 1.000e+00 1.000e+00
SVMGS 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00 - -
SVMGS+ 1.000e+00 1.000e+00 - - - - - - 1.000e+00 1.000e+00 - -
SVMGA 1.598e-03 6.838e-04 - - 1.422e-02 4.762e-03 - - 5.567e-06 2.676e-06 - -
SVMGA+ 5.128e-04 2.098e-04 1.000e+00 - 5.166e-03 1.623e-03 1.000e+00 - 1.359e-06 6.328e-07 - 1.000e+00
MLPGS - - - - - - - - - 1.000e+00 - -
MLPGS+ - - - - - - - - - - - -
MLPGA 4.528e-04 1.844e-04 1.000e+00 - 4.622e-03 1.443e-03 1.000e+00 - 1.166e-06 5.409e-07 - 1.000e+00
MLPGA+ 2.688e-04 1.073e-04 1.000e+00 - 2.899e-03 8.795e-04 1.000e+00 - 6.133e-07 2.807e-07 - -
Table 4.6 – Dunn’s test results on the drawing data: p-values corresponding
to the null hypothesis that system A and B have the same calls. The symbol
"-" is used when system A does not have a lower mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
classifier and sampling method each of them has been used together with (see
table 4.1).
Also, while there is no significant difference in the number of cross-validations
calls amongst those utilising the same feature selection algorithm, there is, in-
deed, a significant difference between those utilising GS as compared to those
utilising GA regardless with which classifier and sampling method each of
them has been used together with (p < 0.01. See table 4.6 for the exact p-
values). This suggests that GS has indeed called a significantly higher number
of CV than GA with a statistical significance on a confidence level of 99% and
thereby also of 95% and 90%.
Furthermore, it can be worth noting that while the number of features ap-
pears to be proportional to the number of CV calls amongst those involving the
use of GS, this is not true amongst those involving the use of GA. As shown
in table 4.1, the order in which the systems utilising GS have led to the least
number of features also happen to be the order in which the systems utilising
GS have led to the least number of CV calls. This, however, is not the case
when GS was replaced by GA.
4.3.1 MCC
As shown in table 4.7 and table 4.8, RFGA+ has achieved the highest mean
MCC followed by RF+ and RF with insignificant difference between them (p
Results | 49
Table 4.7 – Test results on voice data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.
= 1.0). This suggests that while the use of GA and oversampling has led to a
higher mean MCC, the mere use of an RF may be sufficient as the observed
difference may be caused by random.
Interestingly, the Dunn’s test has shown that all systems involving the use
of RF have, together with SVMGS and SVMGS+, shown a higher MCC with
a significant difference on the confidence level of 99% than SVM, SVMGA,
MLP, MLP+, MLPGA and MLPGA+. Moreover, the Dunn’s test has shown
that all systems involving the use of RF have, along with SVMGS and SVMGS+,
achieved the highest MCC with insignificant difference between them (p =
1.0).
While this suggests that the combined use of SVM and GS has the potential
to achieve as high MCC as any system involving the use of RF, it should be
noted that the systems involving the use of RF are those with the highest MCC,
especially RFGA+.
4.3.2 Accuracy
As shown in table 4.7 and table 4.9, RFGA+ has achieved the highest mean ac-
curacy followed by RF and RFGA with insignificant difference between them
(p = 1.0). This suggests that while the use of GA and oversampling has led to
50 | Results
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.777e-05 1.000e+00 1.000e+00 1.000e+00 1.163e-06 1.000e+00 3.073e-06 6.957e-05 1.000e+00 1.000e+00 7.497e-06 4.367e-06
RF+ - - 1.000e+00 1.000e+00 1.000e+00 - 1.857e-05 1.000e+00 1.000e+00 1.000e+00 7.469e-07 1.000e+00 1.997e-06 4.710e-05 1.000e+00 1.000e+00 4.928e-06 2.851e-06
RFGS - - - 1.000e+00 - - 6.838e-04 1.000e+00 1.000e+00 1.000e+00 4.028e-05 1.000e+00 9.619e-05 1.542e-03 1.000e+00 1.000e+00 2.133e-04 1.317e-04
RFGS+ - - - - - - 1.267e-03 1.000e+00 1.000e+00 1.000e+00 8.011e-05 1.000e+00 1.873e-04 2.797e-03 1.000e+00 1.000e+00 4.074e-04 2.545e-04
RFGA - - 1.000e+00 1.000e+00 - - 7.000e-05 1.000e+00 1.000e+00 1.000e+00 3.222e-06 1.000e+00 8.278e-06 1.703e-04 1.000e+00 1.000e+00 1.967e-05 1.165e-05
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.110e-06 1.000e+00 1.000e+00 1.000e+00 6.882e-08 1.000e+00 1.959e-07 5.709e-06 1.000e+00 1.000e+00 5.125e-07 2.861e-07
SVM - - - - - - - - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 1.000e+00
SVM+ - - - - - - 1.611e-01 - - - 1.916e-02 - 3.712e-02 2.930e-01 1.000e+00 1.000e+00 6.766e-02 4.707e-02
SVMGS - - - - - - 1.173e-03 1.000e+00 - 1.000e+00 7.352e-05 1.000e+00 1.724e-04 2.597e-03 1.000e+00 1.000e+00 3.758e-04 2.344e-04
SVMGS+ - - - - - - 3.435e-03 1.000e+00 - - 2.443e-04 1.000e+00 5.515e-04 7.312e-03 1.000e+00 1.000e+00 1.161e-03 7.398e-04
SVMGA - - - - - - - - - - - - - - - - - -
SVMGA+ - - - - - - 5.152e-02 1.000e+00 - - 5.200e-03 - 1.058e-02 9.852e-02 1.000e+00 1.000e+00 2.019e-02 1.366e-02
MLP - - - - - - - - - - 1.000e+00 - - - - - - -
MLP+ - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 1.000e+00
MLPGS - - - - - - 2.782e-01 - - - 3.595e-02 - 6.796e-02 4.932e-01 - - 1.210e-01 8.539e-02
MLPGS+ - - - - - - 1.794e-01 - - - 2.168e-02 - 4.182e-02 3.247e-01 1.000e+00 - 7.588e-02 5.293e-02
MLPGA - - - - - - - - - - 1.000e+00 - 1.000e+00 - - - - -
MLPGA+ - - - - - - - - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 -
Table 4.8 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same MCC. The symbol "-"
is used when system A does not have a higher mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.928e-02 5.903e-02 1.000e+00 1.000e+00 1.812e-02 1.509e-01 1.616e-03 3.745e-05 1.000e+00 3.658e-02 5.465e-04 1.044e-05
RF+ - - 1.000e+00 1.000e+00 - - 1.736e-02 5.351e-02 1.000e+00 1.000e+00 1.631e-02 1.377e-01 1.434e-03 3.262e-05 1.000e+00 3.306e-02 4.824e-04 9.038e-06
RFGS - - - 1.000e+00 - - 2.445e-01 6.245e-01 1.000e+00 1.000e+00 2.321e-01 1.000e+00 2.958e-02 1.122e-03 1.000e+00 4.187e-01 1.161e-02 3.644e-04
RFGS+ - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 5.280e-01 3.558e-02 1.000e+00 1.000e+00 2.465e-01 1.384e-02
RFGA - 1.000e+00 1.000e+00 1.000e+00 - - 1.439e-02 4.490e-02 1.000e+00 1.000e+00 1.351e-02 1.168e-01 1.159e-03 2.550e-05 1.000e+00 2.759e-02 3.860e-04 6.994e-06
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.124e-03 7.469e-03 1.000e+00 4.268e-01 1.982e-03 2.161e-02 1.345e-04 2.143e-06 1.000e+00 4.357e-03 4.061e-05 5.317e-07
SVM - - - - - - - 1.000e+00 - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGS - - - 1.000e+00 - - 2.925e-01 7.369e-01 - 1.000e+00 2.778e-01 1.000e+00 3.641e-02 1.434e-03 1.000e+00 4.970e-01 1.446e-02 4.715e-04
SVMGS+ - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 5.400e-01
SVMGA - - - - - - - 1.000e+00 - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
MLP - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - 1.000e+00
MLPGS - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 8.487e-01 6.365e-02 - 1.000e+00 4.092e-01 2.560e-02
MLPGS+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -
Table 4.9 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same accuracy. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
higher mean accuracy, the mere use of an RF may be sufficient as the observed
difference may be caused by random.
While there is no significant difference between SVMGS, SVMGS+ and
the systems involving the use RF on the confidence level of 90%, it should be
noted that RFGA+ is the only one with significantly higher accuracy than 8
other systems on a confidence level of 99% and one more on the confidence
level of 95% looking at the systems compared to which each of them has achieved
a higher accuracy with a significant difference on the confidence level of 99%
(see table 4.8 for the exact systems and the corresponding p-values)
4.3.3 F1 Score
As shown in table 4.7 and table 4.10, RF has achieved the highest mean F1
score followed by RFGA+ and RFGA with insignificant difference between
them (p = 1.0). This suggests that while the use of GA and oversampling has
Results | 51
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 4.063e-01 3.805e-02 1.000e+00 1.052e-01 5.049e-01 2.350e-02 2.742e-02 4.945e-04 1.000e+00 3.294e-03 7.505e-03 9.694e-05
RF+ - - 1.000e+00 1.000e+00 - - 5.402e-01 5.341e-02 1.000e+00 1.444e-01 6.676e-01 3.331e-02 3.876e-02 7.542e-04 1.000e+00 4.854e-03 1.089e-02 1.520e-04
RFGS - - - 1.000e+00 - - 1.000e+00 3.776e-01 1.000e+00 8.892e-01 1.000e+00 2.507e-01 2.860e-01 8.936e-03 1.000e+00 4.647e-02 9.449e-02 2.137e-03
RFGS+ - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 4.094e-01 - 1.000e+00 1.000e+00 1.339e-01
RFGA - 1.000e+00 1.000e+00 1.000e+00 - - 1.783e-01 1.437e-02 1.000e+00 4.226e-02 2.250e-01 8.630e-03 1.016e-02 1.484e-04 1.000e+00 1.087e-03 2.587e-03 2.696e-05
RFGA+ - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 8.895e-02 6.357e-03 1.000e+00 1.963e-02 1.136e-01 3.732e-03 4.426e-03 5.460e-05 1.000e+00 4.313e-04 1.063e-03 9.334e-06
SVM - - - - - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGS - - - 1.000e+00 - - 1.000e+00 3.791e-01 - 8.924e-01 1.000e+00 2.518e-01 2.871e-01 8.980e-03 1.000e+00 4.668e-02 9.489e-02 2.149e-03
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - - 1.000e+00 - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
MLP - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - 1.000e+00
MLPGS - - - 1.000e+00 - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.021e-01 - 4.190e-01 7.643e-01 2.954e-02
MLPGS+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -
Table 4.10 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same F1 score. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
led to a higher mean F1 score, the mere use of an RF may be sufficient as far
as the F1 score is concerned.
Nevertheless, it should be noted that while not being the one with the high-
est F1 , RFGA+ has a narrower confidence interval than the one with the highest
F1 while also having the highest value in terms of the lowest end of the F1
confidence interval. In particular, the F1 confidence interval of RFGA+ ranged
from 0.878 to 0.905 whereas that of the system achieving the highest mean F1
ranged from 0.872 to 0.912.
Also, it should be noted that when it comes to the number of systems com-
pared to which each of the systems has achieved a higher F1 score with a sig-
nificant difference, RFGA+ was the one with the highest number regardless if
we measure the significance on the confidence level of 90%, 95% or 99% (see
table 4.10 for the exact systems and the corresponding p-values).
System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - - 1.489e-06 8.590e-06 - 1.000e+00 3.318e-02 1.553e-04 1.000e+00 - 1.797e-05 4.691e-04
RFGS+ - - 1.387e-06 8.032e-06 - 1.000e+00 3.169e-02 1.462e-04 1.000e+00 - 1.683e-05 4.428e-04
RFGA - - - - - - - - - - - -
RFGA+ - - 1.000e+00 - - - - - - - 1.000e+00 -
SVMGS 1.000e+00 1.000e+00 1.155e-06 6.754e-06 - 1.000e+00 2.814e-02 1.251e-04 1.000e+00 - 1.421e-05 3.814e-04
SVMGS+ - - 6.625e-06 3.526e-05 - - 8.632e-02 5.528e-04 1.000e+00 - 7.122e-05 1.574e-03
SVMGA - - 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA+ - - 1.000e+00 1.000e+00 - - - - - - 1.000e+00 1.000e+00
MLPGS - - 2.830e-05 1.388e-04 - - 2.139e-01 1.882e-03 - - 2.705e-04 5.051e-03
MLPGS+ 1.000e+00 1.000e+00 8.332e-08 5.564e-07 1.000e+00 1.000e+00 4.941e-03 1.305e-05 1.000e+00 - 1.241e-06 4.386e-05
MLPGA - - 1.000e+00 - - - - - - - - -
MLPGA+ - - 1.000e+00 1.000e+00 - - - - - - 1.000e+00 -
Table 4.11 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same features. The symbol
"-" is used when system A does not have a lower mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
4.3.5 Calls
System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - 1.000e+00 - - - 1.000e+00 - - 1.000e+00 - - -
RFGS+ - - - - - 1.000e+00 - - 1.000e+00 - - -
RFGA 4.709e-05 5.017e-05 - - 5.901e-05 1.158e-05 1.000e+00 1.000e+00 2.581e-06 5.041e-04 1.000e+00 1.000e+00
RFGA+ 3.820e-05 4.072e-05 1.000e+00 - 4.795e-05 9.298e-06 1.000e+00 1.000e+00 2.050e-06 4.167e-04 1.000e+00 1.000e+00
SVMGS 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 - - -
SVMGS+ - - - - - - - - 1.000e+00 - - -
SVMGA 9.034e-06 9.661e-06 - - 1.147e-05 2.050e-06 - 1.000e+00 4.209e-07 1.118e-04 1.000e+00 1.000e+00
SVMGA+ 7.050e-04 7.462e-04 - - 8.629e-04 1.999e-04 - - 5.155e-05 5.836e-03 - 1.000e+00
MLPGS - - - - - - - - - - - -
MLPGS+ 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00 - - 1.000e+00 - - -
MLPGA 1.109e-04 1.179e-04 - - 1.379e-04 2.848e-05 - 1.000e+00 6.636e-06 1.097e-03 - 1.000e+00
MLPGA+ 7.799e-03 8.201e-03 - - 9.326e-03 2.546e-03 - - 7.583e-04 5.009e-02 - -
Table 4.12 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same calls. The symbol "-"
is used when system A does not have a lower mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
Chapter 5
Discussion
terms of MCC and that one can be 99% certain that, in another trial, RFGA+
will provide a higher MCC than several of the systems tested in this thesis.
Similarly, while RFGA+ has not offered a significantly higher accuracy
than all system regardless if we measure the significance on a confidence level
of 90%, 95% or 99%, it should be noted that RFGA+ has achieved significantly
higher accuracy than several other systems both on the voice and drawing data
with a confidence level of 99%. This shows that RFGA+ is the best performing
system in terms of MCC and that one can be 99% certain that, in another trial,
RFGA+ will provide a higher MCC than several of the systems tested in this
thesis.
While not having the highest mean F1 score on the voice data, it should
be noted that RFGA+ has offered the highest mean F1 score on the drawing
data and that, even on the voice data, RFGA+ has achieved the highest value
in terms of the lowest end of the F1 score confidence interval.
More importantly, it should be noted that regardless if it is about the draw-
ing or voice data, no system has offered a significantly higher score than RFGA+
in any metric on the confidence level of 90%.
Again, it should be noted that precision and the recall rate do not take
account of the distribution of the data while MCC does. Hence, given that the
data sets used in this study are imbalanced and RFGA+ is the one with the
highest MCC both in terms of the drawing and the voice data, RFGA+ should
be the one to go for as far as the prediction performance is concerned.
As regards why RFGA+ was the best performing one, the reasons could
be the following:
1. RF is a decision-tree based classifier following the logic of a typical
clinical diagnosis where the diagnosis is made based on a decision-tree-
like thought map;
2. RF is an ensemble classifier where the use of multiple weaker classifiers
make it stronger than a non-ensemble classifier;
3. RF is better suited with GA than with GS as it can handle several features
and GS tends to get stuck at a local optimum in its attempt to provide
good enough performance with as little components as possible;
4. The data sets used in this study are imbalanced, meaning that, without
the use of an under-/oversampling method like random oversampling,
the system may tend to identify a sample as the majority class.
The first and second statements are supported by the observation that, both
for the drawing and the voice data, most systems utilising RF have provided
Discussion | 57
of the drawing and the voice data (see table 4.2 and 4.8). Hence, one may want
to use SVMGS for a faster prediction and lower overall resource consumption
while still having the possibility to reach as high predictive performance as the
system with the highest MCC.
As regards whether it is worth the time and computational resource to in-
clude feature selection for higher predictive performance, it may be important
to note that, amongst the systems not utilising feature selection, RF+ was the
one with the highest MCC on the drawing data whereas RF was the one with
the highest MCC on the voice data. In particular, since RFGA+ did not provide
a significantly higher MCC than RF+ or RF regardless if we look at the drawing
or voice data (p = 1.0), one may want to use RF+ for the drawing data and RF
for the voice data to avoid the need of feature selection. Nevertheless, it should
be noted that while feature selection is a process that takes time and resource
to run, feature selection can lead to faster prediction and lower resource con-
sumption by requiring fewer features for the prediction. Hence, even if the
system cannot provide higher predictive performance with the help of feature
selection, it may still be worth the time and resource to do a feature selection.
Similarly, as regards to whether it is worth the time and computational
resource to include random oversampling for higher predictive performance,
it may be important to note that, amongst the systems not utilising random
oversampling, RF was the one with the highest MCC on both the drawing and
voice data. Again, given that RFGA+ did not provide a significantly higher
MCC than RF on any of the data sets with a confidence level of 90% (p = 1.0),
one may want to use RF to avoid random oversampling.
help the general public save money that could be otherwise used on treatment
and other medical activities.
More importantly, the removal of doctor consultancy fee and waiting time
may motivate patients to get diagnosed earlier and thereby get treatment earlier
for a better health and productivity.
Also, one should remember that the ability to get treatment earlier will
likely help patients avoid unnecessary doctor visits and treatments in the fu-
ture. This, in turn, may serve as a way to help the general public to save money
and time as doctor visits and treatments are both costly and time-consuming.
As regards sustainability in terms of environmental impacts, it should be
noted that the automatic diagnosis of Parkinson’s disease has the benefit of
allowing the general public to test themselves from home instead of having
to travel to the hospital. This would make society more sustainable by re-
ducing emission. Indeed, even though the automatic diagnosis is a resource-
consuming activity with a negative impact on sustainability, the negative im-
pact caused by one single test is likely far less than the negative impact caused
by a traditional diagnosis of Parkinson’s disease where patients have to travel to
and from the hospital only to find out that they don’t have Parkinson’s disease.
More importantly, even if the convenience enabled by the automatic diag-
nosis will lead to people testing for Parkinson’s disease more often than they
do today, one should remember that it is better to test too often than to not
testing at all, especially when the test does not require the interference of a
doctor that has already do much on his/her plate.
Nevertheless, one may want to investigate ways to make the automatic di-
agnosis more sustainable by, for instance, measuring the energy consumption
of the different digital tools that can be used for the diagnosis and thereby come
up with a suggestion. Similarly, one may want to investigate the complexity of
the systems tested in this study and thereby come to a conclusion of whether
one should go for the system with the highest predictive performance in terms
of MCC and accuracy (i.e. RFGA+).
patients will get treatment earlier as it is now possible to get early diagnosis
and because it is now less tempting to delay the diagnosis as the diagnosis is
now for free and less time-consuming. This should be of interest to all medical
professionals as earlier treatment is essential.
This thesis may also be of interest to healthcare professionals as a decision
basis of what algorithms to use for the diagnosis of PD and how reliable the
algorithms are in terms of accuracy, precision and recall rate.
This thesis may also be of interest to the general public since Parkinson’s
disease is a common disease that can lead to disability. The general public
may thus be interested in this thesis as a way to gain knowledge of a disease
that they or their close ones may be suffering; to understand how this disease
can be detected through automatic diagnosis and to get an overview of the
reliability of such diagnosis.
This thesis may also be of interest to machine learning researchers as this
is a study about the automatic diagnosis of a disease using machine learning.
In particular, this thesis may be of interest to those working with automation
and classification problems as this study involves the automatic diagnosis of a
disease as a classification problem.
Conclusions and Future work | 61
Chapter 6
6.1 Conclusion
In conclusion, while none of the systems in this thesis have shown a signifi-
cantly higher performance than all other systems in any metric, it can be stated
that, amongst those systems tested in this thesis, the best system for the diag-
nosis of PD appears to be RFGA+ as a combination of RF, GA and random
oversampling.
of the brain controlling the passive hand. Hence, if PD patients in the data
set happen to have PD affecting only his/her passive side, then it would be
wrong to use the drawings made by his/her dominant hand for the prediction
of PD or the assessment of the model. This problem could, for instance, solved
by registering whether the PD patients have PD affecting their dominant side
and/or asking them to draw with both hands while registering which one is
made by his/her dominant hand. This is thus another area to explore.
Since the data we could found either contains speech data only or drawing
data only, we were also not able to test how the combination of speech and
drawing data would impact the performance of each model. While we could
have created new PD and healthy participants by combining the speech and
drawing data, doing so would not reflect how the performance would look like
in real life where we are to identify PD using the data from the same person.
Therefore, we left this as a suggestion for future research - to collect voice and
drawing data from participants and test our system on the data.
Moreover, a larger dataset could be gathered to verify whether our pro-
posed systems indeed are the best. After all, several systems have been ob-
served with similar results without any significant difference between them.
That is, by testing our implementations on more data, one may find a clearer
difference between the combinations.
Furthermore, since no hyper-parameter tuning was done in this thesis, a
suggestion of future work could be to find the best hyperparameters for the
proposed system.
Last but not least, the thesis has shown that while random oversampling
may help solve the problem of data imbalance, higher performance may be
achieved through the use of a more sophisticated oversampling model that takes
account of the over-fitting problem faced by random oversampling. This could
be another area to explore.
REFERENCES | 63
References
[24] H.-I. Ma, W.-J. Hwang, S.-H. Chang, and T.-Y. Wang, “Progressive
micrographia shown in horizontal, but not vertical, writing in parkinson’s
disease,” Behavioural neurology, vol. 27, no. 2, pp. 169–174, 2013.
[40] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–
32, 2001.
[43] F. Nie, W. Zhu, and X. Li, “Decision tree svm: An extension of linear
svm for non-linear classification,” Neurocomputing, vol. 401, pp. 153–
159, 2020.
[48] T. Wong and P. Yeh, “Reliable accuracy estimates from k-fold cross
validation,” IEEE Transactions on Knowledge and Data Engineering,
vol. 32, no. 8, pp. 1586–1594, 2020.
68 | REFERENCES
Appendix A
Metric
MCC Accuracy Precision Recall F1 Features Calls
Data
Drawing 2.936e-25 4.786e-27 4.503e-16 7.230e-27 6.660e-27 2.953e-24 1.595e-23
Voice 3.790e-31 4.911e-24 9.135e-19 1.299e-13 5.369e-20 3.554e-23 1.475e-22
Table A.1 – Friedmann test results: p-values corresponding the null hypothesis
that all systems came from a population with the same distribution.
Table A.2 – Test results on drawing data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - - - 1.000e+00 - 4.964e-03 - 1.000e+00 - 4.964e-03 - 3.759e-02 1.000e+00 1.000e+00 - 5.979e-02 -
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.249e-04 1.000e+00 1.000e+00 1.000e+00 1.249e-04 1.000e+00 1.360e-03 1.000e+00 1.000e+00 - 2.365e-03 1.000e+00
RFGS 1.000e+00 - - - 1.000e+00 - 6.416e-03 - 1.000e+00 - 6.416e-03 - 4.727e-02 1.000e+00 1.000e+00 - 7.468e-02 -
RFGS+ 1.000e+00 - 1.000e+00 - 1.000e+00 - 2.390e-03 - 1.000e+00 - 2.390e-03 1.000e+00 1.954e-02 1.000e+00 1.000e+00 - 3.166e-02 -
RFGA - - - - - - 1.758e-02 - 1.000e+00 - 1.758e-02 - 1.159e-01 1.000e+00 1.000e+00 - 1.781e-01 -
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.588e-05 1.000e+00 1.000e+00 1.000e+00 1.588e-05 1.000e+00 2.078e-04 1.000e+00 1.000e+00 1.000e+00 3.779e-04 1.000e+00
SVM - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -
SVM+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 2.442e-03 - 1.000e+00 - 2.442e-03 1.000e+00 1.992e-02 1.000e+00 1.000e+00 - 3.226e-02 -
SVMGS - - - - - - 4.579e-02 - - - 4.579e-02 - 2.702e-01 1.000e+00 1.000e+00 - 4.042e-01 -
SVMGS+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.309e-03 1.000e+00 1.000e+00 - 1.309e-03 1.000e+00 1.138e-02 1.000e+00 1.000e+00 - 1.871e-02 1.000e+00
SVMGA - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -
SVMGA+ 1.000e+00 - 1.000e+00 - 1.000e+00 - 6.255e-03 - 1.000e+00 - 6.255e-03 - 4.621e-02 1.000e+00 1.000e+00 - 7.305e-02 -
MLP - - - - - - 1.000e+00 - - - 1.000e+00 - - 1.000e+00 - - 1.000e+00 -
MLP+ - - - - - - - - - - - - - - - - 1.000e+00 -
MLPGS - - - - - - 6.265e-01 - - - 6.265e-01 - 1.000e+00 1.000e+00 - - 1.000e+00 -
MLPGS+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.791e-04 1.000e+00 1.000e+00 1.000e+00 1.791e-04 1.000e+00 1.886e-03 1.000e+00 1.000e+00 - 3.253e-03 1.000e+00
MLPGA - - - - - - - - - - - - - - - - - -
MLPGA+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.629e-03 1.000e+00 1.000e+00 - 1.629e-03 1.000e+00 1.385e-02 1.000e+00 1.000e+00 - 2.266e-02 -
Table A.3 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same precision. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
Recall Rate
As shown in table 4.1, SVM, SVMGA and RF+ were the ones having achieved
the highest recall rate possible. While being the ones with the highest recall
rate, it should be noted that there is no significant difference between these
three systems and the system achieving the highest MCC, i.e. RFGA+. In
particular, it should be noted that, on a confidence level of 95%, the systems
compared to which each of these systems has achieved a significantly higher
recall rate was the same for SVM, SVMGA, RFGA+, RFGA and RF+ - All
of them have achieved a higher recall rate than SVM+, SVMGS+, SVMGA+,
MLP+, MLPGS+, MLPGA+ with p < 0.05 (See A.4 for the exact p-values).
Appendix A: Additional Test Results | 73
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 - 1.000e+00 - 6.556e-07 1.000e+00 2.949e-02 - 7.028e-07 1.000e+00 1.859e-04 1.000e+00 7.207e-03 4.346e-01 2.366e-05
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.678e-07 1.000e+00 1.194e-02 - 1.803e-07 1.000e+00 5.927e-05 1.000e+00 2.716e-03 2.057e-01 6.932e-06
RFGS - - - 1.000e+00 - - - 1.599e-05 1.000e+00 2.298e-01 - 1.703e-05 1.000e+00 2.624e-03 1.000e+00 6.716e-02 1.000e+00 4.122e-04
RFGS+ - - - - - - - 1.257e-04 1.000e+00 8.172e-01 - 1.334e-04 1.000e+00 1.409e-02 1.000e+00 2.703e-01 1.000e+00 2.564e-03
RFGA - - 1.000e+00 1.000e+00 - 1.000e+00 - 6.556e-07 1.000e+00 2.949e-02 - 7.028e-07 1.000e+00 1.859e-04 1.000e+00 7.207e-03 4.346e-01 2.366e-05
RFGA+ - - 1.000e+00 1.000e+00 - - - 1.040e-06 1.000e+00 3.990e-02 - 1.114e-06 1.000e+00 2.733e-04 1.000e+00 1.000e-02 5.573e-01 3.582e-05
SVM 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.678e-07 1.000e+00 1.194e-02 - 1.803e-07 1.000e+00 5.927e-05 1.000e+00 2.716e-03 2.057e-01 6.932e-06
SVM+ - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
SVMGS - - - - - - - 2.166e-03 - 1.000e+00 - 2.282e-03 1.000e+00 1.368e-01 - 1.000e+00 1.000e+00 3.102e-02
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.678e-07 1.000e+00 1.194e-02 - 1.803e-07 1.000e+00 5.927e-05 1.000e+00 2.716e-03 2.057e-01 6.932e-06
SVMGA+ - - - - - - - 1.000e+00 - - - - - 1.000e+00 - - - 1.000e+00
MLP - - - - - - - 3.578e-03 - 1.000e+00 - 3.766e-03 - 2.029e-01 - 1.000e+00 1.000e+00 4.795e-02
MLP+ - - - - - - - - - - - - - - - - - -
MLPGS - - - - - - - 4.730e-04 1.000e+00 1.000e+00 - 5.003e-04 1.000e+00 4.091e-02 - 6.454e-01 1.000e+00 8.222e-03
MLPGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - 5.926e-01 - - - 6.147e-01 - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - 1.000e+00 - - - -
Table A.4 – Dunn’s test results on the drawing data: p-values corresponding
to the null hypothesis that system A and B have the same recall. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
A.2.2 Voice
Overall Result
Table A.5 – Test results on voice data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.
Precision
As shown in table 4.7, SVMGS+ was the system with the highest mean preci-
sion. While being the one with the highest precision, it should be noted that
this system did not offer the highest MCC (see table 4.1) and that there is no
significant difference between SVMGS+ and the system providing the highest
74 | Appendix A: Additional Test Results
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 - 1.000e+00 - 4.447e-03 1.000e+00 1.000e+00 - 3.415e-03 1.000e+00 5.801e-03 4.680e-01 1.000e+00 1.000e+00 3.590e-01 9.362e-04
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 3.980e-04 1.000e+00 1.000e+00 - 2.966e-04 1.000e+00 5.351e-04 7.638e-02 1.000e+00 1.000e+00 5.634e-02 7.063e-05
RFGS - - - - 1.000e+00 - 1.871e-02 1.000e+00 1.000e+00 - 1.465e-02 - 2.394e-02 1.000e+00 1.000e+00 - 1.000e+00 4.401e-03
RFGS+ 1.000e+00 - 1.000e+00 - 1.000e+00 - 6.232e-04 1.000e+00 1.000e+00 - 4.670e-04 1.000e+00 8.334e-04 1.074e-01 1.000e+00 1.000e+00 7.979e-02 1.140e-04
RFGA - - - - - - 2.294e-02 1.000e+00 1.000e+00 - 1.801e-02 - 2.926e-02 1.000e+00 1.000e+00 - 1.000e+00 5.485e-03
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.680e-04 1.000e+00 1.000e+00 - 1.240e-04 1.000e+00 2.282e-04 3.946e-02 1.000e+00 1.000e+00 2.872e-02 2.817e-05
SVM - - - - - - - - - - 1.000e+00 - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVM+ - - - - - - 2.629e-02 - 1.000e+00 - 2.068e-02 - 3.346e-02 1.000e+00 1.000e+00 - 1.000e+00 6.354e-03
SVMGS - - - - - - 4.688e-02 - - - 3.718e-02 - 5.916e-02 1.000e+00 1.000e+00 - 1.000e+00 1.189e-02
SVMGS+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.147e-04 1.000e+00 1.000e+00 - 8.429e-05 1.000e+00 1.565e-04 2.940e-02 1.000e+00 1.000e+00 2.127e-02 1.876e-05
SVMGA - - - - - - - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA+ - - 1.000e+00 - 1.000e+00 - 4.493e-03 1.000e+00 1.000e+00 - 3.451e-03 - 5.860e-03 4.715e-01 1.000e+00 1.000e+00 3.618e-01 9.466e-04
MLP - - - - - - - - - - - - - - - - - 1.000e+00
MLP+ - - - - - - - - - - - - 1.000e+00 - - - - 1.000e+00
MLPGS - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 1.000e+00 - - 1.000e+00 9.858e-01
MLPGS+ - - 1.000e+00 - 1.000e+00 - 1.138e-02 1.000e+00 1.000e+00 - 8.845e-03 - 1.465e-02 9.313e-01 1.000e+00 - 7.262e-01 2.572e-03
MLPGA - - - - - - - - - - - - 1.000e+00 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -
Table A.6 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same precision. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
MCC, i.e. RFGA+, on the confidence level of 90% and thereby also not on
95% or 99% (p = 1.0).
Nevertheless, it should be noted that both SVMGS+ and RFGA+ have achieved
a higher precision than SVM, SVMGA, MLP and MLPGA with significant
difference on the confidence level of 99% and two more on the confidence
level of 95% (See table A.6 for the exact systems and the corresponding p-
values).
Recall Rate
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 9.674e-01 - 1.000e+00 - 8.837e-03 1.000e+00 2.368e-02 - 1.213e-02 1.000e+00 1.000e+00 1.000e+00 8.281e-03 1.000e+00 1.781e-01
RF+ - - - 1.000e+00 - - - 9.848e-01 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 9.402e-01 1.000e+00 1.000e+00
RFGS - 1.000e+00 - 1.000e+00 - - - 5.084e-01 - 1.000e+00 - 6.462e-01 1.000e+00 1.000e+00 - 4.840e-01 1.000e+00 1.000e+00
RFGS+ - - - - - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
RFGA 1.000e+00 1.000e+00 1.000e+00 5.822e-01 - 1.000e+00 - 4.361e-03 1.000e+00 1.213e-02 - 6.056e-03 1.000e+00 8.751e-01 1.000e+00 4.077e-03 1.000e+00 9.910e-02
RFGA+ - 1.000e+00 1.000e+00 1.000e+00 - - - 2.151e-01 1.000e+00 4.785e-01 - 2.783e-01 1.000e+00 1.000e+00 - 2.040e-01 1.000e+00 1.000e+00
SVM 1.000e+00 1.000e+00 1.000e+00 1.075e-01 1.000e+00 1.000e+00 - 4.395e-04 1.000e+00 1.369e-03 - 6.325e-04 1.000e+00 1.716e-01 1.000e+00 4.079e-04 1.000e+00 1.439e-02
SVM+ - - - - - - - - - - - - 5.463e-01 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGS - 1.000e+00 1.000e+00 1.000e+00 - - - 4.010e-01 - 8.564e-01 - 5.123e-01 1.000e+00 1.000e+00 - 3.813e-01 1.000e+00 1.000e+00
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA 1.000e+00 1.000e+00 1.000e+00 6.258e-02 1.000e+00 1.000e+00 1.000e+00 2.137e-04 1.000e+00 6.882e-04 - 3.109e-04 1.000e+00 1.017e-01 1.000e+00 1.980e-04 1.000e+00 7.798e-03
SVMGA+ - - - - - - - 1.000e+00 - - - - 6.932e-01 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
MLP - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - 1.000e+00
MLPGS - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 - 2.651e-02 1.000e+00 6.689e-02 - 3.570e-02 1.000e+00 1.000e+00 - 2.494e-02 1.000e+00 4.399e-01
MLPGS+ - - - - - - - - - - - - 5.201e-01 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -
Table A.7 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same recall. The symbol "-"
is used when system A does not have a higher mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
As shown in table 4.7, SVMGA was the system with the highest recall rate
followed by SVM and RFGA. While being the ones with the highest recall
rate, it should be noted the p-value corresponding to the null hypothesis that
these systems have the same recall rate as the one achieving the highest MCC
Appendix A: Additional Test Results | 75
(i.e. RFGA+) was 1.0. That is, these systems did not offer a higher recall rate
than RFGA+ with a significant difference on the confidence level of 90% and
thereby also not 95% or 99%.
76 | Appendix A: Additional Test Results
TRITA -EECS-EX-2021:387
www.kth.se