You are on page 1of 94

DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND

MANAGEMENT,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2021

Automatic Diagnosis of
Parkinson's Disease Using
Machine Learning
A Comparative Study of Different Feature
Selection Algorithms, Classifiers and Sampling
Methods

JEANNIE HE

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Automatic Diagnosis of
Parkinson’s Disease Using
Machine Learning

A Comparative Study of Different Feature


Selection Algorithms, Classifiers and
Sampling Methods

JEANNIE HE

Degree Programme in Industrial Engineering and Management


Date: June 17, 2021

Supervisor: Arvind Kumar


Examiner: Mårten Björkman
School of Electrical Engineering and Computer Science
Host organization: The Second Affiliated Hospital Zhejiang
University School of Medicine
Swedish title: Automatisk igenkänning av Parkinsons sjukdom med
hjälp av maskininlärning
Swedish subtitle: En jämförande studie av olika urvalsalgoritm,
klassificerare och provtagningsmetod
© 2021 Jeannie He
Abstract | i

Abstract
Over the past few years, several studies have been published to propose algo-
rithms for the automated diagnosis of Parkinson’s Disease using simple exams
such as drawing and voice exams. However, at the same time as all classifiers
appear to have been outperformed by another classifier in at least one study,
there appear to lack a study on how well different classifiers work with a certain
feature selection algorithm and sampling method. More importantly, there
appear to lack a study that compares the proposed feature selection algorithm
and/or sampling method with a baseline that does not involve any feature selec-
tion or oversampling. This leaves us with the question of which combination
of feature selection algorithm, sampling method and classifier is the best as
well as what impact feature selection and oversampling may have on the per-
formance. Given the importance of providing a quick and accurate diagnosis
of Parkinson’s disease, a comparison is made between different systems of
classifier, feature selection and sampling method with a focus on the predictive
performance. A system was chosen as the best system for the diagnosis of
Parkinson’s disease based on its comparative predictive performance on two
sets of data - one from drawing exams and one from voice exams.

Keywords
Machine learning, Parkinson’s disease, Feature Selection, Greedy Search, Ge-
netic Algorithm, Diagnosis of Parkinson’s Disease, Drawing Exams, Voice
Exams
ii | Abstract
Sammanfattning | iii

Sammanfattning
Som en av världens mest vanligaste sjukdom med en tendens att leda till funk-
tionshinder har Parkinsons sjukdom länge varit i centrum av forskning. För att
se till att så många som möjligt får en behandling innan det blir för sent har
flera studier publicerats för att föreslå algoritmer för automatisk diagnos av
Parkinsons sjukdom. Samtidigt som alla klassificerare verkar ha överträffats av
en annan klassificerare i minst en studie, verkar det saknas en studie om hur
väl olika klassificerare fungerar med en viss kombination av urvalsalgoritm
(feature selection algorithm på engelska) och provtagningsmetod. Därutöver
verkar det saknas en studie där resultatet från den föreslagna urvalsalgoritmen
och/eller samplingsmetoden jämförs med resultatet av att applicera klassifice-
raren direkt på datan utan någon urvalsalgoritm eller resampling. Detta läm-
nar oss en fråga om vilket system av klassificerare, urvalsalgoritm och samp-
lingsmetod man bör välja och ifall det är värt att använda en urvalsalgoritm
och överprovtagningsmetod. Med tanke på vikten av att snabbt och noggrant
upptäcka Parkinsons sjukdom har en jämförelse gjorts för att hitta den bästa
kombinationen av klassificerare, urvalsalgoritm och provtagningsalgoritm för
den automatiska diagnosen av Parkinsons sjukdom.

Nyckelord
Maskininlärning, Parkinsons sjukdom, Greedy search, Genetisk algorithm, Ur-
valsalgoritm, Diagnos av Parkinsons sjukdom, Ritningstester, Rösttester
iv | Sammanfattning
Sammanfattning | v

Acknowledgements
I would like to thank my supervisor and examiner for their support as well
as the Second Affiliated Hospital Zhejiang University School of Medicine for
giving me this exciting project.
vi | CONTENTS

Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Original Problem . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Scientific and Engineering Issues . . . . . . . . . . . 2
1.2.3 Research Question . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose and Goals . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5
2.1 Parkinson’s Disease . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Automatic Diagnosis of Parkinson’s Disease . . . . . 6
2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Search Strategy . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Search Direction . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Search Heuristic . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Greedy Search Algorithm for Feature Selection . . . . 10
2.2.5 Genetic Algorithm for Feature Selection . . . . . . . . 11
2.3 Random Oversampling . . . . . . . . . . . . . . . . . . . . . 12
2.4 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Random Forest Classifier . . . . . . . . . . . . . . . . 13
2.4.2 Support Vector Machine . . . . . . . . . . . . . . . . 13
2.4.3 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . 14
2.5 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Significance Testing . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Null Hypothesis and P-value . . . . . . . . . . . . . . 17
CONTENTS | vii

2.6.2 Significance Level and Confidence Level . . . . . . . 17


2.6.3 Friedman’s Test . . . . . . . . . . . . . . . . . . . . . 18
2.6.4 Dunn’s Test with Bonferroni Correction . . . . . . . . 18
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.1 Automatic Diagnosis of Parkinson’s Disease Using Draw-
ing and Voice Data . . . . . . . . . . . . . . . . . . . 19
2.7.2 Automatic Diagnosis of Parkinson’s Disease Using Voice
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.3 Automatic Diagnosis of Parkinson’s Disease Using Time-
based Drawing Data . . . . . . . . . . . . . . . . . . 21
2.7.4 Automatic Diagnosis of Parkinson’s Disease Using Fi-
nal Drawings . . . . . . . . . . . . . . . . . . . . . . 24
2.7.5 Automatic Diagnosis of Parkinson’s Disease Symp-
toms Using Drawing Features . . . . . . . . . . . . . 24
2.7.6 Identification of the "Best" Drawing Features for the
Automatic Diagnosis of Parkinson’s Disease . . . . . 25

3 Methodology 27
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Drawing Data . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Voice Data . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . 28
3.3 Finding the Best System . . . . . . . . . . . . . . . . . . . . 31
3.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Problem Encoding . . . . . . . . . . . . . . . . . . . 31
3.4.2 Forward Greedy Search . . . . . . . . . . . . . . . . 31
3.4.3 Genetic Search . . . . . . . . . . . . . . . . . . . . . 32
3.4.4 Random Oversampling Versus No Oversampling . . . 36
3.5 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Validation and Testing . . . . . . . . . . . . . . . . . . . . . 37
3.6.1 Cross-validation . . . . . . . . . . . . . . . . . . . . 37
3.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.3 Confidence Intervals . . . . . . . . . . . . . . . . . . 40
3.6.4 Significant Testing . . . . . . . . . . . . . . . . . . . 41
3.7 Programming Language and Library . . . . . . . . . . . . . . 41
3.8 Hyperparameter Settings . . . . . . . . . . . . . . . . . . . . 41
viii | Contents

4 Results 43
4.1 Clarification of the Names . . . . . . . . . . . . . . . . . . . 43
4.2 Results on Drawing Data . . . . . . . . . . . . . . . . . . . . 43
4.2.1 MCC . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Selected Features . . . . . . . . . . . . . . . . . . . . 46
4.2.5 Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Results on Voice Data . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 MCC . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4 Selected Features . . . . . . . . . . . . . . . . . . . . 51
4.3.5 Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Discussion 55
5.1 Greedy Search Versus Genetic Algorithm . . . . . . . . . . . 55
5.2 The Best System . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Alternatives to the Best System . . . . . . . . . . . . . . . . . 57
5.4 Ethics, Economics and Sustainability . . . . . . . . . . . . . . 58
5.5 Potential Parties of Interest . . . . . . . . . . . . . . . . . . . 59

6 Conclusions and Future work 61


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Limitation and Suggestion of Future Work . . . . . . . . . . . 61

References 63

A Additional Test Results 71


A.1 Friedman’s Test Results . . . . . . . . . . . . . . . . . . . . . 71
A.2 Precision, Recall Rate and Dunn’s Test Results . . . . . . . . . 71
A.2.1 Drawing . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2.2 Voice . . . . . . . . . . . . . . . . . . . . . . . . . . 73
LIST OF FIGURES | ix

List of Figures

2.1 A pricture of a participant drawing on a spiral template. The


picture is taken from Memedi et al [1]’s study. . . . . . . . . . 7
2.2 Some spiral drawings from Gupta et al. [2]’s study, each be-
longing to different participants: (a) 58-years old healthy par-
ticipant and (b) 28-years old healthy participant, and (c) 56-
years old Parkinson’s disease (PD) patient and (d) 65-years old
PD patient. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 A set of waveforms from Sakar et al.[3]. The upper waveform
belongs to the voice of a healthy individual. The lower wave-
form belongs to a PD patient. The y-axis shows the amplitude
of the signal whereas the x-axis shows the timeline [3] . . . . 8
2.4 An illustration of the classification flow of a RF classifier from
the article written by Golze et al. [4]. . . . . . . . . . . . . . . 13
2.5 An example of how a RBF-SVM classifier sets a boundary
between two classes. The image is taken from Kamarulzaini
et al[5]’s study. The samples are shaped by the class. The grey
ones are those close to the boundary [5]. . . . . . . . . . . . . 13
2.6 An illustration of the classification flow of an MLP classifier
from the article written by Faghfour and Frish [6]. . . . . . . . 14
2.7 An illustration of nested cross-validation (CV) with an outer
loop for testing and an inner loop for validation. The white,
blue, grey and black boxes are the decode, test, training and
validation data respectively. At each outer fold, the decode
data was partitioned into 4 inner folds so that an inner CV can
be done using the decode data. The illustration is made based
on the description provided by Varoquauax et al. [7]. . . . . . 15
x | LIST OF FIGURES

2.8 Ross [8]’s illustration of a null hypothesis testing on two sam-


ple means where µ1 and µ2 are the sample means; H0 is the
null hypothesis; zα/2 and zα are two constants corresponding
the significance level in a two- and one-sided test respectively
[8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 A spiral drawing from a PD patient in Isenkul et al [9]’s study.


The red line is the patient’s drawing, the black spiral is the
template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 A spiral drawing from a PD patient in Isenkul et al [9]’s study
with a red star marking the centre of the figure. . . . . . . . . 28
3.3 An illustration of the cross-over operations used in this study.
The vertical lines are the cross-over points. The arrays to the
left are the parents and the arrays to the right are the children. . 34
3.4 An illustration of the mutation operation used in this study.
Assuming that the third element of a solution was randomly
selected for a bit-wise negation operation. The third element
was thus flipped from 1 to 0 as a result of the mutation. . . . . 35
List of acronyms and abbreviations | xi

List of acronyms and abbreviations


CNN convolutional neural network

CV cross-validation

DST dynamic Spiral test

DT Decision Tree

GA genetic algorithm

GS greedy search

kNN K-nearest Neighbours

LR Logistic Regression

MCC Matthews Correlation Coefficient

MLP Multilayer Perceptron

PD Parkinson’s disease

RBF Radius Basis Function

RF Random Forest

SST static spiral test

SVM Support Vector Machines


xii | List of acronyms and abbreviations
Introduction | 1

Chapter 1

Introduction

PD is a common neurological disorder affecting millions of elders while being


classified as a major cause of global disability. The increasing prevalence of
PD is adding a burden not only to the patients and their family but also to
society. This urges for the early detection of PD to ensure early treatment as a
remedy to the global disability problem [10] and as a way to prolong the PD
patients’ lifespan as fully functional people [2].
To enable early detection of PD, a potential first step is to find an inexpen-
sive way to accurately and quickly predict PD. This can be of value to both
those who turn to the hospital and those who are at the risk of having PD.

1.1 Background
As much as there are research results with promising results, it is still unclear
which approach is the best as they all use different data at the same time as there
is a problem of researchers using the same data for hyper-tuning and testing,
making the performance look better than it is [11]. Hence, it is in the interest
of the Second Affiliated Hospital Zhejiang University School of Medicine to
implement and compare the best performing components from recent studies
as a basis to develop this software program for the screening of PD.

1.2 Problem Definition


1.2.1 Original Problem
The original problem of this thesis is to develop and evaluate a program for
the automatic diagnosis of PD, as accurate as possible without using too much
2 | Introduction

computing resource. By solving this, we will help hospitals lower their work-
load while making sure that those who suffer from PD can be detected before
it is too late.

1.2.2 Scientific and Engineering Issues


The engineering issue is to propose a new system for the diagnosis of PD by
implementing and comparing different combinations of voting technique and
classifier. The scientific issue is to test and compute the performance metrics
of each combination.

1.2.3 Research Question


The research question in this thesis is to find out which of the existing clas-
sifiers, feature selection algorithms and sampling methods can, as a system
of one classifier, one feature selection algorithm and one sampling method,
provide the highest predictive performance on the automatic diagnosis of PD
using digital exams. The predictive performance was measured primarily us-
ing Matthews Correlation Coefficient (MCC) as the data sets in this thesis are
imbalanced and MCC is a metric that takes account of such issues [12].

1.3 Purpose and Goals


1.3.1 Purpose
The purpose of this thesis is to propose a system for the diagnosis of PD upon
which the hospitals can develop a screening test for PD. This way, people can
test for PD themselves without having to go to the hospital. This, in turn,
means that doctors will have time for other medical activities at the same time
as patients with PD can be detected and treated earlier.

1.3.2 Goals
The goal of this project is to implement and evaluate our proposal to see how
well it performs. This has been divided into the following sub-goals:

1. Find a system that can be used to diagnose PD at a higher performance


than the mere use of an existing classification algorithm.
Introduction | 3

2. Extract as many features as possible from the drawing data by taking


inspiration from recent studies.

3. Implement and compare different systems of feature selection algorithm,


sampling technique and classifier.

1.4 Research Methodology


An experiment was conducted to answer the research question by testing, mea-
suring and analysing the performance of different choices of feature selection
algorithm, sampling technique and classifier. More details about the experi-
ment can be found in chapter 3.

1.5 Thesis Scope


Since the aim of this thesis is to propose a system that could be used by the
general public, it is of the host organisation’s interest that the system can pro-
vide high performance on a set of data that is easy to collect without the need
of visiting the hospital. As exams that could be easily conducted through a
mobile device, drawing and voice exams are also exams that could capture
the most common symptoms of PD - drawing exams for bradykinesia, tremor,
rigidity and cognition problems [2]; voice exams for voice impairment and
hesitation. For this reason, drawing and voice exam data were perceived as
suitable data for this thesis.
As we could not find any data that registers the voice and drawing exam
result from the same group of people, we had to test our models on the two
types of data independently.
Given that the data sets used in this thesis are imbalanced and MCC is
a metric that takes class imbalance into account [12], the focus has been laid
upon MCC.In particular, no discussion was made upon the precision and recall
rate as the inclusion of these two metrics may make the report too complicated.
The selection was made based on the consideration that these two metrics nei-
ther share MCC’s ability to take class imbalance into account [12] or accu-
racy’s and F1 score’s ability to allow for comparison with other studies through
their popularity amongst related studies. Therefore, these metrics were only
briefly mentioned in the result as a way to, without making the report too
complicated, demonstrate each systems’ performance in aspects not covered
by the other metrics, i.e. the likelihood that a patient has PD given that a system
4 | Introduction

says so and the probability that a system will identify a patient as having PD
given that the patient has PD.
For the sake of simplicity and time, the thesis scope was limited to only
using one classifier, one sampling technique and one feature selection algo-
rithm in each system. For the same reason, no analysis was made on the result
per epoch for any combination at the same time as no hyper-parameter tun-
ing was conducted. To do this, all classifiers were trained using the default
hyper-parameters provided by Python’s Scikit-Learn Library; and the feature
selection algorithms were implemented to run until the validation score stops
improving.
Moreover, since this is a thesis for a Master’s degree in Computer Science,
the medical, environmental, economical and ethical aspects were discussed to
a limited degree.
Furthermore, given that the voice and drawing data came from different
sources where our knowledge of the participants is limited, no discussion was
made upon whether the voice data is better or worse than the drawing data.
Finally, considering the overrated performance problem mentioned by [13],
no comparison was made between the result shown in this study and the other
studies.
Background | 5

Chapter 2

Background

2.1 Parkinson’s Disease


PD is an age-related degenerative neurological disorder occurring due to the
impairment or death of nerve cells in the substantia nigra - an area in the brain
responsible for the control of body movement [14].
While being incurable, early treatment is important for the long-term re-
lief of PD-related symptoms. Depending on the stage and symptoms, vari-
ous treatment can be taken. This includes medications and physical activi-
ties, occupational therapies, speech-language therapies and surgery. Amongst
the available treatments, medical treatments are the most common and can be
used for helping nerve cells make dopamine; mimicking dopamine; blocking
a certain enzyme from breaking down dopamine in the brain; relieving certain
symptoms of PD [14].
The most common symptoms of PD can be summarized as follows:

• Bradykinesia: The slowness of movement as a prerequisite for the diag-


nosis of PD [15].

• Tremor: A conditioned prerequisites of PD [15] which may occur in


both sides of the body or only one side of the body [14].

• Rigidity: The stiffness in muscle resulting in the inability to move freely


[16].

• Postural Change: The fourth main symptom of PD [16]. Includes the


inability to maintain stability which often leads to falls (Postural Insta-
bility) [17] and the stooping posture as a common symptom of PD [18].
6 | Background

• Coordination: Difficulty in initializing movements and coordinating move-


ments [14].

• Gait impairment: Abnormal walking pattern manifested in reduced walk-


ing speed, reduced step length, increased axial rigidity and/or inability
to maintain walking speed [19].

• Voice impairment: One of the most common symptoms amongst early


PD patients [20]. Characterized by a hoarse voice, reduced volume, re-
stricted pitch variability (monotone), imprecise articulation(slurred speech)
and unstable speech rate [21].

While all the aforementioned symptoms are common amongst PD patients,


bradykinesia, tremor and rigidity (stiffness) are the cardinal symptoms of PD
[14] and thereby also the symptoms commonly used for the diagnosis of PD
[22]. In fact, a patient can, according to Tysnes et al. [15], only be diagnosed
as a PD if the patient shows signs of bradykinesia along with tremor and/or
rigidity.
For this reason, one common approach to diagnose PD is the so-called
drawing exam where patients were asked to draw a spiral or similar so that the
doctor could look for the three cardinal signs of PD by looking at the patients’
drawing patterns[2]. Indeed, a drawing exam can be particularly suitable for
the detection of bradykinesia is a symptom that can be manifested in slow mo-
tion and/or uncomfortably small handwriting(micrographia) [23] which may
progress during writing (progressive micrographia) [24].

2.1.1 Automatic Diagnosis of Parkinson’s Disease


The automatic diagnosis of PD has gained the interest of many as a way to
enable a more accurate and efficient diagnosis of PD while lowering the work-
load of medical professionals.

Drawing Exams
As mentioned earlier, drawing exam is often a part of the clinical diagnosis of
PD thanks to its ability to capture the main symptoms of PD [2]. Because of
this, several models have been proposed to automate this procedure. Often,
the automation implies that a diagnosis would be made by extracting relevant
features from the drawing(s) made by the patient and then using a pre-trained
model to make a prediction based on these features [25]. Figure 2.1 shows
the device used in Memedi et al [1]’s study where the participants were asked
Background | 7

to drawing upon a spiral template on a digital device. Figure 2.2 shows some
drawings from Gupta et al. [2]’s study where an algorithm was proposed to
automatically diagnose PD using scanned drawings.

Figure 2.1 – A pricture of a participant drawing on a spiral template. The


picture is taken from Memedi et al [1]’s study.

Figure 2.2 – Some spiral drawings from Gupta et al. [2]’s study, each
belonging to different participants: (a) 58-years old healthy participant and
(b) 28-years old healthy participant, and (c) 56-years old PD patient and (d)
65-years old PD patient.

Voice Exams
Although not as widely used as drawing exams in practice, voice exams have
drawn the attention of several machine learning scholars as a way to enable the
early detection of PD. This is both because vocal impairment is amongst the
most common symptoms amongst early PD patients [20] and because early
PD patients tend to be those whose vocal abnormalities might be too vague
to be perceptible to humans. In other words, it is hypothesized that the auto-
matic diagnosis of PD through voice exams can help one detect PD earlier and
thereby worth the investigation [26].
8 | Background

Figure 2.3 – A set of waveforms from Sakar et al.[3]. The upper waveform
belongs to the voice of a healthy individual. The lower waveform belongs to
a PD patient. The y-axis shows the amplitude of the signal whereas the x-axis
shows the timeline [3]

Unlike drawing exams, a voice exam allows one to diagnose PD through


vocal impairment. Typically, this is done by converting the patient’s voice to
a waveform and then using a pre-trained model to make a prediction based on
the features extracted from that waveform [26]. To demonstrate this, figure 2.3
shows a plot from Sakar et al. [3]’s study where the waveform belonging to a
healthy person was compared against the waveform belonging to a PD patient.

2.2 Feature Selection


Typically, the automatic diagnosis of PD would involve the extraction of fea-
tures that used as indicators of PD. For instance, one may want to measure the
standard deviation in the patient’s drawing speed as an indicator of tremor [1]
or the average value of the patient’s drawing speed as an indicator of rigidity
and bradykinesia [25]. To list can go on and on, making it hard to determine
which features are essential for the predictive power of the diagnosis. Indeed,
finding the optimal set of features is essential for both the efficiency and the
predictive power of a feature-based machine learning algorithm [27]. At the
same time as information takes time and resource to process, making high
dimensionality a critical problem for an algorithm designed for handling Big
Data [28], having too high dimensionality may lead to the model having too
few samples per dimension [29]. As a result, the model would be easily in-
fluenced by data that are less relevant or noisy, thus leading to overfitting [27]
Background | 9

and high variance [30]. This gives feature selection the potential to improve
the performance of a machine learning algorithm by removing the part of the
data that is noisy, redundant and irrelevant [29].
To conduct a feature selection, one must decide upon the search strategy,
the search direction, the search heuristic and the stopping criterion [31].

2.2.1 Search Strategy


As mentioned by Kumar et al.[27], there are main categories of search strate-
gies that can be used in a feature selection methodology:

1. Sequential. This one refers to those selecting features by sequentially


adding and/or removing elements, such as the greedy search algorithm.
While being simple, this one has the drawback of tending to get stuck at
a local optimum.

2. Exponential search. This one refers to those exploring different subsets


of features either through exhaustive search or through a heuristic. This
one is generally avoided due to its need for extensive computing resource
as well as long computing time.

3. Random search. This one refers to those with random elements in the
search, such as the genetic algorithm that starts with randomly selected
features and proceeds through recreations and selections with random
elements. This one has gained several researchers’ attention as an ap-
proach to avoid getting stuck in a local optimum without having to go
through all possibilities.

2.2.2 Search Direction


Having decided on the search strategy, one must also decide the search direc-
tion. In general, the search direction can be one of the following [27]:

1. Forward. This one refers to those starting with an empty list to gradually
add new features without changing previous choices.

2. Backward. This one refers to those starting with the full data to sequen-
tially remove features without changing previous choices.

3. Compound. This one refers to those combining forward selection with


backward elimination without including any random element.
10 | Background

4. Random. This one refers to those using an algorithm to randomly decide


whether to add or remove an element.

2.2.3 Search Heuristic


Depending on whether one wants the feature selection to be classifier-specific,
one may choose a different type of measures as the evaluation criterion.
Typically, a non-classifier-specific feature selection method would involve
the use of a metric that is representative of the features’ importance without
involving any learning algorithm. Some examples of such metrics are infor-
mation gain, χ2 and correlation coefficient [31].
Compared to non-classifier-specific feature selection methods, a classifier-
specific feature selection method has the potential of leading to a higher pre-
diction performance by adapting the selection to the learning algorithm. This,
however, comes with a cost: the selection algorithm becomes classifier-specific
at the same time as the performance of the feature selection algorithm becomes
dependent on the classifier. In particular, if the classifier is supervised, then
the feature selection may take a long time to complete due to its need for model
training and testing [27].
Nevertheless, a classifier-specific feature selection method would typically
use an evaluation criterion like accuracy where a model is trained and evalu-
ated against a validation data set. To minimize the impact of data partitioning,
such metrics are often computed with the help of CV [27].

Stopping Criteria
To get an output, one must also define a stopping point for the algorithm. To
do this, one can 1) set a maximum number of iterations; 2) set a limit on the
number of features; 3) let the algorithm stop after exploring all alternatives; 4)
stop the algorithm when the outcome stops improving; 5) stop the algorithm
when the change in outcome become insignificant; and/or 6) stop the algorithm
when the result is "good enough", i.e. the evaluation measure has reached a
certain value [27].

2.2.4 Greedy Search Algorithm for Feature Selection


Traditionally, a greedy search algorithm is an algorithm characterized by al-
ways choosing the alternative that leads to the best outcome at the current step
without changing any choice made by the previous steps. While this may not
Background | 11

lead to the global optimum solution, the aim is thus to find a solution that is
as good as possible within a reasonable amount of time [32].
When used for feature selection, the greedy search algorithm can be di-
vided into two categories. The first one is the forward greedy search algorithm,
where the solution is initialized as an empty set to be gradually populated by
adding those features that lead to the best outcome. In the backward greedy
search algorithm, the solution is instead initialized as the entire data set to have
it gradually reduced by removing those with the least positive impact on the
outcome [33].
The latter one is discouraged by Zhang et al., partially because it can be
computationally costly to start with all features, partially because it has a higher
risk of leading to high dimensionality by removing features that are more infor-
mative but was removed at the start because it holds information that overlaps
with the information provided by other less informative features [33].

2.2.5 Genetic Algorithm for Feature Selection


As an evolutionary algorithm, genetic algorithm (GA) is an algorithm that
simulates the model of natural selection by repetitively generating and evalu-
ating new solutions such that only those that are good enough get to survive
and become the parents of new solutions [34]. By giving solutions with a
higher probability of survival and mating while introducing random elements
through mutations, the genetic algorithm can be suitable for problems where
an exhaustive search is practically infeasible [35] at the same time as it is dif-
ficult to estimate the location of the optimal solution in advance [36].

Population Initialization
In general, a GA is initialized by randomly generating a population of indi-
viduals, each represented by a chromosome in the form of a sequence of val-
ues corresponding to a possible solution to the problem. With the population
initialized, a cycle of natural selection, cross-over and mutation can then be
conducted as an imitation of the natural evolution until the stopping criterion
is met [34].

Selection
Often, each iteration of the evolution process starts by selecting a set of solu-
tions for recombination such that new solutions can be generated. Often, this is
done through a mechanism that gives solutions with higher performance in the
12 | Background

problem space, also known as fitness, a higher probability of being chosen for
recombination. Here, one example is the roulette wheel selection mechanism
where a solution’s probability of being chosen is proportional to its fitness
[34].

Cross-Over
With the parents chosen, a set of individuals is to be generated through a pro-
cess called cross-over. Commonly, this is done by, for each pair of parents,
swap some elements between the parents. For instance, one common approach
is the single-point cross-over where a cross-over point is randomly chosen such
that two new solutions can be generated by swapping the elements situated to
the right of the cross-over point. Another common approach is the two-point
cross-over where two cross-over points are randomly chosen such that two
new solutions can be generated by swapping the elements situated between
the chosen cross-over points [34].

Mutation
With the new solutions produced, the algorithm would generally continue with
a process called a mutation. In the case of binary encoding, this is commonly
done by randomly performing a bit-wise negation in one or more bits in the
current solution [34].

Elitism
Often, researchers would ensure that the best solution so far is never lost during
the process by passing the best solution(s) to the next generation. This concept
is called elitism [34].

2.3 Random Oversampling


While stratified k-fold CV can be used to mitigate biased evaluation[7], ran-
dom oversampling, as a technique to randomly replicate samples belonging to
the minority class in the training data set, is a technique that can be used to pre-
vent machine learning algorithms from being bias towards the majority class
[37]. While being simple with a risk of leading to overfitting [38], random
oversampling has shown satisfactory performance in empirical studies. For
this reason, random oversampling has been one of the most popular resampling
strategies [39].
Background | 13

2.4 Classifiers
2.4.1 Random Forest Classifier

Figure 2.4 – An illustration of the classification flow of a RF classifier from


the article written by Golze et al. [4].

Random Forest (RF) is a ensemble learning method in which multiple de-


cision trees are built by, for each tree, randomly selecting some features and
then use the features to train the nodes in the tree [40]. A decision tree is thus
a hierarchy of rules built by learning the feature values in the training data
[41]. Using the hierarchy, a decision can then be made by each decision tree.
By conducting a majority voting, a classification can then be made by a RF
classifier using its underlying decision trees [40] (see figure 2.4).

2.4.2 Support Vector Machine

Figure 2.5 – An example of how a RBF-SVM classifier sets a boundary


between two classes. The image is taken from Kamarulzaini et al[5]’s study.
The samples are shaped by the class. The grey ones are those close to the
boundary [5].
14 | Background

An Support Vector Machines (SVM) is a learning method that uses a math-


ematical formula to set a boundary between two classes and thereby solve the
binary classification problem [42]. Compared to an ordinary SVM, an Ra-
dius Basis Function (RBF)-SVM is an SVM that uses a Gaussian kernel to
enable the separation of two classes that are linearly inseparable [43]. Figure
2.5 shows an example of how a classification can be made by a RBF-SVM
classifier.

2.4.3 Multi-Layer Perceptron

centring
Figure 2.6 – An illustration of the classification flow of an MLP classifier from
the article written by Faghfour and Frish [6].

A multi-layer Perceptron Multilayer Perceptron (MLP) classifier is a feed-


forward artificial neural network consisting of an input layer and an output
layer with at least one hidden layer in between. By computing each node values
that is not on the input layer as a weighted sum of the node values from the
previous layer, the input layer is to be used to accept values that are to be
propagated forward to result in predicted labels at the output layer. The concept
is to train a neural network using a cycle of a forward and backward pass. In
the forward pass, values are propagated from the input layer to the output layer
to predict the labels of the samples. By comparing the resulting labels with the
true labels, a backward pass can then be used to make adjustments, typically by
adjusting the weights using the partial derivatives of a certain error function.
With the network adjusted, a new forward pass can then be triggered to make
new predictions [44]. Figure 2.6 shows an example of an MLP classifier.
Background | 15

Figure 2.7 – An illustration of nested CV with an outer loop for testing and an
inner loop for validation. The white, blue, grey and black boxes are the decode,
test, training and validation data respectively. At each outer fold, the decode
data was partitioned into 4 inner folds so that an inner CV can be done using
the decode data. The illustration is made based on the description provided by
Varoquauax et al. [7].

2.5 Cross-Validation
In Machine learning, one main challenge is the bias-variance trade-off where
adapting the model too much to the training data would lead to variance due to
overfitting, whereas the contrary can lead to bias due to underfitting [45]. Both
are components of total expected error, where bias reflects how the estimated
value differs from the true value whereas variance reflects how the predicted
value differs depending on training data [46]. To address the bias-variance
trade-offs, one popular tool is CV [47]. By randomly partitioning the data
into folds and by, for each fold, use the other folds for model training and the
fold for testing [7], CV can serve as a better tool for performance evaluation
[48] and hyperparameter tuning than regular validation [47].
Firstly, by evaluating the model on each combination of training and test
data, CV can help one reduce the risk of the performance being affected by
how the data is being partitioned. Especially after repeated use of CV [48].
Secondly, using CV for hyperparameter tuning means that one can set the
hyperparameters based on the results from using different combinations of
samples as training data and thereby reducing the risk of overfitting. As a
result, the model’s general performance enhances. For this reason, CV is a
widely used tool for performance measurement [47] and hyperparameter tun-
ing [48].
16 | Background

Nested Cross-Validation
While being widely used for performance measurement and hyperparameter
tuning, the use of CV for both purposes would make the report unreliable as it
means that the hyperparameter would be affected by the test data in a way that
is not possible in real life [7]. In fact, Abdulaal et al. in [11] that one problem
with the contemporary research is that several researchers have overreported
the performance of their model by using flat CV for both hyperparameter tun-
ing and performance measurement. As a solution, Varoquaux et al. [7] and
Abdulaal et al. [11] proposed using nested CV with an outer loop for perfor-
mance evaluation and an inner loop for hyperparameter tuning. This way, one
can avoid bias for both performance measurement and hyperparameter tuning
[7].
The nested CV starts by dividing the data into several folds. Depending
on the type of CV, the division can be done differently (see section 2.5). With
the data divided into folds, an outer loop is formed where the folds take turn to
be the test data while the rest are sent into the inner loop for hyperparameter
tuning [7].
Inside each inner loop, the data is again divided into folds such that the
folds can take turn to be the validation data while the rest are being sent to
the classifier for training. With the classifier trained using the current hyper-
parameter setting and the current training data, a prediction is made on the
current validation data and a performance metric is computed by putting the
classifiers’ prediction in juxtaposition with the true outcome. By utilizing and
comparing different hyperparameter settings, the "best" model is then found
for each validation set of data in the inner loop [7].
Having built the final model for the current outer step, a prediction can
then be made on the test data at the current outer step. By comparing the
prediction with the true value, performance metrics can then be computed and
the process continues until there is no fold left to be tested [7]
Once all folds have been utilized as the test data for performance mea-
surement, one can compute the final performance metrics by averaging the
performance metric at each outer step [11].

Stratified K-fold Cross-Validation


Stratified k-fold CV is a variant of CV. Like the traditional CV methodology,
also known as k-fold CV, stratified k-fold CV means dividing the data into k
folds to have the folds taking turns to be the test set while the rest becomes
the training set. Unlike k-fold CV, stratified k-fold CV requires the division
Background | 17

(a) Two-sided Test with H0: µ1 = µ2 (b) One-sided Test with H0: µ1 ≥ µ2

Figure 2.8 – Ross [8]’s illustration of a null hypothesis testing on two sample
means where µ1 and µ2 are the sample means; H0 is the null hypothesis; zα/2
and zα are two constants corresponding the significance level in a two- and
one-sided test respectively [8].

to be made such that the distribution of classes is about the same between the
folds. This way, the bias and variance problem commonly seen in regular CV
on imbalanced data can be mitigated, although not solved entirely [7].

2.6 Significance Testing


2.6.1 Null Hypothesis and P-value
To determine whether there is a statistically significant difference between two
algorithms, one conventional way is to define a null hypothesis H0 and then
conduct a significance test to compute the p-value corresponding to the likeli-
hood that the null hypothesis is true [8].
Depending on the null hypothesis, different approaches would be used to
compute the p-value. Typically, a two-tested test would be used for testing
whether a sample mean is equal to a certain value or whether two sample
means are the same. Meanwhile, a one-sided test would be used if the aim
is to determine whether one sample mean is significantly larger than a specific
value or another sample mean. To illustrate this, figure 2.8a shows a two-sided
test and figure 2.8b shows a two-sided test.

2.6.2 Significance Level and Confidence Level


Having computed the p-value, the null hypothesis is to be rejected if the p-
value is less than or equal to the predefined significance level, also known as
18 | Background

the α. Hence, if the null hypothesis. Often, α would be set to 0.05, but there
are also cases where 0.10 is used [8].
As a threshold for the rejection of the null hypothesis and thereby accep-
tance of the alternative hypothesis as the opposite of the null hypothesis, the
significance level α sets the upper threshold to the probability that a certain
observation with stated statistical significance happened by chance. Since the
rejection of the null hypothesis is based on the threshold on probability [8],
another common way to define the threshold is to state the confidence level
1 - α as the lower threshold for the likelihood that a certain observation with
stated statistical significance did not happen by chance. For instance, if the
null hypothesis is rejected on a significance level of α = 0.05, then one can
state that the alternative hypothesis is true on a 95% confidence level, meaning
that one can be 95% certain that the alternative hypothesis is true. Hence the
confidence level of 95% can also be used as a threshold as it automatically
implies α = 0.05 and so on [49].

2.6.3 Friedman’s Test


The Friedman’s test is a widely used significance test for the comparison of
multiple matched groups. Typically, the test is used to test against the null
hypothesis that all comparing groups came from a population with the same
distribution. By computing a measure based on the groups’ values, a p-value
corresponding to the null hypothesis can be computed by finding the signif-
icance level α∗ at which the measure is equal to the (1 − α∗) quantile of
the Chi-square distribution with K − 1 degrees of freedom where K is the
number of samples in each group. Once the p-value is computed, the null
hypothesis can be rejected if the p-value is less than or equal to the predefined
significance level α and a post-hoc test is to be conducted to find out the groups
with significant difference between them [50].

2.6.4 Dunn’s Test with Bonferroni Correction


The Dunn’s Test with Bonferroni Correction is a popular post-hoc test that can
be used after the rejection of the null hypothesis in the Friedman’s test. By
computing a p-value for each pair of groups, the test allows one to determine
whether the observed difference between a pair of data sets is significant on
a certain confidence level. By adjusting the p-values using the Bonferroni
correction that takes account of the family-wise type I error 1 − (1 − α)c
where c is the number of comparisons, it makes sure that the expected error
Background | 19

falls below the pre-defined significance level α [50].

2.7 Related Work


As mentioned earlier, drawing and voice exams have been identified as suitable
tools for the automatic diagnosis of PD for different reasons. While some
prioritise drawing exams as a simple and inexpensive tool to detect PD through
the caption of the cardinal symptoms of PD [2], others prioritise voice exam
as a tool to detect PD through voice abnormalities which are common even
amongst early PD patients [20] and yet not easily perceptible by humans [26].
The following subsections described the studies conducted within the field
of diagnosing PD using drawing and/or voice exams.

2.7.1 Automatic Diagnosis of Parkinson’s Disease Us-


ing Drawing and Voice Data
In one study, Gupta et al. [51] proposed a logistic regression-based feature
selection algorithm for the selection of features for the automatic diagnosis of
PD. The algorithm is called Optimized Cuttlefish Algorithm(OCFA) and is a
modified version of the logistic regression-based Cuttlefish algorithm(CFA) to
select a combination of features instead of only one single feature as in CFA.
To evaluate their algorithm, they applied each of the two algorithms on a set
of drawing data and voice data independently. With the features from each of
the algorithm on each data set, they, amongst other things, tested detecting PD
using a logistic regression classifier. As result, they proved that OCFA yielded
a better result than CFA [51].
The OCFA proposed by Gupta et al. [51] was, however, later challenged by
Sharma et al. in [52]. In [52], Sharma et al. proposed another feature selection
algorithm and compared their algorithm with OCFA for the detection of PD
using drawing and voice data independently [52].
The algorithm proposed by Sharma et al. [52] is a modified version of
the Grey Wolf Optimization algorithm(MGWO). Similar to OCFA, MGWO is
an evolutionary algorithm and was proposed for the feature-based diagnosis
of PD [52]. However, while OCFA selects features in groups [51], MGWO
selects features by assessing features independently [52]. To test their algo-
rithm, Sharma et al. [52] tested combining their feature selection algorithm
with three different classifiers independently: K-nearest Neighbours (kNN),
RF, and Decision Tree (DT). By comparing the results with the result from
20 | Background

OCFA, Sharma et al. [52] showed that MGWO provided better result than
OCFA.

2.7.2 Automatic Diagnosis of Parkinson’s Disease Us-


ing Voice Data
Mostafa et al. [53] tested using a multi-agent system to select 11 features out
of the 23 features in their speech data. They then used the features to train and
test five different classifiers: DT, Naive Bayes, MLP, RF, and RBF-SVM. As
a result, they found that MLP had the best result on the accuracy, precision and
recall followed by RF. Also, they found that Naive Bayes had the worst result
in accuracy, precision and recall [53].
Sakar et al. [54] tested using an ensemble of different classifiers along with
a filter-based feature selection algorithm to solve the problem of detecting PD
based on speech data. The ensemble consists of a linear SVM, an RBF-SVM,
an MLP, an Naive Bayes, an LR, an RF and a kNN classifier. To make the
ensemble work, they tested two approaches: voting and stacking. To evaluate
the performance of the two ensembles, they also tested the performance of
using the classifiers independently [54].
The feature selection algorithm is the so-called minimum redundancy-maximum
relevance algorithm(mRMR) which prioritizes those with high joint depen-
dency. To evaluate the impact of the feature selection algorithm, they com-
pared the result of selecting features from the whole data using mRMR with the
result of categorizing the features by their origin and test them independently
without using mRMR [54].
As a result, Sakar et al. [54] found that the best result comes from us-
ing RBF-SVM on the top 50 features ranked by mRMR. Also, they found
that, when using mRMR, all their classifiers decreased in performance as they
removed the features extracted through Tunable Q-factor Wavelet Transform
[54].
Lahmiri et al. [55] tested detecting PD based on speech data using SVM
together with eight feature ranking algorithms to see which one is better. By
testing the ranking algorithms independently, they concluded that receiver op-
erating characteristic(ROC) and Wilcoxon statistic, on average, gave better per-
formance than other heuristics, with ROC being slightly better. Also, they
showed that entropy and Bhattacharyya statistic were the ones with the worst
result. This leaves the remaining ranking algorithm, t-test, fuzzy mutual in-
formation and genetic algorithms as medium good ranking algorithms. Also,
they found that the best result came from selecting 13 patterns using ROC [55].
Background | 21

Ali et al. [56] tested solving the problem of voice-based diagnosis of PD


using a neural network combined with a feature & hyper-parameter selection
algorithm. In short, their algorithm can be described as sorting the features
by their χ2 -score and then use the leave-one-subject-out CV to, in each fold,
calculate the validation loss of all combinations of hyper-parameters with the
number of features as one of the parameters and save the one combination
with the best result. That is, the selected features will always be those with
the highest χ2 -score with the number of features determined by the validation
loss [56].
In one study, Gunduz et al. [20] tested using a convolutional neural net-
work (CNN) to detect PD using some manually selected combinations of fea-
tures from a set of voice data. By applying the CNN on the different combi-
nations and by comparing the result with that from a support vector machine,
they found that the best result comes from using their CNN on the combina-
tion of Tunable Q-factor Wavelet Transform(TQWT), Mel-frequency cepstral
coefficients(MFCC) and the feature generated by concatenating the baseline,
vocal fold and time-frequency features(Concat) [20].

2.7.3 Automatic Diagnosis of Parkinson’s Disease Us-


ing Time-based Drawing Data
Diagnosis Using Features Selected through Greedy Search
Kotsavasilogloua et al. [57] proposed and tested a methodology to identify PD
using a pen-and-tablet device that can record the pen’s position on the tablet’s
XY-coordinate system. In the methodology, both healthy and PD participants
were to be asked to draw lines using the device. By asking the participants
to, for each hand, draw lines in one direction and then another, they collected
four sets of drawing patterns from each participant. For each set of data as
well as all sets: they computed five different metrics: the mean velocity, the
normalized velocity variability, the standard deviation of velocity, the signal
entropy of the X- and Y-coordinateS respectively. These led to 30 metrics
as features from each participant. To test the features they tested different
combinations of algorithms for the searching of the optimal combination of
features and hyper-parameters as well as for the feature selection and classifica-
tion. The classifiers tested were Naive Bayes, AdaBoost, Logistic Regression
(LR), J48, SVM, RF. Amongst the combinations, the best result came from
the Naive Bayes combined with Greedy search using metrics computed by
CfsSubsetEval - an algorithm that evaluates those with high correlation within
22 | Background

the class but low between the classes [57].

Diagnosis Using Features Selected through Ranking


Similar to Kotsavasilogloua et al. [57], Zham et al. [58] also used a device that
records the pen’s location on the tablets XY-coordinate system at time t, only
that their pen also registers the pen-pressure. Also, Zham et al. [58] did not ask
the participants to draw lines using different hands and in different directions.
Instead, they restricted themselves to right-handed participants only and asked
the participants to do different tasks. Using the data from each task, they then
used the Relief-F method to select 5 features and Naive Bayes to detect PD.
By comparing the result on each task, they found that an Archimedean-guided
spiral can provide higher performance than a sequence of letters on the AUC-
score, F1 -score, Precision or error rate. In particular, the best performance was
achieved with 14 features from the guided spiral task [58].

Diagnosis Using Features Selected through Evolutionary Search Algorithms


Gupta et al. [2] also extracted features from the drawing exam images and
used them as inputs. One difference, however, is that they combined the draw-
ing features with some basic information on the participants such as age. In
addition, they also proposed a feature selection algorithm called Optimized
Chaotic Crow Search Algorithm(OCSA) as a way to optimise the result. To
evaluate their algorithm, they tested detecting PD using the features selected by
OCSA as inputs to three different classifiers: RF, kNN, and DT. As a baseline,
they also tested using the features selected by the original version of OCSA -
the Chaotic Crow Search Algorithm(CCSA). By applying each combination
of feature selection algorithm and classifier on spiral and meander drawings,
they showed OCSA has led to a better performance than CCSA (4 percentage
unit higher accuracy while leading to half as much computation time) [2].

Diagnosis Using Time Windows


Gil-Martín et al. [59] also used a pen-and-tablet device, only that their device
also records the participant’s grip angle, pressure and the pen’s z-coordinate.
Together, these result in series of 5D arrays in the form of (x, y, z, grip angle,
pressure). Firstly, they divided the samples into 3-second windows and con-
verted each dimension in each window into 125 data points. Secondly, they
used a convolutional neural network to predict PD and used five-fold cross-
validation to evaluate the performance. Thirdly, they compared the results
Background | 23

from using one single dimension with the result from using all dimensions.
As a result, they found that the highest accuracy comes from using all five
dimensions followed by using one of the coordinates at the horizontal plane
[59].

Diagnosis Using Histograms


In contrast to the aforementioned authors, Al-Yousef et al. [60] detected PD us-
ing a static and dynamic spiral test meaning that they asked the participants to
draw on two identical spiral templates only that the second one blinks. Firstly,
they, for each participant, computed the dissimilarity between the two tests’
acceleration histogram as a feature. Secondly, they created new drawings out
of the final drawings through additive and subtractive operations. Thirdly, they
used the old and new final drawings to compute Histograms of Oriented Gradi-
ents and Edge Histogram Descriptors as features representing each participant.
Fourthly, they used an auto-encoder to create new presentations of the draw-
ings. Finally, they evaluated each feature using three variants of SVM as clas-
sifiers: hard-margin SVM, soft-margin SVM and Gaussian-kernel SVM(also
known as the RBF-SVM). As a result, they concluded that the combination
with the best result is a semi-local Edge Histogram Descriptor from the dy-
namic spiral test and a RBF-SVM [60].

Diagnosis Using Signals


Ribeiro et al. [61] tested solving the problem of identifying PD using the
patients’ drawing pattern using a recurrent neural network.
In contrast to the other authors mentioned above, their data came from
wearable sensors worn by their participants to register the sound, finger grip,
axial pressure of ink refill, tilt and the pen’s acceleration in 3D during the
drawing exams. Also, instead of converting the data to features or other rep-
resentations, they used the signals directly. That is, they did not do any pre-
processing of the data but rather used the raw data as input to make predictions.
With 50% of the data as training data and the rest as test data, they tested their
proposal against Optimum-Path Forest, ImageNet, CIFAR-10 and LeNet. As
result, they showed that their proposal achieved higher accuracy than these
algorithms on the same data, especially when it comes to Optimum-Path Forest
with 20 percentage unit lower accuracy than their proposed recurrent neural
network [61].
24 | Background

2.7.4 Automatic Diagnosis of Parkinson’s Disease Us-


ing Final Drawings
Parziale et al. [62] compared the performance of 3 classifiers for the diagnosis
of PD using guided meander and spiral drawings: SVM, DT and RF. To do
this, they asked 31 PD and 35 healthy participants to draw on 4 templates
of spirals and meanders respectively. From each drawing, they collected 9
metrics, or features, which they then used as input for the classification. With
70% of the data as training data and 30% as test data, they found that RF pro-
vided the best performance and that the performance of DT was only slightly
lower than RF (0.35 percentage unit lower accuracy). Also, they found that
SVM provided a significantly lower performance than the other classifiers (42
percentage unit lower than RF) [62].
Similar to Parziale et al. [62], Bernado et al. [63] also tested identifying
PD using a fixed set of features, only that they used another set of features and
another set of classifiers. As features, they used 4 different distance measures
together with pixel similarity, speed, and time. These features were generated
by asking several healthy participants and PD patients to draw three different
figures each: a cube, a triangle, and a spiral. Using these features, they tested
identifying PD using Naive Bayes classifier, Optimum-Path Forest and SVM
[63].
In contrast to the authors above, Folador et al. [64] did not use any metric-
based feature value for the detection of PD. Instead, they used Histograms of
Oriented Gradients to represent the drawings made by their participants. To
evaluate their approach, they asked 12 healthy participants and 15 PD patients
to draw between three and four sine waves each. They then tested using a RF
to detect PD using the Histograms of Oriented Gradients [64].

2.7.5 Automatic Diagnosis of Parkinson’s Disease Symp-


toms Using Drawing Features
Memedi et al. [1] tested using a mobile device for the automatic detection
of PD-related symptoms based on drawing exams. To do this, they asked a
group of PD and healthy participants to draw on a spiral template shown on
the mobile device such that their drawing pattern can be recorded as a series
of pen positions.
To cover bradykinesia as a symptom of PD, they calculated the average
speed, speed skewness, and the ratio between speed standard deviation and
average speed. To take account of hesitation, they calculated the mean delta
Background | 25

time(average time gap between two consecutive timestamps) [1].


To take account of PD-related tremors, they, for instance, calculated the
radius velocity at each timestamp as the distance between the pen’s current
position and the spiral template’s origin [1].
Will all features extracted and sent to a principal component analysis(PCA)
for dimension reduction, they used five classifiers independently to see which
one is better at identifying PD-related symptoms: MLP, linear SVM, non-
linear SVM, RF and LR. As a result, they found that MLP outperformed the
other classifiers and that the linear SVM and LR performed the worse [1].
Isenkul et al. [9] made a study about how PD can affect a patients’ coor-
dination ability by asking several PD to conduct two different drawing tests
and then compare how their drawing pattern differed between the two tests as
opposed to healthy participants. More specifically, they asked each participant
to draw on a spiral template and then on another spiral template that looks the
same except that it blinks. They called the first one a static spiral test (SST),
and the second one a dynamic Spiral test (DST). By, for each participant, com-
puting the euclidean distance between the two tests’ acceleration histogram,
they found that the distance is significantly higher amongst PD than amongst
healthy participants. Therefore, they concluded that the combined use of a
static and dynamic spiral test [9].

2.7.6 Identification of the "Best" Drawing Features for


the Automatic Diagnosis of Parkinson’s Disease
Poon et al. [25] studied the different handwriting features extractable from the
drawing data provided by Isenkul et al. [9] in which a group of PD patients
and healthy individuals were asked to draw on two spiral templates on a pen-
and-tablet device - one that is still and another that blinks.
As features related to tremor, they, for instance, computed peak instan-
taneous acceleration; peak instantaneous velocity and absolute drawing out-
put length. The peak instantaneous acceleration and velocity were defined
as the maximum acceleration and speed between two consecutive timestamps
respectively.
As features related to tremor and rigidity, they computed standard devia-
tion in pressure as well as the maximum, average and standard deviation in
pressure increase and pressure decrease, where pressure increase is defined as
the pressure change between two consecutive timestamps where the change
is positive and pressure decrease the change between two consecutive times-
tamps where the change is negative. These were referred to as variations in
26 | Background

pressure [25].
As features related to rigidity and bradykinesia, they computed average
computing drawing speed, total drawing time as features along with the spiral
size in the form of spiral width and height. The motivation for including spi-
ral size was that, given that the participants were asked to draw on the same
template, the spiral size could be used to reveal signs of micrographia [25].
Finally, they added average pressure; average grip angle and standard de-
viation in grip angle as features specifically related to rigidity [25].
By measuring the performance of a logistic regression classifier trained
on the features independently, they found that a static test is better at detecting
PD through the observation of reduction and variation in pressure while the
dynamic test with a blinking template is better for detecting PD through the
observation of greater variation in grip angle and increased pressure, possibly
because of the challenge posed by the dynamic test where the participants were
required to trace a blinking spiral template while being stressed and distracted
by the blinking of the template [25].
Methodology | 27

Chapter 3

Methodology

3.1 Data
3.1.1 Drawing Data
For the drawing data, we have utilized the data provided by Isenkul et al. [9]
where a pen-and-tablet device called Wacom Cintiq 12WX graphics was used
to record each participant’s drawing movements in terms of grip angle, pres-
sure and the position of the pen at each timestamp.

centring
Figure 3.1 – A spiral drawing from a PD patient in Isenkul et al [9]’s study.
The red line is the patient’s drawing, the black spiral is the template.

Since not all participants have conducted the same tests, only those who
have undertaken both the static and dynamic spiral test were included. In both
static and dynamic spiral tests, the participant is asked to draw on a spiral
template as shown in figure 3.1. The only difference between a static and a
spiral test is that a dynamic spiral test (DST) is one where the template blinks
whereas a static spiral test (SST) is one where the template does not blink [9].
This has resulted in 89 PD patients and 15 healthy participants.
Also, given that the goal is to develop a screening test for the general public,
the project provider is interested in a system that can accurately identify PD
28 | Methodology

using the data collectable through a device that everyone owns. For this reason,
grip angles were excluded.

3.1.2 Voice Data


Other than the drawing data, we have also used the voice data provided by
Sakar et al. [54] with 752 features collected from 188 PD patients and 64
healthy participants. The data was collected by asking the participants to pro-
nounce the vowel /a/ three times, recording their voice and then using different
algorithms to extract features from the recordings [54].

3.2 Method
3.2.1 Feature Extraction
Since the voice data provided by Sakar et al. [54] was already converted into
features, no feature extraction was needed for the voice data. However, since
this is not the case with the drawing data, a set of functions were implemented
to compute 12 kinematic variables which were, in turn, converted into 132
features. The following sections describe how the kinematic variables and
features were computed.

Kinematic Variables

Figure 3.2 – A spiral drawing from a PD patient in Isenkul et al [9]’s study


with a red star marking the centre of the figure.

As shown in figure 3.2, the coordinate system used by the device in Isenkul
et al. [9]’s study has a centre at (xc , yc ) = (250, 200). Using this as the
Methodology | 29

centre of the spiral, the radius and angle corresponding each data point were
computed using the following formula:
p
ri = (xi − xc )2 + (yi − yc )2 (3.1)
xi − xc
θi = arccos( ) (3.2)
ri
where (xi , yi ) is the pen’s coordinate at timestamp i, ri is the radial distance
between (xi , yi ) and the spiral centre; θi is the corresponding angle.
Using the aforementioned definitions, twelve kinematic variables were com-
puted for each test and each participant as the basis of the feature extraction.
The variables were computed with the motivation that they were used by Memedi
et al [1] and/or Poon et al [25] for the automatic diagnosis of PD. The variables
are defined as follows:
(xi+1 −xi )2 +(yi+1 −yi )2
1. velocity vi = ti+1 −ti
for i ∈ 1, ..., n − 1,
vi+1 −vi
2. acceleration ai = (ti+2 −ti )/2
for i ∈ 1, ..., n − 2,
p
3. radius ri = (xi − xc )2 + (yi − yc )2 for i ∈ 1, ..., n − 1
ri+1 −ri
4. radial velocity rvi = ti+1 −ti
for i ∈ 1, ..., n − 1
ri+1 −ri
5. radial acceleration rai = (ti+2 −ti )/2
for i ∈ 1, ..., n − 2
θi+1 −θi
6. angular velocity ωi = ti+1 −ti
for i ∈ 1, ..., n − 1
θi+1 −θi
7. angular acceleration αi = (ti+2 −ti )/2
for i ∈ 1, ..., n − 2

8. pressure pi for i ∈ 1, ..., n,


(pi+1 −pi )
9. pressure velocity pvi = ti+1 −ti
for i ∈ 1, ..., n − 1,
pi+1 −pi
10. pressure acceleration pai = (ti+2 −ti )/2
for i ∈ 1, ..., n − 2,

11. pressure increase ↑ p = pvi for i ∈ 1, ..., n − 1 where pvi > 0,

12. pressure decrease ↓ p = pvi for i ∈ 1, ..., n − 1 where pvi < 0


30 | Methodology

Statistical Measures as Features from Individual Tests


Using the definitions proposed by Memedi et al [1] and Poon et al [25], a set
of drawing features were, for each drawing test, extracted by applying some
statistical functions on all kinematic variables. The following is a list of the
statistical functions used in this study, where X is a kinematic variable in one
test and m is the number of samples in that test with respect to that variable:

1. the maximum value Xmax

2. the minimum value Xmin


Pm
i=1 Xi
3. the average value µ = m−1
√ Pm 2
i=1 (Xi −µ)
4. the standard deviation s = m−1
Pm 3
1 ( i=1 (Xi −µ) )
5. the skewness s̃ = m s 2

To make a distinction between the static and dynamic spiral test result, we
have kept the features derived from each test as separate features. Since there
are 2 test results and 12 kinematic variables, the aforementioned 5 statistical
measures have thus led to 120 features per participants.

Histogram Distances as Additional Features In addition to the features


derived from each individual test, we have used Isenkul et al. [9]’s definition
to compute the histogram of each kinematic variable.
For the sake of clarity, each histogram is computed by dividing the values
into 10 equally wide bins such that the left and right edge of the histogram
would correspond to the lowest and highest value of the kinematic variable for
the corresponding participant and test.
By denoting the histograms of the static spiral test as HSST and the his-
tograms of the dynamic spiral test HDST the following is the definition of the
histogram distance between the dynamic and static spiral test for each partici-
pant and each kinematic variable based on the definition provided by [9]:
v
u 10  2
uX
Histogram Distance = t HSST (i) − HDST (i) (3.3)
i=1

In contrast to the other features, the histogram distance was computed based
on the result from two different tests - the static and dynamic spiral tests. For
Methodology | 31

this reason, only 1 histogram distance was computed for each kinematic vari-
able and participant. As there are 12 kinematic variables, this resulted in 12
additional features for each participant.

3.3 Finding the Best System


As mentioned earlier, the aim of this thesis is to find the best system for the
diagnosis of PD. To achieve this, 18 systems were implemented and tested
against each other on each data set using the following strategy:
Firstly, two different feature selection algorithms were implemented and
tested against each other on two different sampling methods, three different
classifiers respectively. By only having one feature selection algorithm, one
sampling method and one classifier in each system, 12 systems were thus cre-
ated.
Secondly, a baseline was implemented to keep all features and thereby make
it possible to measure the impact of feature selection. By testing the baseline
on all pairs of sampling method and classifier, another 6 systems were thus
created.

3.4 Feature Selection


3.4.1 Problem Encoding
To enable feature selection through the chosen feature selection algorithms,
each solution were encoded as a mask in the form of a binary array with value
1 denoting that the feature at the same index should be used and value 0 the
opposite.

3.4.2 Forward Greedy Search


As mentioned in section 2.2.4, a forward greedy search is a simple search
algorithm with the ability to select features based on group performance within
a reasonable amount of time [27, 7, 33]. For this reason, greedy search has
been implemented as one of the feature selection algorithms to consider.
The algorithm was implemented based on the forward greedy search de-
scribed in section 2.2.4. The implementation of the greedy search can be found
in algorithm 1. The algorithm was implemented to start with an empty set of
features and then gradually add new features by always taking the one that
32 | Methodology

leads to the highest MCC without removing any previously added features.
The algorithm was implemented to run until the MCC stops increasing.
The MCC was chosen because of several reasons. Firstly, studies have
shown that performance-based search heuristics like MCC can provide higher
performance than correlation-based search heuristics like χ2 by adapting the
search on the chosen classifier [27, 55]. Secondly, the data sets used in this
study are imbalanced and MCC is suitable for imbalanced data as it takes ac-
count of the ratio between the positive and negative samples [12].
The MCC was computed by conducting a 5-fold CV on the data that was
sent to the greedy algorithm and then averaging the results. The implementa-
tion of the CV can be found in section 3.6.1.

Algorithm 1 Greedy
1: procedure greedy(X, y)
2: n ← the number of feature values in X
3: mask ← an array with n zeros
4: backlog ← a shuffled array with i ∈ 1, 2..., n
5: repeat
6: for i ∈ backlog do
7: masknew ← mask.copy()
8: masknew [i] ← 1
9: MCCnew ← avg(CrossValidation(X, y, masknew ))
10: if MCCnew > MCC then
11: MCC ← MCCnew
12: mask ← masknew
13: chosen ← i
14: end if
15: end for
16: backlog.remove(chosen)
17: mask[chosen] ← 1
18: until sum(mask) = sum(masknew )
19: return mask
20: end procedure

3.4.3 Genetic Search


In addition to greedy search, an elitism-based genetic algorithm has been im-
plemented as an alternative search strategy to potentially solve the local optima
problem with greedy search [27] without having to go through all possible
Methodology | 33

combinations [36]. The overall implementation can be found in algorithm 2.

Algorithm 2 Genetic Algorithm


1: procedure genetic(X, y)
2: n ← the number of features in X
3: individ0 ← a random binary array with n elements
4: pop ← population_init()
5: MCCs ← [CrossValidation(X, y, i) for i in pop]
6: best, bestlast , deceased ← pop[argmax(MCCs)], Null, []
7: repeat
8: bestlast , best2nd_last ← bestlast , pop[0]
9: pop, MCCs ← add_children(pop, MCCs, X, y)
10: pop, MCCs ← add_mutated_copies(pop, MCCs, X, y)
11: best, pop, MCCs, deceased ← keep_best(pop,MCCs,deceased)
12: until best = best2nd_last
13: return best
14: end procedure

Population Initialization
To reduce the risk of getting stuck at local optima without having to go through
too many solutions, the population was initiated by randomly generating a
mask and then an opposite solution by conducting a bit-wise negation opera-
tion on the generated mask. The initial population was complemented with an
array with value one at all indices as a way to ensure that the final solution is
at least as good as using all indices in terms of validation MCC.

Selection
At each iteration of evolution, a number of solutions were selected for repro-
duction by randomly selecting 2 distinctive parent solutions n times where:

min(nsolutions , nf eatures )
n=b c (3.4)
2
Here, nsolutions is the number of solutions in the population whereas nf eatures
is the number of features in the data.
Given MCC’s suitability for this study (see section 3.4.2), each solution’s
probability of being chosen was set to be proportional to its MCC. Again, the
MCC was computed by averaging the MCC results from a 5-fold CV on the
data that was sent to the genetic algorithm (see algorithm 4)
34 | Methodology

With new solutions created through reproduction, another number of solu-


tions were selected for mutation using a similar strategy - selecting a distinctive
number of solutions equal to half the minimum value between the number of
solutions in the population and the number of features in the data rounded
downwards where each solution’s probability of being chosen for mutation
was proportional to its MCC.

Cross-Over
Without changing the solutions in the current generation, a cross-over opera-
tion was conducted on each pair of parents to produce new solutions. To do
this, two cross-over points were randomly chosen between the right point of
the first element and the left point of the last
2 element such that the probability
1
of them being the same became p = n−1 .
In the case where two distinctive cross-over points were chosen, a two-
point cross-over operation would be conducted by swapping the elements be-
tween the two cross-over points as shown in figure 3.3a. In the case where the
two cross-over points happen to be the same, a one-point cross-over operation
would be conducted by swapping the elements after the cross-over points as
shown in figure 3.3b.

(a) Two-point Cross-over

(b) One-point Cross-over

Figure 3.3 – An illustration of the cross-over operations used in this study. The
vertical lines are the cross-over points. The arrays to the left are the parents
and the arrays to the right are the children.

Mutation
As mentioned earlier, each iteration of evolution involves the creation of mu-
tated copies of half the population. This was achieved by creating a copy of
Methodology | 35

each solution and then conducting a bit-wise negation operation on a random


element on each copy (see figure 3.4)

Figure 3.4 – An illustration of the mutation operation used in this study.


Assuming that the third element of a solution was randomly selected for a
bit-wise negation operation. The third element was thus flipped from 1 to 0 as
a result of the mutation.

Elitism-based Survival Selection (Keep Best)


To ensure that the best solution is not lost during the process, all solutions
were kept until the end of each evolution at which a survival selection was
conducted by only keeping the first two solutions sorted by average MCC (the
higher the better) followed by the number of features (the lower the better).
Since there might be cases where several combinations are equally good as
the best(first) solution, those that are as good as the best(first) solution are also
kept. That is, in order to be kept, the solution has to either be at the first two
positions or be as good as the best(first) solution both in terms of the average
MCC and the number of features.
To quicken the convergence process, those that were thrown out were put
into a list of "deceased" population to ensure that no cross-over, mutation or
CV can be made on these solutions again by the current round of genetic al-
gorithm. That is, while the solutions would not be put into consideration by
the algorithm in the remaining part of the current round of feature selection,
the solutions may be put into consideration by the algorithm again in another
round of feature selection using another set of training data. This was to make
sure that the feature selection would never be affected by the data used for the
final performance measurement.

Stopping Criteria
To avoid the problem of local optima, the algorithm was also implemented
to stop when the best solution has been the same in three rounds. By always
giving the eldest population a higher ranking in the scenario where two solu-
tions have the same number of features and the same performance score. The
best solution would only change if a new solution is generated with a better
36 | Methodology

performance score or as high performance as the current best solution but with
less number of features (see algorithm 2).

3.4.4 Random Oversampling Versus No Oversampling


As mentioned in 2.3, random oversampling has the benefit of mitigating class
imbalance. Since the data sets used in this thesis are imbalanced, a random
oversampling was conducted as a way to tackle the problem of class imbalance.
However, given that random oversampling can sometimes lead to overfitting
[38], one may want to measure its impact by comparing it against a baseline
where all samples were used without any oversampling. Hence, two sampling
methods were investigated in this study - one with random oversampling and
one without.
Firstly, Python’s Imbalanced-Learn library was used to randomly create
duplicates of the training samples in the minority class such that the number
of samples in the minority class becomes equal to the number of samples in
the majority class. By ensuring that the duplication is conducted on the train-
ing samples only, the training remains thus unaffected by the validation and
test data and bias towards the majority class due to class imbalance was thus
avoided.
Secondly, a baseline was enabled by allowing the random oversampling
feature to be deactivated. That is, when the random oversampling feature was
deactivated, no oversampling or undersampling was used.

3.5 Classifiers
As mentioned earlier, this study involves three classifiers. These are: RF, RBF-
SVM and MLP.
The rationale behind this is that these classifier have outperformed sev-
eral classifiers in related studies: Amongst the studies about drawing-based
diagnosis of PD, RF has outperformed kNN, DT and SVM in study written
by Sharma et al. [52], Gupta et al. [2], Parziale et al. [62] and Memedi
[1]; Amongst the studies about voice-based diagnosis of PD, RBF-SVM has
outperformed linear SVM, MLP, Naive Bayes, LR, RF and kNN in Sakar et al
[54]’s study about voice-based automatic diagnosis of PD whereas MLP has
outperformed DT, Naive Bayes, RF, and RBF-SVM in Mostafa et al. [53]’s
study.
Since the focus in this study is not the implementation of a classifier, a
library was used to implement the classifier using existing code, namely the
Methodology | 37

Scikit-learn library.
For the same reason, all classifiers were initiated using the default hyper-
parameters set by the provider of the library.

3.6 Validation and Testing


3.6.1 Cross-validation
As shown in section 2.2.4 - 2.2.5, a CV was used to produce the score of each
solution along the search. To ensure that the feature selection is not affected
by the test data used for performance measurement, nested CV was imple-
mented with the outer loop for performance measurement and the inner loop
inside each search algorithm to enable performance-based feature selection.
Algorithm 4 shows a simplified version of the overall implementation.

Algorithm 3 Outer Cross-Validation


1: procedure CrossValidationouter (X, y)
2: metrics ← a matrix with 5x6 zeros . 5 folds and 6 metrics
3: X, y ← shuffle(X, y)
4: X, y ← X and y partitioned into 5 stratified folds.
5: for i ∈ 0, 1, ..., 4 do
6: Xtest , ytest ← X[i], y[i]
7: Xtrain , ytrain ← X, y excl. X[i], y[i]
8: mask ← search(Xtrain , ytrain )
9: if oversampling then . a value assigned outside the function
10: Xtrain , ytrain ← oversample(Xtrain , ytrain )
11: end if
12: metrics[i] ← evaluate(Xtrain , ytrain , Xtest , ytest , mask)
13: end for
14: return metrics
15: end procedure

As shown in algorithm 4, both the inner and outer loop started by parti-
tioning the already shuffled data into k stratified folds. The rationale behind
shuffling outside the CV function was to make sure that each combination of
classifier, random sampling and feature selection algorithm can be evaluated
under the same circumstances. The reason why stratified CV was used was
that the data sets in this study are imbalanced.
With the data partitioned into K-folds, the algorithm is implemented to
make a random oversampling of the current training data. The rationale be-
38 | Methodology

Algorithm 4 Inner Cross-Validation


1: procedure CrossValidationinner (X, y, mask)
2: X, y ← X and y partitioned into 5 stratified folds.
3: MCC ← an array with 5 zeros
4: for i ∈ 0, 1, ..., 4 do
5: Xtest , ytest ← X[i], y[i]
6: Xtrain , ytrain ← X, y excl. X[i], y[i]
7: if oversampling then . a value assigned outside the function
8: Xtrain , ytrain ← oversample(Xtrain , ytrain )
9: end if
10: MCC[i] ← getMCC(Xtrain , ytrain , Xtest , ytest , mask)
11: end for
12: return avg(MCC)
13: end procedure

hind the oversampling was to prevent the classifier from being biased towards
the majority class. This was done with the help of the RandomOversampler
module in the Python library Imbalanced-learn.
Having conducted a random oversampling on the data, the algorithm con-
tinued with calling the function evaluate() using the training and test data along
with the mask produced by a search algorithm either as a parameter input when
the CV was used as part of the search heuristic inside the search algorithm
or as a result returned by the search algorithm when the CV was used for
performance measurement.
Having looped through all folds, the function would return a list of metrics
corresponding to the metrics returned by the function evaluate().

Inner loop vs Outer loop


In contrast to the outer loop, the inner loop was implemented to be called inside
each search algorithm. This was to reduce the risk of the feature selection
being affected by how the data was partitioned into training and validation
data set. This was why the CV had an optional parameter called mask that is
only None when it is used outside the search algorithms as part of the final
performance measurement.
Methodology | 39

3.6.2 Metrics
Prediction-related Metrics
For the purpose of performance measurements, a confusion matrix was com-
puted for each outer fold with the following components:

• True Positive(TP) = The number of PD patient samples being correctly


predicted as PD patient samples

• True Negative(TN) = The number of healthy participant samples being


correctly predicted as healthy participant samples

• False Positive(FP) = The number of healthy participant samples being


falsely predicted as PD patient samples

• False Negative(FN) = The number of PD patient samples being falsely


predicted as healthy participant samples

With the metrics in the confusion matrix calculated, the following metrics
were then calculated as performance metrics:

1 + M CC
M CCnorm = (3.5)
2

0


if TP + FN = 0 or TN + FP = 0
M CC = 0 if TP + FP = 0 or TN + FN = 0
√

 TP·TN−FP·FN
otherwise
(TP+FP)(TP+FN)(TN+FP)(TN+FN)
(3.6)

TP + TN
Accuracy = (3.7)
TP + TN + FP + FN
TP
P recision = (3.8)
TP + FP
TP
Recall = (3.9)
TP + FN
P recision ∗ Recall
F1 = 2 ∗ (3.10)
P recision + Recall
The MCC was chosen to tackle the problem of imbalanced data as sug-
gested by Chicco and Jurman [12]. The accuracy and F1 score were chosen
both as a way to make this thesis comparable with other studies and as a way
40 | Methodology

to make the results more understandable as they are some of the measures
most people would understand. The precision was chosen to demonstrate the
probability that a patient has PD given that the system says so. The recall rate
was chosen as a metrics to demonstrate how well the system is at detecting PD
patients.

Other Metrics
Given that feature selection is a process with the potential of reducing the
amount of required computational resource by reducing the number of features
used for training and testing [27], one may want to see the number of features
selected by the different feature algorithms depending on which system it is
used, i.e. with which classifier and sampling method it is used. For this reason,
we have, for each system, registered the number of features selected by the
feature selection algorithm. For the sake of simplicity, this metric will be
henceforth denoted as F eatures.
Similarly, given that feature selection is a process that requires a certain
amount of computational resource and that the feature set evaluation through
CV is likely the part requiring the largest proportion of computational resource,
we have, for each system, registered the number of times CVs has been called
for the evaluation of a feature set throughout the course of feature selection.
To achieve this, we have set a counter to zero at the start of each feature se-
lection and configured it to increment by one each time CV is called for the
evaluation of one feature set. Following this logic, the registered counter is
also the number of feature sets being evaluated during feature selection. For
the sake of simplicity, the value of the counter will be henceforth denoted as
Calls.

3.6.3 Confidence Intervals


As a way to show the range one may expect the performance to vary between, a
95% confidence interval was computed for each performance metric and each
system using the following equation:
σ
x = µ ± 1.96 √ (3.11)
n

where µ is the mean value, σ is the standard deviation between the folds
and n is the number of folds.
Methodology | 41

3.6.4 Significant Testing


To see whether the observed performance difference between each pair of sys-
tems is statistically significant, we have, for each metric, conducted a Fried-
man’s test on all systems to see if there is a significant difference amongst the
systems in that metric. The Friedman’s test was conducted using the existing
implementation in the SciPy library with the null hypothesis that the current
metric has the same distribution in all systems.
Given that the null hypothesis corresponding to the current metric value is
rejected by the Friedman’s test on a confidence level of 95% (i.e. p < 0.05),
a Dunn’s test was conducted between each pair of system where a difference
is observed between the corresponding mean values. The confidence level
was chosen based on the notion that 95% is the most common confidence
level. The Dunn’s test was made to compute the p-value corresponding to the
null hypothesis thus that there is no significant difference between the groups,
meaning that the observed difference may be caused by random.
The Dunn’s test was conducted using the existing implementation in the
Scikit-Posthocs library. The Dunn’s test was configured to apply the Bonfer-
roni correction on the p-values to take account of the family-wise type I error
when comparing multiple pairs of systems.
The purpose of this is to verify whether the observed difference between
each pair of systems is statistically significant and thereby come to a conclu-
sion of whether the best performing system is worth using as well as to what
certainty one can expect the best performing system to be the best performing
system in another trial with the same data.

3.7 Programming Language and Library


To answer the research question, scripts were written in Python using several
libraries including the Scikit-Learn library for the implementation of classi-
fiers and the stratified fold partitioning algorithm as well as the Imbalanced-
Learn library for the implementation of random oversampling and the Scipy
library for the implementation of the Friedman’s test and the Scikit-Posthocs
for the Dunn’s test.

3.8 Hyperparameter Settings


Since the focus is on finding the best combination of classifiers and the model
generation mechanism as well as the most valuable features, both the classi-
42 | Methodology

fiers and the partition algorithm StratifiedKFold were implemented using the
default hyperparameters set by the Scikit-learn library. The same applies to
the random oversampling method from the Imbalanced-Learn library.
Results | 43

Chapter 4

Results

4.1 Clarification of the Names


To enable easy comparison between each system, a ’+’ has been used to de-
note systems involving the use of oversampling. Also, the genetic algorithm
is referred to as GA, whereas the greedy search is referred to as GS.
As mentioned earlier, a baseline was used as a way to investigate the impact
of feature selection. Hence, the mere use of a classifier without the involve-
ment of any feature selection algorithm or oversampling was called "Base-
line" in the graphs with confidence intervals whereas those utilising a classifier
combined with oversampling but no feature selection algorithm was called
"Baseline+".

4.2 Results on Drawing Data


As shown in table 4.1, RFGA+ was the one with the highest MCC, accuracy
and F1 score amongst all systems tested (more metric values can be found in
table A.2 in the appendix).
Since the p-values from the Friedman’s test were below 0.05 for all metrics,
the Dunn’s test was conducted for all metrics. The Friedman’s test results can
be found in table A.1 in the appendix along with the Dunn’s test results on
the resulting precision and recall rates. The following sections describe the
Dunn’s test results on the resulting MCC, accuracy and F1 scores along with
the corresponding metric values.
44 | Results

MCC Accuracy F1 Features Calls


RF 0.856 ± 0.041 0.936 ± 0.017 0.964 ± 0.009 - -
RF+ 0.900 ± 0.039 0.955 ± 0.017 0.975 ± 0.009 - -
RFGS 0.839 ± 0.068 0.926 ± 0.029 0.958 ± 0.017 3.133 ± 0.550 538.533 ± 70.342
RFGS+ 0.841 ± 0.073 0.930 ± 0.029 0.960 ± 0.017 3.267 ± 0.470 555.800 ± 60.162
RFGA 0.847 ± 0.041 0.933 ± 0.017 0.962 ± 0.010 75.667 ± 12.278 9.400 ± 0.882
RFGA+ 0.909 ± 0.056 0.958 ± 0.027 0.976 ± 0.016 76.000 ± 10.627 8.800 ± 0.696
SVM 0.500 ± 0.000 0.856 ± 0.001 0.922 ± 0.001 - -
SVM+ 0.666 ± 0.054 0.700 ± 0.051 0.793 ± 0.040 - -
SVMGS 0.783 ± 0.079 0.904 ± 0.031 0.945 ± 0.017 2.600 ± 0.482 470.067 ± 62.021
SVMGS+ 0.727 ± 0.064 0.833 ± 0.032 0.896 ± 0.022 2.933 ± 0.727 512.400 ± 92.953
SVMGA 0.500 ± 0.000 0.856 ± 0.001 0.922 ± 0.001 62.733 ± 7.611 10.867 ± 3.113
SVMGA+ 0.660 ± 0.050 0.703 ± 0.041 0.798 ± 0.032 86.600 ± 13.486 9.000 ± 0.613
MLP 0.543 ± 0.056 0.808 ± 0.064 0.882 ± 0.052 - -
MLP+ 0.644 ± 0.065 0.678 ± 0.117 0.731 ± 0.150 - -
MLPGS 0.727 ± 0.083 0.894 ± 0.026 0.940 ± 0.015 4.333 ± 0.658 691.600 ± 83.518
MLPGS+ 0.774 ± 0.048 0.840 ± 0.047 0.897 ± 0.034 4.800 ± 0.946 749.933 ± 119.644
MLPGA 0.502 ± 0.052 0.672 ± 0.132 0.731 ± 0.151 95.000 ± 14.624 9.000 ± 0.640
MLPGA+ 0.673 ± 0.060 0.692 ± 0.090 0.767 ± 0.092 87.400 ± 14.277 9.267 ± 1.146

Table 4.1 – Test results on drawing data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.

4.2.1 MCC
As shown in table 4.2, RFGA+ has achieved the highest mean MCC followed
by RF+ and RF with insignificant difference between them. This suggests that
while the use of GA and oversampling has led to a higher mean MCC, the
mere use of an RF may be sufficient as the observed difference may be caused
by random.
While there is no significant difference between SVMGS, SVMGS+ and
the systems involving the use of RF (p > 0.5), it should be noted that RFGA+ is
the only system having achieved a higher MCC with significance difference on
the confidence level of 95% than 8 other systems (e.g. SVM, SVM+, SVMGA,
MLP, MLP+, MLPGA and MLPGA+, see table 4.2 for the exact p-values).

4.2.2 Accuracy
As shown in table 4.1 and table 4.3, RFGA+ has achieved the highest mean
accuracy followed by RF+ and RF with insignificant difference between them.
This suggests that while the use of GA and oversampling has led to higher
mean accuracy, the mere use of an RF may be sufficient as the observed dif-
ference may be caused by random.
Results | 45

System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 1.000e+00 - 4.051e-06 3.489e-01 1.000e+00 1.000e+00 4.051e-06 2.633e-01 1.212e-04 1.357e-01 1.000e+00 1.000e+00 1.076e-05 5.874e-01
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 6.855e-08 2.999e-02 1.000e+00 8.245e-01 6.855e-08 2.146e-02 3.090e-06 9.810e-03 1.000e+00 1.000e+00 2.040e-07 5.586e-02
RFGS - - - - - - 2.415e-05 9.660e-01 1.000e+00 1.000e+00 2.415e-05 7.473e-01 5.951e-04 4.077e-01 1.000e+00 1.000e+00 6.079e-05 1.000e+00
RFGS+ - - 1.000e+00 - - - 2.574e-05 1.000e+00 1.000e+00 1.000e+00 2.574e-05 7.751e-01 6.297e-04 4.237e-01 1.000e+00 1.000e+00 6.467e-05 1.000e+00
RFGA - - 1.000e+00 1.000e+00 - - 1.352e-05 6.969e-01 1.000e+00 1.000e+00 1.352e-05 5.347e-01 3.554e-04 2.863e-01 1.000e+00 1.000e+00 3.466e-05 1.000e+00
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.630e-08 1.649e-02 1.000e+00 5.168e-01 2.630e-08 1.166e-02 1.298e-06 5.191e-03 7.284e-01 1.000e+00 8.025e-08 3.142e-02
SVM - - - - - - - - - - - - - - - - - -
SVM+ - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 -
SVMGS - - - - - - 2.256e-03 1.000e+00 - 1.000e+00 2.256e-03 1.000e+00 3.215e-02 1.000e+00 1.000e+00 1.000e+00 4.880e-03 1.000e+00
SVMGS+ - - - - - - 8.467e-02 1.000e+00 - - 8.467e-02 1.000e+00 7.177e-01 1.000e+00 - - 1.585e-01 1.000e+00
SVMGA - - - - - - - - - - - - - - - - - -
SVMGA+ - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 1.000e+00 - - 1.000e+00 -
MLP - - - - - - 1.000e+00 - - - 1.000e+00 - - - - - 1.000e+00 -
MLP+ - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 -
MLPGS - - - - - - 5.636e-02 1.000e+00 - 1.000e+00 5.636e-02 1.000e+00 5.091e-01 1.000e+00 - - 1.074e-01 1.000e+00
MLPGS+ - - - - - - 8.238e-03 1.000e+00 - 1.000e+00 8.238e-03 1.000e+00 9.855e-02 1.000e+00 1.000e+00 - 1.697e-02 1.000e+00
MLPGA - - - - - - 1.000e+00 - - - 1.000e+00 - - - - - - -
MLPGA+ - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 -

Table 4.2 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same MCC. The symbol "-"
is used when system A does not have a higher mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 1.000e+00 - 3.033e-01 2.514e-06 1.000e+00 8.209e-02 3.033e-01 8.611e-07 5.631e-02 1.133e-04 1.000e+00 5.444e-01 6.826e-04 5.989e-05
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 3.791e-02 7.724e-08 1.000e+00 8.409e-03 3.791e-02 2.391e-08 5.463e-03 5.123e-06 1.000e+00 7.486e-02 3.772e-05 2.531e-06
RFGS - - - - - - 8.926e-01 1.677e-05 1.000e+00 2.712e-01 8.926e-01 6.097e-06 1.920e-01 6.027e-04 1.000e+00 1.000e+00 3.231e-03 3.316e-04
RFGS+ - - 1.000e+00 - - - 8.862e-01 1.656e-05 1.000e+00 2.691e-01 8.862e-01 6.016e-06 1.904e-01 5.959e-04 1.000e+00 1.000e+00 3.197e-03 3.278e-04
RFGA - - 1.000e+00 1.000e+00 - - 4.876e-01 5.741e-06 1.000e+00 1.387e-01 4.876e-01 2.017e-06 9.645e-02 2.348e-04 1.000e+00 8.548e-01 1.346e-03 1.263e-04
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 3.150e-02 5.714e-08 1.000e+00 6.872e-03 3.150e-02 1.754e-08 4.444e-03 3.913e-06 1.000e+00 6.269e-02 2.929e-05 1.922e-06
SVM - - - - - - - 1.000e+00 - 1.000e+00 - 9.557e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGS - - - - - - 1.000e+00 1.233e-03 - 1.000e+00 1.000e+00 5.198e-04 1.000e+00 2.515e-02 1.000e+00 1.000e+00 1.007e-01 1.528e-02
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA - - - - - - - 1.000e+00 - 1.000e+00 - 9.557e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - - 1.000e+00 - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - 1.000e+00 -
MLPGS - - - - - - 1.000e+00 7.377e-03 - 1.000e+00 1.000e+00 3.335e-03 1.000e+00 1.156e-01 - 1.000e+00 4.039e-01 7.355e-02
MLPGS+ - - - - - - - 9.660e-01 - 1.000e+00 - 5.485e-01 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - - - - - -
MLPGA+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -

Table 4.3 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same accuracy. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

Again, regardless if one look at the confidence level of 90%, 95% or 99%,
RFGA+ and RF+ have shared the highest position in terms of the number of
systems compared to which each of them has achieved a higher accuracy with
significant difference (see table 4.3 for the exact systems and the corresponding
p-values).
Nevertheless, it should be noted that RFGA+ has outperformed RFGA both
in terms of mean accuracy and in terms of the number of systems compared
to which it has achieved a significantly higher accuracy with a significant dif-
ference on the confidence level of 90%, 95% and 99%. More importantly, it
should be noted that RFGA+ and RF+ are the only systems achieving a signif-
icantly higher accuracy on a confidence level of 95% than all systems except
for SVMGS, SVMGS+, MLPGS+ and those involving the use of RF (see table
4.3 for the exact p-values). This suggests that it may be advisable to combine
RF with random oversampling or another oversampling method.
46 | Results

4.2.3 F1 Score

System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 1.000e+00 - 8.630e-01 1.457e-06 1.000e+00 1.757e-02 8.630e-01 5.492e-07 9.989e-02 7.870e-05 1.000e+00 1.908e-01 9.898e-04 2.215e-05
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.165e-01 3.575e-08 1.000e+00 1.282e-03 1.165e-01 1.225e-08 9.406e-03 2.934e-06 1.000e+00 1.990e-02 4.959e-05 7.190e-07
RFGS - - - - - - 1.000e+00 1.650e-05 1.000e+00 9.203e-02 1.000e+00 6.651e-06 4.371e-01 6.623e-04 1.000e+00 7.770e-01 6.758e-03 2.056e-04
RFGS+ - - 1.000e+00 - - - 1.000e+00 2.532e-05 1.000e+00 1.226e-01 1.000e+00 1.034e-05 5.635e-01 9.627e-04 1.000e+00 9.886e-01 9.453e-03 3.044e-04
RFGA - - 1.000e+00 1.000e+00 - - 1.000e+00 2.408e-06 1.000e+00 2.485e-02 1.000e+00 9.199e-07 1.363e-01 1.225e-04 1.000e+00 2.565e-01 1.477e-03 3.517e-05
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.025e-01 2.844e-08 1.000e+00 1.087e-03 1.025e-01 9.687e-09 8.099e-03 2.392e-06 1.000e+00 1.724e-02 4.113e-05 5.813e-07
SVM - - - - - - - 4.542e-01 - 1.000e+00 - 2.638e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGS - - - - - - 1.000e+00 2.884e-03 - 1.000e+00 1.000e+00 1.369e-03 1.000e+00 5.674e-02 1.000e+00 1.000e+00 3.506e-01 2.230e-02
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA - - - - - - - 4.542e-01 - 1.000e+00 - 2.638e-01 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - - 1.000e+00 - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - -
MLPGS - - - - - - 1.000e+00 7.626e-03 - 1.000e+00 1.000e+00 3.747e-03 1.000e+00 1.290e-01 - 1.000e+00 7.170e-01 5.329e-02
MLPGS+ - - - - - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - -
MLPGA+ - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -

Table 4.4 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same F1 score. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

As shown in table 4.1 and table 4.4, RFGA+ has achieved the highest mean
F1 score followed by RF+ and RF with insignificant difference between them.
This suggests that while the use of GA and oversampling has led to a higher
mean F1 score, the mere use of an RF may be sufficient as the observed differ-
ence may be caused by random.
Also, regardless if one look at the confidence level of 95% or 99%, RFGA+
and RF+ have shared the highest position in terms of the number of systems
compared to which each of them has achieved a higher F1 score with significant
difference (see table 4.4 for the exact systems and the corresponding p-values).
Nevertheless, it should be noted that RFGA+ has outperformed RFGA both
in terms of mean F1 score and in terms of the number of systems compared
to which it has achieved a significantly higher F1 score with a significant dif-
ference on the confidence level of 90%, 95% and 99%. More importantly, it
should be noted that RFGA+ and RF+ are the only systems achieving a signif-
icantly higher F1 score on a confidence level of 95% than all systems except
for SVMGS, SVMGS+, MLPGS+ and those involving the use of RF (see table
4.4 for the exact p-values). This suggests that it may be advisable to combine
RF with random oversampling or another oversampling method.

4.2.4 Selected Features


As shown in table 4.1, the system in which the application of a feature algo-
rithm has led to the least number of features was SVMGS followed by SVMGS+
and RFGS with insignificant difference between them (p = 1.0). Indeed, the
p-values have shown that the only cases where a significant difference can be
Results | 47

System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - 1.000e+00 9.885e-05 1.969e-05 - - 1.734e-02 4.136e-06 1.000e+00 1.000e+00 2.070e-07 5.439e-06
RFGS+ - - 2.470e-04 5.206e-05 - - 3.523e-02 1.152e-05 1.000e+00 1.000e+00 6.324e-07 1.501e-05
RFGA - - - 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00
RFGA+ - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGS 1.000e+00 1.000e+00 6.477e-06 1.103e-06 - 1.000e+00 2.027e-03 2.004e-07 1.000e+00 1.000e+00 7.735e-09 2.702e-07
SVMGS+ 1.000e+00 1.000e+00 2.725e-05 5.031e-06 - - 6.325e-03 9.853e-07 1.000e+00 1.000e+00 4.347e-08 1.311e-06
SVMGA - - 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA+ - - - - - - - - - - 1.000e+00 1.000e+00
MLPGS - - 1.522e-02 4.255e-03 - - 7.725e-01 1.220e-03 - 1.000e+00 1.070e-04 1.521e-03
MLPGS+ - - 2.515e-02 7.309e-03 - - 1.000e+00 2.172e-03 - - 2.031e-04 2.691e-03
MLPGA - - - - - - - - - - - -
MLPGA+ - - - - - - - - - - 1.000e+00 -

Table 4.5 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same features. The symbol
"-" is used when system A does not have a lower mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

observed on the confidence level of 95% are when a system utilising greedy
search (GS) is compared with a system utilising GA. This suggests that GS
indeed tends to lead to a lower number of features than GA and that while
the choice of classifier and sampling method appears to have an insignificant
impact on the resulting number of selected features, the choice of feature se-
lection algorithm has a certain impact.

4.2.5 Calls

System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - 1.000e+00 - - - - - - 1.000e+00 1.000e+00 - -
RFGS+ - - - - - - - - 1.000e+00 1.000e+00 - -
RFGA 1.146e-03 4.839e-04 - - 1.058e-02 3.477e-03 1.000e+00 - 3.681e-06 1.753e-06 - -
RFGA+ 1.384e-04 5.391e-05 1.000e+00 - 1.598e-03 4.681e-04 1.000e+00 1.000e+00 2.718e-07 1.223e-07 1.000e+00 1.000e+00
SVMGS 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00 - -
SVMGS+ 1.000e+00 1.000e+00 - - - - - - 1.000e+00 1.000e+00 - -
SVMGA 1.598e-03 6.838e-04 - - 1.422e-02 4.762e-03 - - 5.567e-06 2.676e-06 - -
SVMGA+ 5.128e-04 2.098e-04 1.000e+00 - 5.166e-03 1.623e-03 1.000e+00 - 1.359e-06 6.328e-07 - 1.000e+00
MLPGS - - - - - - - - - 1.000e+00 - -
MLPGS+ - - - - - - - - - - - -
MLPGA 4.528e-04 1.844e-04 1.000e+00 - 4.622e-03 1.443e-03 1.000e+00 - 1.166e-06 5.409e-07 - 1.000e+00
MLPGA+ 2.688e-04 1.073e-04 1.000e+00 - 2.899e-03 8.795e-04 1.000e+00 - 6.133e-07 2.807e-07 - -

Table 4.6 – Dunn’s test results on the drawing data: p-values corresponding
to the null hypothesis that system A and B have the same calls. The symbol
"-" is used when system A does not have a lower mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

As much as the systems leading to the lowest number of selected features


are dominated by those involving the use of GS, those leading to the highest
number of CV calls also happen to be those involving the use of GS. Indeed,
while GS has called hundreds of cross-validations on average, GA has called
less than 10 cross-validations on average. This is true regardless of which
48 | Results

classifier and sampling method each of them has been used together with (see
table 4.1).
Also, while there is no significant difference in the number of cross-validations
calls amongst those utilising the same feature selection algorithm, there is, in-
deed, a significant difference between those utilising GS as compared to those
utilising GA regardless with which classifier and sampling method each of
them has been used together with (p < 0.01. See table 4.6 for the exact p-
values). This suggests that GS has indeed called a significantly higher number
of CV than GA with a statistical significance on a confidence level of 99% and
thereby also of 95% and 90%.
Furthermore, it can be worth noting that while the number of features ap-
pears to be proportional to the number of CV calls amongst those involving the
use of GS, this is not true amongst those involving the use of GA. As shown
in table 4.1, the order in which the systems utilising GS have led to the least
number of features also happen to be the order in which the systems utilising
GS have led to the least number of CV calls. This, however, is not the case
when GS was replaced by GA.

4.3 Results on Voice Data


As shown in table 4.7, RFGA+ was the one with the highest mean MCC and
accuracy amongst all systems tested. While not having the absolute highest
F1 score, RFGA+ has achieved about the same F1 score as the one achieving
the highest F1 score at the same time as it has achieved the highest value in
terms of the lowest end of the F1 confidence interval. In particular, the F1
confidence interval of RFGA+ ranged from 0.878 to 0.905 whereas that of
the system achieving the highest mean F1 ranged from 0.872 to 0.912 (more
metric values can be found in table A.5 in the appendix).
Since the p-values from the Friedman’s test were below 0.05 for all metrics,
the Dunn’s test was conducted for all metrics. The Friedman’s test results can
be found in table A.1 in the appendix along with the Dunn’s test results on
the resulting precision and recall rates. The following sections describe the
Dunn’s test results on the resulting MCC, accuracy and F1 scores along with
the corresponding metric values.

4.3.1 MCC
As shown in table 4.7 and table 4.8, RFGA+ has achieved the highest mean
MCC followed by RF+ and RF with insignificant difference between them (p
Results | 49

MCC Accuracy F1 Features Calls


RF 0.752 ± 0.049 0.828 ± 0.032 0.892 ± 0.020 - -
RF+ 0.745 ± 0.033 0.819 ± 0.023 0.883 ± 0.015 - -
RFGS 0.716 ± 0.040 0.802 ± 0.028 0.874 ± 0.018 5.000 ± 0.666 4502.133 ± 497.962
RFGS+ 0.711 ± 0.044 0.784 ± 0.032 0.857 ± 0.022 5.000 ± 0.613 4502.267 ± 457.962
RFGA 0.735 ± 0.045 0.822 ± 0.025 0.889 ± 0.015 575.133 ± 86.938 8.867 ± 0.736
RFGA+ 0.761 ± 0.033 0.831 ± 0.020 0.892 ± 0.013 494.067 ± 78.907 8.800 ± 0.720
SVM 0.504 ± 0.018 0.732 ± 0.011 0.843 ± 0.008 - -
SVM+ 0.649 ± 0.043 0.718 ± 0.039 0.804 ± 0.030 - -
SVMGS 0.704 ± 0.040 0.797 ± 0.026 0.872 ± 0.016 4.933 ± 0.968 4451.333 ± 723.903
SVMGS+ 0.700 ± 0.036 0.752 ± 0.036 0.825 ± 0.030 5.333 ± 1.007 4750.133 ± 751.829
SVMGA 0.493 ± 0.020 0.732 ± 0.011 0.844 ± 0.007 361.333 ± 3.932 9.067 ± 1.484
SVMGA+ 0.664 ± 0.041 0.733 ± 0.032 0.816 ± 0.023 419.600 ± 48.200 10.000 ± 1.267
MLP 0.495 ± 0.017 0.553 ± 0.122 0.517 ± 0.210 - -
MLP+ 0.518 ± 0.020 0.473 ± 0.117 0.396 ± 0.199 - -
MLPGS 0.648 ± 0.057 0.775 ± 0.032 0.862 ± 0.021 5.800 ± 0.909 5099.067 ± 678.598
MLPGS+ 0.648 ± 0.039 0.702 ± 0.049 0.783 ± 0.049 4.333 ± 0.840 4003.067 ± 628.460
MLPGA 0.503 ± 0.012 0.527 ± 0.119 0.484 ± 0.203 528.200 ± 88.456 9.133 ± 0.824
MLPGA+ 0.503 ± 0.010 0.426 ± 0.116 0.302 ± 0.200 431.867 ± 67.640 10.400 ± 0.843

Table 4.7 – Test results on voice data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.

= 1.0). This suggests that while the use of GA and oversampling has led to a
higher mean MCC, the mere use of an RF may be sufficient as the observed
difference may be caused by random.
Interestingly, the Dunn’s test has shown that all systems involving the use
of RF have, together with SVMGS and SVMGS+, shown a higher MCC with
a significant difference on the confidence level of 99% than SVM, SVMGA,
MLP, MLP+, MLPGA and MLPGA+. Moreover, the Dunn’s test has shown
that all systems involving the use of RF have, along with SVMGS and SVMGS+,
achieved the highest MCC with insignificant difference between them (p =
1.0).
While this suggests that the combined use of SVM and GS has the potential
to achieve as high MCC as any system involving the use of RF, it should be
noted that the systems involving the use of RF are those with the highest MCC,
especially RFGA+.

4.3.2 Accuracy
As shown in table 4.7 and table 4.9, RFGA+ has achieved the highest mean ac-
curacy followed by RF and RFGA with insignificant difference between them
(p = 1.0). This suggests that while the use of GA and oversampling has led to
50 | Results

System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.777e-05 1.000e+00 1.000e+00 1.000e+00 1.163e-06 1.000e+00 3.073e-06 6.957e-05 1.000e+00 1.000e+00 7.497e-06 4.367e-06
RF+ - - 1.000e+00 1.000e+00 1.000e+00 - 1.857e-05 1.000e+00 1.000e+00 1.000e+00 7.469e-07 1.000e+00 1.997e-06 4.710e-05 1.000e+00 1.000e+00 4.928e-06 2.851e-06
RFGS - - - 1.000e+00 - - 6.838e-04 1.000e+00 1.000e+00 1.000e+00 4.028e-05 1.000e+00 9.619e-05 1.542e-03 1.000e+00 1.000e+00 2.133e-04 1.317e-04
RFGS+ - - - - - - 1.267e-03 1.000e+00 1.000e+00 1.000e+00 8.011e-05 1.000e+00 1.873e-04 2.797e-03 1.000e+00 1.000e+00 4.074e-04 2.545e-04
RFGA - - 1.000e+00 1.000e+00 - - 7.000e-05 1.000e+00 1.000e+00 1.000e+00 3.222e-06 1.000e+00 8.278e-06 1.703e-04 1.000e+00 1.000e+00 1.967e-05 1.165e-05
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.110e-06 1.000e+00 1.000e+00 1.000e+00 6.882e-08 1.000e+00 1.959e-07 5.709e-06 1.000e+00 1.000e+00 5.125e-07 2.861e-07
SVM - - - - - - - - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 1.000e+00
SVM+ - - - - - - 1.611e-01 - - - 1.916e-02 - 3.712e-02 2.930e-01 1.000e+00 1.000e+00 6.766e-02 4.707e-02
SVMGS - - - - - - 1.173e-03 1.000e+00 - 1.000e+00 7.352e-05 1.000e+00 1.724e-04 2.597e-03 1.000e+00 1.000e+00 3.758e-04 2.344e-04
SVMGS+ - - - - - - 3.435e-03 1.000e+00 - - 2.443e-04 1.000e+00 5.515e-04 7.312e-03 1.000e+00 1.000e+00 1.161e-03 7.398e-04
SVMGA - - - - - - - - - - - - - - - - - -
SVMGA+ - - - - - - 5.152e-02 1.000e+00 - - 5.200e-03 - 1.058e-02 9.852e-02 1.000e+00 1.000e+00 2.019e-02 1.366e-02
MLP - - - - - - - - - - 1.000e+00 - - - - - - -
MLP+ - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 1.000e+00
MLPGS - - - - - - 2.782e-01 - - - 3.595e-02 - 6.796e-02 4.932e-01 - - 1.210e-01 8.539e-02
MLPGS+ - - - - - - 1.794e-01 - - - 2.168e-02 - 4.182e-02 3.247e-01 1.000e+00 - 7.588e-02 5.293e-02
MLPGA - - - - - - - - - - 1.000e+00 - 1.000e+00 - - - - -
MLPGA+ - - - - - - - - - - 1.000e+00 - 1.000e+00 - - - 1.000e+00 -

Table 4.8 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same MCC. The symbol "-"
is used when system A does not have a higher mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.928e-02 5.903e-02 1.000e+00 1.000e+00 1.812e-02 1.509e-01 1.616e-03 3.745e-05 1.000e+00 3.658e-02 5.465e-04 1.044e-05
RF+ - - 1.000e+00 1.000e+00 - - 1.736e-02 5.351e-02 1.000e+00 1.000e+00 1.631e-02 1.377e-01 1.434e-03 3.262e-05 1.000e+00 3.306e-02 4.824e-04 9.038e-06
RFGS - - - 1.000e+00 - - 2.445e-01 6.245e-01 1.000e+00 1.000e+00 2.321e-01 1.000e+00 2.958e-02 1.122e-03 1.000e+00 4.187e-01 1.161e-02 3.644e-04
RFGS+ - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 5.280e-01 3.558e-02 1.000e+00 1.000e+00 2.465e-01 1.384e-02
RFGA - 1.000e+00 1.000e+00 1.000e+00 - - 1.439e-02 4.490e-02 1.000e+00 1.000e+00 1.351e-02 1.168e-01 1.159e-03 2.550e-05 1.000e+00 2.759e-02 3.860e-04 6.994e-06
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 2.124e-03 7.469e-03 1.000e+00 4.268e-01 1.982e-03 2.161e-02 1.345e-04 2.143e-06 1.000e+00 4.357e-03 4.061e-05 5.317e-07
SVM - - - - - - - 1.000e+00 - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGS - - - 1.000e+00 - - 2.925e-01 7.369e-01 - 1.000e+00 2.778e-01 1.000e+00 3.641e-02 1.434e-03 1.000e+00 4.970e-01 1.446e-02 4.715e-04
SVMGS+ - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 5.400e-01
SVMGA - - - - - - - 1.000e+00 - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
MLP - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - 1.000e+00
MLPGS - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 8.487e-01 6.365e-02 - 1.000e+00 4.092e-01 2.560e-02
MLPGS+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -

Table 4.9 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same accuracy. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

higher mean accuracy, the mere use of an RF may be sufficient as the observed
difference may be caused by random.
While there is no significant difference between SVMGS, SVMGS+ and
the systems involving the use RF on the confidence level of 90%, it should be
noted that RFGA+ is the only one with significantly higher accuracy than 8
other systems on a confidence level of 99% and one more on the confidence
level of 95% looking at the systems compared to which each of them has achieved
a higher accuracy with a significant difference on the confidence level of 99%
(see table 4.8 for the exact systems and the corresponding p-values)

4.3.3 F1 Score
As shown in table 4.7 and table 4.10, RF has achieved the highest mean F1
score followed by RFGA+ and RFGA with insignificant difference between
them (p = 1.0). This suggests that while the use of GA and oversampling has
Results | 51

System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 4.063e-01 3.805e-02 1.000e+00 1.052e-01 5.049e-01 2.350e-02 2.742e-02 4.945e-04 1.000e+00 3.294e-03 7.505e-03 9.694e-05
RF+ - - 1.000e+00 1.000e+00 - - 5.402e-01 5.341e-02 1.000e+00 1.444e-01 6.676e-01 3.331e-02 3.876e-02 7.542e-04 1.000e+00 4.854e-03 1.089e-02 1.520e-04
RFGS - - - 1.000e+00 - - 1.000e+00 3.776e-01 1.000e+00 8.892e-01 1.000e+00 2.507e-01 2.860e-01 8.936e-03 1.000e+00 4.647e-02 9.449e-02 2.137e-03
RFGS+ - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 4.094e-01 - 1.000e+00 1.000e+00 1.339e-01
RFGA - 1.000e+00 1.000e+00 1.000e+00 - - 1.783e-01 1.437e-02 1.000e+00 4.226e-02 2.250e-01 8.630e-03 1.016e-02 1.484e-04 1.000e+00 1.087e-03 2.587e-03 2.696e-05
RFGA+ - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 8.895e-02 6.357e-03 1.000e+00 1.963e-02 1.136e-01 3.732e-03 4.426e-03 5.460e-05 1.000e+00 4.313e-04 1.063e-03 9.334e-06
SVM - - - - - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVM+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGS - - - 1.000e+00 - - 1.000e+00 3.791e-01 - 8.924e-01 1.000e+00 2.518e-01 2.871e-01 8.980e-03 1.000e+00 4.668e-02 9.489e-02 2.149e-03
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA - - - - - - 1.000e+00 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA+ - - - - - - - 1.000e+00 - - - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
MLP - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - 1.000e+00
MLPGS - - - 1.000e+00 - - 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.021e-01 - 4.190e-01 7.643e-01 2.954e-02
MLPGS+ - - - - - - - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -

Table 4.10 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same F1 score. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

led to a higher mean F1 score, the mere use of an RF may be sufficient as far
as the F1 score is concerned.
Nevertheless, it should be noted that while not being the one with the high-
est F1 , RFGA+ has a narrower confidence interval than the one with the highest
F1 while also having the highest value in terms of the lowest end of the F1
confidence interval. In particular, the F1 confidence interval of RFGA+ ranged
from 0.878 to 0.905 whereas that of the system achieving the highest mean F1
ranged from 0.872 to 0.912.
Also, it should be noted that when it comes to the number of systems com-
pared to which each of the systems has achieved a higher F1 score with a sig-
nificant difference, RFGA+ was the one with the highest number regardless if
we measure the significance on the confidence level of 90%, 95% or 99% (see
table 4.10 for the exact systems and the corresponding p-values).

4.3.4 Selected Features


As shown in table 4.7, the system in which the application of a feature algo-
rithm has led to the least number of features was SVMGS followed by SVMGS+
and RFGS with insignificant difference between them (p = 1.0). Indeed, the
p-values have shown that the only cases where a significant difference can be
observed on the confidence level of 99% and thereby also 95% and 90% are
when a system utilising GS is compared with a system utilising GA. Indeed,
regardless of which pair of systems we compare, the p-value corresponding
to the null hypothesis that the system with GS and the system with GA have
the same number of selected features has been less than 0.1, often even less
than 0.01 (See table 4.11 for the exact p-values). This suggests that GS indeed
tends to lead to a lower number of features than GA and that while the choice of
52 | Results

System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - - 1.489e-06 8.590e-06 - 1.000e+00 3.318e-02 1.553e-04 1.000e+00 - 1.797e-05 4.691e-04
RFGS+ - - 1.387e-06 8.032e-06 - 1.000e+00 3.169e-02 1.462e-04 1.000e+00 - 1.683e-05 4.428e-04
RFGA - - - - - - - - - - - -
RFGA+ - - 1.000e+00 - - - - - - - 1.000e+00 -
SVMGS 1.000e+00 1.000e+00 1.155e-06 6.754e-06 - 1.000e+00 2.814e-02 1.251e-04 1.000e+00 - 1.421e-05 3.814e-04
SVMGS+ - - 6.625e-06 3.526e-05 - - 8.632e-02 5.528e-04 1.000e+00 - 7.122e-05 1.574e-03
SVMGA - - 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA+ - - 1.000e+00 1.000e+00 - - - - - - 1.000e+00 1.000e+00
MLPGS - - 2.830e-05 1.388e-04 - - 2.139e-01 1.882e-03 - - 2.705e-04 5.051e-03
MLPGS+ 1.000e+00 1.000e+00 8.332e-08 5.564e-07 1.000e+00 1.000e+00 4.941e-03 1.305e-05 1.000e+00 - 1.241e-06 4.386e-05
MLPGA - - 1.000e+00 - - - - - - - - -
MLPGA+ - - 1.000e+00 1.000e+00 - - - - - - 1.000e+00 -

Table 4.11 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same features. The symbol
"-" is used when system A does not have a lower mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

classifier and sampling method appears to have an insignificant impact on the


resulting number of selected features, the choice of feature selection algorithm
has a certain impact.

4.3.5 Calls

System B
RFGS RFGS+ RFGA RFGA+ SVMGS SVMGS+ SVMGA SVMGA+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RFGS - 1.000e+00 - - - 1.000e+00 - - 1.000e+00 - - -
RFGS+ - - - - - 1.000e+00 - - 1.000e+00 - - -
RFGA 4.709e-05 5.017e-05 - - 5.901e-05 1.158e-05 1.000e+00 1.000e+00 2.581e-06 5.041e-04 1.000e+00 1.000e+00
RFGA+ 3.820e-05 4.072e-05 1.000e+00 - 4.795e-05 9.298e-06 1.000e+00 1.000e+00 2.050e-06 4.167e-04 1.000e+00 1.000e+00
SVMGS 1.000e+00 1.000e+00 - - - 1.000e+00 - - 1.000e+00 - - -
SVMGS+ - - - - - - - - 1.000e+00 - - -
SVMGA 9.034e-06 9.661e-06 - - 1.147e-05 2.050e-06 - 1.000e+00 4.209e-07 1.118e-04 1.000e+00 1.000e+00
SVMGA+ 7.050e-04 7.462e-04 - - 8.629e-04 1.999e-04 - - 5.155e-05 5.836e-03 - 1.000e+00
MLPGS - - - - - - - - - - - -
MLPGS+ 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00 - - 1.000e+00 - - -
MLPGA 1.109e-04 1.179e-04 - - 1.379e-04 2.848e-05 - 1.000e+00 6.636e-06 1.097e-03 - 1.000e+00
MLPGA+ 7.799e-03 8.201e-03 - - 9.326e-03 2.546e-03 - - 7.583e-04 5.009e-02 - -

Table 4.12 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same calls. The symbol "-"
is used when system A does not have a lower mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

As much as the systems leading to the lowest number of selected features


are dominated by those involving the use of GS, those leading to the highest
number of CV calls also happen to be those involving the use of GS. Indeed,
while GS has called hundreds of cross-validations on average, GA has called
less than 10 cross-validations on average. This is true regardless of which
classifier and sampling method each of them has been used together with (see
table 4.7).
Results | 53

Also, while there is no significant difference in the number of cross-validations


calls amongst those utilising the same feature selection algorithm, there is a
significant difference between those utilising GS and those utilising GS. In-
deed, regardless of which pair of systems we compare, the p-value correspond-
ing to the null hypothesis that the system with GS and the system with GA have
the same number of selected features has been less than 0.1, often even less
than 0.01 (See table 4.12 for the exact p-values). This suggests that GS tends
to check more sets of features than GA.
Furthermore, like it was with the drawing data, the number of features
appear to be proportional to the number of CV calls amongst those involving
the use of GS, but not amongst those involving the use of GA. As shown in
table 4.7, the order in which the systems utilising GS have led to the least
number of features also happen to be the order in which the systems utilising
GS have led to the least number of CV calls. This, however, is not the case
when GS was replaced by GA.
54 | Results
Discussion | 55

Chapter 5

Discussion

5.1 Greedy Search Versus Genetic Algorithm


Since both the GS and GA were designed to be conducted based on several
iterations of cross-validation, it is no surprise that the search time was signif-
icantly higher than the training time on the same data. Similarly, since each
iteration of GS requires looping through all remaining features while the GA
was designed to start from three solutions and then producing no more than 1.5
times the remaining solutions while only keeping the best ones at the end of
each iteration, it makes sense why genetic algorithm took less time to converge
- it looks at fewer solutions per iteration and looks at only those combinations
generated based on solutions that are already the best ones so far.
Similarly, since GS loops through all remaining features each time an at-
tempt is made to add a new feature, it is no surprise that it resulted in a less
number of CV calls when in systems where it has led to a lower number of
features. Meanwhile, since this is not the case with GA, it is no surprise that
the systems utilising GA did not follow a similar pattern.

5.2 The Best System


As regards the best performing system, it should be noted that both on the voice
and drawing data, RFGA+ has achieved the highest mean MCC and accuracy.
While RFGA+ has not offered a significantly higher MCC than all system
regardless if we measure the significance on a confidence level of 90%, 95% or
99%, it should be noted that RFGA+ has achieved a significantly higher MCC
than several other systems both on the voice and drawing data with a confi-
dence level of 99%. This shows that RFGA+ is the best performing system in
56 | Discussion

terms of MCC and that one can be 99% certain that, in another trial, RFGA+
will provide a higher MCC than several of the systems tested in this thesis.
Similarly, while RFGA+ has not offered a significantly higher accuracy
than all system regardless if we measure the significance on a confidence level
of 90%, 95% or 99%, it should be noted that RFGA+ has achieved significantly
higher accuracy than several other systems both on the voice and drawing data
with a confidence level of 99%. This shows that RFGA+ is the best performing
system in terms of MCC and that one can be 99% certain that, in another trial,
RFGA+ will provide a higher MCC than several of the systems tested in this
thesis.
While not having the highest mean F1 score on the voice data, it should
be noted that RFGA+ has offered the highest mean F1 score on the drawing
data and that, even on the voice data, RFGA+ has achieved the highest value
in terms of the lowest end of the F1 score confidence interval.
More importantly, it should be noted that regardless if it is about the draw-
ing or voice data, no system has offered a significantly higher score than RFGA+
in any metric on the confidence level of 90%.
Again, it should be noted that precision and the recall rate do not take
account of the distribution of the data while MCC does. Hence, given that the
data sets used in this study are imbalanced and RFGA+ is the one with the
highest MCC both in terms of the drawing and the voice data, RFGA+ should
be the one to go for as far as the prediction performance is concerned.
As regards why RFGA+ was the best performing one, the reasons could
be the following:
1. RF is a decision-tree based classifier following the logic of a typical
clinical diagnosis where the diagnosis is made based on a decision-tree-
like thought map;
2. RF is an ensemble classifier where the use of multiple weaker classifiers
make it stronger than a non-ensemble classifier;
3. RF is better suited with GA than with GS as it can handle several features
and GS tends to get stuck at a local optimum in its attempt to provide
good enough performance with as little components as possible;
4. The data sets used in this study are imbalanced, meaning that, without
the use of an under-/oversampling method like random oversampling,
the system may tend to identify a sample as the majority class.
The first and second statements are supported by the observation that, both
for the drawing and the voice data, most systems utilising RF have provided
Discussion | 57

a significantly higher MCC, accuracy and F1 score than their corresponding


versions where RF was replaced by SVM or MLP (p < 0.1). Indeed, in systems
where there is no random oversampling or feature selection involved, the mere
use of RF instead of SVM or MLP has led to a higher score in almost all
metrics on both drawing and voice data (see table 4.1 and 4.7), especially for
MCC, accuracy and F1 score where the use of RF instead of SVM or MLP has
often led to a significantly higher score on the confidence level of 95% (see
table 4.2, 4.3, 4.4, 4.8, 4.9 and 4.10).
Since RF is the only decision-based classifier, the above-mentioned ob-
servations can, however, verify that at least one of the two statements must
be true. Hence, for further investigation, one may want to compare RF with
another classifier that is either decision-based or ensemble-based and not both.
The third statement is supported by the observation that a system involving
the use of RF and GA has led to a higher than the corresponding system where
GA was replaced by GS in all metrics on both voice and data. To prove this,
however, more thorough testing would be needed as no significant difference
was observed in this thesis to support this statement.
Similarly, the fourth statement is supported by the observation that, with
everything else equal, the use of random oversampling has led to a higher
MCC in a majority of the cases, especially when it comes to the drawing data.
Indeed, as one can see in table 4.1, all but one system involving the use of
random oversampling have achieved a higher MCC on the drawing data than
their corresponding versions without random oversampling. This, together
with the fact that the drawing data is more imbalanced than the voice data
(see section 3.1), suggests that RFGA+’s superior performance must have par-
tially come from random oversampling’s ability to solve the class imbalance
problem. Again, this needs to be verified through more thorough testing as no
significant difference was observed in this thesis to support this statement.

5.3 Alternatives to the Best System


Given GA’s tendency to choose a significantly higher number of features than
GS, one may want to use GS as a way to speed up the prediction while low-
ering the overall resource consumption in the long run. If this is the case, one
may want to use SVMGS as the system providing the highest mean MCC and
accuracy amongst those utilising GS both when it comes to the drawing and
voice data (see table 4.1 and 4.7). Indeed, SVMGS is one of the few systems
compared to which the system with the highest MCC has not achieved a higher
MCC with a significant difference on the confidence level of 90% both in terms
58 | Discussion

of the drawing and the voice data (see table 4.2 and 4.8). Hence, one may want
to use SVMGS for a faster prediction and lower overall resource consumption
while still having the possibility to reach as high predictive performance as the
system with the highest MCC.
As regards whether it is worth the time and computational resource to in-
clude feature selection for higher predictive performance, it may be important
to note that, amongst the systems not utilising feature selection, RF+ was the
one with the highest MCC on the drawing data whereas RF was the one with
the highest MCC on the voice data. In particular, since RFGA+ did not provide
a significantly higher MCC than RF+ or RF regardless if we look at the drawing
or voice data (p = 1.0), one may want to use RF+ for the drawing data and RF
for the voice data to avoid the need of feature selection. Nevertheless, it should
be noted that while feature selection is a process that takes time and resource
to run, feature selection can lead to faster prediction and lower resource con-
sumption by requiring fewer features for the prediction. Hence, even if the
system cannot provide higher predictive performance with the help of feature
selection, it may still be worth the time and resource to do a feature selection.
Similarly, as regards to whether it is worth the time and computational
resource to include random oversampling for higher predictive performance,
it may be important to note that, amongst the systems not utilising random
oversampling, RF was the one with the highest MCC on both the drawing and
voice data. Again, given that RFGA+ did not provide a significantly higher
MCC than RF on any of the data sets with a confidence level of 90% (p = 1.0),
one may want to use RF to avoid random oversampling.

5.4 Ethics, Economics and Sustainability


Since this thesis was conducted using published data, one may consider the
data collection in this thesis as not affected by any ethical and/or privacy issues.
While one may argue that the research on the automatic diagnosis has a risk
of removing the human labour in the process, one should remember that the
aim of this thesis to reduce the already overloaded burden of hospitals while
making it possible for the general public to get a diagnosis earlier without hav-
ing to worry about the line or wasting the doctor’s time. Hence, the automatic
diagnosis of Parkinson’s disease should instead be viewed as a way to improve
healthcare by making available resources for other medical activities and by
allowing for earlier diagnosis before it is too late.
In fact, given that an automatic diagnosis of PD is likely much cheaper
than a traditional diagnosis of PD, this study may be motivated as a way to
Discussion | 59

help the general public save money that could be otherwise used on treatment
and other medical activities.
More importantly, the removal of doctor consultancy fee and waiting time
may motivate patients to get diagnosed earlier and thereby get treatment earlier
for a better health and productivity.
Also, one should remember that the ability to get treatment earlier will
likely help patients avoid unnecessary doctor visits and treatments in the fu-
ture. This, in turn, may serve as a way to help the general public to save money
and time as doctor visits and treatments are both costly and time-consuming.
As regards sustainability in terms of environmental impacts, it should be
noted that the automatic diagnosis of Parkinson’s disease has the benefit of
allowing the general public to test themselves from home instead of having
to travel to the hospital. This would make society more sustainable by re-
ducing emission. Indeed, even though the automatic diagnosis is a resource-
consuming activity with a negative impact on sustainability, the negative im-
pact caused by one single test is likely far less than the negative impact caused
by a traditional diagnosis of Parkinson’s disease where patients have to travel to
and from the hospital only to find out that they don’t have Parkinson’s disease.
More importantly, even if the convenience enabled by the automatic diag-
nosis will lead to people testing for Parkinson’s disease more often than they
do today, one should remember that it is better to test too often than to not
testing at all, especially when the test does not require the interference of a
doctor that has already do much on his/her plate.
Nevertheless, one may want to investigate ways to make the automatic di-
agnosis more sustainable by, for instance, measuring the energy consumption
of the different digital tools that can be used for the diagnosis and thereby come
up with a suggestion. Similarly, one may want to investigate the complexity of
the systems tested in this study and thereby come to a conclusion of whether
one should go for the system with the highest predictive performance in terms
of MCC and accuracy (i.e. RFGA+).

5.5 Potential Parties of Interest


Other than the thesis provider, this thesis may be of interest to healthcare pro-
fessionals because if the healthcare professionals perceive the proposed system
as capable of identifying Parkinson’s disease with satisfying performance, they
can propose this to their hospital as a way to provide free screening test for the
general public. At the same time as this means that healthcare professionals
will now have more time for other activities, this also means that more PD
60 | Discussion

patients will get treatment earlier as it is now possible to get early diagnosis
and because it is now less tempting to delay the diagnosis as the diagnosis is
now for free and less time-consuming. This should be of interest to all medical
professionals as earlier treatment is essential.
This thesis may also be of interest to healthcare professionals as a decision
basis of what algorithms to use for the diagnosis of PD and how reliable the
algorithms are in terms of accuracy, precision and recall rate.
This thesis may also be of interest to the general public since Parkinson’s
disease is a common disease that can lead to disability. The general public
may thus be interested in this thesis as a way to gain knowledge of a disease
that they or their close ones may be suffering; to understand how this disease
can be detected through automatic diagnosis and to get an overview of the
reliability of such diagnosis.
This thesis may also be of interest to machine learning researchers as this
is a study about the automatic diagnosis of a disease using machine learning.
In particular, this thesis may be of interest to those working with automation
and classification problems as this study involves the automatic diagnosis of a
disease as a classification problem.
Conclusions and Future work | 61

Chapter 6

Conclusions and Future work

6.1 Conclusion
In conclusion, while none of the systems in this thesis have shown a signifi-
cantly higher performance than all other systems in any metric, it can be stated
that, amongst those systems tested in this thesis, the best system for the diag-
nosis of PD appears to be RFGA+ as a combination of RF, GA and random
oversampling.

6.2 Limitation and Suggestion of Future Work


Due to the limitation on time, we were not able to get the data from the project
provider. Therefore, we had to use secondary data. This implied restricted
control over the data as well as limited information about the data. For in-
stance, we were not able to determine how the researchers determined whether
a participant has PD makes it hard for us to determine if the healthy partici-
pants truly do not suffer PD and that those with PD truly has PD. After all,
some "healthy" participants may have PD only that their symptoms were un-
detectable by human observations. Similarly, one should not forget about pos-
sible misdiagnosis of PD which is hard for us to estimate since we do not know
how the grouping was made. That is, while one can assume that the risk of false
grouping is negligible given that the data was provided by a medical faculty,
having access to primary data is still preferred to ensure proper grouping while
enabling the analysis over the validity of the original grouping.
As mentioned by [65], PD may only affect the patient at one side of the
body. This means that a drawing exam may not be enough to capture the PD-
related symptoms on a PD patient who happens to have PD affecting the side
62 | Conclusions and Future work

of the brain controlling the passive hand. Hence, if PD patients in the data
set happen to have PD affecting only his/her passive side, then it would be
wrong to use the drawings made by his/her dominant hand for the prediction
of PD or the assessment of the model. This problem could, for instance, solved
by registering whether the PD patients have PD affecting their dominant side
and/or asking them to draw with both hands while registering which one is
made by his/her dominant hand. This is thus another area to explore.
Since the data we could found either contains speech data only or drawing
data only, we were also not able to test how the combination of speech and
drawing data would impact the performance of each model. While we could
have created new PD and healthy participants by combining the speech and
drawing data, doing so would not reflect how the performance would look like
in real life where we are to identify PD using the data from the same person.
Therefore, we left this as a suggestion for future research - to collect voice and
drawing data from participants and test our system on the data.
Moreover, a larger dataset could be gathered to verify whether our pro-
posed systems indeed are the best. After all, several systems have been ob-
served with similar results without any significant difference between them.
That is, by testing our implementations on more data, one may find a clearer
difference between the combinations.
Furthermore, since no hyper-parameter tuning was done in this thesis, a
suggestion of future work could be to find the best hyperparameters for the
proposed system.
Last but not least, the thesis has shown that while random oversampling
may help solve the problem of data imbalance, higher performance may be
achieved through the use of a more sophisticated oversampling model that takes
account of the over-fitting problem faced by random oversampling. This could
be another area to explore.
REFERENCES | 63

References

[1] M. Memedi, A. Sadikov, V. Groznik, J. Žabkar, M. Možina, F. Bergquist,


A. Johansson, D. Haubenberger, and D. Nyholm, “Automatic spiral
analysis for objective assessment of motor symptoms in parkinson’s
disease,” Sensors, vol. 15, no. 9, pp. 23 727–23 744, 2015.

[2] D. Gupta, S. Sundaram, A. Khanna, A. E. Hassanien, and V. H. C.


De Albuquerque, “Improved diagnosis of parkinson’s disease using
optimized crow search algorithm,” Computers & Electrical Engineering,
vol. 68, pp. 412–424, 2018.

[3] B. E. Sakar, M. E. Isenkul, C. O. Sakar, A. Sertbas, F. Gurgen, S. Delil,


H. Apaydin, and O. Kursun, “Collection and analysis of a parkinson
speech dataset with multiple types of sound recordings,” IEEE Journal
of Biomedical and Health Informatics, vol. 17, no. 4, pp. 828–834, 2013.

[4] J. Golze, S. Zourlidou, and M. Sester, “Traffic regulator detection


using gps trajectories,” KN-Journal of Cartography and Geographic
Information, vol. 70, no. 3, pp. 95–105, 2020.

[5] K. A. A. Kamarulzaini, N. Ismail, M. H. F. Rahiman, M. N. Taib,


N. A. M. Ali, and S. N. Tajuddin, “Evaluation of rbf and mlp in
svm kernel tuned parameters for agarwood oil quality classification,” in
2018 IEEE 14th International Colloquium on Signal Processing & Its
Applications (CSPA). IEEE, 2018, pp. 250–254.

[6] A. E. Faghfouri and M. B. Frish, “Robust discrimination of human


footsteps using seismic signals,” in Unattended Ground, Sea, and Air
Sensor Technologies and Applications XIII, vol. 8046. International
Society for Optics and Photonics, 2011, p. 80460D.

[7] G. Varoquaux, P. R. Raamana, D. A. Engemann, A. Hoyos-Idrobo,


Y. Schwartz, and B. Thirion, “Assessing and tuning brain decoders:
64 | REFERENCES

Cross-validation, caveats, and guidelines,” NeuroImage, vol. 145, pp.


166–179, 2017.

[8] S. M. Ross, Introductory Statistics. London: Academic Press, 2017.

[9] M. Isenkul, B. Sakar, and O. Kursun, “Improved spiral test using


digitized graphics tablet for monitoring parkinson’s disease,” pp. 171–
5, 2014.

[10] E. R. Dorsey, A. Elbaz, E. Nichols, F. Abd-Allah, A. Abdelalim, J. C.


Adsuar, M. G. Ansha, C. Brayne, J.-Y. J. Choi, D. Collado-Mateo et al.,
“Global, regional, and national burden of parkinson’s disease, 1990-
2016: a systematic analysis for the global burden of disease study 2016,”
The Lancet Neurology, vol. 17, no. 11, pp. 939–953, 2018.

[11] M. J. Abdulaal, A. J. Casson, and P. Gaydecki, “Performance of nested


vs. non-nested svm cross-validation methods in visual bci: Validation
study,” in 2018 26th European Signal Processing Conference (EU-
SIPCO). IEEE, 2018, pp. 1680–1684.

[12] D. Chicco and G. Jurman, “The advantages of the matthews correlation


coefficient (mcc) over f1 score and accuracy in binary classification
evaluation,” BMC genomics, vol. 21, no. 1, pp. 1–13, 2020.

[13] S. Aghanavesi, D. Nyholm, M. Senek, F. Bergquist, and M. Memedi,


“A smartphone-based system to quantify dexterity in parkinson’s disease
patients,” Informatics in Medicine Unlocked, vol. 9, pp. 11–17, 2017.

[14] Cleveland Clinic, “Parkinson’s disease,” Jan 2020. [Online]. Available:


https://my.clevelandclinic.org/health/diseases/8525-parkinsons-disea
se-an-overview

[15] O.-B. Tysnes and A. Storstein, “Epidemiology of parkinson’s disease,”


Journal of Neural Transmission, vol. 124, no. 8, pp. 901–905, 2017.

[16] L. Raiano, G. di Pino, L. di Biase, M. Tombini, N. L. Tagliamonte,


and D. Formica, “Pdmeter: A wrist wearable device for an at-home
assessment of the parkinson’s disease rigidity,” IEEE Transactions on
Neural Systems and Rehabilitation Engineering, vol. 28, no. 6, pp. 1325–
1333, 2020.

[17] B. Palakurthi and S. P. Burugupally, “Postural instability in parkinson’s


disease: A review,” Brain sciences, vol. 9, no. 9, p. 239, 2019.
REFERENCES | 65

[18] C. Schlenstedt, K. Boße, O. Gavriliuc, R. Wolke, O. Granert, G. Deuschl,


and N. G. Margraf, “Quantitative assessment of posture in healthy
controls and patients with parkinson’s disease,” Parkinsonism & related
disorders, vol. 76, pp. 85–90, 2020.

[19] A. Mirelman, P. Bonato, R. Camicioli, T. D. Ellis, N. Giladi, J. L. Hamil-


ton, C. J. Hass, J. M. Hausdorff, E. Pelosin, and Q. J. Almeida, “Gait
impairments in parkinson’s disease,” The Lancet Neurology, vol. 18,
no. 7, pp. 697–708, 2019.

[20] H. Gunduz, “Deep learning-based parkinson’s disease classification


using vocal feature sets,” IEEE Access, vol. 7, pp. 115 540–115 551,
2019.

[21] S. Skodda, W. Grönheit, N. Mancinelli, and U. Schlegel, “Progression


of voice and speech impairment in the course of parkinson’s disease: a
longitudinal study,” Parkinson’s disease, vol. 2013, 2013.

[22] E. Tolosa, G. Wenning, and W. Poewe, “The diagnosis of parkinson’s


disease,” The Lancet Neurology, vol. 5, pp. 75–86, 2006.

[23] T. Patel and F. Chang, “Practice recommendations for parkinson’s


disease: assessment and management by community pharmacists,”
Canadian Pharmacists Journal(Ott), vol. 148, 2015.

[24] H.-I. Ma, W.-J. Hwang, S.-H. Chang, and T.-Y. Wang, “Progressive
micrographia shown in horizontal, but not vertical, writing in parkinson’s
disease,” Behavioural neurology, vol. 27, no. 2, pp. 169–174, 2013.

[25] C. Poon, N. Gorji, M. Latt, K. Tsoi, B. Choi, C. Loy, and S. Poon,


“Derivation and analysis of dynamic handwriting features as clinical
markers of parkinson’s disease,” in Proceedings of the 52nd Hawaii
International Conference on System Sciences, 2019, pp. 3721–3730.

[26] L. Naranjo, C. J. Perez, J. Martin, and Y. Campos-Roca, “A two-stage


variable selection and classification approach for parkinson’s disease
detection by using voice recording replications,” Computer Methods and
Programs in Biomedicine, vol. 142, pp. 147–156, 2017.

[27] V. Kumar and S. Minz, “Feature selection: A literature review,” Smart


Computing Review, vol. 4, 2014.
66 | REFERENCES

[28] I. Tsamardinos, G. Borboudakis, P. Katsogridakis, P. Pratikakis, and


V. Christophides, “A greedy feature selection algorithm for big data of
high dimensionality,” Machine learning, vol. 108, no. 2, pp. 149–202,
2019.

[29] R. Liu and D. F. Gillies, “Overfitting in linear feature extraction for


classification of high-dimensional image data,” Pattern Recognition,
vol. 53, pp. 73–86, 2016.

[30] J. Lever, M. Krzywinski, and N. Altman, “Points of significance: model


selection and overfitting,” 2016.

[31] B. Venkatesh and J. Anuradha, “A review of feature selection and its


methods,” Cybernetics and Information Technologies, vol. 19, no. 1, pp.
3–26, 2019.

[32] N. Rachburee and W. Punlumjeak, “A comparison of feature selection


approach between greedy, ig-ratio, chi-square, and publisher in educa-
tional mining,” pp. 420–424, 2015.

[33] T. Zhang, “Adaptive forward-backward greedy algorithm for learning


sparse representations,” IEEE Transactions on Information Theory,
vol. 57, no. 7, pp. 4689–4708, 2011.

[34] S. Mirjalili, “Genetic algorithm,” in Evolutionary algorithms and neural


networks. Springer, 2019, pp. 43–55.

[35] E. Wijanarko and H. Grandis, “Binary coded genetic algorithm (bcga)


with multi-point cross-over for magnetotelluric (mt) 1d data inversion,”
in IOP Conference Series: Earth and Environmental Science, vol. 318,
no. 1. IOP Publishing, 2019, p. 012029.

[36] H. Vafaie, I. F. Imam et al., “Feature selection methods: genetic


algorithms vs. greedy-like search,” in Proceedings of the international
conference on fuzzy and intelligent control systems, vol. 51, 1994, p. 28.

[37] B. W. Yap, K. Abd Rani, H. A. Abd Rahman, S. Fong, Z. Khairudin,


and N. N. Abdullah, “An application of oversampling, undersampling,
bagging and boosting in handling imbalanced datasets,” in Proceedings
of the first international conference on advanced data and information
engineering (DaEng-2013). Springer, 2014, pp. 13–22.
REFERENCES | 67

[38] G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning


through a heuristic oversampling method based on k-means and smote,”
Information Sciences, vol. 465, pp. 1–20, 2018.

[39] K. Cheng, C. Zhang, H. Yu, X. Yang, H. Zou, and S. Gao, “Grouped


smote with noise filtering mechanism for classifying imbalanced data,”
IEEE Access, vol. 7, pp. 170 668–170 681, 2019.

[40] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–
32, 2001.

[41] A. J. Myles, R. N. Feudale, Y. Liu, N. A. Woody, and S. D. Brown, “An


introduction to decision tree modeling,” Journal of Chemometrics: A
Journal of the Chemometrics Society, vol. 18, no. 6, pp. 275–285, 2004.

[42] W. S. Noble, “What is a support vector machine?” Nature biotechnology,


vol. 24, no. 12, pp. 1565–1567, 2006.

[43] F. Nie, W. Zhu, and X. Li, “Decision tree svm: An extension of linear
svm for non-linear classification,” Neurocomputing, vol. 401, pp. 153–
159, 2020.

[44] H. T. Pedro, R. H. Inman, and C. F. Coimbra, “4 - mathematical methods


for optimized solar forecasting,” in Renewable Energy Forecasting, ser.
Woodhead Publishing Series in Energy, G. Kariniotakis, Ed. Woodhead
Publishing, 2017, pp. 111–152. ISBN 978-0-08-100504-0

[45] D. Berrar, “Cross-validation,” Encyclopedia of bioinformatics and com-


putational biology, vol. 1, pp. 542–545, 2019.

[46] G. C. Cawley and N. L. C. Talbot, “On over-fitting in model selection


and subsequent selection bias in performance evaluation,” The Journal
of Machine Learning Research, vol. 11, pp. 2079–2107, 2010. [Online].
Available: https://www.jmlr.org/papers/volume11/cawley10a/cawley1
0a

[47] R. Sharma, A. V. Nori, and A. Aiken, “Bias-variance tradeoffs in


program analysis,” ACM SIGPLAN Notices, vol. 49, no. 1, pp. 127–137,
2014.

[48] T. Wong and P. Yeh, “Reliable accuracy estimates from k-fold cross
validation,” IEEE Transactions on Knowledge and Data Engineering,
vol. 32, no. 8, pp. 1586–1594, 2020.
68 | REFERENCES

[49] J.-B. Du Prel, G. Hommel, B. Röhrig, and M. Blettner, “Confidence


interval or p-value?: part 4 of a series on evaluation of scientific
publications,” Deutsches Ärzteblatt International, vol. 106, no. 19, p.
335, 2009.
[50] D. G. Pereira, A. Afonso, and F. M. Medeiros, “Overview of friedman’s
test and post-hoc analysis,” Communications in Statistics-Simulation and
Computation, vol. 44, no. 10, pp. 2636–2653, 2015.
[51] D. Gupta, A. Julka, S. Jain, T. Aggarwal, A. Khanna, N. Arunkumar, and
V. H. C. de Albuquerque, “Optimized cuttlefish algorithm for diagnosis
of parkinson’s disease,” Cognitive systems research, vol. 52, pp. 36–48,
2018.
[52] P. Sharma, S. Sundaram, M. Sharma, A. Sharma, and D. Gupta, “Di-
agnosis of parkinson’s disease using modified grey wolf optimization,”
Cognitive Systems Research, vol. 54, pp. 100–115, 2019.
[53] S. A. Mostafa, A. Mustapha, M. A. Mohammed, R. I. Hamed, N. Arunk-
umar, M. K. Abd Ghani, M. M. Jaber, and S. H. Khaleefah, “Examining
multiple feature evaluation and classification methods for improving the
diagnosis of parkinson’s disease,” Cognitive Systems Research, vol. 54,
pp. 90–99, 2019.
[54] C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar,
M. Tutuncu, T. Aydin, M. E. Isenkul, and H. Apaydin, “A comparative
analysis of speech signal processing algorithms for parkinson’s disease
classification and the use of the tunable q-factor wavelet transform,”
Applied Soft Computing, vol. 74, pp. 255–263, 2019.
[55] S. Lahmiri and A. Shmuel, “Detection of parkinson’s disease based
on voice patterns ranking and optimized support vector machine,”
Biomedical Signal Processing and Control, vol. 49, pp. 427–433, 2019.
[56] L. Ali, C. Zhu, M. Zhou, and Y. Liu, “Early diagnosis of parkinson’s
disease from multiple voice recordings by simultaneous sample and
feature selection,” Expert Systems with Applications, vol. 137, pp. 22–
28, 2019.
[57] C. Kotsavasiloglou, N. Kostikis, D. Hristu-Varsakelis, and M. Ar-
naoutoglou, “Machine learning-based classification of simple drawing
movements in parkinson’s disease,” Biomedical Signal Processing and
Control, vol. 31, pp. 174–180, 2017.
REFERENCES | 69

[58] P. Zham, S. Raghav, P. Kempster, S. Poosapadi Arjunan, K. Wong,


K. J. Nagao, and D. K. Kumar, “A kinematic study of progressive
micrographia in parkinson’s disease,” Frontiers in neurology, vol. 10,
p. 403, 2019.

[59] M. Gil-Martín, J. M. Montero, and R. San-Segundo, “Parkinson’s


disease detection from drawing movements using convolutional neural
networks,” Electronics, vol. 8, no. 8, p. 907, 2019.

[60] N. Al-Yousef, R. Al-Saikhan, R. Al-Gowaifly, R. Al-Abdullatif, F. Al-


Mutairi, and O. Bchir, “Parkinson’s disease diagnosis using spiral test
on digital tablets,” International Journal of Advanced Computer Science
and Applications, vol. 11, no. 5, 2020.

[61] L. C. Ribeiro, L. C. Afonso, and J. P. Papa, “Bag of samplings


for computer-assisted parkinson’s disease diagnosis based on recurrent
neural networks,” Computers in biology and medicine, vol. 115, p.
103477, 2019.

[62] A. Parziale, A. Della Cioppa, R. Senatore, and A. Marcelli, “A decision


tree for automatic diagnosis of parkinson’s disease from offline drawing
samples: experiments and findings,” in International Conference on
Image Analysis and Processing. Springer, 2019, pp. 196–206.

[63] L. S. Bernardo, A. Quezada, R. Munoz, F. M. Maia, C. R. Pereira, W. Wu,


and V. H. C. de Albuquerque, “Handwritten pattern recognition for early
parkinson’s disease diagnosis,” Pattern Recognition Letters, vol. 125, pp.
78–84, 2019.

[64] J. P. Folador, A. Rosebrock, A. A. Pereira, M. F. Vieira, and de Oliveira


A. A., “Classification of handwritten drawings of people with parkin-
son’s disease by using histograms of oriented gradients and the random
forest classifier,” pp. 334–343, 2019.

[65] C. Miller-Patterson, R. Buesa, N. McLaughlin, R. Jones, U. Akbar, and


J. H. Friedman, “Motor asymmetry over time in parkinson’s disease,”
Journal of the neurological sciences, vol. 393, pp. 14–17, 2018.
70 | REFERENCES
Appendix A: Additional Test Results | 71

Appendix A

Additional Test Results

A.1 Friedman’s Test Results

Metric
MCC Accuracy Precision Recall F1 Features Calls
Data
Drawing 2.936e-25 4.786e-27 4.503e-16 7.230e-27 6.660e-27 2.953e-24 1.595e-23
Voice 3.790e-31 4.911e-24 9.135e-19 1.299e-13 5.369e-20 3.554e-23 1.475e-22

Table A.1 – Friedmann test results: p-values corresponding the null hypothesis
that all systems came from a population with the same distribution.

A.2 Precision, Recall Rate and Dunn’s Test


Results
A.2.1 Drawing
Overall Result
Precision
As shown in table 4.1, RFGA+ was the system with the highest mean precision
followed by the MLPGS+ and RF+. While there is no statistical difference
between these three systems and while all of them have achieved a higher pre-
cision with a significant difference on the confidence level of 99% than SVM,
SVMGA, MLP and MLPGA (See table A.3 for the exact p-values), it should be
noted that RFGA+ was the system with the highest mean MCC and precision.
72 | Appendix A: Additional Test Results

MCC Accuracy Precision Recall F1 Features Calls


RF 0.856 ± 0.041 0.936 ± 0.017 0.934 ± 0.015 0.996 ± 0.007 0.964 ± 0.009 - -
RF+ 0.900 ± 0.039 0.955 ± 0.017 0.951 ± 0.018 1.000 ± 0.000 0.975 ± 0.009 - -
RFGS 0.839 ± 0.068 0.926 ± 0.029 0.937 ± 0.022 0.981 ± 0.020 0.958 ± 0.017 3.133 ± 0.550 538.533 ± 70.342
RFGS+ 0.841 ± 0.073 0.930 ± 0.029 0.944 ± 0.022 0.978 ± 0.017 0.960 ± 0.017 3.267 ± 0.470 555.800 ± 60.162
RFGA 0.847 ± 0.041 0.933 ± 0.017 0.931 ± 0.018 0.996 ± 0.007 0.962 ± 0.010 75.667 ± 12.278 9.400 ± 0.882
RFGA+ 0.909 ± 0.056 0.958 ± 0.027 0.961 ± 0.021 0.992 ± 0.015 0.976 ± 0.016 76.000 ± 10.627 8.800 ± 0.696
SVM 0.500 ± 0.000 0.856 ± 0.001 0.856 ± 0.001 1.000 ± 0.000 0.922 ± 0.001 - -
SVM+ 0.666 ± 0.054 0.700 ± 0.051 0.945 ± 0.030 0.691 ± 0.054 0.793 ± 0.040 - -
SVMGS 0.783 ± 0.079 0.904 ± 0.031 0.929 ± 0.022 0.963 ± 0.022 0.945 ± 0.017 2.600 ± 0.482 470.067 ± 62.021
SVMGS+ 0.727 ± 0.064 0.833 ± 0.032 0.949 ± 0.028 0.858 ± 0.047 0.896 ± 0.022 2.933 ± 0.727 512.400 ± 92.953
SVMGA 0.500 ± 0.000 0.856 ± 0.001 0.856 ± 0.001 1.000 ± 0.000 0.922 ± 0.001 62.733 ± 7.611 10.867 ± 3.113
SVMGA+ 0.660 ± 0.050 0.703 ± 0.041 0.943 ± 0.027 0.698 ± 0.043 0.798 ± 0.032 86.600 ± 13.486 9.000 ± 0.613
MLP 0.543 ± 0.056 0.808 ± 0.064 0.866 ± 0.017 0.918 ± 0.085 0.882 ± 0.052 - -
MLP+ 0.644 ± 0.065 0.678 ± 0.117 0.801 ± 0.161 0.684 ± 0.151 0.731 ± 0.150 - -
MLPGS 0.727 ± 0.083 0.894 ± 0.026 0.915 ± 0.025 0.970 ± 0.023 0.940 ± 0.015 4.333 ± 0.658 691.600 ± 83.518
MLPGS+ 0.774 ± 0.048 0.840 ± 0.047 0.960 ± 0.018 0.851 ± 0.057 0.897 ± 0.034 4.800 ± 0.946 749.933 ± 119.644
MLPGA 0.502 ± 0.052 0.672 ± 0.132 0.799 ± 0.115 0.740 ± 0.178 0.731 ± 0.151 95.000 ± 14.624 9.000 ± 0.640
MLPGA+ 0.673 ± 0.060 0.692 ± 0.090 0.948 ± 0.026 0.685 ± 0.114 0.767 ± 0.092 87.400 ± 14.277 9.267 ± 1.146

Table A.2 – Test results on drawing data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.
System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - - - 1.000e+00 - 4.964e-03 - 1.000e+00 - 4.964e-03 - 3.759e-02 1.000e+00 1.000e+00 - 5.979e-02 -
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.249e-04 1.000e+00 1.000e+00 1.000e+00 1.249e-04 1.000e+00 1.360e-03 1.000e+00 1.000e+00 - 2.365e-03 1.000e+00
RFGS 1.000e+00 - - - 1.000e+00 - 6.416e-03 - 1.000e+00 - 6.416e-03 - 4.727e-02 1.000e+00 1.000e+00 - 7.468e-02 -
RFGS+ 1.000e+00 - 1.000e+00 - 1.000e+00 - 2.390e-03 - 1.000e+00 - 2.390e-03 1.000e+00 1.954e-02 1.000e+00 1.000e+00 - 3.166e-02 -
RFGA - - - - - - 1.758e-02 - 1.000e+00 - 1.758e-02 - 1.159e-01 1.000e+00 1.000e+00 - 1.781e-01 -
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.588e-05 1.000e+00 1.000e+00 1.000e+00 1.588e-05 1.000e+00 2.078e-04 1.000e+00 1.000e+00 1.000e+00 3.779e-04 1.000e+00
SVM - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -
SVM+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 2.442e-03 - 1.000e+00 - 2.442e-03 1.000e+00 1.992e-02 1.000e+00 1.000e+00 - 3.226e-02 -
SVMGS - - - - - - 4.579e-02 - - - 4.579e-02 - 2.702e-01 1.000e+00 1.000e+00 - 4.042e-01 -
SVMGS+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.309e-03 1.000e+00 1.000e+00 - 1.309e-03 1.000e+00 1.138e-02 1.000e+00 1.000e+00 - 1.871e-02 1.000e+00
SVMGA - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 -
SVMGA+ 1.000e+00 - 1.000e+00 - 1.000e+00 - 6.255e-03 - 1.000e+00 - 6.255e-03 - 4.621e-02 1.000e+00 1.000e+00 - 7.305e-02 -
MLP - - - - - - 1.000e+00 - - - 1.000e+00 - - 1.000e+00 - - 1.000e+00 -
MLP+ - - - - - - - - - - - - - - - - 1.000e+00 -
MLPGS - - - - - - 6.265e-01 - - - 6.265e-01 - 1.000e+00 1.000e+00 - - 1.000e+00 -
MLPGS+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.791e-04 1.000e+00 1.000e+00 1.000e+00 1.791e-04 1.000e+00 1.886e-03 1.000e+00 1.000e+00 - 3.253e-03 1.000e+00
MLPGA - - - - - - - - - - - - - - - - - -
MLPGA+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.629e-03 1.000e+00 1.000e+00 - 1.629e-03 1.000e+00 1.385e-02 1.000e+00 1.000e+00 - 2.266e-02 -

Table A.3 – Dunn’s test results on the drawing data: p-values corresponding to
the null hypothesis that system A and B have the same precision. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

Recall Rate
As shown in table 4.1, SVM, SVMGA and RF+ were the ones having achieved
the highest recall rate possible. While being the ones with the highest recall
rate, it should be noted that there is no significant difference between these
three systems and the system achieving the highest MCC, i.e. RFGA+. In
particular, it should be noted that, on a confidence level of 95%, the systems
compared to which each of these systems has achieved a significantly higher
recall rate was the same for SVM, SVMGA, RFGA+, RFGA and RF+ - All
of them have achieved a higher recall rate than SVM+, SVMGS+, SVMGA+,
MLP+, MLPGS+, MLPGA+ with p < 0.05 (See A.4 for the exact p-values).
Appendix A: Additional Test Results | 73

System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 1.000e+00 - 1.000e+00 - 6.556e-07 1.000e+00 2.949e-02 - 7.028e-07 1.000e+00 1.859e-04 1.000e+00 7.207e-03 4.346e-01 2.366e-05
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.678e-07 1.000e+00 1.194e-02 - 1.803e-07 1.000e+00 5.927e-05 1.000e+00 2.716e-03 2.057e-01 6.932e-06
RFGS - - - 1.000e+00 - - - 1.599e-05 1.000e+00 2.298e-01 - 1.703e-05 1.000e+00 2.624e-03 1.000e+00 6.716e-02 1.000e+00 4.122e-04
RFGS+ - - - - - - - 1.257e-04 1.000e+00 8.172e-01 - 1.334e-04 1.000e+00 1.409e-02 1.000e+00 2.703e-01 1.000e+00 2.564e-03
RFGA - - 1.000e+00 1.000e+00 - 1.000e+00 - 6.556e-07 1.000e+00 2.949e-02 - 7.028e-07 1.000e+00 1.859e-04 1.000e+00 7.207e-03 4.346e-01 2.366e-05
RFGA+ - - 1.000e+00 1.000e+00 - - - 1.040e-06 1.000e+00 3.990e-02 - 1.114e-06 1.000e+00 2.733e-04 1.000e+00 1.000e-02 5.573e-01 3.582e-05
SVM 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.678e-07 1.000e+00 1.194e-02 - 1.803e-07 1.000e+00 5.927e-05 1.000e+00 2.716e-03 2.057e-01 6.932e-06
SVM+ - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
SVMGS - - - - - - - 2.166e-03 - 1.000e+00 - 2.282e-03 1.000e+00 1.368e-01 - 1.000e+00 1.000e+00 3.102e-02
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.678e-07 1.000e+00 1.194e-02 - 1.803e-07 1.000e+00 5.927e-05 1.000e+00 2.716e-03 2.057e-01 6.932e-06
SVMGA+ - - - - - - - 1.000e+00 - - - - - 1.000e+00 - - - 1.000e+00
MLP - - - - - - - 3.578e-03 - 1.000e+00 - 3.766e-03 - 2.029e-01 - 1.000e+00 1.000e+00 4.795e-02
MLP+ - - - - - - - - - - - - - - - - - -
MLPGS - - - - - - - 4.730e-04 1.000e+00 1.000e+00 - 5.003e-04 1.000e+00 4.091e-02 - 6.454e-01 1.000e+00 8.222e-03
MLPGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - 5.926e-01 - - - 6.147e-01 - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - 1.000e+00 - - - -

Table A.4 – Dunn’s test results on the drawing data: p-values corresponding
to the null hypothesis that system A and B have the same recall. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

A.2.2 Voice
Overall Result

MCC Accuracy Precision Recall F1 Features Calls


RF 0.752 ± 0.049 0.828 ± 0.032 0.841 ± 0.022 0.950 ± 0.024 0.892 ± 0.020 - -
RF+ 0.745 ± 0.033 0.819 ± 0.023 0.852 ± 0.019 0.918 ± 0.022 0.883 ± 0.015 - -
RFGS 0.716 ± 0.040 0.802 ± 0.028 0.831 ± 0.017 0.922 ± 0.027 0.874 ± 0.018 5.000 ± 0.666 4502.133 ± 497.962
RFGS+ 0.711 ± 0.044 0.784 ± 0.032 0.850 ± 0.022 0.865 ± 0.030 0.857 ± 0.022 5.000 ± 0.613 4502.267 ± 457.962
RFGA 0.735 ± 0.045 0.822 ± 0.025 0.831 ± 0.018 0.958 ± 0.017 0.889 ± 0.015 575.133 ± 86.938 8.867 ± 0.736
RFGA+ 0.761 ± 0.033 0.831 ± 0.020 0.856 ± 0.018 0.933 ± 0.018 0.892 ± 0.013 494.067 ± 78.907 8.800 ± 0.720
SVM 0.504 ± 0.018 0.732 ± 0.011 0.747 ± 0.005 0.968 ± 0.017 0.843 ± 0.008 - -
SVM+ 0.649 ± 0.043 0.718 ± 0.039 0.829 ± 0.022 0.784 ± 0.040 0.804 ± 0.030 - -
SVMGS 0.704 ± 0.040 0.797 ± 0.026 0.825 ± 0.021 0.927 ± 0.021 0.872 ± 0.016 4.933 ± 0.968 4451.333 ± 723.903
SVMGS+ 0.700 ± 0.036 0.752 ± 0.036 0.861 ± 0.023 0.796 ± 0.045 0.825 ± 0.030 5.333 ± 1.007 4750.133 ± 751.829
SVMGA 0.493 ± 0.020 0.732 ± 0.011 0.746 ± 0.005 0.972 ± 0.017 0.844 ± 0.007 361.333 ± 3.932 9.067 ± 1.484
SVMGA+ 0.664 ± 0.041 0.733 ± 0.032 0.840 ± 0.025 0.796 ± 0.032 0.816 ± 0.023 419.600 ± 48.200 10.000 ± 1.267
MLP 0.495 ± 0.017 0.553 ± 0.122 0.516 ± 0.187 0.602 ± 0.247 0.517 ± 0.210 - -
MLP+ 0.518 ± 0.020 0.473 ± 0.117 0.587 ± 0.190 0.435 ± 0.238 0.396 ± 0.199 - -
MLPGS 0.648 ± 0.057 0.775 ± 0.032 0.798 ± 0.024 0.940 ± 0.033 0.862 ± 0.021 5.800 ± 0.909 5099.067 ± 678.598
MLPGS+ 0.648 ± 0.039 0.702 ± 0.049 0.834 ± 0.025 0.752 ± 0.071 0.783 ± 0.049 4.333 ± 0.840 4003.067 ± 628.460
MLPGA 0.503 ± 0.012 0.527 ± 0.119 0.644 ± 0.171 0.547 ± 0.244 0.484 ± 0.203 528.200 ± 88.456 9.133 ± 0.824
MLPGA+ 0.503 ± 0.010 0.426 ± 0.116 0.361 ± 0.198 0.342 ± 0.236 0.302 ± 0.200 431.867 ± 67.640 10.400 ± 0.843

Table A.5 – Test results on voice data with 95% confidence intervals. The
metrics were computed based on the definitions described in section 3.6.2. By
comparing the mean values before rounding, the best, second best and third
best values were identified and written in red, orange and yellow respectively.

Precision
As shown in table 4.7, SVMGS+ was the system with the highest mean preci-
sion. While being the one with the highest precision, it should be noted that
this system did not offer the highest MCC (see table 4.1) and that there is no
significant difference between SVMGS+ and the system providing the highest
74 | Appendix A: Additional Test Results

System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - - 1.000e+00 - 1.000e+00 - 4.447e-03 1.000e+00 1.000e+00 - 3.415e-03 1.000e+00 5.801e-03 4.680e-01 1.000e+00 1.000e+00 3.590e-01 9.362e-04
RF+ 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 3.980e-04 1.000e+00 1.000e+00 - 2.966e-04 1.000e+00 5.351e-04 7.638e-02 1.000e+00 1.000e+00 5.634e-02 7.063e-05
RFGS - - - - 1.000e+00 - 1.871e-02 1.000e+00 1.000e+00 - 1.465e-02 - 2.394e-02 1.000e+00 1.000e+00 - 1.000e+00 4.401e-03
RFGS+ 1.000e+00 - 1.000e+00 - 1.000e+00 - 6.232e-04 1.000e+00 1.000e+00 - 4.670e-04 1.000e+00 8.334e-04 1.074e-01 1.000e+00 1.000e+00 7.979e-02 1.140e-04
RFGA - - - - - - 2.294e-02 1.000e+00 1.000e+00 - 1.801e-02 - 2.926e-02 1.000e+00 1.000e+00 - 1.000e+00 5.485e-03
RFGA+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 - 1.680e-04 1.000e+00 1.000e+00 - 1.240e-04 1.000e+00 2.282e-04 3.946e-02 1.000e+00 1.000e+00 2.872e-02 2.817e-05
SVM - - - - - - - - - - 1.000e+00 - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVM+ - - - - - - 2.629e-02 - 1.000e+00 - 2.068e-02 - 3.346e-02 1.000e+00 1.000e+00 - 1.000e+00 6.354e-03
SVMGS - - - - - - 4.688e-02 - - - 3.718e-02 - 5.916e-02 1.000e+00 1.000e+00 - 1.000e+00 1.189e-02
SVMGS+ 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.147e-04 1.000e+00 1.000e+00 - 8.429e-05 1.000e+00 1.565e-04 2.940e-02 1.000e+00 1.000e+00 2.127e-02 1.876e-05
SVMGA - - - - - - - - - - - - 1.000e+00 1.000e+00 - - 1.000e+00 1.000e+00
SVMGA+ - - 1.000e+00 - 1.000e+00 - 4.493e-03 1.000e+00 1.000e+00 - 3.451e-03 - 5.860e-03 4.715e-01 1.000e+00 1.000e+00 3.618e-01 9.466e-04
MLP - - - - - - - - - - - - - - - - - 1.000e+00
MLP+ - - - - - - - - - - - - 1.000e+00 - - - - 1.000e+00
MLPGS - - - - - - 1.000e+00 - - - 1.000e+00 - 1.000e+00 1.000e+00 - - 1.000e+00 9.858e-01
MLPGS+ - - 1.000e+00 - 1.000e+00 - 1.138e-02 1.000e+00 1.000e+00 - 8.845e-03 - 1.465e-02 9.313e-01 1.000e+00 - 7.262e-01 2.572e-03
MLPGA - - - - - - - - - - - - 1.000e+00 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -

Table A.6 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same precision. The symbol
"-" is used when system A does not have a higher mean value than system B.
A p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

MCC, i.e. RFGA+, on the confidence level of 90% and thereby also not on
95% or 99% (p = 1.0).
Nevertheless, it should be noted that both SVMGS+ and RFGA+ have achieved
a higher precision than SVM, SVMGA, MLP and MLPGA with significant
difference on the confidence level of 99% and two more on the confidence
level of 95% (See table A.6 for the exact systems and the corresponding p-
values).

Recall Rate

System B
RF RF+ RFGS RFGS+ RFGA RFGA+ SVM SVM+ SVMGS SVMGS+ SVMGA SVMGA+ MLP MLP+ MLPGS MLPGS+ MLPGA MLPGA+
System A
RF - 1.000e+00 1.000e+00 9.674e-01 - 1.000e+00 - 8.837e-03 1.000e+00 2.368e-02 - 1.213e-02 1.000e+00 1.000e+00 1.000e+00 8.281e-03 1.000e+00 1.781e-01
RF+ - - - 1.000e+00 - - - 9.848e-01 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 9.402e-01 1.000e+00 1.000e+00
RFGS - 1.000e+00 - 1.000e+00 - - - 5.084e-01 - 1.000e+00 - 6.462e-01 1.000e+00 1.000e+00 - 4.840e-01 1.000e+00 1.000e+00
RFGS+ - - - - - - - 1.000e+00 - 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
RFGA 1.000e+00 1.000e+00 1.000e+00 5.822e-01 - 1.000e+00 - 4.361e-03 1.000e+00 1.213e-02 - 6.056e-03 1.000e+00 8.751e-01 1.000e+00 4.077e-03 1.000e+00 9.910e-02
RFGA+ - 1.000e+00 1.000e+00 1.000e+00 - - - 2.151e-01 1.000e+00 4.785e-01 - 2.783e-01 1.000e+00 1.000e+00 - 2.040e-01 1.000e+00 1.000e+00
SVM 1.000e+00 1.000e+00 1.000e+00 1.075e-01 1.000e+00 1.000e+00 - 4.395e-04 1.000e+00 1.369e-03 - 6.325e-04 1.000e+00 1.716e-01 1.000e+00 4.079e-04 1.000e+00 1.439e-02
SVM+ - - - - - - - - - - - - 5.463e-01 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGS - 1.000e+00 1.000e+00 1.000e+00 - - - 4.010e-01 - 8.564e-01 - 5.123e-01 1.000e+00 1.000e+00 - 3.813e-01 1.000e+00 1.000e+00
SVMGS+ - - - - - - - 1.000e+00 - - - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
SVMGA 1.000e+00 1.000e+00 1.000e+00 6.258e-02 1.000e+00 1.000e+00 1.000e+00 2.137e-04 1.000e+00 6.882e-04 - 3.109e-04 1.000e+00 1.017e-01 1.000e+00 1.980e-04 1.000e+00 7.798e-03
SVMGA+ - - - - - - - 1.000e+00 - - - - 6.932e-01 1.000e+00 - 1.000e+00 1.000e+00 1.000e+00
MLP - - - - - - - - - - - - - 1.000e+00 - - 1.000e+00 1.000e+00
MLP+ - - - - - - - - - - - - - - - - - 1.000e+00
MLPGS - 1.000e+00 1.000e+00 1.000e+00 - 1.000e+00 - 2.651e-02 1.000e+00 6.689e-02 - 3.570e-02 1.000e+00 1.000e+00 - 2.494e-02 1.000e+00 4.399e-01
MLPGS+ - - - - - - - - - - - - 5.201e-01 1.000e+00 - - 1.000e+00 1.000e+00
MLPGA - - - - - - - - - - - - - 1.000e+00 - - - 1.000e+00
MLPGA+ - - - - - - - - - - - - - - - - - -

Table A.7 – Dunn’s test results on the voice data: p-values corresponding to
the null hypothesis that system A and B have the same recall. The symbol "-"
is used when system A does not have a higher mean value than system B. A
p-value is written in red if p ≤ 0.01 but orange if p ∈ (0.01, 0.05], yellow if
p ∈ (0.05, 0.1] and black if p > 0.1.

As shown in table 4.7, SVMGA was the system with the highest recall rate
followed by SVM and RFGA. While being the ones with the highest recall
rate, it should be noted the p-value corresponding to the null hypothesis that
these systems have the same recall rate as the one achieving the highest MCC
Appendix A: Additional Test Results | 75

(i.e. RFGA+) was 1.0. That is, these systems did not offer a higher recall rate
than RFGA+ with a significant difference on the confidence level of 90% and
thereby also not 95% or 99%.
76 | Appendix A: Additional Test Results
TRITA -EECS-EX-2021:387

www.kth.se

You might also like