Journal Pre-Proof: Artificial Intelligence in Medicine

Journal Pre-proof
A Machine Learning perspective on the emotional content of

Parkinsonian speech<!–<ForCover>Konstantinos Sechidis, Riccardo
Fusaroli, Juan Rafael Orozco-Arroyave, Detlef Wolf, Yan-Ping Zhang, A
Machine Learning perspective on the emotional content of Parkinsonian
speech, <![CDATA[Artificial Intelligence In Medicine]]>,
doi:</ForCover>–>
Konstantinos Sechidis, Riccardo Fusaroli, Juan Rafael

Orozco-Arroyave, Detlef Wolf, Yan-Ping Zhang
PII: S0933-3657(21)00054-3
DOI: https://doi.org/10.1016/j.artmed.2021.102061
Reference: ARTMED 102061
To appear in: Artificial Intelligence In Medicine
Received Date: 6 March 2020

Revised Date: 26 February 2021
Accepted Date: 29 March 2021
Please cite this article as: { doi: https://doi.org/
This is a PDF file of an article that has undergone enhancements after acceptance, such as
the addition of a cover page and metadata, and formatting for readability, but it is not yet the
definitive version of record. This version will undergo additional copyediting, typesetting and
review before it is published in its final form, but we are providing this version to give early
visibility of the article. Please note that, during the production process, errors may be
discovered which could affect the content, and all legal disclaimers that apply to the journal
pertain.
© 2020 Published by Elsevier.

A Machine Learning perspective on the emotional content of Parkinsonian speech
Konstantinos Sechidisa,1 , Riccardo Fusarolib , Juan Rafael Orozco-Arroyavec,d , Detlef Wolfa , Yan-Ping Zhanga,∗
a Roche Pharmaceutical Research & Early Development Informatics, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., 4070 Basel, Switzerland.
b School of Communication and Culture & the Interacting Minds Centre, Aarhus University, Aarhus, Denmark.
c Faculty of Engineering, Universidad de Antioquia UdeA, 1226 Medellı́n, Colombia.
d Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany.
Abstract
Patients with Parkinson’s disease (PD) have distinctive voice patterns, often perceived as expressing sad emotion. While this
characteristic of Parkinsonian speech has been supported through the perspective of listeners, where both PD and healthy control
(HC) subjects repeat the same speaking tasks, it has never been explored through a machine learning modelling approach. Our work
of
provides an objective evaluation of this characteristic of the PD speech, by building a transfer learning system to assess how the
PD pathology affects the sadness perception. To do so we introduce a Mixture-of-Experts (MoE) architecture for speech emotion
recognition designed to be transferable across datasets. Firstly, by relying on publicly available emotional speech corpora, we train
the MoE model and then we use it to quantify perceived sadness in never seen before PD and matched HC speech recordings. To
ro
build our models (experts), we extracted spectral features of the voicing parts of speech and we trained a gradient boosting decision
trees model in each corpus to predict happiness vs. sadness. MoE predictions are created by weighting each expert’s prediction
according to the distance between the new sample and the expert-specific training samples. The MoE approach systematically infers
more negative emotional characteristics in PD speech than in HC. Crucially, these judgments are related to the disease severity and
-p
the severity of speech impairment in the PD patients: the more impairment, the more likely the speech is to be judged as sad. Our
findings pave the way towards a better understanding of the characteristics of PD speech and show how publicly available datasets
can be used to train models that provide interesting insights on clinical data.
re
Keywords: Parkinson’s disease, Machine Learning, Mixture-of-experts, Speech emotion recognition
1. Introduction This conclusion was derived by observational studies were PD

lP
1 16
and healthy control (HC) subjects produced emotional utter- 17

2 Patients with Parkinson’s disease (PD) tend to present dis- ances and independent listeners, naive to the disease status of 18
3 tinctive voice and prosody [1]. PD pathology affects the speech each individual speaker, were detecting the intended emotion. 19
4 production system: the control of the larynx [2] and the respi- In our work we follow a novel machine modelling approach 20
5 ratory system [3] are both affected. Indeed, almost 90% of peo- to better understand the perceived sadness in PD speech, which
na
21
6 ple with PD have hypokinetic dysarthria, a motor speech disor- occurs regardless of the intended emotion, and to explore whether 22
7 der characterized by monopitch, monoloudness, short rushes of this characteristic can be leveraged for assessing PD and its pro- 23
8 speech, accelerated and/or variable speaking rate [4]. gression. In particular, there is a scarcity of openly available PD 24
9 The above peculiarities of the parkinsonian speech affect speech databases to train directly machine learning models, but 25
the perception of listeners, and as a result, PD voice is often
ur
10
there are many on more general emotional speech. We there- 26
11 described as atypical, and perceived as ‘sad’ [5, 6]: fore introduce a transfer learning architecture that uses publicly 27
12 “Parkinsonian speakers are often perceived as “sad” available corpora of emotional speech to train the models, and it 28
13 [...], a pattern which may lead to frequent misattri- is designed to be transferable across different speech corpora. 29
Jo
14 butions of how speakers with PD are truly feeling Finally, we use our model that predicts happiness vs sadness 30
15 or responding to situations in their daily life.” [5] to assess the perceived sadness in speech recordings of PD pa- 31
tients and HC. 32
The purpose of our work is thus threefold: 33

∗ Corresponding author (1) adapt an established machine learning architecture - a 34
Email addresses: kostas.sechidis@novartis.com
Mixture-of-Experts (MoE) model - for inferring the emo- 35
(Konstantinos Sechidis), fusaroli@cas.au.dk (Riccardo Fusaroli),
detlef.wolf@roche.com (Riccardo Fusaroli), tional state in various publicly available emotional speech 36
rafael.orozco@udea.edu.co (Juan Rafael Orozco-Arroyave), corpora. This modelling architecture is motivated by the 37
detlef.wolf@roche.com (Detlef Wolf), literature of multi-source domain adaptation [7]. 38

yan-ping.zhang_schaerer@roche.com (Yan-Ping Zhang)
1 During the work related to this manuscript, K. Sechidis was postdoctoral (2) employ our model to analyse speech recordings of Span- 39
fellow of F. Hoffman-La Roche Ltd; his current affiliation is Novartis Pharma ish PD patients and matched HC [8]. While the corpus 40
AG, Basel, Switzerland.
Preprint submitted to Journal of Artificial Intelligence In Medicine February 25, 2021

41 was introduced to better understand the phonatory, artic- with 10 millisecond overlap [11]. There are various types of 92
42 ulatory and prosodic pathologies in PD speech, our work LLD that have been used in the literature [10], the most impor- 93
43 is the first that uses this corpus to analyse its perceived tant ones are prosodic (e.g. pitch and energy/intensity), voice 94
44 emotional content. quality (e.g. jitter and shimmer), spectral (e.g. formants) and 95
45 (3) explore whether automated assessment of emotional con- cepstral (e.g. mel frequency cepstral coefficients, known as 96
46 tent relate not only to the diagnostic group (PD vs HC), MFCCs). There are many works that explore the predictive 97
47 but also to the level of speech impairment of the PD pa- power of the different LLDs [14]. In our work we will focus 98
48 tients and to the disease severity. on using MFCCs, which have been used extensively in the lit- 99
49 The remaining of the paper is organised as follows: Sec- erature and have many desirable properties, key amongst them 100
50 tion 2 provides a brief background on speech emotion recog- their being independent on the energy of the acoustic signal, 101
51 nition (SER) and PD speech. Section 3 presents the emotional which is strongly variable across speakers and recording con- 102
52 speech corpora we use to build and evaluate our SER model, texts (details in Section 4.2). 103
53 and it also presents the Parkinsonian speech corpus in which After extracting the LLDs in each frame, a number of sta- 104
54 we deploy our model to get more insight on the emotional con- tistical operators or functionals is applied to summarise them 105
55 tent of the PD speech. Section 4 describes the proposed MoE and generate what is known as global features [10]. At the 106
56 architecture and presents in detail how we train and validate end of this process, we derive from each utterance feature vec- 107
57 the experts. Section 5 verifies the effectiveness of the proposed tors of equal size [9]. Some popular functionals are extremes 108
of
58 method through experiments in various emotional speech cor- (min, max), measures of central tendency (mean, median), per- 109
59 pora, and presents our results on the emotional content of the centiles (quartiles, ranges) and higher moments (standard devi- 110
60 Parkinsonian speech. Section 6 provides a discussion on the ation, kurtosis). 111
ro
61 main clinical findings of our work, and Section 7 concludes the After extracting the features, several machine learning al- 112
62 paper. gorithms have been used to infer the emotional state, e.g. Hid- 113
den Markov Models, Gaussian Mixture Models and Random 114
63 2. Background Forests to name a few. In order to improve the predictive per- 115
64
65
2.1. Speech Emotion Recognition
Our work focuses on building SER machine learning mod-
-p
formance, various works are using ensemble methods. Ensem-
ble models can be generated through various ways, for example
by using different machine learning algorithms to build each
model in the ensemble [15] or by using one type of classifier
116
117
118
66 els from utterances relying only on sound processing and with- 119
re
67 out extracting any linguistic information, i.e. identifying words and random subsamples of the data to train each model of the 120
68 that could be correlated with a specific emotion.2 The first ensemble [16]. 121
69 and most important step for building a SER model is to ex- In our work, we want to infer the emotional state of Parkin- 122
tract meaningful and informative acoustic features from each sonian speech from models trained on emotional speech cor- 123
lP
70
71 utterance [9]. To this end, there are two main approaches in pora. In the machine learning literature this is referred to as 124
72 the literature: extracting hand-engineered (hand-crafted) fea- a multi-source domain adaptation problem [7]. A popular ap- 125
73 tures (more details can be found in review papers [10, 11]) or proach to this problem is using a Mixture-of-Experts model 126
74 using raw input signal or spectrograms and a deep-learning ar- [17], where each model (expert) is trained using a different 127
chitecture to derive a representation of the input signal (more emotional speech corpus (source). The ML algorithm we use
na
75 128
76 details in [12]). Besides being computationally expensive, the to train each expert is gradient boosting decision trees, a pow- 129
77 main drawbacks of the latter approach is that it requires a large erful technique that achieves state-of-the-art results in various 130
78 amount of data, which is not the case for the SER corpora we tasks [18] (details in Section 4). 131
79 focused on (details in Section 3). Furthermore, in absence of

2.2. Parkinsonian Speech
ur
80 large and heterogeneous datasets, there are concerns of over- 132
81 fitting to the specific structures of the training materials, with Speech impairment is an important symptom of PD and is 133
82 issues in generalizing to new contexts. [13]. The generalisa- generally characterised as atypicalities in the variation of the 134
83 tion is of crucial importance for our application, since we train fundamental frequency, commonly perceived as monopitch; the 135
Jo
84 models using emotional speech corpora and we deploy them in variation of intensity, commonly perceived as monoloudness, 136
85 Parkinsonian speech; for that reason, we focus on using hand- the variation of stress and emphasis, the imprecise articulation; 137
86 crafted features. as well as short rushes of speech and variable speaking rate [4]. 138
87 Extracting hand-crafted features is an active research area Several machine learning systems have been suggested for 139
88 in SER and more details can be found in [11]. The main ap- diagnosing or inferring the disease progression using speech 140
89 proach is to extract low-level descriptors (LLD), also known recordings. In one of the first works, Little et al. [19] extracted 141
90 as local features, once for every small time-frame, usually us- various features from sustained phonation, and used them to 142
91 ing a sliding overlapping window, e.g. 25 millisecond frames train a support vector machine (SVM) classifier to distinguish 143
between PD and HC. Instead of focusing on sustained phona- 144
2 Tobuild this type of models we need to train them on emotional speech

tion, many recent studies examine more realistic speech, such 145
corpora, where all subjects produce exactly the same utterances (sentences), as reading text or delivering monologues [20, 21]. Orozco- 146
more details in Section 3.1. Arroyave et al. [21] considered speech signals expressed in 147
2
148 three different languages (Spanish, German and Czech) to build how happy (or sad) a person’s utterance sounds. This model is 205
149 a SVM classifier using MFCC features from both voiced and used to study how the probabilities of sounding happy (or sad) 206
150 unvoiced sounds. Furthermore, in cross-language experiments, differ between PD and HC in various speech tests, such as read- 207
151 the authors showed the generalization capabilities of the sug- ing sentences and dialogues. Finally, we explore how different 208
152 gested system across languages and recording devices. levels of speech impairment affect the perceived emotional con- 209
153 For all the works mentioned above the cross-validation strat- tent of Parkinsonian speech. 210
154 egy (e.g. splitting the data to train/test/validation subsets) plays

155 a crucial role. For example, randomly splitting the data, a com-
3. Materials 211
156 mon procedure in the literature [19, 22], leads to overly opti-
157 mistic results [23]. Meeting the speaker independence assump- In this study we use two diverse types of speech recording 212
158 tion [21], i.e. training, testing and validation set contain records corpora. First, our proposed SER model is trained and vali- 213
159 generated from different speakers, leads to more realistic re- dated using emotional speech corpora. Afterwards, a corpus 214
160 sults. This assumption can lead to systems that achieve compet- that contains PD and HC speech is used to test whether there is 215
161 ing performance when they are transferred to different speech a difference in the emotional content between these two groups. 216
162 corpora (i.e. cross-corpus evaluation) [24, 25]. Furthermore,
163 Tsanas et al. [26] showed the importance that the gender plays 3.1. Emotional speech corpora 217
164 by revealing distinct speech PD progression characteristics for
of
We use the following six emotional speech corpora: Crowd- 218
165 males and females. As a result gender should be taking into ac-
sourced emotional multimodal actors dataset (CREMA-D) [31], 219
166 count in the cross-validation strategy, and more details can be
Ryerson audio-visual database of emotional speech and song 220
167 found in Section 4.3.
(RAVDESS) [32], Berlin emotional database (EMO-DB) [33], 221
ro
168 Few studies investigated the acoustic and perceptual fea-
Italian emotional speech database (EMOVO) [34], Toronto emo- 222
169 tures of emotional speech in PD by conducting observational
tional speech set (TESS) [35] and Surrey audio-visual expressed 223
170 studies and relying on traditional univariate testing for differ-
emotion database (SAVEE) [36]. 224
171 ences [27]. Cheang & Pell [28] assessed fundamental frequency
Each corpus contains recordings of sentences spoken by ac-
172
173
174
175
(pitch), amplitude and speech rate in emotional speech pro-
duced by patients with PD and HC. The mean fundamental
frequency distinguished the two speaker groups (PD vs HC).
Möbes et al. [29] also used pitch and intensity (loudness) to
-p
tors expressing various emotions, such as happiness, sadness,
fear, surprise, anger, and disgust. Since the focus of the cur-
rent paper is on the perceived sadness of Parkinsonian speech
225
226
227
228
re
[5, 29], we consider only two emotions: happiness and sadness, 229
176 compare prosody in PD and HC across emotional (saying “Anna”
and in the next section we will show how we can use them to 230
177 with a happy, neutral and sad emotional intonation) and non-
build a SER system that provides a probability on how happy 231
178 emotional speech (say “aaaa”). There were only significant dif-
(or sad) utterances sound. Details on the corpora can be found 232
179 ferences in pitch and loudness range during emotional speech,
lP
in Table 1, where we can see the diversity of the characteristics. 233

180 but not in the sustained phonations. Finally, Pell et al. [5]
There are three different languages: English (in three different 234
181 recorded emotional utterances of PD and HC subjects and they
variants), German and Italian. All corpora are mixed gender, 235
182 asked independent listeners, naive to the disease status of each
apart from two: SAVEE contains only males, and TESS con- 236
183 individual speaker to detect the intended emotion. One of their
tains only females. Some corpora have many speakers while 237
main findings was that the PD speakers were often perceived as
na
184
others have only few. There are various age ranges and various 238
185 “sad”.
sample sizes (number or utterances). For example, the biggest 239
186 To the best of our knowledge, only one study explored the
corpus is CREMA-D with 2534 and the smallest are EMO-DB 240
187 acoustic features of emotional speech in PD using machine learn-
and SAVEE with 133 and 120 examples, respectively. All of 241
188 ing modelling. Zhao et al. [30] asked five PD patients and
these characteristics, especially the sample size and the gender 242
ur
189 seven HC subjects to repeat ten unique statements, each pro-

information, play an important role in the pipeline for our SER 243
190 duced with different target emotion. Various acoustic features
model (details in Section 3). 244
191 were extracted to build three machine learning classifiers (naive
In our work we focus on the corpora presented in Table 1, 245
192 Bayes, support vector machines and random forests). While au-
because they share a common characteristic: the subjects are
Jo
246
193 tomated SER could accurately reconstruct the emotions in HC
asked to repeat short sentences (text) in such a way to convey 247
194 with an accuracy of 73.33%, the accuracy decreased to 65.5%
different emotions. The sentences do not contain any emotional 248
195 when applied to speech produced by PD patients. Misclassifi-
bias, and in each corpus the same sentences are repeated by 249
196 cation patterns were complex, but the authors note a tendency
the subjects. Furthermore, it is useful to clarify that the dif- 250
197 to classify the patients’ speech as less happy than HC speech,
ferent corpora contain different sets of sentences. Focusing 251
198 resonating with previous perceptual findings.
on emotions convoyed by repeating a sentence in contrast to 252
199 From all the above works we can see that the sadness per-
emotions conveyed in free-speech or natural dialogues, such as 253
200 ception in PD patients is affecting their speech in a way that
the Interactive Emotional Dyadic Motion Capture (IEMOCAP) 254
201 is, at least partially, independent of their intended emotion. In
database [37], is crucial for our work. Our main motivation is 255
202 our work, we explore how we can leverage the fact that the PD
to explore the emotional content contained in the acoustic in- 256
203 speech sounds sad into assessing PD and its progression. We
formation of the PD speech, regardless of the actual semantic 257
204 rely on a novel MoE architecture to provide a probability of
3
258 content of the speech. As a result, we should build our models tients. The clinician provides a score between 0-4 that captures 292
259 using databases where the subjects repeat the same sentences the speech impairment: 0 (normal), 1 (slight), 2 (mild), 3 (mod- 293
260 using different emotions. Afterwards, we will deploy our mod- erate) and 4 (severe), more details can be found in Supplemen- 294
261 els to infer the emotional content in recordings of PD vs HC tary material Section 1. Figure 1 shows the distribution of the 295
262 speech using a corpus that we present in the following section. scores in Question 3.1 for the 50 PD patients in PC-GITA.
Table 1: General information about the different emotional speech corpora used
in the current paper
20
count
Name, Language or dialect,
# speakers # utterances
Year Age range 10
 
CREMA-D, American English, Males: 48
 Happy: 1267

7 23 16 4
 
91  2534 
2014 [31] 20-74 years Females: 43
 Sad: 1267
 0
  0 1 2 3
RAVDESS, American English, Males: 12 Happy: 192 Speech impairment (Question 3.1)

 

24  384 
2018 [32] 21-33 years Females: 12
 Sad: 192

 
EMO-DB, German, Males: 5

 Happy: 71

 Figure 1: Distribution of the scores for the 50 PD patients provided by the
10  133  clinician in MDS-UPDRS Question 3.1, which captures the speech impairment.
of
2005 [33] 21-35 years Females: 5
 Sad: 62

 296
Happy: 84

Males: 3

EMOVO, Italian,  

6 167 
2014 [34] 23-30 years Females: 3
 Sad: 83

4. Methods
ro
297
 
TESS, Canadian English, Males: 0

 Happy: 400


2 800 
2011 [35] 26-64 years Females: 2
 Sad: 400
 This section provides a detailed description of the proposed 298
  model. Section 4.1 describes the mixture-of-experts architec- 299

SAVEE, British English, Males: 4
 Happy: 60

ture that we use for SER, Section 4.2 provides details on the
 
4 120 
2009 [36] 27-31 years Females: 0
 Sad: 60
 -p
feature extraction, while Section 4.3 presents the expert models
and the cross-validation protocol we follow for hyper-parameter
optimisation.
300
301
302
303
re
263 3.2. PD vs HC speech corpus
264 Our work focuses on the perceived emotional content of the 4.1. Mixture-of-experts architecture for SER 304
265 Parkinsonian speech, and how it differs from the HC speech; In this section we will describe the ensemble architecture 305
266 for that reason we need to use a corpus that contains both types that we use for SER. Let us assume we have K emotional speech 306
lP
267 of speeches. To this end, we use the Spanish speech corpus corpora, and each one generates a classification dataset Di = 307
268 (PC-GITA) [38], which contains speech recordings of 50 pa- {xn , yn }n=1
Ni
, ∀ i ∈ {1 . . . K}, where Ni is the sample size of the ith 308
269 tients with PD and their respective healthy controls, matched by dataset; in other words, the total number of utterances in the ith 309
270 age, gender and education level, which address possible biases speech emotion corpus. The symbol xn represents the extracted 310
271 when the subjects read sentences. All participants are Colom- features from the nth utterance, more details are provided in sub- 311
na
272 bian Spanish native speakers between 33 to 77 years old. The section 4.2. In our work, we focus on the binary classification 312
273 speakers were asked to produce various speaking tasks, (1) sus- problem of y = 1 (or y = +), having a happy sounding utterance 313
274 tained phonations of the five Spanish vowels /a/, /e/, /i/, /o/ and vs y = 0 (or y = −) having a sad sounding utterance. We train 314
275 /u/ at a constant tone, (2) sustained vowels with a modulated a classifier with each of these datasets, and each model (expert) 315
tone, i.e., varying from low to high, (3) six diadochokinetic provides a prediction, i.e. the probability on how happy the ut-
ur
276 316
277 (DDK) exercises, i.e. the rapid repetition of the of the syllables terance sounds pDi (y = 1|x), ∀ i ∈ {1 . . . K}. Since we have 317
278 /pa-ta-ka/, /pa-ka-ta/, /pe-ta-ka/, /pa-pa-pa/, /ta-ta-ta/ and /ka- a binary classification problem, the probability of how sad an 318
279 ka-ka/, (4) reading six different syntacticly complex and simple utterance sounds is pDi (y = 0|x) = 1 − pDi (y = 1|x). In our 319
work we use probability as a proxy for the degree of happiness

Jo
280 sentences, (5) reading of four sentences with additional empha- 320
281 sis in particular words, and (6) reading a dialog. (or sadness), and more details about the classification model are 321
282 All PD patients were evaluated by the same neurology ex- given in subsection 4.3. 322
283 pert and were labelled according to the part III of the Movement A mixture-of-experts is an ensemble of models (experts)
284 Disorder Society-Sponsored Revision of the Unified Parkin- whose predictions are combined using a weighted sum to pro-
285 son’s Disease Rating Scale (MDS-UPDRS) [39]. The mean duce the overall prediction where the weights dynamically de-
286 value of the MDS-UPDRS-III score across the 50 patients is pend on the input. Initially, it was introduced to combine differ-
287 37.66, while the standard deviation is 18.32. Furthermore, in ent neural networks [40], but given the above definition, it can
288 part III of the MDS-UPDRS test suite, a clinician assesses the be used with any type of models and it is a very effective way
289 overall severity of a patient’s motor symptoms. For our work, of creating an ensemble [41]. In our scenario, given an input
290 we are interested in the score of each patient in the item (ques- x0 from a target domain, the probability of observing a partic-
291 tion) 3.1, which measures the speech impairment of the pa- ular emotion y given x0 is provided by the weighted mixture
4
of the predictions produced by the experts that were trained in Estimating weights:
α(x0 , D1 ), α(x0 , D2 ) . . . α(x0 , DK )
different corpora:
Model trained
K on dataset D1 pD1 (y|x0 )
X
p MoE (y|x0 ) = α(x0 , Di )pDi (y|x0 ) (1)
Model trained
i=1 x0 on dataset D2 pD2 (y|x0 ) p MoE (y|x0 ), eq. (1)
.. ..
The key component of the MoE architecture is estimating the . .
Model trained
weights. In our work, we will follow a recently suggested way on dataset DK pDK (y|x0 )
for estimating these weights, by using a point-to-set metric [17].
Firstly, the point-to-set Mahalanobis distance metric between Figure 2: A graphic representation of our MoE architecture for SER. The target
the example x0 and the dataset Di is defined as follows: example x0 is passed from all the models trained in different SER corpora to
get K class probabilities. Then, the final prediction of the MoE, given by eq.
1 (1), is the weighted average of the K class probabilities, where the weights are
d(x0 , Di ) = x0 − µXDi T ΣXDi −1 x0 − µXDi 2 ,

estimated by eq. (2).
where µXDi is the mean vector of the features in the dataset Di ,

while ΣXDi −1 is the inverse of the variance covariance matrix 4.2. Feature extraction
341
for the same feature space. Since we focus on binary classifica- Emotional speech is often associated to variations in con- 342
of
tion problems which are balanced (see Table 1), the mean vector tour of fundamental frequency, measurements of the speech 343
µXDi may be close to the boundary of the expert trained in the spectrum and temporal characteristics of utterances [10]. How- 344
dataset Di . As a result, small distances d(x0 , Di ) imply areas ever, such acoustic attributes like fundamental frequency and 345
ro
with higher uncertainty for the corresponding classifier, which speech rate systematically vary by speaker, gender, age and lin- 346
is counter-intuitive. To overcome this problem, Guo et al. [17] guistic content. This limits their usefulness in our case. For ex- 347
introduced the following measure of confidence e(x0 , Di ) as the ample, participants of public expressive speech databases such 348
difference between the distances of x0 to the data that belong to as the ones presented in Table 1 tend to be much younger than 349
the two classes:
e(x0 , Di ) = |d(x0 , D+i ) − d(x0 , D−i )|,

-p
Parkinsonian patients. We have therefore decided to exclude 350
such features from the present investigation. 351
In contrast, the Mel Frequency Cepstral Coefficients (MFCCs), 352

introduced by Davis and Mermelstein in the 1980’s [42], have 353
re
where D+i are the data belonging to the positive class (i.e. happy been widely used in speech recognition and speech emotion 354
utterances), while D−i the one that belong to the negative (i.e. recognition tasks. These coefficients are the result of a cosine 355
sad utterances). Furthermore, the final weights are obtained by transform of the log-energies in frequency bands distributed 356
the following normalization of the confidence scores:
lP
over a mel-frequency scale, allowing better representation of 357

exp (e(x , Di ))
0 sound to the human auditory system’s response. MFCCs ac- 358
α(x0 , Di ) = K . (2) curately represent the envelope of the short time power spec- 359
X
exp e(x , Di )
0 trum, providing a great source of local features. If we can de- 360
i=1 termine the statistical distribution of these local features, this 361
na
should give us a scale invariant global representation of speech 362

323 The intuition behind the above methodology for setting the weights being produced, i.e. the shape of the “feature cloud” instead 363
324 is that when the new example we want to classify, x0 , is closer to of the temporal dependencies [43]. Therefore, we propose to 364
325 a specific class of Di than the other class, then the model trained use general statistical distribution descriptors on spectral fea- 365
326 in dataset Di will get high confidence. On the other hand, if tures like MFCCs to yield global features which encode the 366
x0 is near the classification boundary, then the model trained
ur
327 content of entire speech. To only characterize the voicing part 367
328 in dataset Di will get low confidence. In Section 5.1 we will of the speech, a voice activity detection (VAD) step was first 368
329 show how the above architecture achieves better transferability applied. The VAD is performed using a custom-written loud- 369
330 by outperforming competing methods in various cross-corpora ness estimation, which is calculated as the maximal power ob- 370
Jo
331 evaluations. tained from the mel-frequency spectrogram representation of 371

332 A graphical representation of the MoE architecture can be the speech signal [44]. A max-normalization was performed 372
333 seen in Figure 2. This architecture, even if it was originally before converting the power spectrogram (amplitude squared) 373
334 developed for neural networks, is very generic, since it can be to decibel (dB) units. A voicing segment is defined by a loud- 374
335 used with any set of features and any classifier. Furthermore, it ness threshold of -40 dB and minimum duration of 100 ms. 375
336 can also be used with deep learning models, for example Guo We calculated3 thirteen MFCC coefficients, MFCC-0...12. 376
337 et al. [17] use it in combination with an adversarial module to The zeroth coefficient (MFCC-0) quantifies the total energy of 377
338 align the target and sources domain. In the remainder of this the input signal. As we are aiming for a tool that also works 378
339 section we will provide more details on the extracted features in the uncontrolled environment of remote patient monitoring, 379
340 and the experts we use in our work.
3 To calculate MFCCs we used the python speech features library [45].
5
380 where there is no control on the background noise and loud- the training, testing and validation set contain records gener- 421
381 ness, we do not make use of the zeroth coefficient [46]. The ated from different subjects. This type of splitting prevents the 422
382 parameters used to calculate the coefficients were windows of leakage of information between training, testing and validation 423
383 25 ms with overlaps of 10 ms. To represent the distribution set due to the subject’s identity [23], and in speech processing 424
384 of each coefficient across the entire speech recording, we com- is known as speaker independent CV. 425
385 puted 11 statistical descriptors, mean, variance, kurtosis, skew- In our work, we want to use the emotional speech corpora to 426
386 ness, mode, IQR, percentiles 10th, 25th, 50th, 75th, and 90th. build a global model on predicting the probability of sounding 427
387 This results in a 12 × 11 = 132 features per utterance. Figure 3 happy (or sad). The model will be used to infer the probabil- 428
illustrates the steps we follow to extract the features. ity of sounding happy from an utterance produced by a new 429
subject, not seen before. As a result, we should use a subject- 430
1.0 1.0 wise CV protocol. Furthermore, we want to have a gender- 431

MFCC coefficients
Signal amplitude
10 independent model, in order not to favor hyper-parameters that

0.5 0.5 432
7 are tailored to males or females. 433

0.0 0.0
⇒ 4 To adhere this, we will adapt the subject-wise protocol and 434
0.5 0.5
1
we will use a stratified-groups-of-subject-wise nested CV. Since 435
1.0 1.0 we focus on gender-independent modelling, each group should

0.0 0.5 1.0 1.5 2.0 0 30 60 90 120 436
Time (sec) Frame number contain subjects from both genders, and be stratified in terms of 437
of
the ratio of males and females. Adapting the 10-fold subject- 438
Figure 3: A graphical representation of the waveform of one voice sample (left) wise CV, we will use a 5-fold groups-of-subjects-wise CV, where 439
and the extracted MFCCs matrix from voicing segments (right). In the wave- each group contains at least one male and one female. This 440
form the green shadow highlights the voicing segment extracted by the VAD.
ro
We used 11 statistical descriptors to represent the distribution of each of the means that we can only use the CREMA-D, RAVDESS and 441
twelve MFCC coefficients across the frames, which leads to a total of 132 global EMO-DB. The smallest of these corpora is EMO-DB, with 10 442
features. subjects (5 female, 5 male), this is the absolute minimum num- 443
388 ber of subjects we can have for our 5-fold groups-of-subjects- 444
389
390
391
4.3. Training experts via stratified-group-wise cross-validation
The MoE architecture can be used with any type of expert
model. We use gradient boosted decision trees (GBDT), a very
-p
wise CV presented in Figure 4. In EMO-DB each group will
contain only one male and one female subject, while for the
rest of the corpora (CREMA-D and RAVDESS) we stratify the
subjects with respect to the gender in each of the 5 folds. Hav-
445
446
447
448
re
392 powerful machine learning model, and specifically CatBoost, a ing this stratification is very important: for example in the in- 449
393 recently introduced fast, scalable and high performance GBDT ner CV loop (Figure 4b), having a validation set that contains 450
394 library [18]. Since our problem is binary classification, we build recordings only of females, will bias the hyper-parameter opti- 451
395 the GBDT by optimising the log-loss. The three main hyperpa- misation to values that perform well for females, but might not 452
lP
396 rameters of GBDT are: the shrinkage parameter, the number of perform well for male subjects. 453
397 gradient boosting iterations (i.e. the number of decision trees), The inner CV (Figure 4b) is used to select the hyperparam- 454
398 and the depth of the trees. We also optimised the number of eters that optimise an evaluation measure, in our work we use 455
399 splits for numerical features and the coefficient for the regular- F-measure. Then these hyperparameters are used and the per- 456
ization term of the loss function. formance is tested in the test set of the outer CV (Figure 4a).
na
400 457
401 In order to optimise all of the above hyper-parameters we Finally, from the five models of the outer CV (one model per 458
402 designed a nested cross-validation (CV) strategy. The hyper- fold) we choose the one that achieves the best performance, i.e. 459
403 parameter optimization is performed in the inner CV loop, while highest F-measure, and we retrain it using all the available data. 460
404 the outer CV is used to derive an estimate of the generalisation After the training was completed, the optimal parameters for 461
ur
405 performance and to select the optimal model [47]. A popular three models (experts) were the following: 462
406 approach is to use 10-fold CV, where the examples are split CREMA-D: Number of trees = 174, Depth of trees = 8. 463
407 randomly into 10 non overlapping sets. One set is kept for test- RAVDESS: Number of trees = 645, Depth of trees = 12. 464
408 ing the performance of the model, while the remaining 9 sets EMO-DB: Number of trees = 986, Depth of trees = 12. 465
Jo
409 are used in an inner CV for estimating the hyper-parameter by All the results of our work are generated by these three trained 466
410 iteratively splitting the data to train and validation sets. models. As cross-validation is a stochastic process we also per- 467
411 A key characteristic of the above procedure is that the ex- formed a sensitivity analysis across different runs, achieving 468
412 amples (or records) are split randomly into the different sets. consistent results. These additional robustness checks are re- 469
413 While record-wise CV is the holy grail of model selection, it ported in the Supplementary Material Section 4. 470
414 is not a panacea, since: “... for this procedure to be meaning-

415 ful, the relationship between the training and the validation set 4.4. Summary of our SER model 471
416 should mimic the relationship between the training set and the To summarise, we use the MoE architecture presented in 472
417 dataset expected for the clinical use” [23]. This is very impor- Figure 2 with K = 3 experts trained in the three mixed-gender 473
418 tant in clinical machine learning application. For example, if we emotional speech corpora: CREMA-D, RAVDESS and EMO- 474
419 want to develop global models that can be used with new sub- DB. In each utterance we are using a sliding window to extract 475
420 jects, we should follow a subject-wise CV framework, where
6
(a) Outer cross-validation to estimate generalisation performance and select optimal model. (b) Inner cross-validation for hyperparameter tuning.
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4

Group 1 Train + Valid. Train + Valid. Train + Valid. Train + Valid. Test Group 1’ Train Train Train Valid.
Group 2 Train + Valid. Train + Valid. Train + Valid. Test Train + Valid. Group 2’ Train Train Valid. Train
Group 3 Train + Valid. Train + Valid. Test Train + Valid. Train + Valid. Group 3’ Train Valid. Train Train
Group 4 Train + Valid. Test Train + Valid. Train + Valid. Train + Valid. Group 4’ Valid. Train Train Train
Group 5 Test Train + Valid. Train + Valid. Train + Valid. Train + Valid.
Figure 4: Graphical representation of the groups-of-subject-wise nested CV. The groups contain mutually exclusive subjects, stratified with respect to gender.
476 12 cepstral features (MFCCs) as low-level descriptors, and we domain adaptation [7]. While pooling data has an obvious ad- 519
477 summarise them using 11 statistical operators, resulting in a to- vantage of increasing the sample size, it may introduce con- 520
478 tal number of 132 features per recording (Figure 3). A GBDT, founding biases and it does not guarantee improvements in the 521
479 using the CatBoost library, is trained on each corpus, and the performance [17, 7]. To have a fair and informative compar- 522
480 hyper-parameters are optimised using a groups-of-subjects-wise ison, we use all of the five methods as a classification model 523
481 nested cross-validation strategy, presented in Figure 4. By fol- GBDT, which are optimised through the group-of-subjects CV 524
482 lowing this experimental protocol, we avoid introducing gen- strategy presented in Section 4.3. 525
of
483 der or speaker dependent biases in our models, which is im- To compare the five methods we use four evaluation mea- 526
484 portant since we deploy our model in utterances generated by sures, which are tailored to capture different aspects of the per- 527
485 previously unseen speakers from both genders. In the inference formance and each one has each one strengths and weaknesses 528
stage, we extract the same features from the target utterance, [48]: Accuracy, F-measure, Area Under the receiver operating
ro
486 529
487 and we pass it to the three models. The final prediction is a characteristic Curve (AUC) and Cohen’s Kappa statistic. Each 530
488 weighted average of the predictions of the three models, where of these are calculated per subject in the evaluation set, using 531
489 the weights depend on the measure of distance presented in Sec- the utterances that each subject provides. This choice of the 532
490
491
tion 4.1.
5. Results
-p
subject-wise average is a result of the characteristics of the three
test corpora (EMOVO, TESS and SAVEE) presented in Table 1.
We observe that in the three test corpora the speakers generate
very different number of utterances, for example in EMOVO
533
534
535
536
re
492 First we will compare the performance of our proposed MoE on average they generate 28, in TESS 400 and in SAVEE 30. 537
493 for SER against a set of competing methods, using utterances This disparity across corpora motivates the subject-level per- 538
494 from corpora that have not been used during the training phase. formance report to facilitate comparison, which avoids biases 539
495 Afterwards, we will use our model to explore various hypothe- towards methods that perform well to specific individuals that 540
lP
496 ses related to the emotional content of the Parkinsonian speech. produce many utterances. 541
Table 2 summarises our results and compares the different 542
497 5.1. Cross-corpus evaluation for SER methods in terms of accuracy. In this table we report the accu- 543
racy for each subject and, inside brackets, we provide the rank- 544
498 As we mentioned in the previous section, for training we use
ing score: the method with the best performance is assigned 545
three of the corpora presented in Table 1: CREMA-D, RAVDESS
na
499
ranking score 1, the second best 2, while the worst is assigned 546
500 and EMO-DB, the remaining three corpora (EMOVO, SAVEE,
score 5. As we observe, our MoE method outperforms the rest 547
501 TESS) will be used for a cross-corpus evaluation. Overall, we
since it achieves the highest average accuracy (0.84) and the 548
502 use for evaluation 544 happy and 543 sad utterances from 12
smallest average score (1.88). 549
503 different subjects: 6 (3 males and 3 females) of them are Ital-
To explore the statistical significance of our results we anal-
ur
550
504 ian speakers from EMOVO, 4 (all males) of them are British
ysed the ranks of the five methods by using a Friedman test with 551
505 English speakers from SAVEE and 2 (all females) are Cana-
the Nemenyi post-hoc test [49]. These are non-parametric tests 552
506 dian English from TESS. The diversity of these three corpora
that do not assume that the data follow a normal distribution. 553
507 creates a very challenging evaluation set. We compare the per-
Figure 5 presents the critical difference diagrams, where groups
Jo
554
formance of our suggested MoE for SER against the following
of methods that are not significantly different (at α = 0.05) are
508
555
509 methods:
connected. There is a clear statistical significant difference in 556
510 Method 1: Using only CREMA-D data to train a single-model.
the performance of the five methods (p ≤ 0.001), with MoE 557
511 Method 2: Using only RAVDESS data to train a single-model.
performing significantly better than Method 1 and Method 2 558
512 Method 3: Using only EMO-DB data to train a single-model.
methods. 559
513 Method 4: Pooling all data from CREMA-D, RAVDESS and
One limitation of the rank-based tests used so far is that 560
514 EMO-DB and train a single-model.
they rely on rank statistics that ignore the absolute difference 561
515 The first three methods are natural baselines in order to see
in the accuracies across the methods. A parametric alternative 562
516 how our ensemble MoE improves over each expert indepen-
to check whether there is a statistically significant difference 563
517 dently. Method 4, pooling all data together and building one
in the performance of the five methods is to use one-way re- 564
518 mode, is a common and strong method used for multi-source
peated measures ANOVA, which uses directly the accuracies. 565
7
Table 2: Comparing the different methods in terms of Accuracy. The ranking testing databases (EMOVO, TESS and SAVEE) the actors are 582
score is given inside brackets and the best method (i.e. lowest ranking score) is
highlighted with bold font. very expressive, i.e. they produce the emotions in high inten- 583
sity. Thus, a reasonable explanation for the good performance 584
Person ID Accuracy of Method 3 is that it is trained using data that are more similar 585
(Corpus/Gender) Method 1 Method 2 Method 3 Method 4 Our MoE with the three testing databases. 586
Person 1 Supplementary material Section 2 summarises our results 587

0.68 (5.0) 0.86 (2.5) 0.89 (1.0) 0.82 (4.0) 0.86 (2.5)
(EMOVO/Female) for the other three evaluation measures: F-measure (Supple- 588
Person 2 mentary Table 1), AUC (Supplementary Table 2) and Kappa 589
0.50 (5.0) 0.54 (4.0) 0.64 (3.0) 0.82 (1.0) 0.79 (2.0)
(EMOVO/Female)
statistic (Supplementary Table 3). Overall, our suggested method 590
Person 3
0.89 (4.0) 0.7 (5.0) 0.96 (2.0) 0.96 (2.0) 0.96 (2.0) outperforms competing methods in all evaluation measures. 591
(EMOVO/Female)
Person 4
0.96 (2.5) 0.64 (5.0) 0.96 (2.5) 0.96 (2.5) 0.96 (2.5)
(EMOVO/Male) 5.2. Comparing the emotional content of PD and HC speech 592
Person 5
0.68 (3.5) 0.5 (5.0) 0.79 (1.5) 0.68 (3.5) 0.79 (1.5) Parkinsonian speech has been associated to perceptual im- 593
(EMOVO/Male)
Person 6 pressions of sadness, more so than the speech of HC. Having 594
1.00 (2.5) 0.57 (5.0) 1.00 (2.5) 1.00 (2.5) 1.00 (2.5)
(EMOVO/Male) our MoE architecture allows us to verify this hypothesis di- 595
Person 7 rectly, since we can answer the question: “how much does the
0.50 (4.5) 0.53 (3.0) 0.50 (4.5) 0.83 (1.0) 0.60 (2.0) 596
(SAVEE/Male)
MoE-inferred probability of sounding happy differ between ut-
of
597
Person 8
0.60 (4.0) 0.53 (5.0) 0.83 (2.0) 0.77 (3.0) 0.87 (1.0) terances produced by PD and HC?” To this end, we will use
(SAVEE/Male) 598
Person 9
0.50 (3.5) 0.5 (3.5) 0.53 (1.5) 0.47 (5.0) 0.53 (1.5) PC-GITA corpus presented in Section 3.2, where various speak- 599
(SAVEE/Male) ing tasks are repeated by PD and their matched HC. 600
Person 10
ro
(SAVEE/Male)
0.50 (5.0) 0.53 (4.0) 0.63 (3.0) 0.80 (1.0) 0.67 (2.0) Figure 6 presents a summary of our results, where the sig- 601
Person 11 nificance is judged by using the non-parametric Mann-Whitney 602

0.50 (5.0) 0.52 (4.0) 0.83 (2.0) 0.65 (3.0) 0.87 (1.0)
(TESS/Female) U-test with the following levels: *** p-value ≤ 0.001, ** p- 603
Person 12
0.73 (5.0) 0.90 (4.0) 1.00 (1.0) 0.98 (3.0) 0.99 (2.0) value ≤ 0.01, * p-value ≤ 0.05, not significant (ns) p-value > 604
(TESS/Female)
Avg. accuracy
Avg. ranking score
0.67
4.12
0.61
4.17
0.80
2.21
0.82
2.62
0.84
1.88
-p
0.05. These levels are used throughout our work. As we ob-
serve from Figures 6a - 6c, in all of the spearking tasks and
tests there is a statistically significant difference, and the HC is
605
606
607
consistently predicted to sound more happy than the PD.

Friedman test: p-value 0.001
re
608
Furthermore, Table 3 ranks all of the different speaking tasks 609
CD according to Cohen’s d, an effect size suitable for Mann-Whitney 610
U-test. Since in each speaking task we have multiple tests,

1 2 3 4 5 611
we report the average Cohen’s across the tests. Furthermore,

lP
612
to check the statistical significance of each speaking task, we 613
Our MoE (1.88) Method 2 (4.17) derive a combined p-value across the multiple Mann Whitney 614
Method 3 (2.21) Method 1 (4.12) tests using Meng’s correction presented in [50], which is suit- 615
Method 4 (2.62) able when the multiple tests are strongly dependent. As Table 616
3 shows, the strongest effect size is observed in the read dialog

na
617
Figure 5: Critical Difference (CD) diagrams to compare the performance of the task, where the two groups (HC vs PD) differ by 0.934 standard 618
five methods in terms of accuracy.
deviations. Strong effect sizes with statistical significance are 619
also observed for the other two tests (sentences and sentences 620
566 This test indicates again statistical significance with p ≤ 0.001. with emphasis).
ur
567 Furthermore, using a paired-samples t-test, we can infer again

Table 3: Correlation between the probability of sounding happy between the
568 that MoE performs significantly better than Methods 1 and 2. two groups (PD and HC). The results are ranked according to the Average Co-
569 Another interesting observation is that Method 3, which hen’s D effect size and we also report the combined Mann-Whitney test p-value
uses the EMO-DB database to build the model, outperforms for the different speaking tasks.
Jo
570
571 Methods 1 and 2, which use the CREMA-D and the RAVDESS
572 respectively. Initially, this looks a surprising result, since by average combined
speaking tasks (sign.)
573 checking Table 1 we can see EMO-DB is a smaller database effect size p-values
574 than RAVDESS and CREMA-D. The explanation for this find- read-dialog 0.934 0.000 (***)
575 ing comes from the fact that RAVDESS and CREMA-D are the sentences-emphasis 0.833 0.000 (***)
576 only two databases where emotions are produced at different sentences 0.814 0.001 (**)
577 levels of intensity. For example, in CREMA-D [31] the actors
578 were directed to express the sentences in three levels of inten- 621
579 sity: low, medium, and high, while in RAVDESS [32] each
580 emotion is produced at two levels of emotional intensity: nor-
581 mal and strong. On the contrary, in EMO-DB and in the three
8
(a) Sentences (b) Sentences with emphasis (c) Read dialog
** *** *** ** *** *** *** *** *** ***
1.0 *** 1.0 1.0 HC
PD
0.8 0.8 0.8
Prob of sounding happy

0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2

HC HC
PD PD
0.0 0.0 0.0
laura loslibros luisa micasa omar rosita juan preocupado triste viste ayerfuialmedico
Figure 6: Comparing probability of sounding happy across the two groups PD vs HC and across speaking tasks and speaking tests. The statistical significance is
judged using Mann-Whitney test.
622 5.3. Comparing the emotional content of PD speech across dif- the correlation between the probability of sounding happy ob- 650
623 ferent levels of speech impairment tained by our MoE method and the disease severity, measured 651
of
624 Using our model for SER and the PC-GITA corpus, we can in terms of the MDS-UPDRS-III score (details in Section 3.2). 652
625 explore an issue that has not been explored in the literature so Figure 8 presents the scatter plots for the tests from each 653
626 far: “how does the degree of speech impairment affect the per- speaking task with the strongest correlation. In these plots, 654
ception of emotional content in Parkinsonian speech?” For this each dot represents a person, the x-axis describes the MDS- 655
ro
627
628 purpose we will focus on the 50 PD patients, and we will ex- UPDRS-III score, while the y-axis the probability that the ut- 656
629 plore how the probability of sounding happy differs in the four terance sounds happy as derived by our MoE model. As we 657
630 groups with different levels of impairment presented in Figure observe, there is a strong anticorrelation between the probabil- 658
ity of sounding happy and the disease severity: patients with

631
632
633
634
1. To explore the relation between the probability of sounding
happy and the speech impairment, since the latter is in ordinal
scale, we will use the Spearman correlation coefficient and the
rank order test.
-p
high disease severity tend to sound less happy. Finally, Table
5 ranks the different speaking tasks according to the Spearman
correlation and provides the combined p-value. All the three
659
660
661
662
tests, i.e. read dialog, sentences and sentences with emphasis,

re
635 Figure 7 presents the box plots for the tests from each speak- 663
636 ing task with the strongest correlation. We observe a strong achieve statistically significant results. 664
637 anticorrelation between the probability of sounding happy and

Table 5: Correlation between the probability of sounding happy between and
638 the speech impairment: patients with higher impairment tend disease severity, measured in terms of the MDS-UPDRS-III score. The results
lP
639 to sound less happy. Table 4 ranks the different speaking tasks are ranked according to the effect size, estimated through the average Spearman
640 according to the Spearman correlation, and it also provides the correlation, and we also report the combined Spearman rank order test p-value
for the different speaking task.
641 combined p-value. Again we observe that all the three tests, i.e.
642 read dialog, sentences and sentences with emphasis, achieve
statistically significant results. average combined
643
na
effect size p-values

Table 4: Correlation between the probability of sounding happy between and
the different levels of speech impairment in PD patients. The results are ranked
read-dialog -0.495 0.001 (***)
according to the effect size, estimated through the average Spearman correla- sentences -0.463 0.005 (**)
tion, and we also report the combined Spearman rank order test p-value for the sentences-emphasis -0.443 0.010 (**)
different speaking task.
ur
average combined
effect size p-values 6. Discussion 665
Jo
read-dialog -0.415 0.006 (**) In this section we will present our principal findings and fu- 666
sentences -0.414 0.012 (*) ture directions. 667
sentences-emphasis -0.369 0.030 (*) 668
6.1. Findings on using our MoE for SER 669
644 5.4. Comparing the emotional content of PD speech across dif- In the cross-corpora evaluation (Section 5.1), our MoE method 670
645 ferent levels of disease severity is equal to or outperforms competitors in all four evaluation 671
646 Finally, in this section we will try to answer an even broader measures. In other words, using our MoE we achieve same or 672
647 question: “how does the disease severity affect the perception better transferability than using each model independently or 673
648 of emotional content in Parkinsonian speech?”. For this pur- by pooling all data together. One interesting characteristic of 674
649 pose we will focus on the 50 PD patients, and we will explore
9
(a) Sentences, test title: luisa (b) Sentences with emphasis, test title: preocupado (c) Read dialog, test title: ayerfuialmedico
1.0 Spearman rank coefficient = -0.515 1.0 Spearman rank coefficient = -0.428 1.0 Spearman rank coefficient = -0.414

0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
0 1 2 3 0 1 2 3 0 1 2 3
Speech impairment Speech impairment Speech impairment
Figure 7: Comparing probability of sounding happy across different levels of speech impairment and across different speaking tasks. For each speaking task we
report the test with the highest correlation, using the Spearman’s rank correlation.
(a) Sentences, test title: laura (b) Sentences with emphasis, test title: preocupado (c) Read dialog, test title: ayerfuialmedico
Spearman rank coefficient = −0.533 Spearman rank coefficient = −0.509 Spearman rank coefficient = −0.495
1.00 1.00 1.00
●

●●
of
● ● ● ●
● ● ●● ●
● ● ● ● ● ●●
0.75 ● ● ● ●● 0.75 ● ● 0.75 ●
●● ● ● ●● ●
● ●
●
● ● ●●● ● ● ● ●
●●
● ● ● ●●
●● ● ● ● ●● ●●● ●●
● ● ● ● ● ● ●● ●●
● ● ● ●
● ● ● ● ●●●
● ● ● ● ● ● ●
● ●● ● ● ●
0.50 ● 0.50 ● ●● 0.50 ● ●
● ●
● ● ● ●
●
ro
● ● ●
● ●● ● ● ●
● ● ●
● ●● ●
● ● ●
0.25 0.25 0.25 ●
●
● ● ● ● ●
● ●
● ●
● ●
0.00 0.00 0.00
25 50 75 25 50 75 25 50 75
MDS−UPDRS−III
Figure 8: Comparing probability of sounding happy across different levels of disease severity, measured in terms of the MDS-UPDRS-III score. For each speaking
task we report the test with the highest correlation, using the Spearman’s rank correlation.
MDS−UPDRS−III
-p MDS−UPDRS−III
re
675 the MoE architecture, which can explain the superior perfor- this challenging cross-corpus evaluation setting, in the record- 703
676 mance, is that different experts focus on different areas of the ings of 9 out of 12 people the MoE model predicts the correct 704
677 input space. Supplementary Table 4 presents the 10 features emotion with relatively high accuracy (∼ 80% and above). Only 705
lP
678 with the highest importance scores for the three different GBDT for the recordings of Person 9 the accuracy is close to random 706
679 experts. As this table shows, when we want to classify a new guess, but still MoE outperforms the competing methods. At 707
680 text example (i.e. utterance) as happy or sad, different experts the time of writing, state-of-the-art methods for SER are deep 708
681 rely on different features to perform the classification. learning architectures [12], such as the recently introduced ar- 709
This is the first time in that a MoE model has been used chitecture that uses Mel-spectrograms as inputs and leverages
na
682 710
683 for combining multiple diverse sources of emotional speech in a parallel combination of an attention-based long short-term 711
684 order to build a SER system. In the recent literature of mul- memory recurrent neural networks with an attention-based fully 712
685 tidomain adaptation [17], MoE architectures have been used in convolutional networks[51]. Unfortunately, to train this type of 713
686 natural language processing for sentiment analysis and for part- models access to vast amounts of data is needed. This is not the 714
of-speech tagging, where they consistently outperform multi- case for the databases we focus on our work (Table 1), where
ur
687 715
688 ple baselines. The improvement in the performance of the MoE all the subjects repeat the same sentences. To explore the emo- 716
689 can be attributed to the fact that combining diverse set of experts tional content of Parkinsonian speech vs healthy speech, it is 717
690 helps the generalisation to unseen data sources. This property is very important to train the models using recordings from sub- 718
Jo
691 crucial for our research, since we train the models in emotional jects that repeat the same text/sentence, with different emotions. 719
692 speech databases (Table 1) and we explore the peculiarities of Since the focus of our work is the expressiveness of the emo- 720
693 Parkinsonian speech using PC-GITA, a database that contains tions, we do not want to introduce biases due to the semantic 721
694 PD and HC speech recordings. content, which would have been the case if we were using big- 722
695 At this point it is useful to clarify that the purpose of our ger databases that convey the emotion through free-speech or 723
696 work is not to build another state-of-the-art SER system, but to natural dialogues, e.g. IEMOCAP [37]. 724
697 build a system that achieves a good performance across speech One important factor that can influence the results is noise. 725
698 corpora, or in other words good transferability, and use this Bhattacharjee et al. [52] have shown that the normal distribu- 726
699 system to explore the emotional content in the Parkinsonian tion parameters of the MFCC feature vectors of noisy speech 727
700 speech. In other words, we are interested in exploring the pos- signals deviate significantly from the corresponding clean speech 728
701 sibility of learning from more easily available data something signal. Therefore, clean speech is a prerequisite to ensure a re- 729
702 useful to assess clinical data. From Table 2, we can see that in liable prediction from our model. 730
10
731 6.2. Findings on analysing the emotional content of PD speech area, one of the main applications of transfer learning is in 786
732 The main focus of our work was to provide machine learn- building deep neural network architectures, where large non- 787
733 ing insights on the emotional content of the Parkinsonian speech. parkinsonian speech corpora are used for training the models, 788
734 For this purpose we used the PC-GITA database, where there which at the end are fine-tuned with a small parkinsonian speech 789
735 are both PD and HC subjects that repeat the same speaking test, corpus. In our work, we followed a cross-domain transfer learn- 790
736 e.g. repeat exactly the same sentence. Our first finding (Section ing approach where publicly available emotional speech cor- 791
737 5.2) was that the PD speech sounds significantly less happy than pora were used to build a MoE model, which was used to val- 792
738 the HC in all tests, i.e. reading sentences, reading sentences idate the hypothesis that Parkinsonian speech sounds more sad 793
739 with emphasis on specific words, and reading dialogues. This than healthy speech. Although in our work we focused on 794
740 finding, although initially surprising, since all subjects repeat two emotions (happy and sad), our MoE architecture can be 795
741 exactly the same sentences, it is line with the literature. Pell straightforwardly extended to multi-class emotion recognition, 796
742 et al. [5] performed an observation study where they recorded i.e. training using various emotions, such as happiness, sadness, 797
743 emotional utterances by PD and HC subjects and they asked in- neutral, fear, surprise and anger. As a future direction, it will 798
744 dependent listeners, naive to the disease status of each individ- be insightful to explore if the probabilities of further emotional 799
745 ual speaker to detect the intended emotion. The main outcome categories (e.g. neutral, fear, surprise and anger) also change 800
746 was that Parkinsonian speech was often perceived as sad. significantly between the two groups (PD vs HC). A different 801
Furthermore, our finding is complementary to studies that direction can be to check how the probabilities change if we fo- 802
of
747
748 build prediction models for distinguishing between PD and HC cus on positive and negative groups of emotions. For example, 803
749 using running speech (i.e. reading text). The impact of PD in a recent work [56], by focusing on positive vs negative emo- 804
750 on speaking skills can be characterised by three principle “ditions, the authors achieved high transferability across different 805
ro
751 mensions” of speech: phonation, articulation and prosody [21]. SER corpora. 806
752 There are numerous works that build upon this and explore Our findings are complementary to previous automated voice 807
753 how to derive speech features and build prediction models to analysis of PD [22] and have the potential to improve the in- 808
754 distinguish between PD and HC [20, 21]. Our work looks at terpretation of data from PD telemonitoring systems. Further- 809
755
756
757
the PD speech characteristics from a different angle, i.e. the
speech emotional content. We rely on publicly available speech
corpora for building a predictive model to distinguish between
-p
more, our model can be used to monitor the progress of certain
therapies, for example the behavioral speech therapy, which ap-
pears to greatly improve the speech dysfunction [57]. Towards
this direction, one interesting hypothesis to explore is how the
810
811
812
happy and sad utterances. Then we transfer this model to show 813
re
758
759 that the PD speech sounds less happy, and this is due to the PD speech therapy affects the probability of sounding happy de- 814
760 speech pathology. It will be interesting to explore further the rived by our MoE model. Finally, by adopting a MoE ap- 815
761 connection between these two tasks: modelling PD vs HC and proach trained on non-clinical data with promising results on 816
modelling happy vs sad. For example, knowing which features never-seen-before clinical data, our work has implications for 817
lP
762
763 are important for both of these tasks will shed more light on the the growing research on vocal markers of neuro-psychiatric con- 818
764 characteristics of the PD speech in general, and how it relates ditions and its documented issues of small sample sizes and lack 819
765 to emotional PD speech. of generalizability [58, 59, 60]. 820
766 Since PC-GITA contains for each PD patient the score of The main motivation of our work was to rely on the clinical 821
the speech impairment, in Section 5.3 we explored another in- insight that PD speech is perceived as sad. We therefore as-
na
822
767
768 teresting hypothesis: “how the probability of sounding happy sessed whether we could use emotional speech corpora to train 823
769 changes across the different levels of impairment.” There is a machine learning model that could capture speech differences 824
770 a negative relation between the probability of sounding happy between PD patients and matched controls, as well as between 825
771 and the degree of speech impairment. The more severe the patients with varying degrees of speech impairment. While the 826
ur
772 speech pathology, the less happy the PD patient sounds. This usefulness of our results is independent on whether the specific 827
773 negative relation is statistically significant in all the speaking speech samples we investigated actually sound sad, it would 828
774 tasks. To the best of our knowledge such results are reported for have been insightful to explore whether the differences between 829
775 the first time in the literature. Finally, in Section 5.4 we showed PD and HC speech presented in Figure 6 are due to the speech 830
Jo
776 that a similar trend holds between the probability of sounding impairment or to the actual emotions expressed by the partic- 831
777 happy and the overall disease progression. To measure the dis- ipants, and whether these differences would correspond to hu- 832
778 ease progression we used the MDS-UPDRS-III score, which man judgments of sadness in the speech. Unfortunately, these 833
779 provides a clinician’s assessment over the severity of a patient’s pursuits were not possible with the current dataset: we have no 834
780 motor symptoms. This finding is complementary to studies that emotion-related perceptual judgments for the utterances of PC- 835
781 show that emotions are highly affected by the progression of the GITA. An interesting venue for future work would be to cre- 836
782 disease. ate an emotional speech corpus from both PD and HC speech, 837
where for each utterance we will know the indented emotion 838
783 6.3. Future research and the disease status of the speaker, as well as human produced 839
emotional ratings. 840

784 Transfer learning techniques have been used for detection
785 and assessment of PD from speech data [53, 54, 55]. In this
11
841 7. Conclusions [4] J. R. Orozco-Arroyave, Analysis of speech of people with Parkinson’s 894
disease, Vol. 41, Logos Verlag Berlin GmbH, 2016. 895
842 In this article, we introduced a mixture-of-experts for recog- [5] M. D. Pell, H. S. Cheang, C. L. Leonard, The impact of parkinson’s dis- 896
ease on vocal-prosodic communication from the perspective of listeners, 897
843 nising emotion from speech, combining models trained on emo-
Brain and language 97 (2) (2006) 123–134. 898
844 tional speech corpora. Under our architecture, each expert is [6] A. Jaywant, M. D PELL, Listener impressions of speakers with parkin- 899
845 trained on a different corpus and the final prediction is a weighted son’s disease, Journal of the International Neuropsychological Society: 900
846 average of the individual expert predictions, where the weights JINS 16 (1) (2010) 49. 901
[7] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, G. J. Gordon, 902

847 depend on a measure of distance between the sample at hand Adversarial multiple source domain adaptation, in: Advances in Neural 903
848 and the samples in each training corpus. We showed that our Information Processing Systems, 2018, pp. 8559–8570. 904
849 model achieves better transferability and outperforms compet- [8] J. R. Orozco-Arroyave, J. C. Vásquez-Correa, J. F. Vargas-Bonilla, 905
850 ing methods in cross-corpora experiments, i.e. using for eval- R. Arora, N. Dehak, P. S. Nidadavolu, H. Christensen, F. Rudzicz, 906
M. Yancheva, H. Chinaei, Neurospeech: An open-source software for 907

851 uation subjects from hold-out corpora. Furthermore, we used parkinson’s speech analysis, Digital Signal Processing 77 (2018) 207– 908
852 our model to get a better insight on the emotional content of 221. 909
853 the Parkinsonian speech, and how the peculiarities in this type [9] B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emo- 910
854 of speech affect its emotional sound. To this end, we used a re- tions and affect in speech: State of the art and lessons learnt from the first 911
challenge, Speech Communication 53 (9-10) (2011) 1062–1087. 912
855 cently introduced corpus, PC-GITA, where PD and HC perform [10] M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recogni- 913
856 various speaking tasks. Our first finding was that Parkinsonian tion: Features, classification schemes, and databases, Pattern Recognition 914
of
857 speech sounds less happy than speech from healthy subjects. 44 (3) (2011) 572–587. 915
858 This finding is in line with observational studies that compare [11] C.-N. Anagnostopoulos, T. Iliou, I. Giannoukos, Features and classifiers 916
for emotion recognition from speech: a survey from 2000 to 2011, Artifi- 917
859 healthy and Parkinsonian speech. Then, for first time in the cial Intelligence Review 43 (2) (2015) 155–177. 918
literature, we explored the correlation between the PD related [12] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, T. Alhussain,
ro
860 919
861 speech impairment and the emotional content of the speech. Speech emotion recognition using deep learning techniques: A review, 920
IEEE Access 7 (2019) 117327–117345. 921

862 In the group of people with Parkinson’s disease we observed [13] B. W. Schuller, Speech emotion recognition: Two decades in a nutshell, 922
863 a negative relation between the probability of sounding happy benchmarks, and ongoing trends, Communications of the ACM 61 (5) 923
864
865
and the speech impairment.
Acknowledgment
-p
(2018) 90–99.
[14] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Dev-
illers, L. Vidrascu, N. Amir, L. Kessous, The relevance of feature type for
the automatic classification of emotional user states: low level descrip-
tors and functionals, in: Eighth Annual Conference of the International
924
925
926
927
928
re
866 The authors would like to thank all of the patients and col- Speech Communication Association (INTERSPEECH), 2007. 929
[15] B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, G. Rigoll, 930
867 laborators from Fundalianza Parkinson Colombia. Without their Speaker independent speech emotion recognition by ensemble classifi- 931
868 support and contribution it would not be possible to create the cation, in: 2005 IEEE International Conference on Multimedia and Expo, 932
869 corpus PC-GITA and address this research. Konstantinos Sechidis IEEE, 2005, pp. 864–867. 933
lP
870 was funded by Roche pRED Operations Advanced Analytics [16] Y. Sun, G. Wen, Ensemble softmax regression model for speech emotion 934
recognition, Multimedia Tools and Applications 76 (2017) 8305–8328. 935

871 Postdoctoral Fellowship Program, which is aligned with the [17] J. Guo, D. Shah, R. Barzilay, Multi-source domain adaptation with mix- 936
872 pRED Postdoctoral Fellowship Program. ture of experts, in: Proceedings of the 2018 Conference on Empirical 937
873 Furthermore, the authors would like to thank Dr. Martin Methods in Natural Language Processing, 2018, pp. 4694–4703. 938
[18] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, A. Gulin, Cat- 939

Strahm, Dr. Florian Lipsmeier, Dr. Michael Mitchley, Dr. Kirsten
na
874
boost: unbiased boosting with categorical features, in: Advances in Neu- 940
875 Taylor, Dr. Chris Chatham, Florian Zabel and Ioannis Liabotis ral Information Processing Systems, 2018, pp. 6638–6648. 941
876 at Roche for their valuable support. Additional thanks are given [19] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman, L. O. Ramig, 942
877 to Dr. Fabio Massimo Zennaro, Prof. Gavin Brown, Konstanti- Suitability of dysphonia measurements for telemonitoring of parkinson’s 943
disease, IEEE Trans. on Biomedical Engineering 56 (4) (2009) 1015. 944
878 nos Papangelou, Ludvig Olsen, Lasse Hansen and Dr. Idoia [20] J. Hlavnička, R. Čmejla, T. Tykalová, K. Šonka, E. Ržička, J. Rusz, Au- 945
ur
879 Grau Sologestoa for their feedback on the manuscript and the tomated analysis of connected speech reveals early biomarkers of parkin- 946
880 fruitful discussions on the topic. Finally, the authors would like son’s disease in patients with rapid eye movement sleep behaviour disor- 947
881 to thank the anonymous reviewers for their constructive com- der, Scientific reports 7 (1) (2017) 1–13. 948
[21] J. Orozco-Arroyave, F. Hönig, J. Arias-Londoño, J. Vargas-Bonilla, 949

882 ments. K. Daqrouq, S. Skodda, J. Rusz, E. Nöth, Automatic detection of parkin- 950
Jo
son’s disease in running speech spoken in three different languages, The 951
Journal of the Acoustical Society of America 139 (1) (2016) 481–500. 952
883 References [22] A. Tsanas, M. A. Little, P. E. McSharry, L. O. Ramig, Accurate telemon- 953
itoring of parkinson’s disease progression by noninvasive speech tests, 954
884 [1] A. K. Ho, R. Iansek, C. Marigliani, J. L. Bradshaw, S. Gates, Speech IEEE Trans. on Biomedical Engineering 57 (4) (2009) 884–893. 955
885 impairment in a large sample of patients with parkinson’s disease, Be- [23] S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, K. P. Kording, The need to 956
886 havioural neurology 11 (3) (1999) 131–137. approximate the use-case in clinical machine learning, Gigascience 6 (5) 957
887 [2] T. G. Beach, C. H. Adler, L. I. Sue, L. Vedders, L. Lue, C. L. White III, (2017) gix019. 958
888 H. Akiyama, J. N. Caviness, H. A. Shill, M. N. Sabbagh, Multi-organ dis- [24] J. C. Vásquez-Correa, T. Arias-Vergara, J. Orozco-Arroyave, B. Eskofier, 959
889 tribution of phosphorylated α-synuclein histopathology in subjects with J. Klucken, E. Nöth, Multimodal assessment of parkinson’s disease: a 960
890 lewy body disorders, Acta neuropathologica 119 (6) (2010) 689–702. deep learning approach, IEEE journal of biomedical and health informat- 961
891 [3] K. Torsney, D. Forsyth, Respiratory dysfunction in parkinson’s disease., ics 23 (4) (2018) 1618–1630. 962
892 The journal of the Royal College of Physicians of Edinburgh 47 (1) (2017) [25] L. Moro-Velazquez, J. A. Gomez-Garcia, J. I. Godino-Llorente, 963
893 35–39. F. Grandas-Perez, S. Shattuck-Hufnagel, V. Yagüe-Jimenez, N. Dehak, 964
12
965 Phonetic relevance and phonemic grouping of speech in the automatic perspective, Cambridge University Press, 2011. 1036
966 detection of parkinson’s disease, Scientific Reports 9 (1) (2019) 1–16. [49] J. Demšar, Statistical comparisons of classifiers over multiple data sets, 1037
967 [26] A. Tsanas, M. A. Little, P. E. McSharry, L. O. Ramig, Nonlinear speech Journal of Machine learning research 7 (Jan) (2006) 1–30. 1038
968 analysis algorithms mapped to a standard metric achieve clinically useful [50] V. Vovk, R. Wang, Combining p-values via averaging, arXiv preprint 1039
969 quantification of average parkinson’s disease symptom severity, Journal arXiv:1212.4966 (2012). 1040
970 of the royal society interface 8 (59) (2011) 842–855. [51] Z. Zhao, Z. Bao, Y. Zhao, Z. Zhang, N. Cummins, Z. Ren, B. Schuller, Ex- 1041
971 [27] S. Anand, C. E. Stepp, Listener perception of monopitch, naturalness, and ploring deep spectrum representations via attention-based recurrent and 1042
972 intelligibility for speakers with parkinson’s disease, Journal of Speech, convolutional neural networks for speech emotion recognition, IEEE Ac- 1043
973 Language, and Hearing Research 58 (4) (2015) 1134–1144. cess 7 (2019) 97515–97525. 1044
974 [28] H. S. Cheang, M. D. Pell, An acoustic investigation of parkinsonian [52] U. Bhattacharjee, S. Gogoi, R. Sharma, A statistical analysis on the im- 1045
975 speech in linguistic and emotional contexts, Journal of Neurolinguistics pact of noise on mfcc features for speech recognition, in: 2016 Interna- 1046
976 20 (3) (2007) 221–241. tional Conference on Recent Advances and Innovations in Engineering 1047
977 [29] J. Möbes, G. Joppich, F. Stiebritz, R. Dengler, C. Schröder, Emotional (ICRAIE), IEEE, 2016, pp. 1–5. 1048
978 speech in parkinson’s disease, Movement Disorders 23 (2008) 824–829. [53] J. C. Vásquez-Correa, T. Arias-Vergara, C. D. Rios-Urrego, M. Schus- 1049
979 [30] S. Zhao, F. Rudzicz, L. G. Carvalho, C. Márquez-Chin, S. Livingstone, ter, J. Rusz, J. R. Orozco-Arroyave, E. Nöth, Convolutional neural net- 1050
980 Automatic detection of expressed emotion in parkinson’s disease, in: works and a transfer learning strategy to classify parkinson’s disease from 1051
981 2014 IEEE International Conference on Acoustics, Speech and Signal speech in three different languages, in: Iberoamerican Congress on Pat- 1052
982 Processing (ICASSP), IEEE, 2014, pp. 4813–4817. tern Recognition, Springer, 2019, pp. 697–706. 1053
983 [31] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, [54] L. Moro-Velazquez, J. Villalba, N. Dehak, Using x-vectors to automat- 1054
984 R. Verma, Crema-d: Crowd-sourced emotional multimodal actors dataset, ically detect parkinson’s disease from speech, in: ICASSP 2020 - 2020 1055
985 IEEE Trans. on affective computing 5 (4) (2014) 377–390. IEEE International Conference on Acoustics, Speech and Signal Process- 1056
of
986 [32] S. R. Livingstone, F. A. Russo, The ryerson audio-visual database of emo- ing (ICASSP), 2020, pp. 1155–1159. 1057
987 tional speech and song (ravdess): A dynamic, multimodal set of facial and [55] L. Moro-Velazquez, J. A. Gomez-Garcia, J. D. Arias-Londoño, N. De- 1058
988 vocal expressions in north american english, PloS one 13 (5) (2018). hak, J. I. Godino-Llorente, Advances in parkinson’s disease detection 1059
989 [33] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss, A and assessment using voice and speech: A review of the articulatory and 1060
database of german emotional speech, in: Ninth European Conference on phonatory aspects, Biomedical Signal Processing and Control 66 (2021)
ro
990 1061
991 Speech Communication and Technology, 2005. 102418. 1062

992 [34] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco, Emovo corpus: an ital- [56] H. Zhou, K. Chen, Transferable positive/negative speech emotion recog- 1063
993 ian emotional speech database, in: International Conference on Language nition via class-wise adversarial domain adaptation, in: ICASSP 2019 1064
994 Resources and Evaluation (LREC 2014), European Language Resources - 2019 IEEE International Conference on Acoustics, Speech and Signal 1065
995
996
997
998
999
[35]
[36]
Association (ELRA), 2014, pp. 3501–3504.
K. Dupuis, M. K. Pichora-Fuller, Recognition of emotional speech for
younger and older talkers: Behavioural findings from the toronto emo-
tional speech set, Canadian Acoustics 39 (3) (2011) 182–183.
S. Haq, P. J. B. Jackson, Speaker-dependent audio-visual emotion recog-
-p
Processing (ICASSP), 2019, pp. 3732–3736.
[57] L. O. Ramig, C. Fox, S. Sapir, Speech treatment for parkinson’s disease,
Expert Review of Neurotherapeutics 8 (2) (2008) 297–309.
[58] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, T. F.
Quatieri, A review of depression and suicide risk assessment using speech
1066
1067
1068
1069
1070
re
1000 nition, in: Proc. Int. Conf. on Auditory-Visual Speech Processing analysis, Speech Communication 71 (2015) 10–49. 1071
1001 (AVSP’09), Norwich, UK, 2009. [59] R. Fusaroli, A. Lambrechts, D. Bang, D. M. Bowler, S. B. Gaigg, Is voice 1072
1002 [37] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. a marker for autism spectrum disorder? a systematic review and meta- 1073
1003 Chang, S. Lee, S. S. Narayanan, Iemocap: Interactive emotional dyadic analysis, Autism Research 10 (3) (2017) 384–407. 1074
1004 motion capture database, Language resources and evaluation 42 (4) [60] A. Parola, A. Simonsen, V. Bliksted, R. Fusaroli, Voice patterns in 1075
lP
1005 (2008) 335. schizophrenia: A systematic review and bayesian meta-analysis, bioRxiv 1076
1006 [38] J. R. Orozco-Arroyave, J. D. Arias-Londoño, J. F. Vargas-Bonilla, M. C. (2019) 583815. 1077

1007 González-Rátiva, E. Nöth, New Spanish speech corpus database for the
1008 analysis of people suffering from Parkinson’s disease, in: Proceedings of
1009 the Ninth International Conference on Language Resources and Evalu-
1010 ation (LREC’14), European Languages Resources Association (ELRA),
na
1011 Reykjavik, Iceland, 2014, pp. 342–347.

1012 [39] C. G. Goetz, et al., Movement disorder society-sponsored revision of the
1013 unified parkinson’s disease rating scale MDS-UPDRS: scale presentation
1014 and clinimetric testing results, Movement disorders: official journal of the
1015 Movement Disorder Society 23 (15) (2008) 2129–2170.
1016 [40] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures
ur
1017 of local experts., Neural computation 3 (1) (1991) 79–87.

1018 [41] G. Brown, Ensemble learning, Encyclopedia of Machine Learning (2010)
1019 312–320.
1020 [42] S. Davis, P. Mermelstein, Comparison of parametric representations for
1021 monosyllabic word recognition in continuously spoken sentences, IEEE
Jo
1022 Trans. on acoustics, speech, and signal processing 28 (4) (1980) 357–366.
1023 [43] T. Kinnunen, H. Li, An overview of text-independent speaker recognition:
1024 From features to supervectors, Speech communication 52 (2010) 12–40.
1025 [44] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,
1026 O. Nieto, librosa: Audio and music signal analysis in python, 2015.
1027 [45] J. Lyons, python speech features: common speech features for asr,
1028 github.com/jameslyons/python_speech_features (Accessed
1029 July 1, 2019).
1030 [46] K. S. Rao, S. G. Koolagudi, Emotion recognition using speech features,
1031 Springer Science & Business Media, 2012.
1032 [47] G. C. Cawley, N. L. Talbot, On over-fitting in model selection and sub-
1033 sequent selection bias in performance evaluation, Journal of Machine
1034 Learning Research 11 (Jul) (2010) 2079–2107.
1035 [48] N. Japkowicz, M. Shah, Evaluating learning algorithms: a classification
13

Journal Pre-Proof: Artificial Intelligence in Medicine

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Journal Pre-Proof: Artificial Intelligence in Medicine

Uploaded by

Copyright:

Available Formats

Journal Pre-proof

A Machine Learning perspective on the emotional content of

Konstantinos Sechidis, Riccardo Fusaroli, Juan Rafael

To appear in: Artiﬁcial Intelligence In Medicine

Received Date: 6 March 2020

Please cite this article as: { doi: https://doi.org/

© 2020 Published by Elsevier.

1. Introduction This conclusion was derived by observational studies were PD

and healthy control (HC) subjects produced emotional utter- 17

tients and HC. 32

The purpose of our work is thus threefold: 33

detlef.wolf@roche.com (Detlef Wolf), literature of multi-source domain adaptation [7]. 38

AG, Basel, Switzerland.

Preprint submitted to Journal of Artificial Intelligence In Medicine February 25, 2021

60 Parkinsonian speech. Section 6 provides a discussion on the ation, kurtosis). 111

den Markov Models, Gaussian Mixture Models and Random 114

79 focused on (details in Section 3). Furthermore, in absence of

80 large and heterogeneous datasets, there are concerns of over- 132

between PD and HC. Instead of focusing on sustained phona- 144

2 Tobuild this type of models we need to train them on emotional speech

154 egy (e.g. splitting the data to train/test/validation subsets) plays

in Table 1, where we can see the diversity of the characteristics. 233

189 seven HC subjects to repeat ten unique statements, each pro-

  model. Section 4.1 describes the mixture-of-experts architec- 299

work we use probability as a proxy for the degree of happiness

where µXDi is the mean vector of the features in the dataset Di ,

e(x0 , Di ) = |d(x0 , D+i ) − d(x0 , D−i )|,

In contrast, the Mel Frequency Cepstral Coefficients (MFCCs), 352

over a mel-frequency scale, allowing better representation of 357

should give us a scale invariant global representation of speech 362

331 evaluations. tained from the mel-frequency spectrogram representation of 371

subject, not seen before. As a result, we should use a subject- 430

1.0 1.0 wise CV protocol. Furthermore, we want to have a gender- 431

10 independent model, in order not to favor hyper-parameters that

7 are tailored to males or females. 433

1.0 1.0 we focus on gender-independent modelling, each group should

414 is not a panacea, since: “... for this procedure to be meaning-

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4

Table 2 summarises our results and compares the different 542

sity. Thus, a reasonable explanation for the good performance 584

Person 1 Supplementary material Section 2 summarises our results 587

Person 11 nificance is judged by using the non-parametric Mann-Whitney 602

consistently predicted to sound more happy than the PD.

Furthermore, Table 3 ranks all of the different speaking tasks 609

CD according to Cohen’s d, an effect size suitable for Mann-Whitney 610

U-test. Since in each speaking task we have multiple tests,

we report the average Cohen’s across the tests. Furthermore,

to check the statistical significance of each speaking task, we 613

3 shows, the strongest effect size is observed in the read dialog

567 Furthermore, using a paired-samples t-test, we can infer again

Prob of sounding happy

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

ity of sounding happy and the disease severity: patients with

tests, i.e. read dialog, sentences and sentences with emphasis,

637 anticorrelation between the probability of sounding happy and

effect size p-values

6.1. Findings on using our MoE for SER 669

Prob of sounding happy

Prob of sounding happy

Prob of sounding happy

Prob of sounding happy