You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324732732

Learning Image-based Representations for Heart Sound Classification

Conference Paper · April 2018


DOI: 10.1145/3194658.3194671

CITATIONS READS
15 480

6 authors, including:

Zhao Ren Nicholas Cummins


Universität Augsburg Universität Augsburg
32 PUBLICATIONS   235 CITATIONS    121 PUBLICATIONS   1,534 CITATIONS   

SEE PROFILE SEE PROFILE

Vedhas Pandit Jing Han


Universität Augsburg Universität Augsburg
24 PUBLICATIONS   236 CITATIONS    39 PUBLICATIONS   360 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Automatic Sentiment Analysis in the Wild (SEWA) View project

MixedEmotions (http://mixedemotions-project.eu/) View project

All content following this page was uploaded by Kun Qian on 15 April 2020.

The user has requested enhancement of the downloaded file.


1
2
Learning Image-based Representations for Heart Sound 59
60
3 Classification 61
4 62
5 Zhao Ren Nicholas Cummins Vedhas Pandit 63
6 64
ZD.B Chair of Embedded Intelligence ZD.B Chair of Embedded Intelligence ZD.B Chair of Embedded Intelligence
7 65
for Health Care and Wellbeing, for Health Care and Wellbeing, for Health Care and Wellbeing,
8 66
University of Augsburg, Germany University of Augsburg, Germany University of Augsburg, Germany
9 67
10
zhao.ren@informatik.uni-augsburg. nicholas.cummins@ieee.org vedhas.pandit@informatik. 68
11
de uni-augsburg.de 69
12 70
13
Jing Han Kun Qian Björn Schuller 71
14 ZD.B Chair of Embedded Intelligence Machine Intelligence and Signal GLAM – Group on Language, Audio 72
15 for Health Care and Wellbeing, Processing Group, Technische & Music, Imperial College London, 73

ft
16 University of Augsburg, Germany Universität München, Germany UK 74

io ra
17 jing.han@informatik.uni-augsburg. andykun.qian@tum.de schuller@ieee.org 75
18 de 76

ut d
19 77
20 ABSTRACT KEYWORDS 78

ib ing
21 Machine learning based heart sound classification represents an Heart Sound Classification, Phonocardiogram, Convolutional Neu- 79

n
22 efficient technology that can help reduce the burden of manual ral Networks, Scalogram, Transfer Learning. 80
23 auscultation through the automatic detection of abnormal heart
str rk ACM Reference Format: 81
24 sounds. In this regard, we investigate the efficacy of using the pre- Zhao Ren, Nicholas Cummins, Vedhas Pandit, Jing Han, Kun Qian, and Björn 82
25 trained Convolutional Neural Networks (CNNs) from large-scale Schuller. 2018. Learning Image-based Representations for Heart Sound Clas- 83
di o
26 image data for the classification of Phonocardiogram (PCG) signals sification. In Proceedings of Digital Health (DH’18). ACM, April 23-26, 2018, 84
or d w

27 by learning deep PCG representations. First, the PCG files are seg- Lyon, France, 5 pages. https://doi.org/10.475/123_4 85
28 mented into chunks of equal length. Then, we extract a scalogram 86
29 image from each chunk using a wavelet transformation. Next, the 1 INTRODUCTION 87
30 scalogram images are fed into either a pre-trained CNN, or the same Heart disease continues to be a leading worldwide health bur- 88
t f he

31 network fine-tuned on heart sound data. Deep representations are den [16]. Phonocardiograph is a method of recording the sounds 89
32 then extracted from a fully connected layer of each network and and murmurs made by heart beats, as well as the associated tur- 90
No blis

33 classification is achieved by a static classifier. Alternatively, the bulent blood flow with a stethoscope, over various locations in 91
34 scalogram images are fed into an end-to-end CNN formed by adapt- the chest cavity [11]. Phonocardiogram (PCG), as the product of 92
35 ing a pre-trained network via transfer learning. Key results indicate phonocardiograph, is widely employed in the diagnosis of heart 93
that our deep PCG representations extracted from a fine-tuned
pu

36 disease. Enhancing conventional heart diseases diagnostic methods 94


37 CNN perform the strongest, 56.2 % mean accuracy, on our heart using the state-of-the-art automated classification techniques based 95
38 sound classification task. When compared to a baseline accuracy of on PCG recordings, is a rapidly growing field of machine learning 96
Un

39 46.9 %, gained using conventional audio processing features and a research [13]. In this regard, the recent PhysioNet/ Computing in 97
40 support vector machine, this is a significant relative improvement Cardiology (CinC) Challenge in 2016 [3], has encouraged the de- 98
41 of 19.8 % (p < .001 by one-tailed z-test). velopment of heart sound classification algorithms, by collecting 99
42 multiple PCG datasets from different groups to construct a large, 100
43 CCS CONCEPTS more than 20 hours of recordings, heart sound database. The two- 101
44 102
• Computing methodologies → Supervised learning by clas- class classification of normal/ abnormal heart sound was the core
45 103
sification; • Applied computing → Health care information task of the PhysioNet/ CinC Challenge 2016.
46
systems; Health informatics; In recent years, Convolutional Neural Networks (CNNs) have 104
47 proven to be effective for a range of different signals and image 105
48 classification tasks [7, 9]. In particular, large-scale CNNs have rev- 106
Björn Schuller is also with the ZD.B Chair of Embedded Intelligence for Health Care
49 and Wellbeing, University of Augsburg, Germany. olutionised visual recognition tasks as evidenced by their perfor- 107
50 mances in the ImageNet Large Scale Visual Recognition Competi- 108
51 Permission to make digital or hard copies of part or all of this work for personal or tion (ILSVRC) [23]. On the back of the challenge, a large number 109
52
Unpublished working
classroom use is granted without feedraft.
providedNot for distribution
that copies are not made or distributed 110
of pre-trained CNNs have been made publicly available, such as
for profit or commercial advantage and that copies bear this notice and the full citation
53 AlexNet [10] and VGG [30]. Similarly, CNNs have also been suc- 111
on the first page. Copyrights for third-party components of this work must be honored.
54 For all other uses, contact the owner/author(s). cessfully used for the detection of abnormal heart sounds [14]. 112
55 DH’18, April 23-26, 2018, Lyon, France Herein, we utilise the Image Classification CNN (ImageNet) to 113
© 2016 Copyright held by the owner/author(s).
56
ACM ISBN 123-4567-24-567/08/06. . . $15.00 process scalogram images of PCG recordings for abnormal heart 114
57 https://doi.org/10.475/123_4 sound detection. Scalogram images are constructed using wavelet 115
58 2018-01-29 18:51 page 1 (pp. 1-5) Submission ID: 123-A12-B3 116
DH’18, April 23-26, 2018, Lyon, France Zhao Ren, Nicholas Cummins, Vedhas Pandit, Jing Han, Kun Qian, and Björn Schuller

117 transformations [22]. Wavelets are arguably the predominate fea- 175
118 ture representation used for heart sound classification [8], and have 176
119 successfully been applied in other acoustic classification tasks [19– 177
120 21]. Moreover, instead of training CNNs from scratch, which can 178
121 be a time-consuming task due in part to the large hyperparameter 179
122 space associated with CNNs, we explore the benefits of using the 180
123 aforementioned pre-trained ImageNet to construct robust heart 181
124 sound classification models. Such an approach has been employed 182
125 in other acoustic classification paradigms [1, 4], but to the best of (a) Normal (a0007.wav) (b) Abnormal (a0001.wav) 183
126 the authors’ knowledge it has not been verified for PCG based heart 184
127 sound classification. Further, we also explore if transfer-learning Figure 1: The scalogram images are extracted from the first 185
128 based adaptation and updating of the ImageNet parameters can 4 s segments of normal/ abnormal heart sounds using the 186
129 further improve the accuracy of classification. viridis colour map. The samples from which these scalo- 187
130 The remainder of this paper is structured as follows: first, we gram images have been extracted are described in parenthe- 188
131 describe our proposed approach in Section 2; the database descrip- ses. 189

ft
132 tion, experimental set up, evaluation method and results are then 190
191

io ra
133 presented in Section 3; finally, our conclusions and future work
134 plans are given in Section 4. 2.3 Convolutional Neural Networks 192

ut d
135 193
We use ImageNet to process the scalogram images for heart sound
136 194
2 METHODOLOGY classification. The VGG16 ImageNet is chosen due to its successful

ib ing
137 195
In this section, we describe the classification paradigms we test for application in the ILSVRC Challenge1 . VGG16 is constructed from

n
138 196
abnormal heartbeat detection. This consists of: (i) a conventional 13 ([2, 2, 3, 3, 3]) convolutional layers, five maxpooling layers, three
139 197
audio-based baseline system; (ii) two deep PCG representation fully connected layers {fc6, fc7, fc} and a soft-max layer for 1000
140
str rk 198
systems combined with a Support Vector Machine (SVM) classifier; labels according to the image classification task in the ImageNet
141 199
(iii) two end-to-end CNN based systems. Challenge. The receptive field size of 3 × 3 is used in all of the
di o
142 200
convolutional layers. The full details of VGG16 are described in [30].
or d w

143 201
The structure and parameters of VGG16 are obtained from Pytorch2 .
144 2.1 Baseline Classification System Further, we use VGG16 for either feature extraction or classification
202
145 203
As a baseline, we use a system based on the INTERSPEECH COM- by transfer learning, both of which are described in the following
146 204
putational PARalinguitic challengE (ComParE) audio feature
t f he

sub-sections.
147 205
set [29], and SVM classification. The combination of ComParE
148 206
features and SVM have been used in a range of similar acoustic 2.4 Deep PCG Feature Representations
No blis

149 207
classification tasks such as snore sound classification [28]. The
150 ImagNet has gathered considerable research interest as a feature 208
ComParE feature set is a 6373 dimensional representation of an
151 extractor for a task of interest, e. g., [1]. In this regard, this sub- 209
audio instance and is extracted using the openSMILE toolkit [5]; full
section presents two methodologies for unsupervised PCG feature 210
pu

152
details of the audio features presented in ComParE can be found
153 extraction using VGG16. 211
in [5].
154 212
2.4.1 PCG Feature Extraction from ImageNet. The activations
Un

155 213
2.2 Scalogram Representation of the first fully connected layer fc6 of VGG16 are employed as our
156 214
feature representations. These features have previously proven to
157 In this study, to transform the PCG samples into images which can 215
be effective in the task of acoustic scene classification [21]. Essen-
158 be processed by an ImageNet, the scalogram images are generated 216
tially, we feed the scalogram images into VGG16 and then the deep
159 using the morse wavelet transformation [17] with 2 kHz sampling 217
PCG feature representations of 4096 attributes are extracted as the
160 frequency. We have previously successfully used these scalogram 218
activations of all neurons in the first fully connected layer fc6.
161 images for acoustic scene classification [21]. When creating the 219
162 images, we represent frequency, in kHz, on the vertical axis, and 2.4.2 PCG Feature Extraction from adapted ImageNet. As VGG16 220
163 time, in s, on the horizontal axis. We use the viridis colour map, is normally employed for image classification tasks on a very dif- 221
164 which varies from blue (low range) to green (mid range) to red ferent data from that required for heart sound classification, the 222
165 (upper range), to colour the wavelet coefficient values. Further, the feature extraction method described in the previous sub-section 223
166 axes and margins marking are removed to ensure only the neces- may yields a sub-optimal feature representation. We therefore also 224
167 sary information is fed into the ImageNet. Finally, the scalogram employ a transfer learning methodology (see Section 2.5.2) to adapt 225
168 images are scaled to 224 × 224 for compatibility with the VGG16 the parameters of VGG16 to better suit the task of abnormal heart 226
169 ImageNet [30], which will be introduced in Section 2.3. sound detection. After the adaptation according to Section 2.5.2, the 227
170 The scalogram images of a normal and an abnormal heartbeat scalogram images are fed into the updated CNN model and a new 228
171 are given in Figure 1. It can even be observed by human eyes that, 229
172 there are some clear distinctions between the two classes in these 1 http://www.image-net.org/challenges/LSVRC/ 230
173 (exemplar) images. 2 http://pytorch.org/ 231
174 2018-01-29 18:51 page 2 (pp. 1-5) Submission ID: 123-A12-B3 232
Learning Image-based Representations for Heart Sound Classification DH’18, April 23-26, 2018, Lyon, France

233 set of deep representations (also with 4096 attributes) are extracted (1) MIT heart sounds database: The Massachusetts Institute of 291
234 from the first fully connected layer fc6. Technology heart sounds database (MIT) [31, 32] comprises 292
235 409 PCG and ECG recordings sampled at 44.1 kHz with 16 293
236
2.4.3 Classification Methodology. We perform classification of 294
bit quantisation from 121 subjects, in which there are 117
237
the heart sound samples into one of two classes: normal and abnor- 295
recordings from 38 healthy adults and 134 recordings from
238
mal. The process is achieved for the deep PCG feature representa- 296
37 patients. The recording duration varies from 9 s to 37 s
239
tions via a linear SVM; the robustness of SVM for such a classifica- 297
with a 32 s average length.
tion task is well-known in the literature [6]. Herein, our two deep
240 (2) AAD heart sounds database: Aalborg University heart sounds 298
feature representations are denoted as pre-trained VGG+SVM for
241 database (AAD) [25–27] is recorded at a 4 kHz sample rate 299
242
the set-up described in Section 2.4.1 and learnt VGG+SVM for the 300
and 16 bit quantisation. It contains 544 recordings from 121
243
set-up described in Section 2.4.2. 301
healthy adults and 151 recordings from 30 patients. The
244 recording length varies from 5 s to 8 s with an 8 s average 302
245 2.5 End-to-end ImageNet based Classification length. 303
246 With the aim of constructing a robust end-to-end heart sound CNN (3) AUTH heart sounds database: The Aristotle University of 304
247 classifier, we adapt the parameters of VGG16 on the heart sound Thessaloniki heart sounds database (AUTH) [18] includes 45 305

ft
248 data by transfer learning. To achieve this, we use two different recordings in total from 11 healthy adults and 34 patients. 306
approaches, both of which are described below. 307

io ra
249 Each healthy adult/ patient gives one recording and the
250 recording length varies from 10 s to 122 s with a 49 s average 308
2.5.1 Learning Classifier of ImageNet. Noting that there are

ut d
251 length. The sampling rate is 4 kHz with 16 bit quantisation. 309
three fully connected layers in VGG16, we create our ImageNet
252 (4) UHA heart sounds database: The University of Haute Al- 310
classifier, herein denoted as learning Classifier of VGG16, by freezing

ib ing
253 sace heart sounds database (UHA) [15] is sampled at 8 kHz 311
the parameters of the convolutional layers and fc6, and updating

n
254 with a 16 bit quantisation. It contains 39 recordings from 312
(using scalogram images of heart sound data) the parameters of
255 25 healthy adults and 40 recordings from 30 patients. The 313
the final two fully connected layers and the soft-max layer for
256
str rk recording length varies from 6 s to 49 s with a 15 s average 314
classification.
257 length. 315
di o
258 2.5.2 Learning ImageNet. In this method, herein denoted as (5) DLUT heart sounds database: The Dalian University of 316

learning VGG, we replace the last fully connected layer with a Technology heart sounds database (DLUT) [33] includes 338
or d w

259 317
260 new one which has 2 neurons and a soft-max layer in order to recordings from 174 healthy adults and 335 recordings from 318
261 achieve the 2-class classification task. We then update the entire 335 patients. The recording length varies from 8 s to 101 s 319
262 network (again, using scalogram images of heart sound data) so with a 23 s average length. The sampling rate is 8 kHz with 320
t f he

263 that all VGG16 parameters are adapted to the heart sound data. a 16 bit quantisation. 321
264 This method represents a faster way to achieve a full CNN based (6) SUA heart sounds database: The Shiraz University adult 322
No blis

265 classification than training an entire CNN from scratch with random heart sounds database (SUA) [24] is constructed from 81 323
266 initialisation of parameters. recordings from 79 healthy adults and 33 recordings from 324
267 33 patients. Except for three recordings sampled at 44.1 kHz 325

2.6 Late-fusion Strategy 326


pu

268 and one at 384 kHz, the sampling rate is 8 kHz with 16 bit
269 quantisation. The recording duration varies from 30 s to 60 s 327
As the PCG recordings in the PhysioNet/ CinC Challenge are of
270 with a 33 s average length. 328
varying lengths (cf. Section 3.1), we segment the recordings into
Un

271 329
non-overlapping chunks of 4 seconds. We therefore employ a late- A detailed overview of database is described in Table 1. In this
272 330
fusion strategy to produce a single label (normal/ abnormal) per work, we split the dataset into a training set (including MIT, AUTH,
273 331
recoding. Our strategy is based on the probabilities of predictions, UHA, and DLUT) and a test set (including AAD and SUA). Further,
332
pi , i = 1, ...n of each i-th segment of a PCG sample, as outputted
274
we carry out a three-fold cross validation by excluding the database
275 333
by the SVM or the soft-max layer; we choose the label of a PCG MIT, AUTH, or UHA for fold 1, fold 2, or fold 3 (c. f., Table 2)
276 334
sample according to the highest probability max {pi } gained from correspondingly for validation, noting that due to its large size,
277 335
each chunk. DLUT is always for system training.
278 336
279
3 EXPERIMENTS 3.2 Setup 337
280 338
281 3.1 Database We generate scalogram images using the Matlab-2017 wavelet tool-
339
box3 . During training/ adaptation of VGG16, both for last two layers
282 Our proposed approaches are evaluated on the database of Phys- 340
of VGG16 (cf. Section 2.5.1), and the entire network (cf. Section 2.5.2),
283 ioNet/ CinC Challenge 2016 [12]. This dataset is focused on classifi- 341
the learning rate is 0.001, the batch size is 64, and the epoch is set as
284 cation of normal and abnormal heart sound recordings. As the test 342
50. The cross entropy is applied as the loss function and stochastic
285 set labels for this data are not publicly available, we use the training 343
gradient descent is used as the optimiser. The deep representations
286 set of the database and split it into a new training/ development/ 344
(cf. Section 2.4), with a dimensionality of 4096, are extracted from
287 test set. There are totally 3240 heart sound recordings collected 345
fc6 of VGG16.
288 from 947 pathological patients and healthy individuals. The dataset 346
289 consists of six sub-databases from different research groups: 3 https://de.mathworks.com/products/wavelet.html 347
290 2018-01-29 18:51 page 3 (pp. 1-5) Submission ID: 123-A12-B3 348
DH’18, April 23-26, 2018, Lyon, France Zhao Ren, Nicholas Cummins, Vedhas Pandit, Jing Han, Kun Qian, and Björn Schuller

349 Table 1: An overview of the training and test partitions used in this work. The training set is structured by four sub-sets from 407
350 four different databases of the PhysioNet/ CinC dataset, and the test set is by two. The PCG recordings in this dataset are 408
351 annotated by the two-class labels (normal/ abnormal). 409
352 410
353 411
Durations (s)
354 Dataset Database Recordings Normal Abnormal 412
355 413
Total Min Max Average
356 414
Training MIT 409 117 292 13328.08 9.27 36.50 32.59
357 415
AUTH 31 7 24 1532.49 9.65 122.00 49.44
358 416
UHA 55 27 28 833.14 6.61 48.54 15.15
359 417
DLUT 2141 1958 183 49397.15 8.06 101.67 23.07
360 418
Total 2636 2109 527 65090.86
361 419
Test AAD 490 386 104 3910.20 5.31 8.00 7.98
362 420
SUA 114 80 34 3775.45 29.38 59.62 33.12
363 421
Total 604 466 138 7685.65

ft
364 422
423

io ra
365
366
Table 2: Performances comparison of the proposed approaches with baseline. The methods are evaluated on the 3-fold devel- 424
opment set and the test set. The experimental results are evaluated by Sensitivity (Se), Specificity (Sp), and the Mean Accuracy

ut d
367 425
368
(MAcc). 426

ib ing
369 427

n
370 Development set Test set 428
371 429
fold 1 fold 2 fold 3 mean
372
str rk 430
performance [%] Se Sp MAcc Se Sp MAcc Se Sp MAcc Se Sp MAcc Se Sp MAcc
373 431
62.5
di o
374 ComParE+SVM (baseline) 23.6 93.2 58.4 58.3 100.0 79.2 00.0 100.0 50.0 27.3 97.7 76.8 17.0 46.9 432
pre-trained VGG+SVM 57.2 70.9 64.1 41.7 85.7 63.7 17.9 81.5 49.7 38.9 79.4 59.1 24.6 87.1 55.9
or d w

375 433
376 learnt VGG+SVM 58.6 57.3 57.9 83.3 57.1 70.2 32.1 70.4 51.3 58.0 61.6 59.8 24.6 87.8 56.2 434
377 learning Classifier of VGG 68.2 51.3 59.7 79.2 14.3 46.7 35.7 40.7 38.2 61.0 35.4 48.2 33.3 63.7 48.5 435
378 learning VGG 83.6 40.2 61.9 95.8 28.6 62.2 53.6 44.4 49.0 77.7 37.7 57.7 12.3 95.7 54.0 436
t f he

379 437
380 438
When classifying by SVM, we use the LIBSVM library [2] with 3.4 Results
No blis

381 439
a linear kernel and optimise the SVM complexity parameter C ∈ The experimental results of the baseline and proposed methods are 440
[10−5 ; 10+1 ] on the development partition. We present the best
382
383 shown in Table 2. All CNN-based approaches achieve improvements 441
results from this optimisation as the final result. in MAcc over the baseline on test set. Although this consistency is 442
pu

384
385 not seen on the development set, the MAccs achieved on the test set 443
386 3.3 Evaluation Method indicate that the deep representation features extracted from scalo- 444
Un

387 According to the official scoring mechanism of the PhysioNet/ gram images perform stronger and more robust than conventional 445
388 CinC Challenge 2016 [12], our predictions are evaluated by both audio features when performing heart sound classification. 446
389 Sensitivity (Se) and Specificity (Sp). For two-class classification, Se When comparing the methods ‘learning Classifier of VGG’ and 447
390 and Sp are defined as: ‘learning VGG’, it is clear from the results that adapting the entire 448
391 CNNs is definitely more effective than only updating the last two 449
TP fully connected layers. Moreover, an in-general trend of the SVM 450
392
Se = , (1)
393 TP + FN classification of features extracted from either the pre-trained or the 451
394 learnt VGG topologies performing stronger than the CNN classifiers 452
395 can be observed. This could be due in part to the SVM classifiers 453
TN
396 Sp = , (2) being better to suit to the relatively smaller amounts of training 454
397
T N + FP data available in the PhysioNet/ CinC dataset than the soft-max 455
398 where T P denotes the number of true positive abnormal sam- classifiers. 456
399 ples, F N denotes the number of false negative abnormal samples, Finally, the strongest performance, 56.2 % MAcc, is obtained on 457
400 T N denotes the number of true negative normal samples, and F P the test set using the method ‘learnt VGG+SVM’. This MAcc offers 458
401 denotes the number of false positive normal samples. a significant relative improvement of 19.8 % on our baseline classi- 459
402 Finally, the Mean Accuracy (MAcc) is given as the overall score fier (p < .001 by one-tailed z-test), ComParE features and a SVM. 460
403 of the predictions, which is defined as: Therefore, our learnt CNN model is shown to extract more salient 461
404 deep representation features for abnormal heart sound detection 462
405 MAcc = (Se + Sp)/2. (3) 463
406 2018-01-29 18:51 page 4 (pp. 1-5) Submission ID: 123-A12-B3 464
Learning Image-based Representations for Heart Sound Classification DH’18, April 23-26, 2018, Lyon, France

465 when compared with features gained from the pre-trained VGG16 [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classifi- 523
466 model. cation with deep convolutional neural networks. In Proc. NIPS. Lake Tahoe, NV, 524
1097–1105.
467 525
[11] Aubrey Leatham. 1952. Phonocardiography. British Medical Bulletin 8, 4 (1952),
468 4 CONCLUSIONS 333–342. 526
469 [12] Chengyu Liu et al. 2016. An open access database for the evaluation of heart 527
We proposed to apply and adapt pre-trained Image Classification sound algorithms. Physiological Measurement 37, 12 (Nov. 2016), 2181–2213.
470 528
Convolutional Neural Networks (ImageNet) on scalogram images [13] Ilias Maglogiannis, Euripidis Loukis, Elias Zafiropoulos, and Antonis Stasis. 2009.
471 Support vectors machine-based identification of heart valve diseases using heart 529
of Phonocardiogram (PCG) for the task of normal/ abnormal heart sounds. Computer Methods and Programs in Biomedicine 95, 1 (July 2009), 47–61.
472 530
sound classification. Deep PCG representations extracted from a [14] Vykintas Maknickas and Algirdas Maknickas. 2017. Recognition of normal–
473 abnormal phonocardiographic signals using deep convolutional neural networks 531
task-adapted version of the popular ImageNet VGG16 were shown
474 and mel-frequency spectral coefficients. Physiological Measurement 38, 8 (July 532
to be more robust for this task than the widely used ComParE 2017), 1671–1679.
475 533
audio feature set. The combination of learnt VGG features and a [15] Ali Moukadem, Alain Dieterlen, Nicolas Hueber, and Christian Brandt. 2013. A
476
SVM significantly (p < .001 by one-tailed z-test) outperformed the robust heart sounds segmentation module based on S-transform. Biomedical 534
477 Signal Processing and Control 8, 3 (May 2013), 273–281. 535
ComParE based baseline system. We speculate this success is due [16] Dariush Mozaffarian et al. 2016. Heart disease and stroke statistics–2016 update:
478 A report from the American Heart Association. Circulation 133, 4 (Jan. 2016), 536
to the autonomous nature of the feature extraction associated with
479 e38–e360. 537
the ‘learnt VGG’ topology; the representations are adapted to the

ft
480
[17] Sofia C. Olhede and Andrew T. Walden. 2002. Generalized morse wavelets. IEEE 538
dataset and therefore are more robust than a ‘fixed’ conventional Transactions on Signal Processing 50, 11 (Nov. 2002), 2661–2670.
539

io ra
481 [18] Chrysa D. Papadaniil and Leontios J. Hadjileontiadis. 2014. Efficient heart sound
feature set.
482 segmentation and extraction using ensemble empirical mode decomposition and 540
In future work, data augmentation will be investigated for heart kurtosis features. IEEE Journal of Biomedical and Health Informatics 18, 4 (July

ut d
483 541
sound classification to compensate for the unbalanced nature of the 2014), 1138–1152.
484 [19] Kun Qian, Christoph Janott, Vedhas Pandit, Zixing Zhang, Clemens Heiser, Win- 542
dataset. Further, a new ImageNet topology based on the scalogram

ib ing
485 fried Hohenhorst, Michael Herzog, Werner Hemmert, and Björn Schuller. 2017. 543
images will be developed and validated on a variety of heart sound Classification of the excitation location of snore sounds in the upper airway by

n
544
486
datasets, e. g., AudioSet4 , to build a robust ImageNet for heart sound acoustic multifeature analysis. IEEE Transactions on Biomedical Engineering 64, 8
487 (Aug. 2017), 1731–1741. 545
classification.
488
str rk [20] Kun Qian, Christoph Janott, Zixing Zhang, Clemens Heiser, and Björn Schuller. 546
2016. Wavelet features for classification of vote snore sounds. In Proc. ICASSP.
489
ACKNOWLEDGMENTS Shanghai, China, 221–225. 547
di o
490 [21] Zhao Ren, Vedhas Pandit, Kun Qian, Zijiang Yang, Zixing Zhang, and Björn 548
Schuller. 2017. Deep sequential image features on acoustic scene classification.
or d w

491 549
This work was partially supported by the German In Proc. of DCASE Workshop. Munich, Germany, 113–117.
492
national BMBF IKT2020-Grant under grant agree- [22] Olivier Rioul and Martin Vetterli. 1991. Wavelets and signal processing. IEEE 550
493 Signal Processing Magazine 8, 4 (Oct. 1991), 14–38. 551
ment No. 16SV7213 (EmotAsS), the EU Horizon 2020 [23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
494 Innovation Action Automatic Sentiment Estimation in the Wild un- Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. 552
t f he

495 der grant agreement No. 645094 (SEWA), and the EU H2020 / EFPIA Berg, and Li Fei-Fei. 2015. Imagenet large scale visual recognition challenge. 553
496
International Journal of Computer Vision 115, 3 (Dec. 2015), 211–252. 554
Innovative Medicines Initiative under grant agreement No. 115902 [24] Maryam Samieinasab and Reza Sameni. 2015. Fetal phonocardiogram extraction
No blis

497 (RADAR-CNS). using single channel blind source separation. In Proc. ICEE. Tehran, Iran, 78–83. 555
498 [25] Samuel E. Schmidt, Claus Holst-Hansen, Claus Graff, Egon Toft, and Johannes J. 556
Struijk. 2010. Segmentation of heart sound recordings by a duration-dependent
499 REFERENCES hidden Markov model. Physiological Measurement 31, 4 (Mar. 2010), 513–529.
557
558
pu

500 [1] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael [26] Samuel E. Schmidt, Claus Holst-Hansen, John Hansen, Egon Toft, and Johannes J.
501 Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. 2017. Snore sound Struijk. 2015. Acoustic features for the identification of coronary artery disease. 559
classification using image-based deep spectrum features. In Proc. INTERSPEECH. IEEE Transactions on Biomedical Engineering 62, 11 (Nov. 2015), 2611–2619.
502 [27] Samuel E. Schmidt, Egon Toft, Claus Holst-Hansen, and Johannes J. Struijk. 2010. 560
Stockholm, Sweden, 3512–3516.
Un

503 [2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector Noise and the detection of coronary artery disease with an electronic stethoscope. 561
504 machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (Apr. In Proc. CIBEC. Cairo, Egypt, 53–56. 562
2011), 1–27. [28] Björn Schuller, Stefan Steidl, Anton Batliner, Elika Bergelson, Jarek Krajewski,
505 [3] Gari D. Clifford, Chengyu Liu, Benjamin Moody, David Springer, Ikaro Silva, Christoph Janott, Andrei Amatuni, Marisa Casillas, Amdanda Seidl, Melanie 563
506 Qiao Li, and Roger G. Mark. 2016. Classification of normal/abnormal heart sound Soderstrom, Anne Ss Warlaumont, Guillermo Hidalgo, Sebastian Schnieder, 564
recordings: The PhysioNet/Computing in Cardiology Challenge 2016. In Proc. Clemens Heiser, Winfried Hohenhorst, Michael Herzog, Maximilian Schmitt, Kun
507 565
Computing in Cardiology Conference (CinC). Vancouver, Canada, 609–612. Qian, Yue Zhang, George Trigeorgis, Panagiotis Tzirakis, and Stefanos Zafeiriou.
508 [4] Jun Deng, Nicholas Cummins, Jing Han, Xinzhou Xu, Zhao Ren, Vedhas Pandit, 2017. The INTERSPEECH 2017 computational paralinguistics challenge: Ad- 566
509 Zixing Zhang, and Björn Schuller. 2016. The University of Passau open emo- dressee, cold & snoring. In Proc. INTERSPEECH. Stockholm, Sweden, 3442–3446. 567
tion recognition system for the multimodal emotion challenge. In Proc. CCPR. [29] Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer,
510 Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik 568
Chengdu, China, 652–666.
511 [5] Florian Eyben, Felix Weninger, Florian Groß, and Björn Schuller. 2013. Recent De- Marchi, Marcello Mortillaro, Hugues Salamin, Anna Polychroniou, Fabio Valente, 569
512 velopments in openSMILE, the Munich open-source multimedia feature extractor. and Samuel Kim. 2013. The INTERSPEECH 2013 computational paralinguistics 570
In Proc. ACM Multimedia. Barcelona, Spain, 835–838. challenge: Social signals, conflict, emotion, autism. In Proc. INTERSPEECH. Lyon,
513 [6] Steve R. Gunn. 1998. Support vector machines for classification and regression. France, 148–152. 571
514 ISIS technical report 14, 1 (May 1998), 5–16. [30] Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks 572
[7] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- for large-scale image recognition. In Proc. ICLR. San Diego, CA, no pagination.
515 573
thankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional [31] Zeeshan Syed, Daniel Leeds, Dorothy Curtis, Francesca Nesta, Robert A. Levine,
516 neural networks. In Proc. CVPR. Columbus, OH, 1725–1732. and John Guttag. 2007. A framework for the analysis of acoustical cardiac signals. 574
517 [8] Edmund Kay and Anurag Agarwal. 2016. DropConnected neural network trained IEEE Transactions on Biomedical Engineering 54, 4 (Apr. 2007), 651–662. 575
with diverse features for classifying heart sounds. In Proc. Computing in Cardiol- [32] Zeeshan Hassan Syed. 2003. MIT automated auscultation system. Ph.D. Disserta-
518 576
ogy Conference (CinC). Vancouver, Canada, 617–620. tion. Massachusetts Institute of Technology.
519 [9] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv [33] Hong Tang, Ting Li, Tianshuang Qiu, and Yongwan Park. 2012. Segmentation 577
520 preprint arXiv:1408.5882 (2014). of heart sounds based on dynamic clustering. Biomedical Signal Processing and 578
Control 7, 5 (Sep. 2012), 509–516.
521 4 https://research.google.com/audioset/dataset/heart_sounds_heartbeat.html 579
522 2018-01-29 18:51 page 5 (pp. 1-5) Submission ID: 123-A12-B3 580

View publication stats

You might also like