You are on page 1of 4

Pathological Speech Processing: Review

1
Máté HIREŠ (1st year),
Supervisor: 2 Peter DROTÁR
1,2 Department of Computers and Informatics, FEI TU of Košice, Slovak Republic
1 mate.hires@tuke.sk, 2 peter.drotar@tuke.sk

Abstract—This paper describes an investigation of voice pathol- Voice Database (SVD) [10] could be an ideal set of data to start
ogy detection. This is a preliminary study in the field of automatic the research in this area, since all of its recordings are sampled
detection of voice disorders. We compare some of the available at 50 kHz with 16-bit resolution.
voice databases, which contains both healthy and pathological
voice recordings. We also compare some methods, which use
neural networks for the detection and achieved significant results. II. R ELATED WORK
Finally we discuss the presented methods and their results in term
of their accuracy. Deep learning models got popular from the introduction
Keywords—voice pathology, speech disorders, machine learn- of the deep Boltzman machine in 2006 [11]. From this time
ing, signal processing, deep learning. using these models, researchers and scientist have achieved
significant results in many fields, such as medical analysis
[2], [12], [13], computer vision [14], cybersecurity [15], etc.
I. I NTRODUCTION
There are a lot of research articles in the field of voice
The human voice can have a radical effect on the health pathology detection, where the researchers used SVD for their
condition of an individual. It can be radically affected by neo- work [16], [17], [18], [19], [20]. They yielded certain fea-
plasm, vocal palsy and other phonotraumatic diseases. Auto- tures from the voice recordings. These features contains mel-
matic detection of these pathological voice disorders is a very frequency cepstral coefficients, energy, entropy, harmonics-to-
challenging and significant medical classification problem. noise ratio, normalized noise energy, multidimensional voice
Voice pathology includes evaluation and treatment of voice program parameters, etc. Hence, multiple classifiers have been
production related diseases, which are affecting fluency, in- used. Most of the researchers used traditional methods like
tonation, phonation and breathing. Since the computer-aided Support Vector Machine classifier (SVM), Dimensionality
analysis and diagnosis of human voice can be an effective and Reduction Method (DRM), glottal source excitation, Gaussian
low-cost tool for patients around the world, voice pathology mixture models, k-means clustering. Deep Neural Network
has congested specific interest amongst machine learning (DNN) methods were also significantly useful. Notable results
and signal processing scientists. The current activity of the were achieved for example with the Far Eastern Memorial
detection of voice disorders includes a special procedure called Hospital (FEMH) dataset using transfer learning techniques
laryngeal endoscopy [1], [2]. This is a very complicated and [21]. It consists of transferring the knowledge from a small
expensive procedure and it also requires an expert to perform dataset to another previously trained model, which was trained
the evaluation. To detect voice disorders without intrusion, we on a similar domain [22].
can analyze the voice directly. Voices are mostly affected by cancer, nodules, polyps, cysts,
Many scientists work on the automatic detection of voice laryngitis, vocal tremors, spasmodic dysphonia, vocal fold
disorders, exactly in the last few years. Van der Merwe [3] paralysis and sulcus diseases [1]. The results of the voice
provides a foundation and accentuates the necessity of the pathology detection depends on the data used. Dankovičová
research related to voice disorder detection. Kent [4] discusses et al. differentiated healthy and dysfunctional voice samples
the connection between voice production and its dysfunctions. with the accuracy of over 70% using traditional methods
Some studies found specific speech dysfunctions within some [23]. Kasuya et al. effectively identified glottic cancer, vocal
particular population groups [5], [6]. cord polyp and other nerve paralysis diseases [24]. Fang
Thanks to the amount of available data and the advance- et al. identified polyp, cyst, nodules, neoplasm, and other
ment in the computational power, a lot of appropriate Deep diseases as well by using deep learning models [1]. They
Learning models are provided for speech processing. Hence, achieved over 90% accuracy in female and over 94% accuracy
we are allowed to use complex model architectures. We can in male subjects. Dibazar et al. connected pitch dynamics,
teach convolutional layers [7] to detect various features that Mel-frequency spectral coefficients and Hidden Markov Model
could help us to differentiate pathological and healthy voice. classifier (HMM) to identify some voice disorders. Working
Some Interspeech challenges [8], [9] also attracted interest in with data from the Massachusets Eye and Ear Infirmary
the application of Machine Learning and signal processing (MEEI) they implemented a model, which resulted with the
techniques for voice pathology detection. accuracy of over 99% [25]. Souissi et al. achieved over
A lot of available datasets contain recordings, which are 86% accuracy using 2000 recordings [17]. Al-nasheri et al.
recorded in more different environments, thus it makes hard pushed the accuracy of 99% [18]. Hemmerling at al. achieved
to find common features in the samples. The Saarbruecken the highest accuracy of 100%, who detected the disorders
for women and men separately [19]. However, since these • counting from 0-10,
results were achieved on specific datasets, the high accuracy • a standardized Arabic passage,
can be questionable. • reading of three common words.

The sample frequency of all collected samples in AVPD is 50


III. DATASETS kHz.
A. The Saarbruecken Voice Database (SVD)
The SVD is a collection of voice recordings from more than E. New Spanish Speech corpus database
2000 patients [10]. One recording session contains the follow- This database contains speech recordings of 50 people
ing recordings: with Parkinson’s disease and 50 healthy recordings with 25
• Recording of the vowels [i, a, u] produced at normal, male and 25 female on each category. The participants are
high and low pitch, Colombians, speaking native Spanish. This database is bal-
• Recordings of the vowels [i, a, u] with rising-falling pitch, anced in terms of age and gender as well. The average
• Recording of the sentence ”Guten Morgen, wie geht es age of the participants is approximately 60 in each category
Ihnen?” ("Good morning, how are you?"). and subcategory. The recordings were sampled at 44100 Hz
This is a total of 13 files per session. The signals for each with 16 bits of resolution.
case are stored in separate files. All the samples are recorded The participants had to perform different tasks to analyze
at 50 kHz with 16 bit resolution. This dataset contains 71 some aspects of their voice. The tasks can be grouped into
different pathologies including organic and functional. the following aspects: (i) phonation, (ii) articulation, (iii)
prosody. Detailed description of each task is described in [28].

B. The Far Eastern Memorial Hospital database (FEMH) IV. M ETHODOLOGY


This dataset includes 60 normal voice samples and 402 A. DNN Architecture
samples of common voice disorders, including vocal nodules, The following Deep Neural Network (DNN) architecture
polyps, andcysts, glottic neoplasm, vocal atrophy, laryngeal was presented by Harar et al. [20]. They used two stacks
dystonia, unilateral vocal paralysis and sulcus vocalis. of convolutional layers so their model can transform the input
It contains 3-second long voice samples of patients sustaining into a set of more abstract repeating patterns, that could be
the vowel sound /a/ [1]. These samples were recorded important for the decrease of the network cost. Between each
using a high-quality microphone with a digital amplifier stack there is a max pooling layer to reduce the dimension
under a background noise level between 40 and 45 dBA. of the input vector. Moreover, they wrapped all the convolution
The sampling rate was 44,100 Hz with a 16-bit resolution. and pooling layers in a TimeDistributed layer to keep the time
axis unchanged. The resulting matrices are then reshaped to be
connected to a recurrent Long-Short-Term-Memory (LSTM)
C. The Massachusetts Eye and Infirmary Voice Disorders layer. This LSTM layer is set so it can remember the changes
Database (MEEI) in time. At last, there is a stack of 3 fully connected layers,
The MEEI Voice Disorders Database was delivered in 1994 which ends with a 2 neuron Softmax layer for the final
at the MEEI Voice and Speech Lab and partly at Kay Elemet- classification. One of the neurons are for the healthy and one
rics Corp. [26]. It contains over 1400 samples of recordings for the pathological class.
of: The detailed DNN architecture with its input and output
• sustained of vowel /a/, from which 53 is normal and 657
parameters is shown in Fig. 1.
pathological,
• continuous speech, from which 53 is normal and 661 B. CNN Architecture
pathological. Convolutional Neural Networks (CNN) are variants of the
All the samples have 16 bit resolution, but as a disadvantage, standard neural network. They are not using fully connected
the samples are recorded in two different environments. hidden layers. They have a special network architecture, which
The frequency for normal samples is 50 kHZ, while the consists of alternating convolution and pooling layers, most
frequency of the pathological samples is either 25 or 50 kHz. of them are hidden. In the following text we will discuss the
(53 normal and 657 pathological files) and continuous speech CNN architecture used by Wu at al. in [29].
(53 normal and 661 pathological). The weights are shared among all the units, where each unit
is calculated as
NW
I X
X
D. The Arabic Voice Pathology Database (AVPD) hkm = σ( k
vl,n+m−1 wl,n + w0k ), (1)
The AVDP dataset was developed at the Communication l=1 n=1
and Swallowing Disorders Unit of King Abdul Aziz University where vl,m is the m-th unit of the l-th input layer V and hk,m
Hospital, Riyadh, Saudi Arabia [27]. The idea behind the cre- is the m-th unit of the k-th convolutional layer H. NW is the
ation of this dataset is to overcome the environmental problems k
size of all the weights, where wl,n is the n-th unit of the given
of the MEEI dataset, which could lead to non-accurate results. weight. The feature extraction is an automatic process by the
The dataset contains: shared-weights.
• voice samples of successive speech tasks, A pooling layer is fundamental for dimensional reduction
• sustained phonation of the vowels /a/, /e/ and /o/, to decrease the complexity of the computations. To build such
TABLE I
D ESCRIPTION OF SPEECH FRAME FEATURES [13]

Feature Type N. of Features


Prosodic features
Short-time energy, Average power, Average magni- 12
tude, Zero crossings, Mean, Standard deviation, Me-
dian, Max, Min, Range, Dynamic range, Interquartile
range
Vocal-tract features
MFCC (Mel Frequency Cepstrum Coefficients) 39
Teager Energy Operator 1
Excitation features
Jitter 1
Shimmer 1
Total number of raw features per frame 54
First order time derivative of raw features 54
Second order time derivative of raw features 54
Total number of features per frame 162

D. Deep Belief Network


Once the features are extracted, the training of a deep
belief network can be done. As an example method we have
chosen the transfer learning approach by Islam et al. [21].
They trained their Deep Belief Network (DBN) on the TIMIT
dataset [30]. The classification is performed in two steps, pre-
training and fine-tuning. In the first step, the DBN learns the
input feature in an unsupervised manner in different layers.
A stack of layers are learned using the restricted Boltzman
Fig. 1. The detailed DNN architecture by Harar et al. [20] machine algorithm. The states are notable only in the visible
layer. The DBN learns the hidden features via the following
energy function:
a pooling layer, maximization function is used. The pooling X X X
layer is defined as follows: E(v, h) = − bi vi − bj hj − vi hj wij , (3)
G i∈input j∈f ea i,j
pkm = max hl,(m−1)×s+n , (2)
n=1 where vi and hj represents the binary states of input i
where G is the size of the pooling window using maximization and feature j, respectively. Attributes bi and bj are the biases
function. The value s is a step, when the pooling window shifts for input i and feature j and wij is the weight between them.
among the convolutional layer. In the second, fine-tuning step the truth is attached on the
They used 10 hidden layers. After the first layer, which has topmost layer with the pre-trained weights. The fine-tuning
the shape of 8∗ 3 with the step 1, each following layer was is a supervised procedure, thus a soft-max layer is added
convolved with 8 filters with the shape 8∗ 3∗ 8 and the step on top of the learned weight so the weights could be fine-tuned
1. They set 4∗ 4 pooling windows and a RELU activation for the phoneme cases. The model uses stochastic gradient
function. Next, they formed the feature map into a fully descent algorithm for weight optimization. To determine the
connected layer to train the model classification. To overcome probability of the visible vector, the following function is used:
the over-fitting problems, L2-regularization is used. X P
exp(−E(v, h))
p(v) = p(v, h) = P h , (4)
C. Feature Extraction Procedure h∈H u,g exp(−E(u, g))

The following model was created by Banerjee et al. for fea- where H is the set of all possible binary hidden vectors in the
ture extraction [13]. DBN.
Firstly, there is a 25ms segment size and a 10ms shift size used
to extract the features for each patient. This model extracts E. Transfer Learning
three types of features for each segment:
Transfer learning is proved to be notably useful within all
• Prosodic features,
the deep learning methods. This technique is useful especially
• Vocal-Tract features,
when there is limited training data. It consists of transferring
• Excitation features.
the knowledge from a small dataset to another previously
These three types produce overall 54 features for each seg-
trained model, which was trained on a similar domain [22].
ment. Next, there are 162 features formed by taking the first
Once the model is trained, transfer learning approach can be
and second time derivates for each segment. Finally there is
applied on different layers of the trained model. The follow-
a concatenation of 15 consecutive feature vectors to create
ing transfer learning approach is modelled by Islam et al.
a final vector of size 2430, which serve for diagnosis. The
[21] following the DBN modelling (see previous subsection)
detailed description of the features is shown in Table I.
and it contains the following steps:
1) Training of a DBN model using all the samples from [10] B. Woldert-Jokisz, “Saarbruecken voice database,” 2007.
the TIMIT dataset for 39 phoneme classes, see Table I. [11] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
2) The FEMH training dataset is added to the trained DBN 2006.
model to find the representation in a given layer. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
3) An SVM classifier model is trained with a linear kernel with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
and grid optimization for the new representation of the [13] D. Banerjee, K. Islam, K. Xue, G. Mei, L. Xiao, G. Zhang, R. Xu,
FEMH data, an SVM to find the optimal parameters. C. Lei, S. Ji, and J. Li, “A deep transfer learning approach for improved
4) The trained SVM model predicts the unlabelled testing post-traumatic stress disorder diagnosis,” Knowledge and Information
Systems, vol. 60, no. 3, pp. 1693–1724, 2019.
data in the FEMH dataset. [14] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via
5) Repeating the whole procedure for all the layers of the a boosted deep belief network,” in Proceedings of the IEEE conference
DBN model. on computer vision and pattern recognition, 2014, pp. 1805–1812.
[15] M. M. U. Chowdhury, F. Hammond, G. Konowicz, C. Xin, H. Wu,
and J. Li, “A few-shot deep learning approach for improved intrusion
V. D ISCUSSION detection,” in 2017 IEEE 8th Annual Ubiquitous Computing, Electronics
and Mobile Communication Conference (UEMCON). IEEE, 2017, pp.
In this paper we have discussed the process of voice 456–462.
pathology detection. We showed some notable results achieved [16] D. Martínez, E. Lleida, A. Ortega, A. Miguel, and J. Villalba, “Voice
using some machine learning techniques. We also observed pathology detection on the Saarbrücken voice database with calibration
and fusion of scores using multifocal toolkit,” in Advances in Speech
some auxiliary methods used in machine learning algorithms. and Language Technologies for Iberian Languages. Springer, 2012,
Since the deep learning and transfer learning approaches pp. 99–109.
tent to be more precise in comparison with the traditional [17] N. Souissi and A. Cherif, “Dimensionality reduction for voice disorders
identification system based on mel frequency cepstral coefficients and
voice recognition methods, we can consider more inves- support vector machine,” in 2015 7th international conference on
tigation in voice detection using deep learning techniques modelling, identification and control (ICMIC). IEEE, 2015, pp. 1–6.
to achieve more accurate results. Convolutional neural network [18] A. Al-nasheri, G. Muhammad, M. Alsulaiman, and Z. Ali, “Investigation
of voice pathology detection and classification on different frequency
showed up to be very effective in extracting important features regions using correlation functions,” Journal of Voice, vol. 31, no. 1,
from the spectograms of voice recordings, which helps in di- pp. 3–15, 2017.
agnosis. Deep belief network is useful for making the system [19] D. Hemmerling, A. Skalski, and J. Gajda, “Voice data mining for
laryngeal pathology assessment,” Computers in biology and medicine,
more robust by initializing the weights. In the future work we vol. 69, pp. 270–276, 2016.
will further analyze and experiment with optimizing as well [20] P. Harar, J. B. Alonso-Hernandezy, J. Mekyska, Z. Galaz, R. Burget,
as upgrading the existing algorithms or combining the used and Z. Smekal, “Voice pathology detection using deep learning: a
preliminary study,” in 2017 international conference and workshop on
methods, respectively so we can achieve better performance. bioinspired intelligence (IWOBI). IEEE, 2017, pp. 1–4.
Transfer learning and CNN based feature extraction could [21] K. A. Islam, D. Perez, and J. Li, “A transfer learning approach for
potentially provide an additional approach to the automatic the 2018 FEMH Voice Data Challenge,” in 2018 IEEE International
Conference on Big Data (Big Data). IEEE, 2018, pp. 5252–5257.
voice pathology detection problem. [22] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-
actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–
ACKNOWLEDGMENT 1359, 2009.
[23] Z. Dankovičová, D. Sovák, P. Drotár, and L. Vokorokos, “Machine
This work was supported by the Slovak Research and Learning Approach to Dysphonia Detection,” Applied Sciences, vol. 8,
Development Agency under contract No. APVV-16-0211. no. 10, p. 1927, 2018.
[24] H. Kasuya, S. Ogawa, Y. Kikuchi, and S. Ebihara, “An acoustic analysis
of pathological voice and its application to the evaluation of laryngeal
R EFERENCES pathology,” Speech communication, vol. 5, no. 2, pp. 171–181, 1986.
[1] S.-H. Fang, Y. Tsao, M.-J. Hsiao, J.-Y. Chen, Y.-H. Lai, F.-C. Lin, and [25] A. A. Dibazar, S. Narayanan, and T. W. Berger, “Feature analysis
C.-T. Wang, “Detection of Pathological Voice Using Cepstrum Vectors: for automatic detection of pathological speech,” in Proceedings of the
A Deep Learning Approach,” Journal of Voice, vol. 33, no. 5, pp. 634– Second Joint 24th Annual Conference and the Annual Fall Meeting of the
641, 2019. Biomedical Engineering Society][Engineering in Medicine and Biology,
[2] S. R. Schwartz, S. M. Cohen, S. H. Dailey, R. M. Rosenfeld, E. S. vol. 1. IEEE, 2002, pp. 182–183.
Deutsch, M. B. Gillespie, E. Granieri, E. R. Hapner, C. E. Kimball, [26] M. Eye and E. Infirmary, “Voice disorders database, version. 1.03 (cd-
H. J. Krouse et al., “Clinical practice guideline: hoarseness (dysphonia),” rom),” Lincoln Park, NJ: Kay Elemetrics Corporation, 1994.
Otolaryngology–Head and Neck Surgery, vol. 141, no. 1_suppl, pp. 1– [27] T. A. Mesallam, M. Farahat, K. H. Malki, M. Alsulaiman, Z. Ali,
31, 2009. A. Al-Nasheri, and G. Muhammad, “Development of the Arabic voice
[3] M. R. McNeil, D. Robin, and R. Schmidt, Clinical management of pathology database and its evaluation by using speech features and
sensorimotor speech disorders. Thieme New York, 1997. machine learning algorithms,” Journal of healthcare engineering, vol.
[4] R. D. Kent, “Research on speech motor control and its disorders: A 2017, 2017.
review and prospective,” Journal of Communication disorders, vol. 33, [28] J. R. Orozco-Arroyave, J. D. Arias-Londoño, J. F. Vargas-Bonilla, M. C.
no. 5, pp. 391–428, 2000. Gonzalez-Rátiva, and E. Nöth, “New Spanish speech corpus database
[5] L. E. DeLisi, “Speech disorder in schizophrenia: review of the literature for the analysis of people suffering from parkinson’s disease.” in LREC,
and exploration of its relation to the uniquely human capacity for 2014, pp. 342–347.
language,” Schizophrenia bulletin, vol. 27, no. 3, pp. 481–496, 2001. [29] H. Wu, J. Soraghan, A. Lowit, and G. Di Caterina, “A deep learning
[6] P. Lieberman, E. Kako, J. Friedman, G. Tajchman, L. S. Feldman, and method for pathological voice detection using convolutional deep belief
E. B. Jiminez, “Speech production, syntax comprehension, and cognitive networks,” Interspeech 2018, 2018.
deficits in parkinson’s disease,” Brain and language, vol. 43, no. 2, pp. [30] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,
169–189, 1992. “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.
[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93,
applied to document recognition,” Proceedings of the IEEE, vol. 86, 1993.
no. 11, pp. 2278–2324, 1998.
[8] B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt,
R. v. Son, F. Weninger, F. Eyben, T. Bocklet et al., “The interspeech
2012 speaker trait challenge,” in Thirteenth Annual Conference of the
International Speech Communication Association, 2012.
[9] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. S. Narayanan, “Auto-
matic intelligibility classification of sentence-level pathological speech,”
Computer speech & language, vol. 29, no. 1, pp. 132–144, 2015.

You might also like