Professional Documents
Culture Documents
1
Máté HIREŠ (1st year),
Supervisor: 2 Peter DROTÁR
1,2 Department of Computers and Informatics, FEI TU of Košice, Slovak Republic
1 mate.hires@tuke.sk, 2 peter.drotar@tuke.sk
Abstract—This paper describes an investigation of voice pathol- Voice Database (SVD) [10] could be an ideal set of data to start
ogy detection. This is a preliminary study in the field of automatic the research in this area, since all of its recordings are sampled
detection of voice disorders. We compare some of the available at 50 kHz with 16-bit resolution.
voice databases, which contains both healthy and pathological
voice recordings. We also compare some methods, which use
neural networks for the detection and achieved significant results. II. R ELATED WORK
Finally we discuss the presented methods and their results in term
of their accuracy. Deep learning models got popular from the introduction
Keywords—voice pathology, speech disorders, machine learn- of the deep Boltzman machine in 2006 [11]. From this time
ing, signal processing, deep learning. using these models, researchers and scientist have achieved
significant results in many fields, such as medical analysis
[2], [12], [13], computer vision [14], cybersecurity [15], etc.
I. I NTRODUCTION
There are a lot of research articles in the field of voice
The human voice can have a radical effect on the health pathology detection, where the researchers used SVD for their
condition of an individual. It can be radically affected by neo- work [16], [17], [18], [19], [20]. They yielded certain fea-
plasm, vocal palsy and other phonotraumatic diseases. Auto- tures from the voice recordings. These features contains mel-
matic detection of these pathological voice disorders is a very frequency cepstral coefficients, energy, entropy, harmonics-to-
challenging and significant medical classification problem. noise ratio, normalized noise energy, multidimensional voice
Voice pathology includes evaluation and treatment of voice program parameters, etc. Hence, multiple classifiers have been
production related diseases, which are affecting fluency, in- used. Most of the researchers used traditional methods like
tonation, phonation and breathing. Since the computer-aided Support Vector Machine classifier (SVM), Dimensionality
analysis and diagnosis of human voice can be an effective and Reduction Method (DRM), glottal source excitation, Gaussian
low-cost tool for patients around the world, voice pathology mixture models, k-means clustering. Deep Neural Network
has congested specific interest amongst machine learning (DNN) methods were also significantly useful. Notable results
and signal processing scientists. The current activity of the were achieved for example with the Far Eastern Memorial
detection of voice disorders includes a special procedure called Hospital (FEMH) dataset using transfer learning techniques
laryngeal endoscopy [1], [2]. This is a very complicated and [21]. It consists of transferring the knowledge from a small
expensive procedure and it also requires an expert to perform dataset to another previously trained model, which was trained
the evaluation. To detect voice disorders without intrusion, we on a similar domain [22].
can analyze the voice directly. Voices are mostly affected by cancer, nodules, polyps, cysts,
Many scientists work on the automatic detection of voice laryngitis, vocal tremors, spasmodic dysphonia, vocal fold
disorders, exactly in the last few years. Van der Merwe [3] paralysis and sulcus diseases [1]. The results of the voice
provides a foundation and accentuates the necessity of the pathology detection depends on the data used. Dankovičová
research related to voice disorder detection. Kent [4] discusses et al. differentiated healthy and dysfunctional voice samples
the connection between voice production and its dysfunctions. with the accuracy of over 70% using traditional methods
Some studies found specific speech dysfunctions within some [23]. Kasuya et al. effectively identified glottic cancer, vocal
particular population groups [5], [6]. cord polyp and other nerve paralysis diseases [24]. Fang
Thanks to the amount of available data and the advance- et al. identified polyp, cyst, nodules, neoplasm, and other
ment in the computational power, a lot of appropriate Deep diseases as well by using deep learning models [1]. They
Learning models are provided for speech processing. Hence, achieved over 90% accuracy in female and over 94% accuracy
we are allowed to use complex model architectures. We can in male subjects. Dibazar et al. connected pitch dynamics,
teach convolutional layers [7] to detect various features that Mel-frequency spectral coefficients and Hidden Markov Model
could help us to differentiate pathological and healthy voice. classifier (HMM) to identify some voice disorders. Working
Some Interspeech challenges [8], [9] also attracted interest in with data from the Massachusets Eye and Ear Infirmary
the application of Machine Learning and signal processing (MEEI) they implemented a model, which resulted with the
techniques for voice pathology detection. accuracy of over 99% [25]. Souissi et al. achieved over
A lot of available datasets contain recordings, which are 86% accuracy using 2000 recordings [17]. Al-nasheri et al.
recorded in more different environments, thus it makes hard pushed the accuracy of 99% [18]. Hemmerling at al. achieved
to find common features in the samples. The Saarbruecken the highest accuracy of 100%, who detected the disorders
for women and men separately [19]. However, since these • counting from 0-10,
results were achieved on specific datasets, the high accuracy • a standardized Arabic passage,
can be questionable. • reading of three common words.
The following model was created by Banerjee et al. for fea- where H is the set of all possible binary hidden vectors in the
ture extraction [13]. DBN.
Firstly, there is a 25ms segment size and a 10ms shift size used
to extract the features for each patient. This model extracts E. Transfer Learning
three types of features for each segment:
Transfer learning is proved to be notably useful within all
• Prosodic features,
the deep learning methods. This technique is useful especially
• Vocal-Tract features,
when there is limited training data. It consists of transferring
• Excitation features.
the knowledge from a small dataset to another previously
These three types produce overall 54 features for each seg-
trained model, which was trained on a similar domain [22].
ment. Next, there are 162 features formed by taking the first
Once the model is trained, transfer learning approach can be
and second time derivates for each segment. Finally there is
applied on different layers of the trained model. The follow-
a concatenation of 15 consecutive feature vectors to create
ing transfer learning approach is modelled by Islam et al.
a final vector of size 2430, which serve for diagnosis. The
[21] following the DBN modelling (see previous subsection)
detailed description of the features is shown in Table I.
and it contains the following steps:
1) Training of a DBN model using all the samples from [10] B. Woldert-Jokisz, “Saarbruecken voice database,” 2007.
the TIMIT dataset for 39 phoneme classes, see Table I. [11] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
2) The FEMH training dataset is added to the trained DBN 2006.
model to find the representation in a given layer. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
3) An SVM classifier model is trained with a linear kernel with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
and grid optimization for the new representation of the [13] D. Banerjee, K. Islam, K. Xue, G. Mei, L. Xiao, G. Zhang, R. Xu,
FEMH data, an SVM to find the optimal parameters. C. Lei, S. Ji, and J. Li, “A deep transfer learning approach for improved
4) The trained SVM model predicts the unlabelled testing post-traumatic stress disorder diagnosis,” Knowledge and Information
Systems, vol. 60, no. 3, pp. 1693–1724, 2019.
data in the FEMH dataset. [14] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via
5) Repeating the whole procedure for all the layers of the a boosted deep belief network,” in Proceedings of the IEEE conference
DBN model. on computer vision and pattern recognition, 2014, pp. 1805–1812.
[15] M. M. U. Chowdhury, F. Hammond, G. Konowicz, C. Xin, H. Wu,
and J. Li, “A few-shot deep learning approach for improved intrusion
V. D ISCUSSION detection,” in 2017 IEEE 8th Annual Ubiquitous Computing, Electronics
and Mobile Communication Conference (UEMCON). IEEE, 2017, pp.
In this paper we have discussed the process of voice 456–462.
pathology detection. We showed some notable results achieved [16] D. Martínez, E. Lleida, A. Ortega, A. Miguel, and J. Villalba, “Voice
using some machine learning techniques. We also observed pathology detection on the Saarbrücken voice database with calibration
and fusion of scores using multifocal toolkit,” in Advances in Speech
some auxiliary methods used in machine learning algorithms. and Language Technologies for Iberian Languages. Springer, 2012,
Since the deep learning and transfer learning approaches pp. 99–109.
tent to be more precise in comparison with the traditional [17] N. Souissi and A. Cherif, “Dimensionality reduction for voice disorders
identification system based on mel frequency cepstral coefficients and
voice recognition methods, we can consider more inves- support vector machine,” in 2015 7th international conference on
tigation in voice detection using deep learning techniques modelling, identification and control (ICMIC). IEEE, 2015, pp. 1–6.
to achieve more accurate results. Convolutional neural network [18] A. Al-nasheri, G. Muhammad, M. Alsulaiman, and Z. Ali, “Investigation
of voice pathology detection and classification on different frequency
showed up to be very effective in extracting important features regions using correlation functions,” Journal of Voice, vol. 31, no. 1,
from the spectograms of voice recordings, which helps in di- pp. 3–15, 2017.
agnosis. Deep belief network is useful for making the system [19] D. Hemmerling, A. Skalski, and J. Gajda, “Voice data mining for
laryngeal pathology assessment,” Computers in biology and medicine,
more robust by initializing the weights. In the future work we vol. 69, pp. 270–276, 2016.
will further analyze and experiment with optimizing as well [20] P. Harar, J. B. Alonso-Hernandezy, J. Mekyska, Z. Galaz, R. Burget,
as upgrading the existing algorithms or combining the used and Z. Smekal, “Voice pathology detection using deep learning: a
preliminary study,” in 2017 international conference and workshop on
methods, respectively so we can achieve better performance. bioinspired intelligence (IWOBI). IEEE, 2017, pp. 1–4.
Transfer learning and CNN based feature extraction could [21] K. A. Islam, D. Perez, and J. Li, “A transfer learning approach for
potentially provide an additional approach to the automatic the 2018 FEMH Voice Data Challenge,” in 2018 IEEE International
Conference on Big Data (Big Data). IEEE, 2018, pp. 5252–5257.
voice pathology detection problem. [22] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-
actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–
ACKNOWLEDGMENT 1359, 2009.
[23] Z. Dankovičová, D. Sovák, P. Drotár, and L. Vokorokos, “Machine
This work was supported by the Slovak Research and Learning Approach to Dysphonia Detection,” Applied Sciences, vol. 8,
Development Agency under contract No. APVV-16-0211. no. 10, p. 1927, 2018.
[24] H. Kasuya, S. Ogawa, Y. Kikuchi, and S. Ebihara, “An acoustic analysis
of pathological voice and its application to the evaluation of laryngeal
R EFERENCES pathology,” Speech communication, vol. 5, no. 2, pp. 171–181, 1986.
[1] S.-H. Fang, Y. Tsao, M.-J. Hsiao, J.-Y. Chen, Y.-H. Lai, F.-C. Lin, and [25] A. A. Dibazar, S. Narayanan, and T. W. Berger, “Feature analysis
C.-T. Wang, “Detection of Pathological Voice Using Cepstrum Vectors: for automatic detection of pathological speech,” in Proceedings of the
A Deep Learning Approach,” Journal of Voice, vol. 33, no. 5, pp. 634– Second Joint 24th Annual Conference and the Annual Fall Meeting of the
641, 2019. Biomedical Engineering Society][Engineering in Medicine and Biology,
[2] S. R. Schwartz, S. M. Cohen, S. H. Dailey, R. M. Rosenfeld, E. S. vol. 1. IEEE, 2002, pp. 182–183.
Deutsch, M. B. Gillespie, E. Granieri, E. R. Hapner, C. E. Kimball, [26] M. Eye and E. Infirmary, “Voice disorders database, version. 1.03 (cd-
H. J. Krouse et al., “Clinical practice guideline: hoarseness (dysphonia),” rom),” Lincoln Park, NJ: Kay Elemetrics Corporation, 1994.
Otolaryngology–Head and Neck Surgery, vol. 141, no. 1_suppl, pp. 1– [27] T. A. Mesallam, M. Farahat, K. H. Malki, M. Alsulaiman, Z. Ali,
31, 2009. A. Al-Nasheri, and G. Muhammad, “Development of the Arabic voice
[3] M. R. McNeil, D. Robin, and R. Schmidt, Clinical management of pathology database and its evaluation by using speech features and
sensorimotor speech disorders. Thieme New York, 1997. machine learning algorithms,” Journal of healthcare engineering, vol.
[4] R. D. Kent, “Research on speech motor control and its disorders: A 2017, 2017.
review and prospective,” Journal of Communication disorders, vol. 33, [28] J. R. Orozco-Arroyave, J. D. Arias-Londoño, J. F. Vargas-Bonilla, M. C.
no. 5, pp. 391–428, 2000. Gonzalez-Rátiva, and E. Nöth, “New Spanish speech corpus database
[5] L. E. DeLisi, “Speech disorder in schizophrenia: review of the literature for the analysis of people suffering from parkinson’s disease.” in LREC,
and exploration of its relation to the uniquely human capacity for 2014, pp. 342–347.
language,” Schizophrenia bulletin, vol. 27, no. 3, pp. 481–496, 2001. [29] H. Wu, J. Soraghan, A. Lowit, and G. Di Caterina, “A deep learning
[6] P. Lieberman, E. Kako, J. Friedman, G. Tajchman, L. S. Feldman, and method for pathological voice detection using convolutional deep belief
E. B. Jiminez, “Speech production, syntax comprehension, and cognitive networks,” Interspeech 2018, 2018.
deficits in parkinson’s disease,” Brain and language, vol. 43, no. 2, pp. [30] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,
169–189, 1992. “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.
[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93,
applied to document recognition,” Proceedings of the IEEE, vol. 86, 1993.
no. 11, pp. 2278–2324, 1998.
[8] B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt,
R. v. Son, F. Weninger, F. Eyben, T. Bocklet et al., “The interspeech
2012 speaker trait challenge,” in Thirteenth Annual Conference of the
International Speech Communication Association, 2012.
[9] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. S. Narayanan, “Auto-
matic intelligibility classification of sentence-level pathological speech,”
Computer speech & language, vol. 29, no. 1, pp. 132–144, 2015.