Professional Documents
Culture Documents
1 Abstract
2 This paper investigates the use of Multi-Dimensional Voice Program (MDVP) parameters to automatically detect voice pathology in Arabic voice
3 pathology database (AVPD). MDVP parameters are very popular among the physician / clinician to detect voice pathology; however, MDVP is a
4 commercial software. AVPD is a newly developed speech database designed to suit a wide range of experiments in the field of automatic voice
5 pathology detection, classification, and automatic speech recognition. This paper is the first step to evaluate MDVP parameters in AVPD using
6 sustained vowel /a/. The experimental results demonstrate that some of the acoustic features show an excellent ability to discriminate between
7 normal and pathological voices. The overall best accuracy is 81.33% by using SVM classifier. A voice disorder database is an essential element
8 in doing research on automatic voice disorder detection and classification. Ethnicity affects the voice characteristics of a person, and so it is
9 necessary to develop a database by collecting the voice samples of the targeted ethnic group. This will enhance the chances of arriving at a
10 global solution for the accurate and reliable diagnosis of voice disorders by understanding the characteristics of a local group. Motivated by
11 such idea, an Arabic voice pathology database (AVPD) is designed and developed in this study by recording three vowels, running speech,
12 and isolated words. For each recorded samples, the perceptual severity is also provided which is a unique aspect of the AVPD. During the
13 development of the AVPD, the shortcomings of different voice disorder databases were identified so that they could be avoided in the AVPD. In
14 addition, the AVPD is evaluated by using six different types of speech features and four types of machine learning algorithms. The results of
15 detection and classification of voice disorders obtained with the sustained vowel and the running speech are also compared with the results of
16 an English-language disorder database, the Massachusetts Eye and Ear Infirmary (MEEI) database.
4
T ing and uses it in talking ,singing , laughing , crying and
screaming in order to express his feelings and to communicate
No matter being one of the major forms of human expression
and of being used every day through most people, approxi-
29
30
mately 10% of the overall population gives with voice issues, and 31
5 with his community [1] The voice is also taken into consider- amongst voice specialists, the share reaches 50% [8] ,[9].Young- 32
6 ation a crucial device within the lives of many professionals, sters and adults are equally affected; however, the reasons are 33
7 with approximately 25% of the economically active population special in keeping with the age organizations. 34
8 thinking about their voice to be a important instrument of their Medical voice pathology detection is executed through the exe- 35
9 task [2].The concept of “ordinary voice” is complicated, and cution of numerous strategies, including the acoustic evaluation. 36
10 there is no consensus on the challenge. There is not a pattern It consists of an estimation of appropriate parameters extracted 37
11 of “ordinary voice”, there are not any described limits of what from voice signal to assess any possible changes of the vocal 38
12 is considered regular, and from which factor it can be stated tract, in line with the recommendations of the SIFEL protocol 39
13 that an person has dysphonia[3] .When the voice changes nega- [10] (Societ‘aItaliana di Foniatria e Logopedia), developed by 40
14 tively, it is said that it is disturbed or dysphonic [4].Dysphonia, means of the Italian Society of Logaoedics and Phonetics, fol- 41
15 therefore, can be defined as any difficulty or change in vocal lowing the commands of the Committee for Phonetics of the 42
16 emission that doesn’t allow for a natural voice production,[5,6] European Society of Laryngology. It is a non-invasive exami- 43
17 preventing momentary or permanent oral communication[7] nation in clinical exercise, complementary to different scientific 44
18 .Thus, dysphonia causes damage to the individual, since the checks, consisting of the laryngoscope examination based at the 45
19 voice produced exhibits difficulties or limitations in fulfilling direct observation of the vocal folds. 46
20 its basic role of transmission of verbal and emotional message Numerous acoustic parameters are expected to assess the king- 47
21 [6] .Dysphonia is a symptom, not a disease; it is a manifestation dom of health of the voice. Regrettably, the accuracy of those 48
22 that is part of the speech disorder picture[3].Dysphonia is the parameters in the detection of voice problems is, frequently, as- 49
23 main symptom of the oral communication disorder[6].However, sociated with the algorithms used to estimate them. For that 50
24 voice disorders are manifested beyond the dysphonic picture; reason the principle attempt of researchers is oriented to the 51
25 the patient may experience difficulty in keeping his/her voice take a look at of acoustic parameters and the application of class 52
26 (asthenia), vocal fatigue, variation in habitual fundamental vocal
2 Voice Disorder by using the AVPD including Machine Learning
1 strategies able to achieve a excessive discrimination accuracy. excessive accuracy of 100% within the detection issue by their 62
2 Currently, speech pathology has targeted hobby on gadget learn- method to split male and woman audio system. The examine 63
3 ing strategies. of Hammami et al. [28] assessed the execution of the proposed 64
4 The Arabic voice pathology database (AVPD) will have a po- high order statistic feature high lights extricated from wavelet 65
5 tential impact on the assessment of voice disorders in the Arab space to segregate among regular voices and pathological ones. 66
6 region. Race has been suggested to contribute to the perception Conventional functions which includes merciless Wavelet Es- 67
7 of voice, with Walton and Orlikoff [1] showing, for example, that teem, cruel Wavelet power and merciless Wavelet Entropy had 68
8 measures of amplitude and frequency perturbation in African- been used within the experiments. These highlights, mixed with 69
9 American adult males are not equal to those of white adult males. a SVM classifier, reach the maximum accelerated correctness of 70
10 Additionally, Sapienza [2] analyzed the vowel /a/ in a group 99.26% inside the vicinity step and one hundred% when classi- 71
11 of 20 African Americans and 20 white Americans, finding that fying the facts. With the intention to consist of concrete logical 72
12 African-American males and females had higher mean funda- covered values a clinical evaluation changed into completed 73
13 mental frequencies and lower sound pressure levels, although on statistics gathered from subjects from a healing middle in 74
14 the differences were not significant. This difference was partially Tunez. The outcomes were suitable and the precisions were 75
15 attributed to the large ratio of the membranous to cartilaginous 94.82% and 94.44% for the region and type, respectively. Fon- 76
16 portion of the vocal folds and increased thickness, a finding seca et al. [29] labored at the discovery of co- existent laryngeal 77
17 previously reported by Boshoff [3]. Sapienza [2] did not exam- problems for which the predominant phonic side effect is the 78
18 ine other acoustic parameters for gender or racial differences. identical, growing features with note worthy inter-magnificence 79
19 Walton and Orlikoff [1] found, through acoustical analysis, that coverage. Based totally on the mixture of SE, ZCR and SH, all 80
20 African-American speakers had significantly greater amplitude applied for extraction, related with DPM, mainly obtained for 81
21 perturbation measures and significantly lower harmonics-to- category, the proposed technique was efficiently concluded, pro- 82
22 noise ratios than did white adult males. Although the former ductively dealing within definitions and inconsistencies with an 83
23 had a lower mean speaking fundamental frequency than the predicted precision of 95%. The continuing task of dysphonia 84
24 latter, the differences were not significant in the group of 50 voice research is the small size of the database produced with the 85
25 subjects. aid of Rueda and Krishnan [30]. It’s miles very complicated to 86
26 We observed that most researchers were using standard apply extra superior deep mastering techniques without under 87
27 databases such as Massachusetts Eye and Ear Infirmary (MEEI), fitting or over fitting. They proposed an adaptive approach uti- 88
28 Saarbruecken Voice Database (SVD), that’s why we have de- lized to interrupt down a signings additives using a Fourier-base 89
29 cided to use the Arabic Voice Pathology Database (AVPD) ,in dsychrosqueezing exchange (FSST) for data enlargement and 90
30 addition to enlarge interest in Arab voice and Identifying the alternate. The second TF representation output will become the 91
31 disorder voices at the Arabic language level that’s what makes input to CNN. 92
34 This paper is organized as follows: An overview of currents frequencies depending at the kind of voice ailment and its place 94
35 studies and some related works are presented in Section 2. In at the vocal folds, as we discovered. Thus, monitoring the fre- 95
36 Section 3, we describe our proposed voice pathology detection quency businesses is distinctly critical to evaluate which one 96
37 based on avpd database. We present the experimental results in contributes more to the discovery and category of voice afflic- 97
38 Section 4. Finally, we present our conclusions and directions of tions. For instance, Pouchoulin et al. [31] stated that decrease 98
39 future research in Section 5. frequencies (3000 Hz) are great er affordable for recognizing 99
al. [32] verified that the manage of dysphonic voice flags is alto- 101
40 RELATED WORK
gether less constant in the recurrence region between 2000 and 102
41 The effects of the published studies vary significantly due to the 6400 Hz than the alternative recurrence areas. To discuss the 103
42 variances most of the data sets used in the experimental out- results of a comparative literature evaluate, they analysed voice 104
43 comes. In line with Martınez et al. [24], the accuracy done using statistics of maintained phonation of the vowel /a/ as nicely. 105
44 2 hundred records of sustained vowel /a/ represent a high cost But, in contrast to beyond studies, we analyze a bigger database 106
45 and it’s very near our observe. Different studies applied the accumulated from SVD [34]. Furthermore, to suggest systems 107
46 combination of vowels /a/, /i/ and /u/ to get high accuracy capable of powerful voice pathology detection and category, we 108
47 and do not consciousness at the pathology causes. In the studies do not confine the database as if it were a subset of popular 109
48 by Souissi et al. In [25] they performed excessive accuracy of voice pathologies. In this observe, the database consists of an 110
49 87.Eighty two% making use of subset concerning four types of expansive quantity of pathologies with small recordings. As we 111
50 voice pathologies that include 71 kinds. Also, Al-Nasheri et al. located in side the associated works, no matter beyond paintings 112
51 [38,26] completed an accuracy of 99.68% due to their use of a [34], no different research have applied deep gaining knowledge 113
52 subset concerning the various pathologist to conduct a check of strategies for voice pathology identity. In the following sec- 114
53 on records that changed into moreover displayed in different tions we make use of a sturdy voice pathology identification 115
54 than data sets, consisting of Arabic Voice Pathology Database version based on the acoustic characteristic extraction method. 116
55 (AVPD), and Massachusetts Eye and Ear Infirmary Database We use voice pathology detection and identification utilising 117
56 (MEEI). Some other look at carried out through Muhammad et CNN method. We make use of the switch mastering method 118
57 al. [37] applied a subset of three varieties of voice pathology’s for the use of the modern effective CNN fashions. In particular, 119
58 that completed an accuracy of 93.20%. In addition, they utilized the ResNet34 fashion shad been used. To address the difficulty 120
59 a combination of voice information as an electrocardiograph sign of inadequate distribution of an collection of voice pathology’s 121
60 to growth the accuracy to 99.98%. However, in every other study with few recordings in the data sets, we additionally discover 122
61 conducted by using Hemmer ling et al. [27] they carried out a the utilization of abnormality detection methods. 123
3
1 Materials and methods Arabic digits English Translation IPAs of Arabic digits
Q® Zero /á/, /i/, /f/, /r/
2 DATA Yg@ð One /w/, /a/, //, /i/, /d/
á
J K
@ Two /a/, /th/, /n/, /a/, /y/, /n/
3 Voice disorder database is an essential element in AVDD sys-
4 tem. The dataset consists of voice recordings of both normal éKCK Three /th/, /a/, /l/, /a/, /th/, /a/
5 and pathological voices. The recordings can contain either sus- éªK P
@ Four /a/, /r/, /b/, //, /a/
.
6 tained vowel phonation or continuous speech. In our paper, we g
éÔ Five /kh/, /a/, /m/, /s/, /a/
7 have observed that most of the researchers have used standard
8 databases, such as Massachusetts Eye and Ear Infirmary (MEEI), éJ Six /s/, /i/, /t/, /t/, /a/
ensure that they did not suffer from any vocal impairments and 41
Figure 1 This pie chart represents no. of studies[59]. that they had no vocal complications in the past. 42
their consent and certifying that they had no objection to the use 45
Recording Text : 52
the AVPD. The text was compiled in a way that ensured that it 55
16 Arabic Voice Pathology Database was simple and short, and at the same time it covered all the 56
Arabic phonemes. The first type of text was three vowels, fatha 57
17 Video-Laryngeal Stroboscopic Examination. /a/, damma /u/, and kasra /i/, which were recorded with a 58
18 KayPENTAX’s video-laryngeal stroboscopic system (Model type of text involved isolated words, including Arabic digits 60
19 9200C2) was used in the examination, including a 70° rigid from zero to ten and some common words (see Tables 1 and 61
20 endoscope, 3CCD Toshiba camera, Sony LCD monitor, and a 2). The third type of text was running speech (see Table 3), and 62
21 light source (Model RLS 9100B). Clinical diagnosis and classifi- the continuous speech was taken from the first chapter of the 63
22 cation of voice disorders were decided based on laryngoscopic Quran, called the Al-Fateha. The third type of text is running 64
23 examination. Two experienced phoniatricians were responsible speech, and it is given in Table 3. The continuous speech is the 65
24 for clinical diagnosis and classification of voice disorders. In first chapter from the Holy book of Muslims, called Al-Fateha. 66
25 case of unclear diagnosis, two examiners reviewed the recorded One of the reasons behind the selection of the religious text is 67
26 video-laryngeal examinations and a consensus decision about that most of the visitors to our voice disorder unit are illiterate. 68
27 clinical diagnosis was obtained. Therefore, we selected the religious text because every Muslim 69
4 Voice Disorder by using the AVPD including Machine Learning
1 memorizes it by heart. The other reason is the duration of Al- subjects, and the remaining subjects are distributed among five 16
2 Fateha which is 20 seconds, and it is better than the duration of voice disorders: sulcus 11%, nodules 5%, cyst 7%, paralysis 14%, 17
3 running speech of MEEI database (9 seconds) and SVD database and polyp 11% (Figure 1(a)). Among the 51% of normal subjects 18
4 (2 seconds). (188 samples), there are 116 male and 82 female speakers. In 19
English translation Sentence number Al-Fateha respectively, is as follows for the different disorders: sulcus 20 21
and polyps 18 and 22 (Figure 1(b)). The inner ring in Figure 1(b) 23
all the worlds
represents the number of female subjects, while the outer ring 24
The Compassionate, the Merciful 2 Õæk QË@ á Ôg QË@
shows the number of male subjects. 25
5 The Arabic digits and Al-Fateha covered all the Arabic letters
6 except three: P , h., and . Therefore, some common words were
7 included in the text to cover these omissions. These words were
8 ¬Q £ (envelope), È@Q « (deer), and ÉÔg. (camel), as mentioned in
9 Table 2. The number of occurrences of each Arabic letter in the
10 recorded text is mentioned in Table 4. For illiterate patients,
11
we have shown pictures of ¬Q£ (envelope), È@Q « (deer), and
12 ÉÔg.(camel) to record these words.
1 Approximately 60% of the subjects in the AVPD are male, errors by updating the start and end times of the segments, 38
2 while 40% are female. The information about the mean age (in because these errors occur due to incorrect labeling of these two 39
3 years) of the recorded subjects with standard deviation (STD) is times. After updating the time, the erroneous segments were 40
4 provided in Figure 2. The average age ± STD of male subjects extracted again by using updated time information. All tasks 41
5 who are normal or suffering from sulcus, nodules, cysts, paraly- associated with the segmentation of the AVPD are presented 42
6 sis, or polyps is 27 ± 10, 35 ± 13, 12 ± 2, 25 ± 18, 46 ± 15, and 48 ± and described in Table 6. 43
14 Segmentation of Recorded Samples Many speech features extraction algorithms, MFCC, LPC, LPCC,
15 Recorded samples were divided into the following 22 segments: PLP, RASTA-PLP, and MDVP, were implemented in this module
16 six segments for vowels (three vowels plus their repetition), 11 of the automatic assessment system. Before the extraction of
17 segments for Arabic digits (zero to ten), two segments for Al- features, the speech signal was divided into frames of 20 mil-
18 Fateha (divided in this manner so that the first part may be used liseconds, which made the analysis easy because speech changes
19 to train the system and the second part to test the system), and quickly over time. The MFCC mimics the human auditory per-
20 three segments for the common words. The first part of AlFateha ception, while the LPC and the LPCC mimic the human speech
21 starts from sentence number 1 and ends at 4, while the second production system. The PLP and the RASTA-PLP simulate, to
22 part contains the last three sentences. some extent, both the auditory and the production mechanisms.
23 Each of the 22 segments was stored in a separate wav file. The In the MFCC [17, 18], the time-domain speech signal was con-
24 segmentation was performed with the help of Praat software verted into a frequency-domain signal, which was filtered by
25 [28] by labeling the start and end time of each segment. Then, applying a set of band-pass filters. The center frequencies of
26 these two times were used to extract a segment from a recorded the filters were spaced on a Mel-scale and the bandwidths cor-
27 sample. Once each recorded sample was divided into segments responded to the critical bandwidths of the human auditory
28 and stored into 22 wav files, the next step was the verification system. The Mel-scale filter is given by (1), where f is frequency
29 process, which ensured that each segmented wav file consisted in Hz and m represents the corresponding frequency in Mel-
30 of a complete utterance. During the verification process, we en- scale. In this study, 29 Mel-scale filters are used. Later, a discrete
31 countered three types of errors, as described in Table 5 segments cosine transform was applied to the filtered outputs to compress
32 were extracted and decorrelate them.
33 A record of the errors was maintained in an excel sheet, where
22 segments were listed along the columns and the recorded f
34 m = 2959 log(1 + ) (1)
35 samples were listed along the rows. If a segment had any of the 700
36 above errors in any segment, then i, m, or d were mentioned
37 under that segment. The next step was the correction of these During extraction of the LPC features, the Linear Prediction
6 Voice Disorder by using the AVPD including Machine Learning
Incomplete i When some part of the extracted (a) “d” is missing in wahid(b) “w” is missing
text is missing at the start or end in wahid(c) Both “w” and “d” are missing
More m When a segment contains some part (a) Segment of Sifar also contains “w” of
of the next or previous segment next segment wahid
(b) Segment of Ithnayn also contains “d”
of previous segment wahid
Different d When the text in a segment is other Segment contains wahid instead of sifar
than the expected one
filtering on speech signals to remove the effects of formants in The center frequency of the jth critical band is represented by fj in 14
order to estimate the source signal [29]. For LP analysis of order (4). Furthermore, the intensity loudness power law of hearing is 15
P, the current sample of a source signal can be estimated by using used to simulate the nonlinear relationship between the intensity 16
P previous samples by using: of sound and perceived loudness [34]. The extraction process 17
where, x1, x2, x3,. . . , xr are samples of original speech signal and 0.2 + 0.1z−1 − 0.1z−3 − 0.2z−4
2
R ( z ) = z4 × (7) (7)
3 ai’s represent the required LPC features. To get accurate LPC 1 − 0.94z−1
4 features, it is necessary to reduce the error E between the current 21
6 first-order derivative of E equal to zero and solve the resulting In all types of experiments, static as well as delta and delta- 23
7 equations by using the Levinson-Durbin algorithm [30]. More- delta features were considered. The delta and delta-delta coeffi- 24
8 over, the LPCC features are calculated by using the recursive cients were computed by taking the first-order and second-order 25
9 relation [31] given in (3), where 2 is the gain in LP analysis, P derivatives of static features, respectively. The derivative was 26
10 is the order of the LP analysis, an are LPC features, and cn are calculated by taking the linear regression with a window size of 27
11 obtained LPCC features. In this study, we performed LP analysis five elements. All experiments for MFCC, LPCC, and RASTA- 28
12 with P = 11. PLP were conducted using 12 features (static), 24 features (12 29
c1 = lnσ2 (3) (3) delta-delta). For LPC and PLP, all experiments were performed 31
n −1
k
cn = ∑ ck a, n > p(5) (5) Pattern Matching 38
k =1
n
The computed features are multidimensional and their inter- 39
wj = (6) (6) SVM was implemented with linear and RBF kernels, GMM was 51
1 16, and 32 code books to generate acoustic models, and HMM recorded by the Massachusetts Eye and Ear Infirmary voice and 42
2 was applied by using five states with 2, 4, and 6 mixtures in each speech laboratory [40]. A subset of the database that has been 43
4 Detection and Classification Results for the AVPD subjects and 173 samples of disordered subjects suffering from 46
5 Experiments for detection determine whether an unknown test adductor spasmodic dysphonia, nodules, keratosis, polyps, and 47
6 sample is normal or disordered. It is a two-class problem: paralysis. The detection and classification accuracies for the 48
7 1-the first class consists of all normal samples, and the second MEEI database are presented in Table 8. The maximum obtained 49
8 class contains samples of all types of disorder. During the clas- detection accuracy for the MEEI database with the sustained 50
9 sification of disorders, the objective is to determine the type of vowel is 93.6%, which is obtained by using MFCC and RASTA- 51
10 voice disorder. The classification of voice disorders is a many PLP when used with SVM. The maximum accuracy for running 52
11 class problem, and the number of classes depends upon the speech is 98.7%, obtained by using LPC and GMM. Similarly, for 53
12 number of types of voice disorder. the classification of disorders, the maximum obtained accuracy 54
13 2-All voice samples of the MEEI and the AVPD are downsam- with the sustained vowel is 98.2%, achieved with LPCC and PLP 55
14 pled to 25KHz, and each speech signal was divided into a frame with VQ. The classification of disorders with running speech 56
15 of 20 milliseconds with 50% overlapping the previous frame. To obtained an accuracy of 97.3% by using SVM with all types of 57
16 avoid bias in the training and testing samples, all experiments speech features. 58
20 the remaining four were used to train the system. LPCC Detection 91.0 97.9 90.7 96.4 83.2 97.8 87.6 98.2
Classification 95.4 97.3 97.3 97.3 98.2 97.3 87.5 97.3
RASTA-PLP Detection 93.6 98.0 91.6 98.1 84.1 96.4 88.9 98.1
Classification 95.5 97.3 97.3 97.3 97.3 96.3 85.2 84.6
TotalCorrectlyDetectedsamples LPC Detection 82.9 96.0 83.2 98.7 78.3 97.3 80.1 96.3
Accuraccy(%) = × 100 (8)
TotalNumbero f Samples PLP
Classification
Detection
95.2
87.8
97.3
96.8
97.3
91.2
97.3
97.8
97.3
89.4
94.4
97.8
75.0
87.4
82.5
96.3
Classification 95.0 97.3 97.3 97.3 98.2 94.4 61.1 84.6
MDVP Detection 89,5 — 88,3 — 68,3 — — —
Classification 88.9 — — — — — — —
Features Experiments SVM/AH/ Al-Fateha GMM/AH/ Al-Fateha VQ/AH/ Al-Fateha HMM/AH/ Al-Fateha
MFCC Detection 76.5 77.4 74.4 77.1 70.3 71.1 71.6 78.1
Table 8 Overall best accuracies (%) for sustained vowels and
Classification 89.2 89.2 88.9 89.5 75.3 81.6 88.7 90.9
LPCC Detection 60.1 76.5 54.5 76.7 70.3 75.9 73.5 71.5 running speech by using the MEEI database.
Classification 67.6 84.7 75.4 86.0 75.5 77.9 59.0 86.0
RASTA-PLP Detection 77.0 76.7 72.8 74.5 67.1 75.0 66.3 79.0
Classification 92.9 90.2 91.3 91.2 88.9 90.3 88.7 92.7
LPC Detection 62.3 71.6 53.7 71.9 70.7 71.5 71.4 62.3
Overall best accuracies (%) for sustained vowels and running 59
Classification 66.3 82.4 74.6 79.7 78.6 75.3 85.9 75.9 speech by using the MEEI database. The best detection rate for 60
PLP Detection 75.8 79.1 73.2 78.5 72.0 78.1 73.6 81.6
Classification 91.5 90.1 88.9 91.2 79.4 77.2 88.7 85.8
sustained vowels. The best detection rate for running speech.α 61
MDVP Detection 79.5 — 69.8 — 64.8 — — — The best classification rate for sustained vowels .β The best 62
Classification 82.3 — — — — — — —
classification rate for running speech. 63
38 All experiments performed for the AVPD were also performed the AVPD and rated over the scale of 1 to 3, where 3 represents 84
39 with the MEEI database in order to make a comparison between a voice disorder with a high severity. Furthermore, the normal 85
40 the results. The experimental setup for the MEEI database is the subjects in the AVPD are recorded after the clinical evaluation 86
41 same as the one used for the AVPD. The MEEI database was under the same condition as those used for the pathological 87
8 Voice Disorder by using the AVPD including Machine Learning
(2) Recording location Massachusetts Eye , Ear Communication and Saarland University,
Infirmary (MEEI) voice Swallowing Disordered Unit, Germany
and speech laboratory, King Abdulaziz University
USA Hospital, Saudi Arabia
(3) Sampling frequency Samples are recorded at All samples are recorded at All samples are
different sampling same frequency recorded at same
frequencies (i) 48kHz frequency
(i) 10kHz (i) 50kHz
(ii) 25kHz
(ii) 50kHz
(4) Extension of Recorded samples are Recorded samples are Recorded samples
recorded samples stored in.NSP format stored in.wav and.nsp are stored in.wav
only format and.nsp format
(5) Recorded text (i) Vowel /a/ (i) Vowel /a/ (i) Vowel /a/
(ii) Rainbow passage (ii) Vowel /i/ (ii) Vowel /i/
(iii) Vowel /u/ (iii) Vowel /u/
(iv) Al-Fateha (running (iv) A sentence
speech)
(v) Arabic digits
(vi) Common words
(All vowels are recorded
with a repetition)
Table 9 Comparison of AVPD with two publicly available voice disorder databases.
1 subjects. In the MEEI database, the normal subjects are not clini- normal and pathological samples are detected correctly by the 21
2 cally evaluated, although they do not have any history of voice system. One of the many possibilities may be that specificity is 22
3 complication [44]. In the SVD database, no such information is 0% and sensitivity is 70.47%. Another possibility may be that 23
16 features are used with the nearest mean classifier. The numbers pathological samples are recorded at a unique sampling fre- 36
17 of normal and pathological samples used in the study are 53 and quency in the AVPD. It is important because Deliyski et al. con- 37
18 657, respectively, taken from MEEI database. The interpretation cluded that sampling frequency influenced the accuracy and re- 38
19 of the results (accuracy of 65.26%) becomes difficult when data liability of acoustic analysis [48]. In addition, the MEEI database 39
20 are unbalanced, because it cannot be determined how many contains one vowel, whereas the AVPD records three vowels. 40
9
1 Although the SVD also records three vowels, they are recorded paralysis, and sulci) were included in the database. The database 41
2 only once. In the AVPD, the three vowels are recorded with a contains repeated vowels, continuous sounds, Arabic numerals 42
3 repetition, as some studies recommended that more than one and some common words. Assessing the perceived severity of 43
4 sample of the same vowel helps to model the intraspeaker vari- speech impairment and recording individual words are unique 44
5 ability [49, 50]. Another important characteristic of the AVPD is aspects of AVPD. All subjects, including patients and normal 45
6 the total length of the recorded sample, which is 60 seconds, as subjects, were recorded after clinical assessment. Baseline re- 46
7 described in Table 1. All recorded text in the AVPD is of the same sults for AVPD are provided by using different types of speech 47
8 length for normal as well as disordered subjects. In the MEEI features and a range of machine learning algorithms. Accuracy 48
9 database, the recording times for normal and pathological sub- for detecting and classifying speech disorders is computed for 49
10 jects are different. Moreover, the duration of connected speech sustained vowels and continuous speech. 50
11 (a sentence) in the SVD database is only 2 seconds, which is too Comparing the obtained results with the English Speech Impair- 51
12 short and not sufficient to develop an automatic detection system ment Database (MEEI), the classification results of the two were 52
13 based on connected speech. Furthermore, a text-independent comparable, although significant differences were observed in 53
14 system is not possible to build with the SVD database. The the context of obstacle detection. The recognition results of the 54
15 average length of the running speech (Al-Fateha) in the AVPD MEEI database were also significantly different from those of the 55
16 is 18 seconds, and it consists of seven sentences. Al-Fateha is German Speech Disorders Database (SVD). The reason may lie in 56
17 segmented into two parts, as described in Section 2.5, so that it the different recording environments of normal and pathological 57
18 may be used to develop text-independent systems. subjects in the MEEI database. Therefore, various shortcomings 58
19 A comparison of the highest accuracies for the detection and of the SVD and MEEI databases were considered before record- 59
20 classification of the AVPD and MEEI databases is depicted in ing the AVPD. 60
21 Figure 4. It can be observed from Figure 4 that the highest In our opinion avpd is also important and it would be good to 61
22 accuracy for detection with the sustained vowel is 79.5% for the do more research on this database it needs more help to grow as 62
23 AVPD and 93.6% for the MEEI database the MEEI and svd That’s the exact reason why we choose to talk 63
REFERENCES 65
66
[2] .Fortes FSG, Imamura R, Tsuji DH, Sennes LU. Perfil dos
profis-sionais da vozcomqueixasvocaisatendidosemumcentroter-
ciário de saúde. Braz J Otorhinolaryngol. 2007;73:27—31. 68
38 AVPDs. AVPD may be a key factor in progress in speech pathol- [10] . A. R. Maccarini and E. Lucchini, “La valutazionesogget-
39 ogy assessment in the Arab region. Dysphonic patients with five tivaedoggettivadelladisfonia. il protocollosifel,” ACTA PHONI-
40 different types of organic voice disorders (cysts, nodules, polyps, ATRICA LATINA, vol. 24, no. 1/2, pp. 13–42, 2002. 76
10 Voice Disorder by using the AVPD including Machine Learning
[11] . S. Jothilakshmi, “Automatic system to detect the type of anLanguages; Springer: Berlin/Heidelberg, Germay, 2012; pp.
voicepathology,” Applied Soft Computing, vol. 21, pp. 244–249, 99–109. 14
1 2014. [25]. E. Van Leer, R. C. Pster, and X. Zhou, “An iOS-based cep-
[12] . N. Saenz-Lechon, J. I. Godino-Llorente, V. Osma- stral peak prominence application: Feasibility for patient prac-
Ruiz, and P. G´omez- Vilda, “Methodological issues in the de- tice of resonant voice,”J. Voice, vol. 31, no. 1, pp. 131.e9131.e16,
velopment of automaticsystems for voicepathologydetection,” 2017. 15
Biomedical Signal Processing and Control,vol. 1, no. 2, pp. [26]. R. Gravina, P. Alinia, H. Ghasemzadeh, and G. Fortino,
2 120–128, 2006. “Multi-sensor fusion in body sensor networks: State-of-the-art
[13]. Mohammed, M.A.; Ghani, M.K.A.; Hamed, R.I.; and research challenges,” Inf. Fusion, vol. 35, pp. 6880, May
Ibrahim, D.A.; Abdullah, M.K. Artificial neural networks for 2017. 16
automatic segmentation and identification of nasopharyngeal- [27]. Hemmerling, D.; Skalski,A.; Gajda, J.Voice data mining
3 carcinoma. J. Comput. Sci. 2017, 21, 263–274.[CrossRef] for laryngealpathologyassessment. Comput. Biol. Med. 2016,
[14]. Mohammed, M.A.; Ghani, M.K.A.; Arunkumar, N.A.; 69, 270–276 [CrossRef] 17
Hamed, R.I.; Abdullah, M.K.; Burhanuddin, M.A. A real time [28]. Hammami, I.; Salhi, L.; Labidi, S. Voice Pathologies
computer aidedobjectdetection of nasopharyngealcarcinomaus- Classification and DetectionUsing EMD-DWT AnalysisBased on
inggeneticalgorithm and artificial neural network based on Haar- HigherOrderStatisticFeatures. IRBM 2020, 41, 161–171. [Cross-
featurefear. Future Gener. Comput. Syst. 2018, 89, 539–547. Ref] 18
4 [CrossRef]
[29]. Fonseca, E.S.; Guido, R.C.; Junior, S.B.; Dezani, H.; Gati,
[15]. Djenouri, D.; Laidi, R.; Djenouri, Y.; Balasingham, I. R.R.; Pereira, D.C.M. Acoustic investigation of speech patholo-
Machine learning for smart building applications: Review and gies based on the discriminative paraconsistent machine (DPM).
taxonomy. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Cross- Biomed. Signal Process. Control 2020, 55, 101615. [CrossRef] 19
5 Ref]
[30]. Rueda, A.; Krishnan, S. AugmentingDysphonia Voice
[16]. Alhussein, M.; Muhammad, G. Voice pathologydetec-
Using Fourier-basedSynchrosqueezingTransform for a CNN
tionusingdeeplearning on mobile healthcareframework. IEEE
Classifier. In Proceedings of the ICASSP 2019-2019 IEEE Interna-
6 Access 2018, 6, 41034–41041. [CrossRef]
tional Conference on Acoustics, Speech and Signal Processing
[17]. Mostafa, S.A.; Mustapha, A.; Mohammed, M.A.; Hamed, (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6415–6419. 20
R.I.; Arunkumar, N.; Ghani, M.K.A.; Jaber, M.M.; Khaleefah, S.H.
[31]. Pouchoulin, G.; Fredouille, C.; Bonastre, J.F.; Ghio, A.;
Examining multiple featureevaluation and classification meth-
Révis, J. Characterization of the PathologicalVoices (Dyspho-
ods for improving the diagnosis of Parkinson’sdisease. Cogn.
nia) in the FrequencySpace; International Congress of Phonetic
7 Syst. Res. 2019, 54, 90–99. [CrossRef]
Sciences (ICPhS): Saarbrücken, Germany, 2007; pp. 1993–1996. 21
[18]. Obaid, O.I.; Mohammed, M.A.; Ghani, M.K.A.; Mostafa,
[32]. Fraile, R.; Godino-Llorente, J.I.; Sáenz-Lechón, N.; Osma-
A.; Taha, F. Evaluating the performance of machine learning
Ruiz, V.; Gutiérrez-Arriola, J.M. Characterization of dysphon-
techniques in the classification ofWisconsinBreast Cancer. Int. J.
icvoices by means of a filterbank-based spectral analysis: Sus-
8 Eng. Technol. 2018, 7, 160–166.
tainedvowels and running speech. J. Voice 2013, 27, 11–23.
[19]. Mohammed, M.A.; Ghani, M.K.A.; Arunkumar, N.A.;
[CrossRef] [PubMed] 22
Mostafa, S.A.; Abdullah, M.K.; Burhanuddin, M.A. Trainable
[34]. L. W. Lopes et al., “Accuracy of acoustic analysis mea-
model for segmenting and identifyingNasopharyngealcarci-
surements in the evaluation of patients with different laryngeal
9 noma. Comput. Electr. Eng. 2018, 71, 372–387. [CrossRef]
diagnoses,” J. Voice, vol. 31,no. 3, pp. 382.e15382.e26, 2017. 23
[20]. Kukharchik, P.; Martynov, D.; Kheidorov, I.; Kotov, O.
Vocal foldpathologydetectionusingmodifiedwavelet-likefeatures [35]. Titze, I.R.; Martin, D.W. Principles of Voice Production;
and support vector machines. In Proceedings of the 2007 15th the Journal of the Acoustical Society of America. Acoust. Soc.
European Signal ProcessingConference, Poznan, Poland, 3–7 Am. 1998, 104, 1148. [CrossRef] 24
10 September 2007; pp. 2214–2218. [36]. Al-Nasheri, A.; Muhammad, G.; Alsu-
[21]. Dubuisson, T.; Dutoit, T.; Gosselin, B.; Remacle, M. laiman, M.; Ali, Z.; Malki, K.H.; Mesallam, T.A.;
On the use of the correlationbetweenacousticdescriptors for the Ibrahim, M.F. Voice pathologydetection and classifi-
normal/pathologicalvoices discrimination. EURASIP J. Adv. cation usingauto-correlation and entropyfeatures in
11 Signal Process. 2009, 2009, 173967. [CrossRef] die rent f requencyregions.IEEEAccess2017, 6, 6961˘6974.[CrossRe f ]
[22]. Fredouille, C.; Pouchoulin, G.; Bonastre, J.F.; Azzarello, [37]. Muhammad, G.; Alhamid, M.F.; Hossain, M.S.; Al-
M.; Giovanni, A.; Ghio, A. Application of automatic speaker mogren, A.S.; Vasilakos, A.V. Enhanced living by assess-
recognition techniques to pathologicalvoiceassessment. In Pro- ingvoicepathologyusing a co-occurrence matrix. Sensors2017,
ceedings of the International Conference on Acoustic Speech 17, 267. [CrossRef] [PubMed] 25
and Signal Processing (ICASSP 2005), Philadelphia, PA, USA, 23 [38]. D. Talkin, “A robust algorithm for pitch tracking
12 March 2005. (RAPT),” in Speech Coding and Synthesis, vol. 495. New York,
[23]. Wang, J.; Jo, C. Performance of gaussian mixture mod- NY, USA: Elsevier, 1995, p. 518. 26
els as a classifier for pathologicalvoice. In Proceedings of the [39]. P. A. Naylor, A. Kounoudes, J. Gudnason, and M.
11th Australian International Conference on Speech Science and Brookes, “Estimation of glottal closure instants in voiced speech
Technology, Melbourne, Australia, 4–7 Jun 2006; Volume 107, pp. using the DYPSA algorithm,” IEEE Trans. Audio, Speech, Lan-
13 122–131. guage Process., vol. 15, no. 1, pp. 3443, Jan. 2007. 27
[24]. Martínez, D.; Lleida, E.; Ortega, A.; Miguel, A.; Villalba, [40]. H. Kawahara, A. de Cheveigné, H. Banno, T. Takahashi,
J. Voice pathologydetection on the Saarbrücken voicedatabase- and T. Irino, “Nearly defect-free F0 trajectory extraction for ex-
with calibration and fusion of scores using multifocal toolkit. pressive speech modi- cations based on STRAIGHT,” in Proc.
In Advances in Speech and Language Technologies for Iberi- 9th Eur. Conf. Speech Commun. Technol., 2005, pp. 537540. 28
11