You are on page 1of 11

FSR

Voice Disorder by using the AVPD including Machine Learning


AFARTASS Soumia1 , REGRAGUI Youssef1 and ZEGLAZI Zayd2
1 University Med V

∗ Corresponding author: [1]: Afartasssoumia@gmail.com , [2]: regragui.youssef@um5r.ac.ma, [3] :zayd.zglz@gmail.com

1 Abstract
2 This paper investigates the use of Multi-Dimensional Voice Program (MDVP) parameters to automatically detect voice pathology in Arabic voice
3 pathology database (AVPD). MDVP parameters are very popular among the physician / clinician to detect voice pathology; however, MDVP is a
4 commercial software. AVPD is a newly developed speech database designed to suit a wide range of experiments in the field of automatic voice
5 pathology detection, classification, and automatic speech recognition. This paper is the first step to evaluate MDVP parameters in AVPD using
6 sustained vowel /a/. The experimental results demonstrate that some of the acoustic features show an excellent ability to discriminate between
7 normal and pathological voices. The overall best accuracy is 81.33% by using SVM classifier. A voice disorder database is an essential element
8 in doing research on automatic voice disorder detection and classification. Ethnicity affects the voice characteristics of a person, and so it is
9 necessary to develop a database by collecting the voice samples of the targeted ethnic group. This will enhance the chances of arriving at a
10 global solution for the accurate and reliable diagnosis of voice disorders by understanding the characteristics of a local group. Motivated by
11 such idea, an Arabic voice pathology database (AVPD) is designed and developed in this study by recording three vowels, running speech,
12 and isolated words. For each recorded samples, the perceptual severity is also provided which is a unique aspect of the AVPD. During the
13 development of the AVPD, the shortcomings of different voice disorder databases were identified so that they could be avoided in the AVPD. In
14 addition, the AVPD is evaluated by using six different types of speech features and four types of machine learning algorithms. The results of
15 detection and classification of voice disorders obtained with the sustained vowel and the running speech are also compared with the results of
16 an English-language disorder database, the Massachusetts Eye and Ear Infirmary (MEEI) database.

17 Keywords: AVPD ; MACHINE LEARNING ; mvpd; Mfcc; lpc

1 Introduction frequency, hoarseness, lack of vocal volume and projection, loss 27

of vocal efficiency, and low resistance when speaking[7]. 28


He human voice is a voice that originates from a human be-
2

4
T ing and uses it in talking ,singing , laughing , crying and
screaming in order to express his feelings and to communicate
No matter being one of the major forms of human expression
and of being used every day through most people, approxi-
29

30

mately 10% of the overall population gives with voice issues, and 31
5 with his community [1] The voice is also taken into consider- amongst voice specialists, the share reaches 50% [8] ,[9].Young- 32
6 ation a crucial device within the lives of many professionals, sters and adults are equally affected; however, the reasons are 33
7 with approximately 25% of the economically active population special in keeping with the age organizations. 34
8 thinking about their voice to be a important instrument of their Medical voice pathology detection is executed through the exe- 35
9 task [2].The concept of “ordinary voice” is complicated, and cution of numerous strategies, including the acoustic evaluation. 36
10 there is no consensus on the challenge. There is not a pattern It consists of an estimation of appropriate parameters extracted 37
11 of “ordinary voice”, there are not any described limits of what from voice signal to assess any possible changes of the vocal 38
12 is considered regular, and from which factor it can be stated tract, in line with the recommendations of the SIFEL protocol 39
13 that an person has dysphonia[3] .When the voice changes nega- [10] (Societ‘aItaliana di Foniatria e Logopedia), developed by 40
14 tively, it is said that it is disturbed or dysphonic [4].Dysphonia, means of the Italian Society of Logaoedics and Phonetics, fol- 41
15 therefore, can be defined as any difficulty or change in vocal lowing the commands of the Committee for Phonetics of the 42
16 emission that doesn’t allow for a natural voice production,[5,6] European Society of Laryngology. It is a non-invasive exami- 43
17 preventing momentary or permanent oral communication[7] nation in clinical exercise, complementary to different scientific 44
18 .Thus, dysphonia causes damage to the individual, since the checks, consisting of the laryngoscope examination based at the 45
19 voice produced exhibits difficulties or limitations in fulfilling direct observation of the vocal folds. 46
20 its basic role of transmission of verbal and emotional message Numerous acoustic parameters are expected to assess the king- 47
21 [6] .Dysphonia is a symptom, not a disease; it is a manifestation dom of health of the voice. Regrettably, the accuracy of those 48
22 that is part of the speech disorder picture[3].Dysphonia is the parameters in the detection of voice problems is, frequently, as- 49
23 main symptom of the oral communication disorder[6].However, sociated with the algorithms used to estimate them. For that 50
24 voice disorders are manifested beyond the dysphonic picture; reason the principle attempt of researchers is oriented to the 51
25 the patient may experience difficulty in keeping his/her voice take a look at of acoustic parameters and the application of class 52
26 (asthenia), vocal fatigue, variation in habitual fundamental vocal
2 Voice Disorder by using the AVPD including Machine Learning

1 strategies able to achieve a excessive discrimination accuracy. excessive accuracy of 100% within the detection issue by their 62

2 Currently, speech pathology has targeted hobby on gadget learn- method to split male and woman audio system. The examine 63

3 ing strategies. of Hammami et al. [28] assessed the execution of the proposed 64

4 The Arabic voice pathology database (AVPD) will have a po- high order statistic feature high lights extricated from wavelet 65

5 tential impact on the assessment of voice disorders in the Arab space to segregate among regular voices and pathological ones. 66

6 region. Race has been suggested to contribute to the perception Conventional functions which includes merciless Wavelet Es- 67

7 of voice, with Walton and Orlikoff [1] showing, for example, that teem, cruel Wavelet power and merciless Wavelet Entropy had 68

8 measures of amplitude and frequency perturbation in African- been used within the experiments. These highlights, mixed with 69

9 American adult males are not equal to those of white adult males. a SVM classifier, reach the maximum accelerated correctness of 70

10 Additionally, Sapienza [2] analyzed the vowel /a/ in a group 99.26% inside the vicinity step and one hundred% when classi- 71

11 of 20 African Americans and 20 white Americans, finding that fying the facts. With the intention to consist of concrete logical 72

12 African-American males and females had higher mean funda- covered values a clinical evaluation changed into completed 73

13 mental frequencies and lower sound pressure levels, although on statistics gathered from subjects from a healing middle in 74

14 the differences were not significant. This difference was partially Tunez. The outcomes were suitable and the precisions were 75

15 attributed to the large ratio of the membranous to cartilaginous 94.82% and 94.44% for the region and type, respectively. Fon- 76

16 portion of the vocal folds and increased thickness, a finding seca et al. [29] labored at the discovery of co- existent laryngeal 77

17 previously reported by Boshoff [3]. Sapienza [2] did not exam- problems for which the predominant phonic side effect is the 78

18 ine other acoustic parameters for gender or racial differences. identical, growing features with note worthy inter-magnificence 79

19 Walton and Orlikoff [1] found, through acoustical analysis, that coverage. Based totally on the mixture of SE, ZCR and SH, all 80

20 African-American speakers had significantly greater amplitude applied for extraction, related with DPM, mainly obtained for 81

21 perturbation measures and significantly lower harmonics-to- category, the proposed technique was efficiently concluded, pro- 82

22 noise ratios than did white adult males. Although the former ductively dealing within definitions and inconsistencies with an 83

23 had a lower mean speaking fundamental frequency than the predicted precision of 95%. The continuing task of dysphonia 84

24 latter, the differences were not significant in the group of 50 voice research is the small size of the database produced with the 85

25 subjects. aid of Rueda and Krishnan [30]. It’s miles very complicated to 86

26 We observed that most researchers were using standard apply extra superior deep mastering techniques without under 87

27 databases such as Massachusetts Eye and Ear Infirmary (MEEI), fitting or over fitting. They proposed an adaptive approach uti- 88

28 Saarbruecken Voice Database (SVD), that’s why we have de- lized to interrupt down a signings additives using a Fourier-base 89

29 cided to use the Arabic Voice Pathology Database (AVPD) ,in dsychrosqueezing exchange (FSST) for data enlargement and 90

30 addition to enlarge interest in Arab voice and Identifying the alternate. The second TF representation output will become the 91

31 disorder voices at the Arabic language level that’s what makes input to CNN. 92

32 it so easy to approach the world in all scientific, artistic and


33 cultural fields... It’s far clean that each voice disorder produces one of a kind 93

34 This paper is organized as follows: An overview of currents frequencies depending at the kind of voice ailment and its place 94

35 studies and some related works are presented in Section 2. In at the vocal folds, as we discovered. Thus, monitoring the fre- 95

36 Section 3, we describe our proposed voice pathology detection quency businesses is distinctly critical to evaluate which one 96

37 based on avpd database. We present the experimental results in contributes more to the discovery and category of voice afflic- 97

38 Section 4. Finally, we present our conclusions and directions of tions. For instance, Pouchoulin et al. [31] stated that decrease 98

39 future research in Section 5. frequencies (3000 Hz) are great er affordable for recognizing 99

dysphonic voices than higher frequencies. Moreover, Fraile et 100

al. [32] verified that the manage of dysphonic voice flags is alto- 101
40 RELATED WORK
gether less constant in the recurrence region between 2000 and 102

41 The effects of the published studies vary significantly due to the 6400 Hz than the alternative recurrence areas. To discuss the 103

42 variances most of the data sets used in the experimental out- results of a comparative literature evaluate, they analysed voice 104

43 comes. In line with Martınez et al. [24], the accuracy done using statistics of maintained phonation of the vowel /a/ as nicely. 105

44 2 hundred records of sustained vowel /a/ represent a high cost But, in contrast to beyond studies, we analyze a bigger database 106

45 and it’s very near our observe. Different studies applied the accumulated from SVD [34]. Furthermore, to suggest systems 107

46 combination of vowels /a/, /i/ and /u/ to get high accuracy capable of powerful voice pathology detection and category, we 108

47 and do not consciousness at the pathology causes. In the studies do not confine the database as if it were a subset of popular 109

48 by Souissi et al. In [25] they performed excessive accuracy of voice pathologies. In this observe, the database consists of an 110

49 87.Eighty two% making use of subset concerning four types of expansive quantity of pathologies with small recordings. As we 111

50 voice pathologies that include 71 kinds. Also, Al-Nasheri et al. located in side the associated works, no matter beyond paintings 112

51 [38,26] completed an accuracy of 99.68% due to their use of a [34], no different research have applied deep gaining knowledge 113

52 subset concerning the various pathologist to conduct a check of strategies for voice pathology identity. In the following sec- 114

53 on records that changed into moreover displayed in different tions we make use of a sturdy voice pathology identification 115

54 than data sets, consisting of Arabic Voice Pathology Database version based on the acoustic characteristic extraction method. 116

55 (AVPD), and Massachusetts Eye and Ear Infirmary Database We use voice pathology detection and identification utilising 117

56 (MEEI). Some other look at carried out through Muhammad et CNN method. We make use of the switch mastering method 118

57 al. [37] applied a subset of three varieties of voice pathology’s for the use of the modern effective CNN fashions. In particular, 119

58 that completed an accuracy of 93.20%. In addition, they utilized the ResNet34 fashion shad been used. To address the difficulty 120

59 a combination of voice information as an electrocardiograph sign of inadequate distribution of an collection of voice pathology’s 121

60 to growth the accuracy to 99.98%. However, in every other study with few recordings in the data sets, we additionally discover 122

61 conducted by using Hemmer ling et al. [27] they carried out a the utilization of abnormality detection methods. 123
3

1 Materials and methods Arabic digits English Translation IPAs of Arabic digits
Q®“ Zero /á/, /i/, /f/, /r/
2 DATA Yg@ð One /w/, /a/, //, /i/, /d/
á
J K @ Two /a/, /th/, /n/, /a/, /y/, /n/
3 Voice disorder database is an essential element in AVDD sys-
4 tem. The dataset consists of voice recordings of both normal éKCK Three /th/, /a/, /l/, /a/, /th/, /a/
5 and pathological voices. The recordings can contain either sus- éªK P @ Four /a/, /r/, /b/, //, /a/
.
6 tained vowel phonation or continuous speech. In our paper, we  g
é‚Ô Five /kh/, /a/, /m/, /s/, /a/
7 have observed that most of the researchers have used standard
8 databases, such as Massachusetts Eye and Ear Infirmary (MEEI), éJƒ Six /s/, /i/, /t/, /t/, /a/

9 Saarbruecken Voice Database (SVD), and Arabic Voice Pathology  ƒ


éªJ Seven /s/, /a/, /b/, //, /a/
.
10 Database (AVPD) Since most of the articles use MEEI and SVD
ß
éJKAÖ Eight /th/, /a/, /m/, /a/, /n/, /y/, /a/
11 the most as we see below , we chose to use the AVPD database:  
骂 Nine /t/, /i/, /s/, //, /a/
èQå„« Ten //, /a/, //, /a/, /r/, /a/

Table 1 Arabic digits with international phonetic alphabets


(IPAs) and English translation.

Recording Equipment and Protocol 28

AVPD recorded normal and disordered subjects using the Com- 29

puterized Speech Lab Model 4500 (CSL 4500), a product of 30

KayPENTAX (Montvale, NJ, USA). All subjects were recorded 31

by experienced clinicians in a sound-processed room in the Com- 32

munication and Swallowing Disorders Unit of King Abdulaziz 33

University Hospital. The samples were recorded at a sampling 34

frequency of 48 kHz and a bit rate of 16 bits. All recordings were 35

made with a fixed distance of 15 cm between the mouth and 36

the microphone and were saved in two different audio formats. 37

Five organic voice disorders, vocal cord cysts, nodules, paral- 38

ysis, polyps, and sulci were considered in AVPD. In addition, 39

all healthy subjects were recorded after clinical assessment to 40

ensure that they did not suffer from any vocal impairments and 41

Figure 1 This pie chart represents no. of studies[59]. that they had no vocal complications in the past. 42

Information on subjects’ gender, age and smoking habits was 43

also collected, and each subject signed a document expressing 44

their consent and certifying that they had no objection to the use 45

of their recorded samples for research purposes. In addition, the 46

perceived severity of impaired speech quality was scored on a 47


12 This section describes the steps to designing and developing scale of 1 to 3, with 1 representing mild, 2 moderate, and 3 severe 48
13 the AVPD and includes an overview of the text recorded and speech quality impairment. In AVPD, a large amount of text 49
14 provides the statistics of the database. Moreover, segmentation is recorded for each topic, which is explained in the following 50
15 and verification processes are also discussed. subsections. 51

Recording Text : 52

Three types of text, including three vowels, isolated words, and 53

running speech, were considered during the development of 54

the AVPD. The text was compiled in a way that ensured that it 55
16 Arabic Voice Pathology Database was simple and short, and at the same time it covered all the 56

Arabic phonemes. The first type of text was three vowels, fatha 57

17 Video-Laryngeal Stroboscopic Examination. /a/, damma /u/, and kasra /i/, which were recorded with a 58

repetition, including onset and offset information.The second 59

18 KayPENTAX’s video-laryngeal stroboscopic system (Model type of text involved isolated words, including Arabic digits 60

19 9200C2) was used in the examination, including a 70° rigid from zero to ten and some common words (see Tables 1 and 61

20 endoscope, 3CCD Toshiba camera, Sony LCD monitor, and a 2). The third type of text was running speech (see Table 3), and 62

21 light source (Model RLS 9100B). Clinical diagnosis and classifi- the continuous speech was taken from the first chapter of the 63

22 cation of voice disorders were decided based on laryngoscopic Quran, called the Al-Fateha. The third type of text is running 64

23 examination. Two experienced phoniatricians were responsible speech, and it is given in Table 3. The continuous speech is the 65

24 for clinical diagnosis and classification of voice disorders. In first chapter from the Holy book of Muslims, called Al-Fateha. 66

25 case of unclear diagnosis, two examiners reviewed the recorded One of the reasons behind the selection of the religious text is 67

26 video-laryngeal examinations and a consensus decision about that most of the visitors to our voice disorder unit are illiterate. 68

27 clinical diagnosis was obtained. Therefore, we selected the religious text because every Muslim 69
4 Voice Disorder by using the AVPD including Machine Learning

1 memorizes it by heart. The other reason is the duration of Al- subjects, and the remaining subjects are distributed among five 16

2 Fateha which is 20 seconds, and it is better than the duration of voice disorders: sulcus 11%, nodules 5%, cyst 7%, paralysis 14%, 17

3 running speech of MEEI database (9 seconds) and SVD database and polyp 11% (Figure 1(a)). Among the 51% of normal subjects 18

4 (2 seconds). (188 samples), there are 116 male and 82 female speakers. In 19

addition, the number of pathologic male and female patients, 20

English translation Sentence number Al-Fateha respectively, is as follows for the different disorders: sulcus 20 21

Praise be to God, Lord of 1 á 


Ö ÏAª Ë@ H . P é<Ë Y Ò m Ì '@ and 22, nodules 18 and 2, cysts 17 and 7, paralysis 31 and 21, 22

and polyps 18 and 22 (Figure 1(b)). The inner ring in Figure 1(b) 23
all the worlds
represents the number of female subjects, while the outer ring 24
The Compassionate, the Merciful 2 Õæk QË@  á Ô g QË@

 shows the number of male subjects. 25

Ruler on the Day of Reckoning 3 á


K YË@ Ð ñ
K ½ Ë AÓ@

You alone do we worship, and 4 ¼A
K @ ð Y J . ª K ¼A
K @
You alone do we ask for help
Guide us on the straight path 5 Õæ® J ‚ Ü Ï@  @ Qå  ”Ë@ AKY
ë@



I Ò ª K @ áK
Y Ë@  @ Qå•

The path of those who have 6
received your grace
Not the path of those who have brought 7
 DÊ « H ñ’ ª Ü Ï@ Q«
Ñî

 . 

down wrath, nor of those who wander astray á


ËA’Ë@

Table 2 Text from Al-Fateha with English translation

5 The Arabic digits and Al-Fateha covered all the Arabic letters
6 except three: P , h., and   . Therefore, some common words were
7 included in the text to cover these omissions. These words were
8 ¬Q £ (envelope), È@Q « (deer), and ÉÔg. (camel), as mentioned in
9 Table 2. The number of occurrences of each Arabic letter in the
10 recorded text is mentioned in Table 4. For illiterate patients,
11

we have shown pictures of ¬Q£ (envelope), È@Q « (deer), and
12 ÉÔg.(camel) to record these words.

Common words English translation IPAs of common words


¬Q £ Envelope /z/, /a/, /r/, /f/

È@Q « Deer //, /a/, /z/, /a/, /l/

ÉÔg. Camel /j/, /a/, /m/, /a/, /l/

Table 3 Common words with IPAs and English translation

Number of occurrences Letters


30 @
5 H.
5 H
4 H
1 h.
4 h
1 p
5 X
1 X

Table 4 Number of occurrences of each Arabic letter in the


recorded text

Figure 2 (a) Distribution of normal and voice disorder subjects


13 Statistics. in the AVPD. (b) Number of male and female samples for each
14 Overall, 366 samples of normal and pathological subjects are disorder and normal subjects[60].
15 recorded in the AVPD. Normal subjects are 51% of the total
5

1 Approximately 60% of the subjects in the AVPD are male, errors by updating the start and end times of the segments, 38

2 while 40% are female. The information about the mean age (in because these errors occur due to incorrect labeling of these two 39

3 years) of the recorded subjects with standard deviation (STD) is times. After updating the time, the erroneous segments were 40

4 provided in Figure 2. The average age ± STD of male subjects extracted again by using updated time information. All tasks 41

5 who are normal or suffering from sulcus, nodules, cysts, paraly- associated with the segmentation of the AVPD are presented 42

6 sis, or polyps is 27 ± 10, 35 ± 13, 12 ± 2, 25 ± 18, 46 ± 15, and 48 ± and described in Table 6. 43

7 10 years, respectively, while for female subjects it is 22 ± 5, 32 ±


8 14, 35 ± 12, 35 ± 17, 36 ± 14, and 32 ± 10 years, respectively. A Number Tasks Description
9 consent form is signed by each normal and disordered subject Task 1 Time labeling Start and end times of
10 before recording of his\her voice sample. In the consent form,
the recorded vowels, digits,
11 each subject testified that his\her participation is completely
12 voluntary, and their decisions will not affect the medical care Al-Fateha, and common words
13 they receive. Task 2 Extraction By using start and end times, the
recorded vowels, digits, Al-Fateha,
and common words are extracted
and stored in a new wav file
Task 3 Verification Verification of the extracted vowels,
digits, Al-Fateha, and common words
Task 4 Repeat time labeling Update start and end time of
the erroneous segments
Task 5 Repeat extraction Extract the segments again using
updated time

Table 6 Tasks for the AVPD

Evaluation of the AVPD : 44

To evaluate AVPD, various speech disorder detection and classi- 45

fication experiments were performed by implementing an auto- 46

matic scoring system. The same experiment was performed at 47

the Massachusetts Eye and Ear Hospital (MEEI) to compare its 48

results with those of the AVPD. The automatic scoring system 49

consists of two main modules: the first module is speech feature 50


Figure 3 Age distribution of male and female subjects in the extraction, and the second module is pattern matching, which is 51
AVPD[60]. implemented using various machine learning techniques. 52

Feature Extraction Techniques 53

14 Segmentation of Recorded Samples Many speech features extraction algorithms, MFCC, LPC, LPCC,
15 Recorded samples were divided into the following 22 segments: PLP, RASTA-PLP, and MDVP, were implemented in this module
16 six segments for vowels (three vowels plus their repetition), 11 of the automatic assessment system. Before the extraction of
17 segments for Arabic digits (zero to ten), two segments for Al- features, the speech signal was divided into frames of 20 mil-
18 Fateha (divided in this manner so that the first part may be used liseconds, which made the analysis easy because speech changes
19 to train the system and the second part to test the system), and quickly over time. The MFCC mimics the human auditory per-
20 three segments for the common words. The first part of AlFateha ception, while the LPC and the LPCC mimic the human speech
21 starts from sentence number 1 and ends at 4, while the second production system. The PLP and the RASTA-PLP simulate, to
22 part contains the last three sentences. some extent, both the auditory and the production mechanisms.
23 Each of the 22 segments was stored in a separate wav file. The In the MFCC [17, 18], the time-domain speech signal was con-
24 segmentation was performed with the help of Praat software verted into a frequency-domain signal, which was filtered by
25 [28] by labeling the start and end time of each segment. Then, applying a set of band-pass filters. The center frequencies of
26 these two times were used to extract a segment from a recorded the filters were spaced on a Mel-scale and the bandwidths cor-
27 sample. Once each recorded sample was divided into segments responded to the critical bandwidths of the human auditory
28 and stored into 22 wav files, the next step was the verification system. The Mel-scale filter is given by (1), where f is frequency
29 process, which ensured that each segmented wav file consisted in Hz and m represents the corresponding frequency in Mel-
30 of a complete utterance. During the verification process, we en- scale. In this study, 29 Mel-scale filters are used. Later, a discrete
31 countered three types of errors, as described in Table 5 segments cosine transform was applied to the filtered outputs to compress
32 were extracted and decorrelate them.
33 A record of the errors was maintained in an excel sheet, where
22 segments were listed along the columns and the recorded f
34 m = 2959 log(1 + ) (1)
35 samples were listed along the rows. If a segment had any of the 700
36 above errors in any segment, then i, m, or d were mentioned
37 under that segment. The next step was the correction of these During extraction of the LPC features, the Linear Prediction
6 Voice Disorder by using the AVPD including Machine Learning

Errors in the segments Abbreviation Description Examples

Incomplete i When some part of the extracted (a) “d” is missing in wahid(b) “w” is missing
text is missing at the start or end in wahid(c) Both “w” and “d” are missing
More m When a segment contains some part (a) Segment of Sifar also contains “w” of
of the next or previous segment next segment wahid
(b) Segment of Ithnayn also contains “d”
of previous segment wahid

Different d When the text in a segment is other Segment contains wahid instead of sifar
than the expected one

Table 5 Description of errors encountered during the verification process

(LP) analysis was performed. The LP analysis applies reverse 13

filtering on speech signals to remove the effects of formants in The center frequency of the jth critical band is represented by fj in 14

order to estimate the source signal [29]. For LP analysis of order (4). Furthermore, the intensity loudness power law of hearing is 15

P, the current sample of a source signal can be estimated by using used to simulate the nonlinear relationship between the intensity 16

P previous samples by using: of sound and perceived loudness [34]. The extraction process 17

of RASTA-PLP is the same as PLP, except that the RASTA filter 18


p
given by (5) is applied after the critical bandwidth phenomena
xlr ∑ a i xr −i (2)
19
= (2)
i =1
to remove the effect of constant and slowly varying parts [22]. 20

where, x1, x2, x3,. . . , xr are samples of original speech signal and 0.2 + 0.1z−1 − 0.1z−3 − 0.2z−4
2
R ( z ) = z4 × (7) (7)
3 ai’s represent the required LPC features. To get accurate LPC 1 − 0.94z−1
4 features, it is necessary to reduce the error E between the current 21

5 and estimated sample. This can be done by substituting the 22

6 first-order derivative of E equal to zero and solve the resulting In all types of experiments, static as well as delta and delta- 23

7 equations by using the Levinson-Durbin algorithm [30]. More- delta features were considered. The delta and delta-delta coeffi- 24

8 over, the LPCC features are calculated by using the recursive cients were computed by taking the first-order and second-order 25

9 relation [31] given in (3), where 2 is the gain in LP analysis, P derivatives of static features, respectively. The derivative was 26

10 is the order of the LP analysis, an are LPC features, and cn are calculated by taking the linear regression with a window size of 27

11 obtained LPCC features. In this study, we performed LP analysis five elements. All experiments for MFCC, LPCC, and RASTA- 28

12 with P = 11. PLP were conducted using 12 features (static), 24 features (12 29

static and 12 delta), and 36 features (12 static, 12 delta, and 12 30

c1 = lnσ2 (3) (3) delta-delta). For LPC and PLP, all experiments were performed 31

by using only 12 static features. 32

In addition, 22 acoustic parameters were also extracted from 33

n −1   each normal and pathological sample. These 22 speech samples 34


k
cn = an + ∑ n
ck a, 1 < n < p(4) (4) are defined in Table 1 of [35], and they were extracted by using 35

k =1 MDVP software [5]. This software is used frequently for the 36

objective assessment of voice disorders in clinics. 37

n −1  
k
cn = ∑ ck a, n > p(5) (5) Pattern Matching 38

k =1
n
The computed features are multidimensional and their inter- 39

pretation is not easy for the human mind. Therefore, a pattern- 40


The extraction of PLP features depends on three psychoacous-
matching phase becomes important in such situations in order 41
tic principles of hearing [32]: (1) critical bandwidth, (2) equal-
to determine the trend in the data [36]. In this study, the pattern 42
loudness hearing curve, and (3) intensity loudness power law
matching was performed by using different machine learning 43
of hearing. The critical bandwidths are computed by applying
techniques, which performed better than statistical approaches 44
the Bark-scale proposed by Zwicker [33]. The sensitivity of the
in different areas [37, 38]. Machine learning techniques do not 45
human auditory mechanism to different frequencies is different
make strict assumptions about the data but instead learn to rep- 46
at the same sound intensity. Therefore, each critical band is mul-
resent complex relationships in a data-driven manner [39]. In 47
tiplied with the equal-loudness weight. The weight for the jth
this module, various machine learning techniques (e.g., SVM 48
critical band is computed as
[26], VQ [27], GMM [23], and HMM [24, 25]) were implemented 49

f j2 + 1.44 × 106 for automatic detection and classification of voice disorders. 50

wj =     (6) (6) SVM was implemented with linear and RBF kernels, GMM was 51

f j2 + 1.6x105 × f j2 +9.61 × 106 implemented using 2, 4, 8, 16, and 32 mixtures, VQ used 2, 4, 8, 52


7

1 16, and 32 code books to generate acoustic models, and HMM recorded by the Massachusetts Eye and Ear Infirmary voice and 42

2 was applied by using five states with 2, 4, and 6 mixtures in each speech laboratory [40]. A subset of the database that has been 43

3 state. used in a number of studies was considered for the experiments 44

in this study [9, 36, 41–44]. The subset contained 53 normal 45

4 Detection and Classification Results for the AVPD subjects and 173 samples of disordered subjects suffering from 46

5 Experiments for detection determine whether an unknown test adductor spasmodic dysphonia, nodules, keratosis, polyps, and 47

6 sample is normal or disordered. It is a two-class problem: paralysis. The detection and classification accuracies for the 48

7 1-the first class consists of all normal samples, and the second MEEI database are presented in Table 8. The maximum obtained 49

8 class contains samples of all types of disorder. During the clas- detection accuracy for the MEEI database with the sustained 50

9 sification of disorders, the objective is to determine the type of vowel is 93.6%, which is obtained by using MFCC and RASTA- 51

10 voice disorder. The classification of voice disorders is a many PLP when used with SVM. The maximum accuracy for running 52

11 class problem, and the number of classes depends upon the speech is 98.7%, obtained by using LPC and GMM. Similarly, for 53

12 number of types of voice disorder. the classification of disorders, the maximum obtained accuracy 54

13 2-All voice samples of the MEEI and the AVPD are downsam- with the sustained vowel is 98.2%, achieved with LPCC and PLP 55

14 pled to 25KHz, and each speech signal was divided into a frame with VQ. The classification of disorders with running speech 56

15 of 20 milliseconds with 50% overlapping the previous frame. To obtained an accuracy of 97.3% by using SVM with all types of 57

16 avoid bias in the training and testing samples, all experiments speech features. 58

17 were performed using a fivefold cross validation approach. In


18 this approach, all samples were divided into five disjointed test- Features Experiments SVM/AH/ Al-Fateha GMM/AH/ Al-Fateha VQ/AH/ Al-Fateha HMM/AH/ Al-Fateha
MFCC Detection 93.6 97.4 91.6 97.3 90.3 96.0 88.9 98.3
19 ing sets. Each time one of the sets was used to test the system, Classification 95.4 97.3 97.3 97.3 96.3 97.3 87.5 88.9

20 the remaining four were used to train the system. LPCC Detection 91.0 97.9 90.7 96.4 83.2 97.8 87.6 98.2
Classification 95.4 97.3 97.3 97.3 98.2 97.3 87.5 97.3
RASTA-PLP Detection 93.6 98.0 91.6 98.1 84.1 96.4 88.9 98.1
Classification 95.5 97.3 97.3 97.3 97.3 96.3 85.2 84.6
TotalCorrectlyDetectedsamples LPC Detection 82.9 96.0 83.2 98.7 78.3 97.3 80.1 96.3
Accuraccy(%) = × 100 (8)
TotalNumbero f Samples PLP
Classification
Detection
95.2
87.8
97.3
96.8
97.3
91.2
97.3
97.8
97.3
89.4
94.4
97.8
75.0
87.4
82.5
96.3
Classification 95.0 97.3 97.3 97.3 98.2 94.4 61.1 84.6
MDVP Detection 89,5 — 88,3 — 68,3 — — —
Classification 88.9 — — — — — — —
Features Experiments SVM/AH/ Al-Fateha GMM/AH/ Al-Fateha VQ/AH/ Al-Fateha HMM/AH/ Al-Fateha

MFCC Detection 76.5 77.4 74.4 77.1 70.3 71.1 71.6 78.1
Table 8 Overall best accuracies (%) for sustained vowels and
Classification 89.2 89.2 88.9 89.5 75.3 81.6 88.7 90.9
LPCC Detection 60.1 76.5 54.5 76.7 70.3 75.9 73.5 71.5 running speech by using the MEEI database.
Classification 67.6 84.7 75.4 86.0 75.5 77.9 59.0 86.0
RASTA-PLP Detection 77.0 76.7 72.8 74.5 67.1 75.0 66.3 79.0
Classification 92.9 90.2 91.3 91.2 88.9 90.3 88.7 92.7
LPC Detection 62.3 71.6 53.7 71.9 70.7 71.5 71.4 62.3
Overall best accuracies (%) for sustained vowels and running 59

Classification 66.3 82.4 74.6 79.7 78.6 75.3 85.9 75.9 speech by using the MEEI database. The best detection rate for 60
PLP Detection 75.8 79.1 73.2 78.5 72.0 78.1 73.6 81.6
Classification 91.5 90.1 88.9 91.2 79.4 77.2 88.7 85.8
sustained vowels. The best detection rate for running speech.α 61

MDVP Detection 79.5 — 69.8 — 64.8 — — — The best classification rate for sustained vowels .β The best 62
Classification 82.3 — — — — — — —
classification rate for running speech. 63

Table 7 Overall best accuracies (%) for sustained vowels and


running speech by using the AVPD. Discussion 64

Ethnicity influences the voice characteristics of people, as con- 65


21 The best detection rate for sustained vowels. The best de- cluded by Walton and Orlikoff [1] and Malki et al. [4]. There- 66
22 tection rate for running speech. The best classification rate for fore, the development of the AVPD was a good initiative, and it 67
23 sustained vowels. The best classification rate for running speech. will contribute in the area of pathology assessment, especially 68
24 Only the overall best accuracies (%) of voice disorder de- in the Arab region. The AVPD is compared with the German 69
25 tection and classification for all types of feature extraction and voice disorder database (SVD) and an English voice disorder 70
26 machine learning techniques are presented in Table 7. How- database (MEEI) by different aspects in Table 9. The SVD and 71
27 ever, among all feature extraction techniques, the maximum MEEI databases are only two publicly available voice disorder 72
28 obtained detection rate is 79.5%. This maximum detection rate databases. 73
29 is achieved with MDVP by using SVM. In Table 7, “—” rep- Comparison of AVPD with two publicly available voice dis- 74
30 resents that experiments are not applicable here. For running order databases. 75
31 speech, the maximum detection rate is 81.6%, which is obtained During the development of the AVPD, different shortcom- 76
32 by using PLP and HMM. Similarly, the maximum accuracy for ings of the SVD and MEEI databases were avoided. Another 77
33 classification of voice disorder is 92.9% for sustained vowels and drawback of a sustained phonation is a loss of information of 78
34 obtained with RASTA-PLP by using SVM. Furthermore, in the the signal-to-noise ratio because a complete recording, including 79
35 case of running speech, the maximum classification accuracy is silence at the start and end of the recording, is necessary for its 80
36 92.72%, which is obtained with RASTA-PLP and HMM. computation.Automatic systems are sometimes unable to differ- 81

entiate between normal and mildly severe pathological subjects. 82


37 Detection and Classification Results for the MEEI This is the reason why perceptual severity is also considered in 83

38 All experiments performed for the AVPD were also performed the AVPD and rated over the scale of 1 to 3, where 3 represents 84

39 with the MEEI database in order to make a comparison between a voice disorder with a high severity. Furthermore, the normal 85

40 the results. The experimental setup for the MEEI database is the subjects in the AVPD are recorded after the clinical evaluation 86

41 same as the one used for the AVPD. The MEEI database was under the same condition as those used for the pathological 87
8 Voice Disorder by using the AVPD including Machine Learning

Sr. number Characteristics MEEI AVPD SVD

(1) Language English Arabic German

(2) Recording location Massachusetts Eye , Ear Communication and Saarland University,
Infirmary (MEEI) voice Swallowing Disordered Unit, Germany
and speech laboratory, King Abdulaziz University
USA Hospital, Saudi Arabia

(3) Sampling frequency Samples are recorded at All samples are recorded at All samples are
different sampling same frequency recorded at same
frequencies (i) 48kHz frequency
(i) 10kHz (i) 50kHz
(ii) 25kHz
(ii) 50kHz

(4) Extension of Recorded samples are Recorded samples are Recorded samples
recorded samples stored in.NSP format stored in.wav and.nsp are stored in.wav
only format and.nsp format

(5) Recorded text (i) Vowel /a/ (i) Vowel /a/ (i) Vowel /a/
(ii) Rainbow passage (ii) Vowel /i/ (ii) Vowel /i/
(iii) Vowel /u/ (iii) Vowel /u/
(iv) Al-Fateha (running (iv) A sentence
speech)
(v) Arabic digits
(vi) Common words
(All vowels are recorded
with a repetition)

Table 9 Comparison of AVPD with two publicly available voice disorder databases.

1 subjects. In the MEEI database, the normal subjects are not clini- normal and pathological samples are detected correctly by the 21

2 cally evaluated, although they do not have any history of voice system. One of the many possibilities may be that specificity is 22

3 complication [44]. In the SVD database, no such information is 0% and sensitivity is 70.47%. Another possibility may be that 23

4 mentioned. specificity is 100% and sensitivity is 62.40%. Specificity is a ratio 24

between correctly detected normal samples and the total number 25


5 The AVPD has a balance between the number of normal and of normal samples, and sensitivity is a ratio between correctly 26
6 pathological subjects. Normal subjects are 51% of the total sub- detected pathological samples and the total number of patho- 27
7 jects in the AVPD. On the other hand, the percentage of normal logical samples [47]. The problem occurs due to imbalanced 28
8 subjects in the MEEI and SVD databases are 7% and 33%, re- normal and pathological data. Therefore, Arjmandi et al. used 29
9 spectively. The number of normal subjects in the MEEI database 50 normal and 50 pathological samples to establish a balance 30
10 compared with pathological subjects is alarming. The numbers between normal and pathological subjects in the study [35]. Un- 31
11 of normal and pathological samples in the MEEI database are fortunately, this significantly limited the total sample number, 32
12 7% and 93%, respectively. As a result, an automatic system for which may have affected the reliability of results obtained in the 33
13 disorder detection based on the MEEI database may be biased study. 34
14 and cannot provide reliable results. For example, Dibazar et al.
15 [46] obtained a classification accuracy of 65.26% when MFCC Unlike the MEEI database, it is assured that all normal and 35

16 features are used with the nearest mean classifier. The numbers pathological samples are recorded at a unique sampling fre- 36

17 of normal and pathological samples used in the study are 53 and quency in the AVPD. It is important because Deliyski et al. con- 37

18 657, respectively, taken from MEEI database. The interpretation cluded that sampling frequency influenced the accuracy and re- 38

19 of the results (accuracy of 65.26%) becomes difficult when data liability of acoustic analysis [48]. In addition, the MEEI database 39

20 are unbalanced, because it cannot be determined how many contains one vowel, whereas the AVPD records three vowels. 40
9

1 Although the SVD also records three vowels, they are recorded paralysis, and sulci) were included in the database. The database 41

2 only once. In the AVPD, the three vowels are recorded with a contains repeated vowels, continuous sounds, Arabic numerals 42

3 repetition, as some studies recommended that more than one and some common words. Assessing the perceived severity of 43

4 sample of the same vowel helps to model the intraspeaker vari- speech impairment and recording individual words are unique 44

5 ability [49, 50]. Another important characteristic of the AVPD is aspects of AVPD. All subjects, including patients and normal 45

6 the total length of the recorded sample, which is 60 seconds, as subjects, were recorded after clinical assessment. Baseline re- 46

7 described in Table 1. All recorded text in the AVPD is of the same sults for AVPD are provided by using different types of speech 47

8 length for normal as well as disordered subjects. In the MEEI features and a range of machine learning algorithms. Accuracy 48

9 database, the recording times for normal and pathological sub- for detecting and classifying speech disorders is computed for 49

10 jects are different. Moreover, the duration of connected speech sustained vowels and continuous speech. 50

11 (a sentence) in the SVD database is only 2 seconds, which is too Comparing the obtained results with the English Speech Impair- 51

12 short and not sufficient to develop an automatic detection system ment Database (MEEI), the classification results of the two were 52

13 based on connected speech. Furthermore, a text-independent comparable, although significant differences were observed in 53

14 system is not possible to build with the SVD database. The the context of obstacle detection. The recognition results of the 54

15 average length of the running speech (Al-Fateha) in the AVPD MEEI database were also significantly different from those of the 55

16 is 18 seconds, and it consists of seven sentences. Al-Fateha is German Speech Disorders Database (SVD). The reason may lie in 56

17 segmented into two parts, as described in Section 2.5, so that it the different recording environments of normal and pathological 57

18 may be used to develop text-independent systems. subjects in the MEEI database. Therefore, various shortcomings 58

19 A comparison of the highest accuracies for the detection and of the SVD and MEEI databases were considered before record- 59

20 classification of the AVPD and MEEI databases is depicted in ing the AVPD. 60

21 Figure 4. It can be observed from Figure 4 that the highest In our opinion avpd is also important and it would be good to 61

22 accuracy for detection with the sustained vowel is 79.5% for the do more research on this database it needs more help to grow as 62

23 AVPD and 93.6% for the MEEI database the MEEI and svd That’s the exact reason why we choose to talk 63

about it in this paper . 64

REFERENCES 65

66

[1] . "UCL PhoneticsLinguistics" , Archivedfrom the original


on April 26, 2015 67

[2] .Fortes FSG, Imamura R, Tsuji DH, Sennes LU. Perfil dos
profis-sionais da vozcomqueixasvocaisatendidosemumcentroter-
ciário de saúde. Braz J Otorhinolaryngol. 2007;73:27—31. 68

[3] . CBMVO —ComitêBrasileiroMultidisciplinar


De VozOcupacional.Boletim n◦1 da vozocu-
pacional [serial online]; 2010. Avail-able at:
http://www.aborlccf.org.br/conteudo/secao.asp?s=43[cited
06.01.13]. 69

[4] . Andrade FBF, Azevedo R. Similaridades dos sinais


e sintomasapresentadosnasdisfoniasfuncionaispsicogênicas e
Figure 4 Comparison of detection and classification accuracy
nasdisfo-nias comsuspeita de simulac¸ão: diagnósticodiferencial.
for the AVPD and MEEI databases[60].
DistúrbComun. 2006;18:63—73 70

[5]. Alves LP, Araújo LTR, Xavier Neto JA. Prevalência de


queixasvocais e estudo de fatoresassociadosemumaamostra de
24 the maximum accuracy for detection with running speech is
pro-fessoresde ensinofundamentalem Maceió, Alagoas, Brasil.
25 81.6% for the AVPD and 98.7% for the MEEI database. There is a
RevBrasSaúdeOcup. 2010;35:168—75. 71
26 significant difference between accuracies of the MEEI database
[6]. Sarvat M, Tsuji D, Maniglia JV, Mendes R, Gomes A,
27 and the AVPD, 14.1% for sustained vowels and 17.1% for run-
LeiteJ. Consensonacional sobre vozprofissional; 2004. Avail-
28 ning speech. The same kind of trend for accuracy is observed
ableat: http://www.iocmf.com.br/codigos/consenso2004 72
29 in the study by [51].A difference of 20% was observed between
[7] . Cielo CA, Beber BC, Maggi CR, Körbes D, Oliveira CF,
30 accuracies, a significant difference of 15% was observed. The rea-
WeberDE, et al. Disfoniafuncionalpsicogênicaporpuberfonia do
31 son for the difference might be the recording environments of the
tipomuda vocal incompleta: aspectosfisiológicos e psicológi-
32 MEEI database, as [53] mentions that “Normal and pathological
cos.EstudPsicol (Campinas). 2009;26:227—36. 73
33 voices were recorded at different locations (Kay Elemetrics and
[8] . Roy N, Merrill RM, Thibeault S, et al. Prevalence of
34 MEEI Voice and Speech Lab., respectively), assumedly under
voicedisorders in teachers and the general population. J Speech
35 the same acoustic
Lang HearRes. 2004;47:281–293. 74

[9] . Angelillo M, Di Maio G, Costa G, et al. Prevalence


36 Conclusion of occupationalvoicedisorders in teachers. J Prev Med Hyg.
37 This study presents the design, development, and evaluation of 2009;50:26–32. 75

38 AVPDs. AVPD may be a key factor in progress in speech pathol- [10] . A. R. Maccarini and E. Lucchini, “La valutazionesogget-
39 ogy assessment in the Arab region. Dysphonic patients with five tivaedoggettivadelladisfonia. il protocollosifel,” ACTA PHONI-
40 different types of organic voice disorders (cysts, nodules, polyps, ATRICA LATINA, vol. 24, no. 1/2, pp. 13–42, 2002. 76
10 Voice Disorder by using the AVPD including Machine Learning

[11] . S. Jothilakshmi, “Automatic system to detect the type of anLanguages; Springer: Berlin/Heidelberg, Germay, 2012; pp.
voicepathology,” Applied Soft Computing, vol. 21, pp. 244–249, 99–109. 14

1 2014. [25]. E. Van Leer, R. C. Pster, and X. Zhou, “An iOS-based cep-
[12] . N. Saenz-Lechon, J. I. Godino-Llorente, V. Osma- stral peak prominence application: Feasibility for patient prac-
Ruiz, and P. G´omez- Vilda, “Methodological issues in the de- tice of resonant voice,”J. Voice, vol. 31, no. 1, pp. 131.e9131.e16,
velopment of automaticsystems for voicepathologydetection,” 2017. 15

Biomedical Signal Processing and Control,vol. 1, no. 2, pp. [26]. R. Gravina, P. Alinia, H. Ghasemzadeh, and G. Fortino,
2 120–128, 2006. “Multi-sensor fusion in body sensor networks: State-of-the-art
[13]. Mohammed, M.A.; Ghani, M.K.A.; Hamed, R.I.; and research challenges,” Inf. Fusion, vol. 35, pp. 6880, May
Ibrahim, D.A.; Abdullah, M.K. Artificial neural networks for 2017. 16

automatic segmentation and identification of nasopharyngeal- [27]. Hemmerling, D.; Skalski,A.; Gajda, J.Voice data mining
3 carcinoma. J. Comput. Sci. 2017, 21, 263–274.[CrossRef] for laryngealpathologyassessment. Comput. Biol. Med. 2016,
[14]. Mohammed, M.A.; Ghani, M.K.A.; Arunkumar, N.A.; 69, 270–276 [CrossRef] 17
Hamed, R.I.; Abdullah, M.K.; Burhanuddin, M.A. A real time [28]. Hammami, I.; Salhi, L.; Labidi, S. Voice Pathologies
computer aidedobjectdetection of nasopharyngealcarcinomaus- Classification and DetectionUsing EMD-DWT AnalysisBased on
inggeneticalgorithm and artificial neural network based on Haar- HigherOrderStatisticFeatures. IRBM 2020, 41, 161–171. [Cross-
featurefear. Future Gener. Comput. Syst. 2018, 89, 539–547. Ref] 18
4 [CrossRef]
[29]. Fonseca, E.S.; Guido, R.C.; Junior, S.B.; Dezani, H.; Gati,
[15]. Djenouri, D.; Laidi, R.; Djenouri, Y.; Balasingham, I. R.R.; Pereira, D.C.M. Acoustic investigation of speech patholo-
Machine learning for smart building applications: Review and gies based on the discriminative paraconsistent machine (DPM).
taxonomy. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Cross- Biomed. Signal Process. Control 2020, 55, 101615. [CrossRef] 19
5 Ref]
[30]. Rueda, A.; Krishnan, S. AugmentingDysphonia Voice
[16]. Alhussein, M.; Muhammad, G. Voice pathologydetec-
Using Fourier-basedSynchrosqueezingTransform for a CNN
tionusingdeeplearning on mobile healthcareframework. IEEE
Classifier. In Proceedings of the ICASSP 2019-2019 IEEE Interna-
6 Access 2018, 6, 41034–41041. [CrossRef]
tional Conference on Acoustics, Speech and Signal Processing
[17]. Mostafa, S.A.; Mustapha, A.; Mohammed, M.A.; Hamed, (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6415–6419. 20
R.I.; Arunkumar, N.; Ghani, M.K.A.; Jaber, M.M.; Khaleefah, S.H.
[31]. Pouchoulin, G.; Fredouille, C.; Bonastre, J.F.; Ghio, A.;
Examining multiple featureevaluation and classification meth-
Révis, J. Characterization of the PathologicalVoices (Dyspho-
ods for improving the diagnosis of Parkinson’sdisease. Cogn.
nia) in the FrequencySpace; International Congress of Phonetic
7 Syst. Res. 2019, 54, 90–99. [CrossRef]
Sciences (ICPhS): Saarbrücken, Germany, 2007; pp. 1993–1996. 21
[18]. Obaid, O.I.; Mohammed, M.A.; Ghani, M.K.A.; Mostafa,
[32]. Fraile, R.; Godino-Llorente, J.I.; Sáenz-Lechón, N.; Osma-
A.; Taha, F. Evaluating the performance of machine learning
Ruiz, V.; Gutiérrez-Arriola, J.M. Characterization of dysphon-
techniques in the classification ofWisconsinBreast Cancer. Int. J.
icvoices by means of a filterbank-based spectral analysis: Sus-
8 Eng. Technol. 2018, 7, 160–166.
tainedvowels and running speech. J. Voice 2013, 27, 11–23.
[19]. Mohammed, M.A.; Ghani, M.K.A.; Arunkumar, N.A.;
[CrossRef] [PubMed] 22
Mostafa, S.A.; Abdullah, M.K.; Burhanuddin, M.A. Trainable
[34]. L. W. Lopes et al., “Accuracy of acoustic analysis mea-
model for segmenting and identifyingNasopharyngealcarci-
surements in the evaluation of patients with different laryngeal
9 noma. Comput. Electr. Eng. 2018, 71, 372–387. [CrossRef]
diagnoses,” J. Voice, vol. 31,no. 3, pp. 382.e15382.e26, 2017. 23
[20]. Kukharchik, P.; Martynov, D.; Kheidorov, I.; Kotov, O.
Vocal foldpathologydetectionusingmodifiedwavelet-likefeatures [35]. Titze, I.R.; Martin, D.W. Principles of Voice Production;
and support vector machines. In Proceedings of the 2007 15th the Journal of the Acoustical Society of America. Acoust. Soc.
European Signal ProcessingConference, Poznan, Poland, 3–7 Am. 1998, 104, 1148. [CrossRef] 24

10 September 2007; pp. 2214–2218. [36]. Al-Nasheri, A.; Muhammad, G.; Alsu-
[21]. Dubuisson, T.; Dutoit, T.; Gosselin, B.; Remacle, M. laiman, M.; Ali, Z.; Malki, K.H.; Mesallam, T.A.;
On the use of the correlationbetweenacousticdescriptors for the Ibrahim, M.F. Voice pathologydetection and classifi-
normal/pathologicalvoices discrimination. EURASIP J. Adv. cation usingauto-correlation and entropyfeatures in
11 Signal Process. 2009, 2009, 173967. [CrossRef] die rent f requencyregions.IEEEAccess2017, 6, 6961˘6974.[CrossRe f ]
[22]. Fredouille, C.; Pouchoulin, G.; Bonastre, J.F.; Azzarello, [37]. Muhammad, G.; Alhamid, M.F.; Hossain, M.S.; Al-
M.; Giovanni, A.; Ghio, A. Application of automatic speaker mogren, A.S.; Vasilakos, A.V. Enhanced living by assess-
recognition techniques to pathologicalvoiceassessment. In Pro- ingvoicepathologyusing a co-occurrence matrix. Sensors2017,
ceedings of the International Conference on Acoustic Speech 17, 267. [CrossRef] [PubMed] 25

and Signal Processing (ICASSP 2005), Philadelphia, PA, USA, 23 [38]. D. Talkin, “A robust algorithm for pitch tracking
12 March 2005. (RAPT),” in Speech Coding and Synthesis, vol. 495. New York,
[23]. Wang, J.; Jo, C. Performance of gaussian mixture mod- NY, USA: Elsevier, 1995, p. 518. 26

els as a classifier for pathologicalvoice. In Proceedings of the [39]. P. A. Naylor, A. Kounoudes, J. Gudnason, and M.
11th Australian International Conference on Speech Science and Brookes, “Estimation of glottal closure instants in voiced speech
Technology, Melbourne, Australia, 4–7 Jun 2006; Volume 107, pp. using the DYPSA algorithm,” IEEE Trans. Audio, Speech, Lan-
13 122–131. guage Process., vol. 15, no. 1, pp. 3443, Jan. 2007. 27

[24]. Martínez, D.; Lleida, E.; Ortega, A.; Miguel, A.; Villalba, [40]. H. Kawahara, A. de Cheveigné, H. Banno, T. Takahashi,
J. Voice pathologydetection on the Saarbrücken voicedatabase- and T. Irino, “Nearly defect-free F0 trajectory extraction for ex-
with calibration and fusion of scores using multifocal toolkit. pressive speech modi- cations based on STRAIGHT,” in Proc.
In Advances in Speech and Language Technologies for Iberi- 9th Eur. Conf. Speech Commun. Technol., 2005, pp. 537540. 28
11

[41]. L. Tan and M. Karnjanadecha, “Pitch detection algo-


rithm: Autocorrelation method and AMDF,” in Proc. 3rd Int.
1 Symp. Commun. Inf. Technol., vol. 2. 2003, pp. 551556.
[42]. L. Verde, G. De Pietro, and G. Sannino, “A methodology
for voice classi cation based on the personalized fundamental
frequency estimation,” Biomed. Signal Process. Control, vol. 42,
2 pp. 134144, Apr. 2018.
[43]. A. De Cheveigné and H. Kawahara, “YIN, a fundamen-
tal frequency estimator for speech and music,” J. Acoust. Soc.
3 Amer., vol. 111, no. 4, pp. 19171930, 2002.
[44]. M. Farrús, J. Hernando, and P. Ejarque, “Jitter and
shimmer measurements for speaker recognition,” in Proc.
8th Annu. Conf. Int. Speech Commun.Assoc., 2007, pp.
14. [45]. F. Severin, B. Bozkurt, and T. Dutoit, “HNR
extraction in voiced speech,oriented towards voice quality
analysis,” in Proc. EUSIPCO, Sep. 2005, pp. 14. [46].
VOICEBOX: Speech Processing Toolbox for MATLAB.Accessed:
Jan. 1, 2018. [Online]. Available: http://www.ee.ic.
4 ac.uk/hp/staff/dmb/voicebox/voicebox.html
[47]. T. Marciniak, R. Weychan, S. Drgas, A. D¡browski, and
A. Krzykowska, “Speaker recognition based on short polish
sequences,” in Proc. Signal Process. Algorithms, Archit., Ar-
5 rangements, Appl. Conf. (SPA), Sep. 2010, pp. 9598.
[48]. B. Schölkopf, C. J. Burges, and A. J. Smola, Advances
in Kernel Methods Support Vector Learning. Cambridge, MA,
6 USA: MIT Press, 1999.
[49]. V. N. Vapnik, “An overview of statistical learning theory,”
7 IEEE Trans.Neural Netw., vol. 10, no. 5, pp. 988999, Sep. 1999.
[50] E. Alpaydin, Introduction to Machine Learning. Cam-
8 bridge, MA, USA: MIT Press 2014.
[51]. S. L. Salzberg, “C4.5: Programs for machine learning
by J. Ross Quinlan.Morgan Kaufmann Publishers, Inc., 1993,”
9 Mach. Learn., vol. 16, no. 3,pp. 235240, Sep. 1994.
[52]. G. H. John and P. Langley, “Estimating continuous dis-
tributions in Bayesian classiers,” in Proc. 11th Conf. Uncertainty
10 Artif. Intell., 1995,pp. 338345.
[53]. N. Landwehr, M. Hall, and E. Frank, “Logistic model
11 trees,” Mach. Learn., vol. 59, no. 1, pp. 161205, May 2005.
[54]. D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based
12 learning algorithms,”Mach. Learn., vol. 6, no. 1, pp. 3766, 1991.
[55]. J. G. Cleary and L. E. Trigg, “K: An instance-based
learner using an entropic distance measure,” in Proc. 12th Int.
13 Conf. Mach. Learn., vol. 5. 1995, pp. 108114.
[56]. K. Hajian-Tilaki, “Receiver operating characteristic
(ROC) curve analysis for medical diagnostic test evaluation,”
Caspian J. Internal Med., vol. 4,no. 2, pp. 627635, 2013. [57]. I. H.
Witten, E. Frank, and M. Hall, Data Mining: Practical Machine
Learning Tools and Techniques. San Mateo, CA, USA: Morgan
14 Kaufmann, 2005.
[58].Correlation AttributeEval. Accessed:
Jan. 29, 2018. [Online]. Available:
http://weka.sourceforge.net/doc.dev/weka/attributeSelection/
15 Correlation AttributeEval.html
[59].Syed, Sidra Abid, Munaf Rashid, and Samreen Hussain.
"Meta-analysis of voice disorders databases and applied ma-
chine learning techniques." Mathematical Biosciences and Engi-
16 neering 17.6 (2020): 7958-7979.
[60].Mesallam, T. A., Farahat, M., Malki, K. H., Alsulaiman,
M., Ali, Z., Al-Nasheri, A., Muhammad, G. (2017). Development
of the arabic voice pathology database and its evaluation by
using speech features and machine learning algorithms. Journal
17 of healthcare engineering, 2017.

You might also like