You are on page 1of 5

Consonant Phoneme Based Extreme Learning Machine

(ELM) Recognition Model for Foreign Accent Identification


Kaleem Kashif Yizhi Wu Adjeisah Michael
School of Information Science & School of Information Science & School of Computer Science &
Technology Technology Technology
Donghua University Donghua University Donghua University
Shanghai, China Shanghai, China Shanghai, China
kaleem@mail.dhu.edu.cn yz_wu@dhu.edu.cn madjeisah@mail.dhu.edu.cn

ABSTRACT robustness of existing (ASR) Automatic Speech Recognition


Foreign accent automatic identification has a key role in many systems [4]. The Arabic language has several regional dialects but,
speech systems, such as speech recognition, speaker identification, in our work, we focus on Gulf Arabic. Gulf Arabic is a variety of
voice conversion, and immigration screenings, etc. English the Arabic language spoken in Eastern Arabia around the coasts of
speakers exhibit dialectal differences or non-native accents on the Persian Gulf in Kuwait, Bahrain, Qatar, the United Arab
specific features of their speech, and these features can be used to Emirates, as well as parts of eastern Saudi Arabia, southern Iraq,
identify the dialect or native language of the speaker. In this paper, and south Iran and northern Oman. Accents or dialects differ in
we proposed the consonant phoneme based Extreme Learning various acoustic traits including the phonetic realization of vowels
Machine (ELM) recognition model for accent identification based and consonants, rhythmical characteristics and prosody[5, 6]. The
on the different pronunciation of English consonant phonemes by phonetic realization of consents can also be applied as
Arab native speakers. Mel-Frequency Cepstrum Coefficients discriminative features to identify the native speaker. Here, we
(MFCCs) and the normalized energy parameter along with their focus on the different features of the pronunciation of consonants
first and second derivatives are used as acoustic features and of Arabic and native English speakers. Gulf Arabic has eight stop
trained with ELMs, SVMs and DBN classifiers. ELM classifier consonants and English has six stops. Influenced by the phonemic
showed fast learning, and better performance, based on KFold inventory of Arabic, the pronunciation of English stops phonemes
validation with an accuracy of 88% and standard deviation by Arabic speakers differs from native English speakers. So,
( , 76% by SVM and 64% with DBN classifier based on discrimination of consonant phonemes, AID recognition
respectively. Our proposed ELM and SVM model showed an 11%, model is set up in this paper.
16% increase in accuracy respectively over the previous work In previous approaches, many recognition methods have been
model by using the same classifier on multiple words based analyzed for AID. Traditional methods, such as Gaussian Mixture
acoustic model to identify regional accents. Model (GMM), Linear Discriminant Analysis (LDA) classifier,
widely used for classification but having their own limitations [7,
CCS Concepts 8]. Support Vector Machine (SVM) has achieved good results for
• Computing methodologies→Artificial intelligence→Natural AID [9], but still, it is less efficient due to its limitations such as
language processing→Speech recognition choosing a “good” kernel function is not easy and it takes long
training time on large datasets. For the past years, Deep learning
Keywords models, such as Neural Network (DNN), (RNN), Deep Belief
Extreme Learning Machines (ELM); Support vector machines Network (DBN), has remarkable performance in many areas
(SVM); Deep Belief Network (DBN); Accent identification. including AID. [10]. However, the main problem with the deep
neural network architectures was the learning process, since the
1. INTRODUCTION ordinary gradient descent algorithm does not work well and
Accent identification (AID) refers to the problem of inferring the
sometimes makes the training quite impossible [11]. Due to these
native language of a speaker from his or her foreign accent
limitations and low efficiency, Extreme Learning Machine (ELM)
speaking [1, 2]. AID system has been used in a number of major
is investigated as a substitute because the main advantage of the
applications, such as voice conversion and immigration screenings
ELM model is the short training time with better accuracy. The
[3]. Furthermore, it has great significance for improving the
number of hidden layer nodes can be randomly selected and
analyzed to reduce the calculation time and learning speed is high.
Permission to make digital or hard copies of all or part of this
ELM uses a quadratic loss function and minimizes the sum of
work for personal or classroom use is granted without fee
squared errors between the class labels and the network output. It
provided that copies are not made or distributed for profit or
not only penalizes wrong answers but also penalizes correct
commercial advantage and that copies bear this notice and the full
answers which are far from the decision boundary [12].
citation on the first page. To copy otherwise, or republish, to post
on servers or to redistribute to lists, requires prior specific In this paper, the consonant phonemes based Extreme Learning
permission and/or a fee. Request permissions from Machine (ELM) algorithm for Arabic AID is proposed and
Permissions@acm.org. investigated with different activation functions and achieved
WSSE 2019, September 20–23, 2019, Wuhan, China better accuracy results. We also compared these accuracy results
© 2019 Association for Computing Machinery.
with other traditional classifiers like Support Vector Machine
ACM ISBN 978-1-4503-7213-8/19/09…$15.00
(SVM) and Deep Belief Network (DBN) to construct a more
DOI: https://doi.org/10.1145/3362125.3362130 efficient model. The choice of discriminative features is very

68
important and is discussed [13]. Mel Frequency Cepstrum different places of articulation. They are the bilabial voiced /b/
Coefficients (MFCCs) and the normalized energy parameters and its voiceless counterpart /p/, the voiced alveolar /d/ and its
along their first and second derivatives as raw features are used in counterpart /t/, and the voiced velar /g/ and its counterpart /k/.
our model and trained with ELM, SVM, DBN classifiers.
2.1.1 Bilabial Stop /b/ vs /p/ pronunciation
2. CONSONANT PHONEME BASED ELM Influenced by the phonemic inventory of Arabic, the
MODEL pronunciation of English bilabial stops phonemes by Arabic
speakers differ from native English speakers in two main ways.
The proposed consonant phoneme-based ELM recognition model
Firstly, English has two bilabial stop phonemes, /b/ and /p/. The
enables to identify the speaker from particular phoneme
difference between /b/ and /p/ is that /b/ is voiced, produced with
pronunciation sound. The system composed of forced alignment,
sustained vocal fold vibration through the closure, whereas /p/ is
audio features extraction, and ELM as a classifier as shown in
voiceless, and does not involve vocal fold vibration. Unlike
Fig1. Firstly, the phoneme segmented from the audio speech is
English, most dialects of Arabic lack voiceless bilabial stop /p/
separated by forced alignment implemented based on GMM-
HMM [14]. Then, the audio segments from an audio sentence are and only has voiced /b/, which is marked by the letter ‫ ب‬in the
framed and spectrum features such as the Mel-frequency Arabic alphabet. Consequently, Arab learners have difficulty in
Cepstrum (MFCC) of the phoneme are extracted. After the pronouncing and hearing the English letter /p/ because it is not a
normalization of the spectrum, features are used as an input with phoneme in their language. So they articulate and interpret the
ELM classifier to identify the speaker. During the extreme closest letter to which exists in their language, which is ‫ب‬, often
learning machine algorithm design, different activation functions transliterated as /b/. Secondly, there are de-voicing processes
are investigated and 'Softlim' shows better accuracy. changing of /b/ into /p/ in English which Arabic speakers often
fail to get [15].

2.1.2 Alveolar plosives /t/ vs /d/ pronunciation


The voiceless /t/ and the voiced /d/ are alveolar plosives in
English, produced with the tongue tip touching the alveolar ridge,
while in many Arabic dialects they are dental plosives, produced
with the tongue tip touching the back of the upper teeth. Therefore,
Arab learners may replace the alveolar by the dental because of
the interference of the mother tongue on the target language.
Another feature of English alveolar stops that may be difficult for
learners is that the regular past tense suffix /d/ is realized as
voiceless [t] following a voiceless consonant, for example,
“danced” [danst]. This phonological rule may not be apparent to
Arabic learners of English, and they may produce voiced /d/ in
these contexts [16].

2.1.3 Velar plosive /g/ vs /k/ pronunciation


The sounds /k/ and /g/ are made by raising the tongue dorsum at
the back of the mouth to make a complete closure with the soft
palate, or velum. /k/ is a voiceless sound whereas /g/ is a voiced
sound. These are known as velar plosives. The English voiced
velar plosive /g/ has no counterpart in some dialects of Arabic,
whereas other dialects, such as Egyptian Arabic, do have /g/.
Speakers of the dialects that don‟t have /g/ sometimes pronounce
it as voiceless /k/, especially in the words that have “ex-” because
they have not mastered the rules of assimilation. For example,
ʻexistʼ / ɪgzɪst /→ [ɪkzɪst] [17].

2.2 Extreme Learning Machines (ELM)


Extreme learning machine (ELM) is an efficient algorithm that
determines the output weights of a single layer feed-forward
neural networks (SLFNs). It adopts an analytical solution instead
of the standard gradient descent algorithm [18]. Neural
networks have been used to solve classification problems in
Figure 1. Phoneme Based Recognition Model.
several domains ranging from computer vision to bioinformatics.
2.1 Discriminative Features of Stop Traditionally, for an SLFN, all the parameters (weights and biases)
for the different layers need to be tuned/learned and there is
Consonant Phonemes dependency among the different layers. To overcomes the training
The stop consonants also called stop sounds, such as /b/, /p/, /k/ problems of NN, the ELM randomly assigns weights to the input
etc. are produced by complete closure of the air passage through layers and analytically compute the weights for the output layer
the vocal tract. The eight Arabic stop sounds are pronounced in using a simple generalized inverse operation. The ELM
five different places of articulation and include bilabial /b/, velar framework has shown comparable classification performance,
/k/, voiceless alveolar /t/ and the voiced alveolar /d/ [15]. On the improved model representation (less complexity) and faster run
other hand, the six English stop sounds are pronounced in three times in comparison to support vector machines SVM [12]. For a

69
single hidden layer neural networks, as shown in Fig.2, suppose
that is the number of samples of input where
=[ , ,… ] T , =[ , ,… ]T . For an L
signal hidden layer neural network with a hidden layer node it can
be expressed as,

∑ ( )

and is the activation function, = [ , ,… ] T


is the input weight and is the output weight, in is an offset
of the first hidden layer unit shows the and inner the Iinput neurons hidden neurons output neurons
product. The goal of a single hidden layer neural network learning Figure 2. ELM Basic model.
is to minimize the error of the output, which can be expressed as,
3. EXPERIMENTS AND EVOLUTION
In this section, we evaluate the performance of Consonant
∑ Phoneme based ELM model for Arabic as a foreign language
accent identification. The data source has been taken from the
speech accent archive. George Mason University [20]. The
that exist and so that
archive dataset consists of recordings of different speakers of a
variety of language backgrounds reading a paragraph. We have
taken 100 speakers of English native and 100 speakers from the
∑ ( )
Gulf region. Native and non-native speakers of English all read
the same English paragraph which is as follows,
can be expressed as matrix where is the output of “PLEASE CALL STELLA, ASK HER TO BRING THESE
hidden layer node and is the output weight and is the expected THINGS WITH HER FROM THE STORE; SIX SPOONS OF
output. FRESH SNOW PEAS, FIVE THICK SLABS OF BLUE
CHEESE AND MAYBE, A SNACK FOR HER BROTHER BOB.
WE ALSO NEED A SMALL PLASTIC SNAKE AND A BIG
TOY FROG FOR THE KIDS. SHE CAN SCOOP THESE
THINGS INTO THREE RED BAGS AND WE WILL GO MEET
[ ]
HER WEDNESDAY AT THE TRAIN STATION”. Because this
dataset is available in (.mp3) format with 41.1 kHz sample rate, at
the very first step we changed our samples from (.mp3) 41.1 kHz
to (.wav) format 16 kHz more computationally efficient. Because
[ ] [ ] one of the dominant pronunciation differences between Arabic
native and English native speaker is the pronunciation of the
words containing consonant phonemes. Based on this motivation,
In order to train a single hidden layer neural network, we hope to in our experiment we have analyzed different phonemes /p/, /b/, /t/,
get and to make /d/, /k/and /g/. By doing the forced alignment extracted each
phoneme from the speech sentence as a result we got 1500
samples of each phoneme. The MFCCs with 26 dimensions were
used to extract features. We used a Hamming window and
triangular filter bank for the MFCCs. The features were extracted
where = 1 . This is equivalence to minimize the loss from 25ms windowed signal with 10ms frameshift. Performance
function of the model has been analyzed by comparing their accuracies
using different classifiers, different activation functions, and
different consonants.
∑ (∑ )
3.1 Algorithm Implementation and
Classification
In the ELM algorithm once the input weight and bias of hidden In the classification phase different machine learning models like
layer are randomly determined, the output matrix of the hidden Support Vector Machine (SVM), Deep Belief Network (DBN)
layer is uniquely determined. The training single hidden layer and Extreme Learning Machine (ELM) algorithms model
neural network can be transformed into a linear system designed according to classification data. There are six phoneme
and the output can be determined by [19]. The basic features were trained and tested by three classifiers. For data
ELM model is shown in Fig2. training and testing, we set data division by ration 7:3 respectively.
For the first experiment with SVM classifier, the dataset labeled 1
for Arabic samples, and 2 for English samples. For SVM training,
a grid search method was used to find optimal SVM model
parameters [21]. We used linear, polynomial, RBF, and sigmoid

70
kernels with d = {1, 2, ., 15}, γ = { , , . . ., }, and C = deviation saw minimum. By increasing more neurons its
{ ,... }. Best performance shows at BRF kernel with efficiency again decreased. Using ELMs as classifiers for accent
C=1 and gamma= 0.04. In DBN algorithm Batch size =20, identification model gives a better accent classification accuracy
learning rate = .001, epochs =20 used with activation function relative to SVMs and DBN.
sigmoid. During ELM training, the number of neurons in hidden
layer was varied from 100 to 1000 with an increment of 100 and
Table 2. Accuracy at different activation functions on EML
„tanh‟, „sine‟, „tribas‟, „inv_tribase‟, „sigmoid‟, „hardlim‟,
classifier (Consonant Phonemes)
„softlim‟, „gaussian‟, „multiquadric‟, and „inv_multiquadric‟ are
tested as activation functions. The best performance achieved by Activation Functions Phoneme LM (%)
selecting softlim activation function with 600 neurons in the tanh /t/ 84.0
hidden layer. The number of neurons in the hidden layer that tribas /t/ 81.11
yielded good accuracy was learned, based on parameter tuning. inv_tribas /t/ 80.44
All these experiments were done on Window 10 “CPU 2.7-2.9 sigmoid /t/ 84.0
GHz” by Python version 3.3.6. hardlim /t/ 83.55
gaussian /t/ 82.66
3.2 Comparative Experiments multiquadric /t/ 86.66
In order to reflect the advantages of the consonant phoneme-based inv_multiquadric /t/ 86.0
ELM recognition method, we compared this method with the softlim /t/ 88.0
traditional classifiers SVM and DBN as shown in Table 1. The
experimental results show that the classifiers SVM, DBN, have
lower accuracy than ELM. By tuning different parameters 3.3 Time-consuming Performance
including „gamma‟ , „kernel‟ and regularization parameter „C‟ of In Table 3, the relative time-consuming comparison between the
SVM, the accuracy achieved 71%,74%,76%,75%,69% &72% on different classifiers is shown. During experiments, training time
consonants phonemes /p/,/b/,/t/,/d/,/k/and /g/ respectively. By for dataset calculated at each step for SVM, DBN and ELM
changing a different number of hidden layers and fine tanned classifier. Our proposed ELM algorithm takes comparatively less
other parameters including learning rate and activation functions training time and seen more efficient.
of DBN classifier, the resultant output efficiency is obtained as
60%,62%, 64%,66%,65%, and 61% respectively. and finally, the Table3. Performance comparison between ELM, SVM and
ELM designed model applied and tuned different parameters DBN classifier (Dataset Training Time)
including „activation functions‟, and the number of neurons. The Methods SVM DBN ELM
ELM performance found 86%,87%,88%,86%, 85% and 87%
respectively. By comparing all three classifiers accuracy Training Time (s) 72 95 35
performance, it is found that ELM has high performance. In our
experiments six different consonants /p/, /b/, /t/, /d/, /g/, and /k/
are investigated and their featured trained by using different As discussed, AID is a challenging problem and different
classifiers SVM, DBN, and ELM, in which ELM showed good researchers applied various classification techniques on different
accuracy and in a comparison of different consonants phoneme /t/ datasets to identify the speaker from non-native languages. Bjorn
is the best-predicted result. Schuller chose different languages and classified by SVM but
accuracy reached up to 44.66%. Yishan Jiao continued work on
Table 1. Accuracy comparison SVM, DBN, ELM classifier the same languages and classified using DNN, RNN by
(different consonant Phonemes) combining short- and long-term features. The overall accuracy he
Consonants SVM (%) DBN (%) ELM (%) achieved is 51.92%, and the UAR is 52.24% on multi-languages
[10]. In the empirical study of classification on a Foreign
/p/ 71 60 87.76
Accented English (FAE) dataset, an average accuracy of 32.7%
/b/ 74 62 82.55
was obtained. By using GMMs and the Bayesian classifier,
/t/ 76 64 88.00
prediction rates of 73% and 58.9% respectively were obtained. In
/d/ 75 66 86.00
the text-independent automatic accent classification using
/k/ 69 65 85.00
phoneme-based models, average classification accuracies of 64.90%
/g/ 72 61 86.00
at the phone level and 75.18% at the word level for pairwise
classification was obtained. In another study on the TIMIT dataset
In our experiment, a challenging task was to select a suitable and that used the most discriminating vowels, a detection rate of 42.52%
more accurate activation function during the algorithm fine-tuning. was obtained. Furthermore, using ELM classifier on TIMIT
Table 2 shows the comparative performance with other activation dataset for regional accent identification the accuracy obtained
functions. It is found that softlim sowed better performance over was 77.88% [12]. Table 4 summarizes the comparison of accent
other activation functions. The main advantage of using Softlim is classification results. In which the proposed ELM model shows
the output probabilities range. The range will 0 to 1, and the sum better accuracy to identify Arabic native speakers, with consonant
of all the probabilities will be equal to one. After all experiments, phoneme-based AID.
it is found the consonant phoneme /t/ have the higher accuracy 88%
as compared to other consonant phonemes, the phoneme /t/ is
tested by adding neurons in range (100-1000). Accuracy increases
by increasing the number of neurons and after a certain limit,
decrement in accuracy observed. Meanwhile, standard deviation
also monitored while increasing ELM neurons, when the number
of neurons added with an increment of 100 and gradually
increased up to 600 then its accuracy was maximum and standard

71
Table 4. Comparison of accent classification result [7] J. Padmanabhan and M. J. Johnson Premkumar, "Machine
learning in automatic speech recognition: A survey," IETE
Dataset &Techniques Languages Accuracy%
Technical Review, vol. 32, no. 4, pp. 240-251, 2015.
FAE + HLDA[12] EN, FR, GE 32.70
FAE + GMM[12] EN, CN, FR, KR 73.00 [8] A. Tomar, "Various classifiers based on their accuracy for
FAE + Bayes[12] EN,CN,FR,TH,TR 58.90 age estimation through
CU accent + LDA[12] EN (Regional) 64.90 facial features," International Research Journal of Engineering
TIMIT + Prosodic[12] EN (Regional) 42.52 and Technology (IRJET), vol. 03, no. 07, 2016.
TIMIT+ELM [12] EN (Regional) 77.88 [9] K. Aida-zade, A. Xocayev, and S. Rustamov, "Speech
GMU+ELM (Proposed) EN, AR 88.00 recognition using support vector machines," in 2016 IEEE
10th International Conference on Application of Information
4. CONCLUSION AND FUTURE WORK and Communication Technologies (AICT), 2016: IEEE, pp.
In this paper, the consonant phoneme-based ELM recognition 1-4.
model is proposed for foreign accent Identification. To meet the [10] Y. Jiao, M. Tu, V. Berisha, and J. M. Liss, "Accent
foreign accent identification accuracy challenges, we extracted Identification by Combining Deep Neural Networks and
robust features of a consonant phoneme by using MFCC, feed Recurrent Neural Networks Trained on Long and Short Term
them as the input of the ELM classifier and chose more efficient Features," in Interspeech, 2016, pp. 2388-2392.
activation function. Compared with the traditional SVM and
DBN classification models, ELM was found more effective, with [11] C. R. Rubi, "A Review: Speech Recognition with Deep
higher accuracy for accent identification. In the future, we can Learning Methods," International Journal of Computer
investigate ELM more deeply by adding different layers like Science and Mobile Computing, vol. 4, no. 5, pp. 1017-1024,
“ML-ELM” and “H-ELM” approach. As well as by combining 2015.
RNN with PCA can boost up the overall accuracy of the model. [12] M. Rizwan and D. V. Anderson, "A weighted accent
Investigating the feature engineering with ML-ELM, we can classification using multiple words," Neurocomputing, vol.
implement it for multi-classification by adding more languages, 277, pp. 120-128, 2018.
for multiple foreign identifications with better efficiency. [13] B. Pes, "Feature selection for high-dimensional data: the
5. ACKNOWLEDGMENT issue of stability," in 2017 IEEE 26th International
Conference on Enabling Technologies: Infrastructure for
This work was supported by Shanghai Sailing Program
Collaborative Enterprises (WETICE), 2017: IEEE, pp. 170-
(No.19YF1402000) and the Fundamental Research Funds for the
175.
Central Universities (No.2232019D3-52).
[14] S. Brognaux and T. Drugman, "HMM-based speech
6. REFERENCES segmentation: Improvements of fully automatic approaches,"
[1] S. Xue, H. Jiang, L. Dai, and Q. Liu, "Speaker adaptation of IEEE/ACM Transactions on Audio, Speech, and Language
hybrid NN/HMM model for speech recognition based on Processing, vol. 24, no. 1, pp. 5-15, 2015.
singular value decomposition," Journal of Signal Processing [15] M. Al Zahrani, "Saudi Speakers‟ Perception of the English
Systems, vol. 82, no. 2, pp. 175-185, 2016. Bilabial Stops/b/and/p," Sino-US English Teaching, vol. 12,
[2] S. Sinha, A. Jain, and S. Agrawal, "ACOUSTIC-PHONETIC no. 6, pp. 435-447, 2015.
FEATURE BASED DIALECT IDENTIFICATION IN [16] O. Hago and W. Khan, "The pronunciation problems faced
HINDI SPEECH," International Journal on Smart Sensing & by Saudi EFL learners at secondary schools," Education and
Intelligent Systems, vol. 8, no. 1, 2015. Linguistics Research, vol. 1, no. 2, pp. 85-99, 2015.
[3] H. Behravan, V. Hautamäki, and T. Kinnunen, "Factors [17] I. Sabir and N. Alsaeed, "A brief description of consonants in
affecting i-vector based foreign accent recognition: A case modern standard Arabic," Linguistics and Literature Studies,
study in spoken Finnish," Speech Communication, vol. 66, pp. vol. 2, no. 7, pp. 185-189, 2014.
118-129, 2015.
[18] K. Sun, J. Zhang, C. Zhang, and J. Hu, "Generalized extreme
[4] C.-C. Chiu et al., "State-of-the-art speech recognition with learning machine autoencoder and a new deep neural
sequence-to-sequence models," in 2018 IEEE International network," Neurocomputing, vol. 230, pp. 374-381, 2017.
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2018: IEEE, pp. 4774-4778. [19] G.-B. Huang, D. H. Wang, and Y. Lan, "Extreme learning
machines: a survey," International Journal of machine
[5] C. G. Clopper, D. B. Pisoni, and K. De Jong, "Acoustic learning and cybernetics, vol. 2, no. 2, pp. 107-122, 2011.
characteristics of the vowel systems of six regional varieties
of American English," The Journal of the Acoustical Society [20] S. Weinberger, "Speech Accent Archive. George Mason
of America, vol. 118, no. 3, pp. 1661-1676, 2005. University," Online:< http://accent. gmu. edu, 2014.
[6] M. E. Beckman, Stress and non-stress accent. Walter de [21] A. Tharwat, A. E. Hassanien, and B. E. Elnaghi, "ABA-
Gruyter, 2012. based algorithm for parameter optimization of support vector
machine," Pattern Recognition Letters, vol. 93, pp. 13-22,
2017.

72

You might also like