Professional Documents
Culture Documents
Consonant Phoneme Based Extreme Learning Machine (ELM) Recognition Model For Foreign Accent Identification
Consonant Phoneme Based Extreme Learning Machine (ELM) Recognition Model For Foreign Accent Identification
68
important and is discussed [13]. Mel Frequency Cepstrum different places of articulation. They are the bilabial voiced /b/
Coefficients (MFCCs) and the normalized energy parameters and its voiceless counterpart /p/, the voiced alveolar /d/ and its
along their first and second derivatives as raw features are used in counterpart /t/, and the voiced velar /g/ and its counterpart /k/.
our model and trained with ELM, SVM, DBN classifiers.
2.1.1 Bilabial Stop /b/ vs /p/ pronunciation
2. CONSONANT PHONEME BASED ELM Influenced by the phonemic inventory of Arabic, the
MODEL pronunciation of English bilabial stops phonemes by Arabic
speakers differ from native English speakers in two main ways.
The proposed consonant phoneme-based ELM recognition model
Firstly, English has two bilabial stop phonemes, /b/ and /p/. The
enables to identify the speaker from particular phoneme
difference between /b/ and /p/ is that /b/ is voiced, produced with
pronunciation sound. The system composed of forced alignment,
sustained vocal fold vibration through the closure, whereas /p/ is
audio features extraction, and ELM as a classifier as shown in
voiceless, and does not involve vocal fold vibration. Unlike
Fig1. Firstly, the phoneme segmented from the audio speech is
English, most dialects of Arabic lack voiceless bilabial stop /p/
separated by forced alignment implemented based on GMM-
HMM [14]. Then, the audio segments from an audio sentence are and only has voiced /b/, which is marked by the letter بin the
framed and spectrum features such as the Mel-frequency Arabic alphabet. Consequently, Arab learners have difficulty in
Cepstrum (MFCC) of the phoneme are extracted. After the pronouncing and hearing the English letter /p/ because it is not a
normalization of the spectrum, features are used as an input with phoneme in their language. So they articulate and interpret the
ELM classifier to identify the speaker. During the extreme closest letter to which exists in their language, which is ب, often
learning machine algorithm design, different activation functions transliterated as /b/. Secondly, there are de-voicing processes
are investigated and 'Softlim' shows better accuracy. changing of /b/ into /p/ in English which Arabic speakers often
fail to get [15].
69
single hidden layer neural networks, as shown in Fig.2, suppose
that is the number of samples of input where
=[ , ,… ] T , =[ , ,… ]T . For an L
signal hidden layer neural network with a hidden layer node it can
be expressed as,
∑ ( )
70
kernels with d = {1, 2, ., 15}, γ = { , , . . ., }, and C = deviation saw minimum. By increasing more neurons its
{ ,... }. Best performance shows at BRF kernel with efficiency again decreased. Using ELMs as classifiers for accent
C=1 and gamma= 0.04. In DBN algorithm Batch size =20, identification model gives a better accent classification accuracy
learning rate = .001, epochs =20 used with activation function relative to SVMs and DBN.
sigmoid. During ELM training, the number of neurons in hidden
layer was varied from 100 to 1000 with an increment of 100 and
Table 2. Accuracy at different activation functions on EML
„tanh‟, „sine‟, „tribas‟, „inv_tribase‟, „sigmoid‟, „hardlim‟,
classifier (Consonant Phonemes)
„softlim‟, „gaussian‟, „multiquadric‟, and „inv_multiquadric‟ are
tested as activation functions. The best performance achieved by Activation Functions Phoneme LM (%)
selecting softlim activation function with 600 neurons in the tanh /t/ 84.0
hidden layer. The number of neurons in the hidden layer that tribas /t/ 81.11
yielded good accuracy was learned, based on parameter tuning. inv_tribas /t/ 80.44
All these experiments were done on Window 10 “CPU 2.7-2.9 sigmoid /t/ 84.0
GHz” by Python version 3.3.6. hardlim /t/ 83.55
gaussian /t/ 82.66
3.2 Comparative Experiments multiquadric /t/ 86.66
In order to reflect the advantages of the consonant phoneme-based inv_multiquadric /t/ 86.0
ELM recognition method, we compared this method with the softlim /t/ 88.0
traditional classifiers SVM and DBN as shown in Table 1. The
experimental results show that the classifiers SVM, DBN, have
lower accuracy than ELM. By tuning different parameters 3.3 Time-consuming Performance
including „gamma‟ , „kernel‟ and regularization parameter „C‟ of In Table 3, the relative time-consuming comparison between the
SVM, the accuracy achieved 71%,74%,76%,75%,69% &72% on different classifiers is shown. During experiments, training time
consonants phonemes /p/,/b/,/t/,/d/,/k/and /g/ respectively. By for dataset calculated at each step for SVM, DBN and ELM
changing a different number of hidden layers and fine tanned classifier. Our proposed ELM algorithm takes comparatively less
other parameters including learning rate and activation functions training time and seen more efficient.
of DBN classifier, the resultant output efficiency is obtained as
60%,62%, 64%,66%,65%, and 61% respectively. and finally, the Table3. Performance comparison between ELM, SVM and
ELM designed model applied and tuned different parameters DBN classifier (Dataset Training Time)
including „activation functions‟, and the number of neurons. The Methods SVM DBN ELM
ELM performance found 86%,87%,88%,86%, 85% and 87%
respectively. By comparing all three classifiers accuracy Training Time (s) 72 95 35
performance, it is found that ELM has high performance. In our
experiments six different consonants /p/, /b/, /t/, /d/, /g/, and /k/
are investigated and their featured trained by using different As discussed, AID is a challenging problem and different
classifiers SVM, DBN, and ELM, in which ELM showed good researchers applied various classification techniques on different
accuracy and in a comparison of different consonants phoneme /t/ datasets to identify the speaker from non-native languages. Bjorn
is the best-predicted result. Schuller chose different languages and classified by SVM but
accuracy reached up to 44.66%. Yishan Jiao continued work on
Table 1. Accuracy comparison SVM, DBN, ELM classifier the same languages and classified using DNN, RNN by
(different consonant Phonemes) combining short- and long-term features. The overall accuracy he
Consonants SVM (%) DBN (%) ELM (%) achieved is 51.92%, and the UAR is 52.24% on multi-languages
[10]. In the empirical study of classification on a Foreign
/p/ 71 60 87.76
Accented English (FAE) dataset, an average accuracy of 32.7%
/b/ 74 62 82.55
was obtained. By using GMMs and the Bayesian classifier,
/t/ 76 64 88.00
prediction rates of 73% and 58.9% respectively were obtained. In
/d/ 75 66 86.00
the text-independent automatic accent classification using
/k/ 69 65 85.00
phoneme-based models, average classification accuracies of 64.90%
/g/ 72 61 86.00
at the phone level and 75.18% at the word level for pairwise
classification was obtained. In another study on the TIMIT dataset
In our experiment, a challenging task was to select a suitable and that used the most discriminating vowels, a detection rate of 42.52%
more accurate activation function during the algorithm fine-tuning. was obtained. Furthermore, using ELM classifier on TIMIT
Table 2 shows the comparative performance with other activation dataset for regional accent identification the accuracy obtained
functions. It is found that softlim sowed better performance over was 77.88% [12]. Table 4 summarizes the comparison of accent
other activation functions. The main advantage of using Softlim is classification results. In which the proposed ELM model shows
the output probabilities range. The range will 0 to 1, and the sum better accuracy to identify Arabic native speakers, with consonant
of all the probabilities will be equal to one. After all experiments, phoneme-based AID.
it is found the consonant phoneme /t/ have the higher accuracy 88%
as compared to other consonant phonemes, the phoneme /t/ is
tested by adding neurons in range (100-1000). Accuracy increases
by increasing the number of neurons and after a certain limit,
decrement in accuracy observed. Meanwhile, standard deviation
also monitored while increasing ELM neurons, when the number
of neurons added with an increment of 100 and gradually
increased up to 600 then its accuracy was maximum and standard
71
Table 4. Comparison of accent classification result [7] J. Padmanabhan and M. J. Johnson Premkumar, "Machine
learning in automatic speech recognition: A survey," IETE
Dataset &Techniques Languages Accuracy%
Technical Review, vol. 32, no. 4, pp. 240-251, 2015.
FAE + HLDA[12] EN, FR, GE 32.70
FAE + GMM[12] EN, CN, FR, KR 73.00 [8] A. Tomar, "Various classifiers based on their accuracy for
FAE + Bayes[12] EN,CN,FR,TH,TR 58.90 age estimation through
CU accent + LDA[12] EN (Regional) 64.90 facial features," International Research Journal of Engineering
TIMIT + Prosodic[12] EN (Regional) 42.52 and Technology (IRJET), vol. 03, no. 07, 2016.
TIMIT+ELM [12] EN (Regional) 77.88 [9] K. Aida-zade, A. Xocayev, and S. Rustamov, "Speech
GMU+ELM (Proposed) EN, AR 88.00 recognition using support vector machines," in 2016 IEEE
10th International Conference on Application of Information
4. CONCLUSION AND FUTURE WORK and Communication Technologies (AICT), 2016: IEEE, pp.
In this paper, the consonant phoneme-based ELM recognition 1-4.
model is proposed for foreign accent Identification. To meet the [10] Y. Jiao, M. Tu, V. Berisha, and J. M. Liss, "Accent
foreign accent identification accuracy challenges, we extracted Identification by Combining Deep Neural Networks and
robust features of a consonant phoneme by using MFCC, feed Recurrent Neural Networks Trained on Long and Short Term
them as the input of the ELM classifier and chose more efficient Features," in Interspeech, 2016, pp. 2388-2392.
activation function. Compared with the traditional SVM and
DBN classification models, ELM was found more effective, with [11] C. R. Rubi, "A Review: Speech Recognition with Deep
higher accuracy for accent identification. In the future, we can Learning Methods," International Journal of Computer
investigate ELM more deeply by adding different layers like Science and Mobile Computing, vol. 4, no. 5, pp. 1017-1024,
“ML-ELM” and “H-ELM” approach. As well as by combining 2015.
RNN with PCA can boost up the overall accuracy of the model. [12] M. Rizwan and D. V. Anderson, "A weighted accent
Investigating the feature engineering with ML-ELM, we can classification using multiple words," Neurocomputing, vol.
implement it for multi-classification by adding more languages, 277, pp. 120-128, 2018.
for multiple foreign identifications with better efficiency. [13] B. Pes, "Feature selection for high-dimensional data: the
5. ACKNOWLEDGMENT issue of stability," in 2017 IEEE 26th International
Conference on Enabling Technologies: Infrastructure for
This work was supported by Shanghai Sailing Program
Collaborative Enterprises (WETICE), 2017: IEEE, pp. 170-
(No.19YF1402000) and the Fundamental Research Funds for the
175.
Central Universities (No.2232019D3-52).
[14] S. Brognaux and T. Drugman, "HMM-based speech
6. REFERENCES segmentation: Improvements of fully automatic approaches,"
[1] S. Xue, H. Jiang, L. Dai, and Q. Liu, "Speaker adaptation of IEEE/ACM Transactions on Audio, Speech, and Language
hybrid NN/HMM model for speech recognition based on Processing, vol. 24, no. 1, pp. 5-15, 2015.
singular value decomposition," Journal of Signal Processing [15] M. Al Zahrani, "Saudi Speakers‟ Perception of the English
Systems, vol. 82, no. 2, pp. 175-185, 2016. Bilabial Stops/b/and/p," Sino-US English Teaching, vol. 12,
[2] S. Sinha, A. Jain, and S. Agrawal, "ACOUSTIC-PHONETIC no. 6, pp. 435-447, 2015.
FEATURE BASED DIALECT IDENTIFICATION IN [16] O. Hago and W. Khan, "The pronunciation problems faced
HINDI SPEECH," International Journal on Smart Sensing & by Saudi EFL learners at secondary schools," Education and
Intelligent Systems, vol. 8, no. 1, 2015. Linguistics Research, vol. 1, no. 2, pp. 85-99, 2015.
[3] H. Behravan, V. Hautamäki, and T. Kinnunen, "Factors [17] I. Sabir and N. Alsaeed, "A brief description of consonants in
affecting i-vector based foreign accent recognition: A case modern standard Arabic," Linguistics and Literature Studies,
study in spoken Finnish," Speech Communication, vol. 66, pp. vol. 2, no. 7, pp. 185-189, 2014.
118-129, 2015.
[18] K. Sun, J. Zhang, C. Zhang, and J. Hu, "Generalized extreme
[4] C.-C. Chiu et al., "State-of-the-art speech recognition with learning machine autoencoder and a new deep neural
sequence-to-sequence models," in 2018 IEEE International network," Neurocomputing, vol. 230, pp. 374-381, 2017.
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2018: IEEE, pp. 4774-4778. [19] G.-B. Huang, D. H. Wang, and Y. Lan, "Extreme learning
machines: a survey," International Journal of machine
[5] C. G. Clopper, D. B. Pisoni, and K. De Jong, "Acoustic learning and cybernetics, vol. 2, no. 2, pp. 107-122, 2011.
characteristics of the vowel systems of six regional varieties
of American English," The Journal of the Acoustical Society [20] S. Weinberger, "Speech Accent Archive. George Mason
of America, vol. 118, no. 3, pp. 1661-1676, 2005. University," Online:< http://accent. gmu. edu, 2014.
[6] M. E. Beckman, Stress and non-stress accent. Walter de [21] A. Tharwat, A. E. Hassanien, and B. E. Elnaghi, "ABA-
Gruyter, 2012. based algorithm for parameter optimization of support vector
machine," Pattern Recognition Letters, vol. 93, pp. 13-22,
2017.
72