You are on page 1of 4

Noise Effect on Arabic Alphadigits in Automatic Speech

Recognition
Yousef Ajami Alotaibi1 , Khondaker Abdullah-Al-Mamun2, Ghulam Muhammad1
1
Department of Computer Engineering, College of Computer and Information Sciences,
King Saud University, Saudi Arabia
{yalotaibi, ghulam}@ccis.ksu.edu.sa
2
Institute of Sound and Vibration Research, University of Southampton, UK

Abstract - Automatic Speech Recognition (ASR) in Arabic 2 Literature review


speech, particularly Saudi accented speech, is a less
researched area. Some efforts to develop ASR on Saudi An extensive range of work is done on spoken
accented Arabic speech in clean environment have been alphadigits recognition in the area of ASR and many
studied in previous literature. These papers discuss the algorithms using a variety of techniques have been
difficulties and problems in Arabic ASR up to some extent. developed. In general, spoken alphabets and digits for
In this paper, we analyze the effect of noise at different different languages were targeted by ASR researchers. Most
Signal to Noise Ratio (SNR) on Saudi accented Arabic of the research has been done in English, Japanese, and
alphadigits. The experimental result shows the accuracy of Mandarin languages, but very few researches can be found
85.01% in clean environment, and 82.45% and 60.55% in Arabic language. An artificial neural networks based
accuracy in noisy condition at SNR 20 dB and 5 dB, speech recognition system was designed and tested with
respectively. The most confusing alphadigits are also automatic Arabic digits recognition by Y.A. Alotaibi [3].
discussed both in clean and noisy conditions. The system was an isolated word speech recognizer and it
was implemented both as a multi-speaker and speaker-
independent mode. Y.A. Alotaibi also developed a HMM-
Index Terms—Saudi accented Arabic speech, based ASR system for recognizing Arabia alphabets [4].
alphadigits, ASR, HMM, noise. Ghulam [5] developed an HMM/SM-based Japanese
connected digit recognizer that achieved 98.2% recognition
accuracy in clean environment. The recognizer used
1 Introduction subspace method to verify the N-best hypotheses from an
HMM-based classifier.
Automatic speech recognition (ASR) is rich in many The researches on Arabic alphadigits discussed above
languages like English, Japanese, Spanish, German, involve speech recorded in controlled environment or in
Mandarin, etc. However, ASR in Arabic speech, particularly clean condition. However, natural speech may be uttered at
Saudi accented speech, is a less researched area. Some any natural environment that introduces noise. The work in
efforts to develop ASR on Arabic speech in clean this paper is an extension of [4] to recognize the alphadigits
environment have been reported in the literature [1][2]. in noisy condition. This paper concentrates on analysis and
These papers discuss the difficulties and problems in Arabic investigation of Arabic alphadigits in noisy environment
ASR up to some extent. In this paper, we analyze the effect from an ASR perspective. As a simulation of noisy speech,
of noise on Arabic alphadigits. In best of our knowledge, it white Gaussian noise is artificially added to the clean speech
is the first attempt towards developing a noise-robust Arabic at different signal-to-noise ratio (SNR) level.
speech recognition system. Table 1 shows the 29 alphabets The organization of the rest of this paper is as follows.
and the 10 digits of Arabic language along with the way of Section 3 describes the experimental setup while section 4
how to pronounce them, type of syllable, and number of describes the results. Finally, conclusion of this research is
syllables in every spoken alphadigit. All Arabic syllables presented in section 5.
must contain at least one vowel. Also Arabic vowels cannot
be initials and they can occur either between two consonants
or be the final phoneme in a word.
Table 1: Arabic Alphadigits. 3 Experimental setup
Arabic No. of
Symbol Alphadigit Syllables 3.1 System Parameter
Writing Syllables
D0 Sifr ‫صف‬ CVCC 1
D1 Wahed ‫واح‬ CV-CVC
The main experiment has been conducted on a
2 connected phoneme task constituting isolated Arabic
D2 Athnayn ‫أث ي‬ CVC-CVCC 2 alphadigit. Each phoneme is modeled by a three state
D3 Thalathah ‫ثاث‬ CV-CV-CVC 3 HMM. The state transition is left-to-right. Observation
D4 Arbaah ‫أربع‬ CVC-CV-CVC 3 probability density functions are modeled using Gaussian
D5 Khamsah ‫خ س‬ CVC-CVC 2 Mixture Models (GMM). The number of mixtures of each
D6 Setah ‫ست‬ CVC-CVC 2
state is set to 1. All training and recognition experiments
D7 Sabaah ‫سبع‬ CVC-CVC 2
have been implemented with the HTK speech recognition
D8 Thamanyah ‫ث ا ي‬ CV-CV-CV-
software package [6].
CVC 4 The parameters of the system are: 10 kHz sampling
D9 Tesaah ‫تسع‬ CVC-CVC 2
rate with a 16 bit sample resolution; 25 millisecond
A01 Alef ‫ألف‬ CV-CVC Hamming window duration with a step size of 10
2
milliseconds; 12 MFCC coefficients extracted by DCT from
Hamzah CVC-CVC
A02 2 26 filter outputs; and the pre-emphasis coefficient is 0.97.
Baa ‫با‬ CV
A03 1
Taa ‫تا‬ CV 3.2 Database
A04 1
Thaa ‫ثا‬ CV
A05 1
Jeem ‫جيم‬ CVC An in-house database was created from all Arabic
A06 1
H_aa ‫حا‬ CV alphadigits and this database is used in the experiments. A
A07 1 total of fifty speakers, all male, were used to utter all the
Khaa ‫خا‬ CV
A08 1 twenty-nine Arabic alphabets and ten digits with ten
Daal ‫دال‬ CVC repetitions. On the other hand, a total of seventeen male
A09 1
Thaal ‫ال‬ CVC speakers were used to utter all ten Arabic digits ten times.
A10 1
‫را‬
All speakers spoke Arabic as their native language and the
Raa CV
A11 1 speakers range in age from eighteen to fifty years of old.
Zain ‫ي‬ CV-CVC This means that we get 16,200 alphadigit tokens. The
A12 2
Seen ‫سي‬ CVC sampling rate is 10 kHz. White Gaussian noise has been
A13 1
Sheen ‫شي‬ CVC artificially added to the clean speech at segmental SNRs: -5
A14 1 dB, 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB. In the training
Saad ‫صاد‬ CVC
A15 1 mode, the first and the second repetitions of each alphadigits
Dhaad ‫ضاد‬ CVC that were uttered by all speakers were used in the training
A16 1
T_aa ‫طا‬ CV phase. Thus, the total tokens considered for training is
A17 1
Dhaa ‫ظا‬ CV
12,030 tokens. For testing mode, 4,170 tokens were used in
A18 1 the recognition phase.
Ain ‫عي‬ CV-CVC
A19 2
Ghain ‫غي‬ CV-CVC
A20 2
Faa ‫فا‬ CV
A21 1
A22
Qaaf ‫قاف‬ CVC
1
4 Results
Kaaf ‫كاف‬ CVC
A23 1 The results provided in this paper depend mainly on the
Laam ‫ام‬ CVC
A24 1 outcomes of the designed Arabic alphadigits recognition
Meem ‫ميم‬ CVC system at clean and noisy environment. The performance of
A25 1
Noon ‫و‬ CVC any recognition system depends on many factors but the size
A26 1
and the perplexity of vocabulary are among the effecting
Haa ‫ا‬ CV
A27 1 performance factors. In this system the size of vocabulary is
Wawo ‫واو‬ CVC relatively low (only thirty-nine spoken Arabic alphadigits).
A28 1
Yaa ‫يا‬ CV However, the existence of acoustically similar sets is
A29 1
obviously damaging the accuracy of the system. Figure 1and
2 show the accuracy of the system for all spoken digits and
alphabets, respectively. The overall system performance is
85.01% in clean environment while for the noisy
Alphabets A1 to A10
environment the overall performances decrease according to
the level of noise. For example, the overall performances are
100.00
82.45%, 77.53%, 70.50%, 60.55%, 37.12% and 3.45% for
90.00
SNR = 20dB, 15dB, 10dB, 5dB, 0dB and -5dB,
80.00 Clean
respectively. Spoken alphadigits D1, A1, A2 and A28 have
70.00 20dB
got 100% recognition rate for clean utterances. This is due
60.00
to the fact that these spoken alphadigits possess a very high 15dB
50.00
dissimilarity compared to all other spoken alphadigits. On 10dB
40.00
the other hand, the worst performance is encountered in 5dB
30.00
clean utterances with alpha digits A5, A10, A13, A17, A18, 0dB
20.00
A21 and A27, where the performances are less then 70%.
10.00 Average
We found in the experiment the fact that for the several
0.00
alphadigits the recognition rates increase with the level of A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
noise, which opposes the ideal scenario. When we consider
the digits the scenarios is some digits have got more Figure 2.a: Correctness for Individual Arabic Alphabets (A1
recognition rate with more noise while for other digits have ~ A10).
got similar recognition rate (D1) with more noise and some recognition rate from for clean speech and noisy speech up
follow the ideal scenarios. When we consider the alphabets to SNR 5dB. For SNR 0 dB and less the recognition
the recognition rate are decreases with increasing the noise decreases again. For digit D2, the accuracy always less than
with the speech utterances but for few alphabets the the average and it is almost similar for clean and different
recognition rate follow the reverse direction. noisy conditions up to SNR 5dB. The high accuracy has got
From Figure 1, the digit D0 has got 80% accuracy in clean from the digit D9 at clean and noisy speech up to SNR 10
speech and it improved to 87%, 95% and 100% in the noisy dB and it confuses with digit D6 at SNR 5 and 0 dB with a
environment at SNR 20dB, 5dB and 0dB, respectively. The less recognition rate.
confusion of D0 with A22 gradually decreases with the The alphabets A1, A2 have high accuracy for clean
increasing of noise level. The similar situation also happened and noisy situation but the accuracy of alphabets A3, A4
with the digits D3, D5 and D6. The digit D1 maintain 100% heavily decreases with the increasing of noise. The
alphabet A5 has got very low accuracy for clean and noisy
Digits speech utterances and it is confused with A4, A21, A23
100.00 and A27. Here the last phoneme of alphabet A4, A5, A21
95.00 and A27 are very similar. It may be noticed that this is a
90.00
region for confusion. The similar scenario also happens in
Clean alphabet A9 and A10, A13 and A14, A17 and A18, A7 and
85.00
20dB A27, A19 and A20. Table 2 and 3 show five most
80.00
15dB confused alphadigits in clean and SNR 5 dB respectively.
75.00
10dB
70.00
5dB
65.00
Alphabets A11 to A20
0dB
60.00
Average 100.00
55.00
90.00
50.00 80.00
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 Clean
70.00 20dB
60.00
Figure 1: Correctness for Individual Arabic digits 15dB
50.00
10dB
40.00
5dB
30.00
20.00 0dB

10.00 Average
0.00
A11 A12 A13 A14 A15 A16 A17 A18 A19 A20

Figure 2.b: Correctness for Individual Arabic Alphabets


(A11 ~ A20).
and A27 which is below 40% whereas the best four average
Alphabets A21 to A29
accuracies have found for the alphadigits D1, D5, D7, and
100.00 A1 which is above 80%.
90.00 The digits (e.g. D3) and alphabets (e.g. A2) that consist
80.00
of more number of phonemes have high accuracy in clean
Clean
70.00
environment. If we add noise, the accuracies do not degrade
20dB
60.00
more and in some cases, increase even more. The alphabets
15dB that consist of less number of phonemes have inferior
50.00
10dB accuracies.
40.00
5dB
30.00
0dB
20.00

10.00
Average
5 Conclusion
0.00
A21 A22 A23 A24 A25 A26 A27 A28 A29 In this investigation we found the effect of noise in
different SNR to recognize Saudi accented Arabic alphadigits.
Figure 2.c: Correctness for Individual Arabic Alphabets (A21 ~ The system is designed for clean and noisy speech ranging
A29). from SNR 20 to -5 dB. In Arabic alphadigits the phoneme AA
Table 2: The five most confused alphadigits in clean condition. is mostly common and it is also mostly confused with other
alphadigits. In future we will try to improve the performance of
Alphadigits / Mostly Confused with Comments this ASR system.
Accuracy
A5 (THAA) / A4(TAA), A7(H_AA), Common phoneme is
41.5 A21(FAA), A23(KAAF),
A27(HAA)
AA 6 References
A13 (SEEN) / D2(ATHNAYN), Common phoneme is [1] A. Youssef et. al., “An Arabic TTS System on the IBM
30.0 A6(JEEM), A14(SHEEN) EE
Trainable Synthesizer”, Le traitement automatique de l’arabe,
A17 (T_AA) / A11(RAA), A22(QAAF), Common phoneme is JEP-TALN 2004, pp. 19-21 April 2004.
67.7 A28(WAAWO) AA
A18 (DHAA) A9(DAAL), A11(RAA), Common phoneme is [2] W. Abdulah et.al., “Real-time Spoken Arabic Recognizer,” Int.
/ 66.2 A16(DHAAD), A17(T_AA) AA J. Electronics, Vol. 59, No. 5, pp. 645-648, 1984.
A27 (HAA) / A7(H_AA), A21(FAA) Common phoneme is
[3] Y.A. Alotaibi, “Investigating spoken Arabic digits in speech
60.0 AA
recognition setting,” Journal of Information Sciences 173 (1–3),
Elsevier, pp. 115–139. 2005.
Table 3: The five most confused alphadigits in SNR = 5 dB.
[4] Y.A. Alotaibi, “Automatic recognition, investigation, and
analysis of the spoken arabic alphabet,” Egyptian Computer
Alphadigits / Mostly Confused with Comments Science Journal, 2008.
Accuracy

A4 (TAA) / A5(THAA), A21(FAA), Common phoneme is


[5] Muhammad Ghulam, “A study on auditory-based feature
23.1 A23(KAAF) AA extraction and HMM/SM based classification for robust speech
recognition,” Ph.D. Thesis, Toyohashi University of Technology,
A5 (THAA) / A3(BAA), A4(TAA), Common phoneme is
Japan, 2006.
15.4 A21(FAA) AA
A10 (THAAL) / A9(DAAL) Common phoneme is [6] S. Young et. al., “The HTK Book (for HTK Version. 3.4)”,
19.2 AA Cambridge University Engineering Department, 2006.
A21 (FAA) / A3(BAA), A4(TAA), Common phoneme is http:///htk.eng.cam.ac.uk/prot-doc/htkbook.pdf.
20.8 A5(THAA), A17(T_AA) AA
A27 (HAA) / A5(THAA), A7(H_AA), Common phoneme is
49.2 A8(KHAA), A21(FAA) AA

The recognition rate of alphabet A17 and A18 are not


good in noisy condition and even in clean condition while
alphabet A19 and A20 maintains high accuracy in clean and
noisy speech up to SNR 5 dB. The worst five average
accuracies have found for the alphabets A5, A10, A13, A21,

You might also like