You are on page 1of 7

See

discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/236115961

Data-driven Arabic phoneme recognition using


varying number of HMM states

CONFERENCE PAPER · JANUARY 2013


DOI: 10.1109/ICCSPA.2013.6487258

DOWNLOAD VIEWS

1 154

5 AUTHORS, INCLUDING:

Wasfi Al-Khatib Moustafa Elshafei


King Fahd University of Petroleum and Min… King Fahd University of Petroleum and Min…
16 PUBLICATIONS 59 CITATIONS 71 PUBLICATIONS 265 CITATIONS

SEE PROFILE SEE PROFILE

Husni Al-Muhtaseb Mansour Alghamdi


King Fahd University of Petroleum and Min… King Abdulaziz City for Science and Techno…
47 PUBLICATIONS 154 CITATIONS 36 PUBLICATIONS 137 CITATIONS

SEE PROFILE SEE PROFILE

Available from: Khalid Nahar


Retrieved on: 05 August 2015
Data-Driven Arabic Phoneme Recognition using
Varying Number of HMM States
Khalid M.O Nahar Wasfi G. Al-Khatib Moustafa Elshafei Husni Al-Muhtaseb
nahar@kfupm.edu.sa wasfi@kfupm.edu.sa elshafei@kfupm.edu.sa muhtaseb@kfupm.edu.sa
Computer Engineering Dept. Computer Science Dept. System Engineering Dept. Computer Science Dept.
King Fahd University Of Petroleum and Minerals
Dhahran 31261, Saudi Arabia

Mansour M. Alghamdi
mghamdi@kacst.edu.sa
Computer Research Institute
King Abdulaziz City for Science and Technology (KACST)
Riyadh 11442, Saudi Arabia

Abstract— Continuous Arabic Speech Recognition, appears in the formal style of writing and speech across the Arab world
many real life applications. Its speed, accuracy and [5].
improvement are highly dependent on the accuracy of the
language phonemes set. The main goal of this research is to Arabic dialects and accents, however, pose great
recognize and transcribe the Arabic phonemes based on a data- challenges in building automatic speech recognition systems
driven approach. We built a phoneme recognizer based on a (ASR). For example, MSA-based ASR systems suffer from
data driven approach using HTK tool. Different numbers of high word error rate (WER) when applied to those dialects
Gaussian mixtures with different numbers of HMM states were and accents. As a consequence, a data driven approach needs
used in modeling the Arabic phonemes in order to reach the best to be investigated and its ability to recognize and transcribe
configuration. The corpus used consists of about 4000 files, Arabic phonemes needs to be assessed in recognizing non
representing 5 recorded hours of modern standard Arabic of MSA phonemes that exist in other dialects, and out-of-
TV-News. The maximum phoneme recognition accuracy vocabulary words (OOV).
reached was 56.79%. This result is very encouraging and shows
the viability of our approach as compared to using a fixed This study examines automatic data driven-based
number of HMM states. transcription of the Arabic phonemes by developing a more
suitable phoneme recognition process. Multiple experiments
Keywords—Phoneme recognition; Arabic Speech Recognition; were conducted with different parameters, using the Hidden
KFUPM Arabic speech; corpus HMM. Markov Toolkit (HTK). Three parameters were used in the
experimental part; the number of HMM states for each
I. INTRODUCTION phoneme, the feature vector type and length and the number of
Gaussian mixtures for each phoneme.
Phonemes transcription plays an important role in the
process of speech recognition, text-to-speech applications and The remainder of this paper is organized as follows:
speech database construction [1]. Two traditional methods are Section 2 describes the problem statement. Section 3 gives an
usually used for phoneme transcription; the feature input overview of related work to this research. Section 4 details the
method, which is carried out by the speech recognition task, research methodology and our experimental setup. Section 5
and the text input method, which is carried out by Grapheme presents the results of our experimentation. Finally, we
to Phoneme task (G2P) [2]. Both methods include phone conclude the paper in Section 6 and provide further future
recognition. The maximum accuracy reached in continuous work.
speech recognition with large vocabulary was 80% [2]. G2P
gives more accurate recognition, but relies on a perfect II. PROBLEM STATEMENT
pronunciation lexicon. Many research work efforts have been put in continuous
Phonemes recognition in continuous speech is not an easy Arabic speech recognition using HMM with different rates of
task due to coarticulation, and the inherited variability in the accuracy. The recognition accuracy in our case is measured by
pronunciation of some phonemes. Stop consonants are short the percent of correctly recognized phonemes. The ASR
duration phonemes that are usually misclassified. accuracy is usually affected by several factors, including
noise, the phoneme set used, the number of HMM states for
State–of-the-art continuous Arabic speech recognition each phoneme and its length.
(ASR) systems recognize modern standard Arabic (MSA) that
usually appears in TV-News, newspapers, and books. MSA is In this paper we propose to build a phoneme recognizer
based on a data driven approach using the HTK tool. Different
number of Gaussian mixtures and different number of HMM

978-1-4673-2821-0/13/$31.00 ©2013 IEEE


states are investigated in order to reach the best configuration
model for each phoneme.

III. RELATED WORK


Phoneme recognition and transcription in continuous
speech recognition have been addressed by many researchers.
Lee et al., [3] used a large network of trained phonemic
HMMs where the maximum likelihood state sequence of the
network gave the recognized phoneme sequence. A multi-state
fixed structure with three observation probability represented
a phoneme. The acoustic variability was better characterized,
even though the phonemes duration and structure were not
modeled accurately. Levinson et al., [4] used a single large
continuous variable duration hidden Markov model
(CVDHMM) with a number of states equal to the number of
phonemes in the vocabulary. A single CVDHMM state is
characterizing the whole phoneme. The state models both, the
acoustic variability and the durational variability of each
phoneme. Levinson et al, reported a reasonable performance
because of the external knowledge that was incorporated into
the model, such as the phoneme duration and the bigram
probabilities. Elmahdy, M. et al., [5] proposed an approach for Figure 1. Arabic Phoneme Transcription Methodology
rapid phonetic transcription of dialectal Arabic, where the
Arabic Chat Alphabet (ACA) was used. The proposed A. Data Preparation
approach represented an alternative way to conventional Five-hour recordings of modern Arabic speech from
phonetic transcription methods. Results showed that the ACA- different TV-news clips have been developed and transcribed
based approach outperforms the graphemic baseline while it by King Fahd University of Petroleum & Minerals (KFUPM)
performs as accurate as the phoneme-based baseline with a with the support of King Abdul-Aziz City for Science and
slight increase in WER. Bayeh, R. et al., [6] presented a study Technology (KACST). The corpus consists of 10-20
of manual and data driven association of words and millisecond utterances with 16 KHz sampling rate, along with
phonemes. Their study was based on defining the speech their corresponding MFCC files and their transcriptions. In
recognition system in a target language based on acoustic addition, it contains a list of Arabic phonemes, an Arabic
models from another source language. They found that dictionary and other script files used for manipulating corpus
phoneme-to-phoneme association was more practical but, information.
words transcription provided better results. Experiments were
conducted with French as the source language and Arabic as Using CMU sphinx, the trigram word transcription of all
the target language. utterances of the corpus was produced. Table 1 shows the
transcription of the word “‫( ”إرﺗﻔﻌﺖ‬EEIHRTAEFAEAIAET)
Alghamdi, M. et al., [7] proposed a new labeling scheme as an example of trigram word transcription done by CMU
which was able to cover all the Quranic Sounds and their sphinx. Providing the CMU Sphinx with the MFCC feature
phonological variations. They presented a set of labels that files of the training set corpus and its corresponding
cover all Arabic phonemes and their allophones and then they transcription, a mechanism of forced alignment was done by
showed how it can be efficiently used to segment the Quranic Sphinx to generate the phoneme and word segmentation for all
corpus. No reported results in the literature could be reached. utterances of the training set.
The authors claimed that the initial results were encouraging.
TABLE 1. THE WORD “‫ ”ارﺗﻔﻌﺖ‬ON THE (LEFT) PART, TRIGRAM WORD
PHONEME SEGMENTATION ON (RIGHT) PART.
IV. METHODOLOGY AND EXPERIMENTAL SETUP
The methodology we followed to transcribe Arabic CMU Sphinx Labeling
phonemes in a data-driven approach is shown in Fig 1. Word Segmentation Phoneme Segmentation
StaFrm EdFrm Word StaFrm EdFrm Phone Prev Next
The methodology involves the following steps: 0 12 <s> 0 12 SIL
13 15 <s> 13 15 SIL
1. Data preparation, which includes defining a proper
16 21 E SIL IH
grammar; and building the HMM models (Prototypes) for 22 24 IH E R
each phoneme. 25 27 R IH T
28 30 T R AE
‫اِر َﺗ َﻔﻌَﺖ‬

2. Training the HMM models. 31 33 AE T F


3. Phoneme recognition and transcription. 16-59 34 36 F AE AE
4. Experimentation to find the best HMM configuration. 37 39 AE F AI
40 44 AI AE AE
Details of Tasks 1 – 3 are given in the remainder of this 45 51 AE AI T
section. Section 5 explains Task 4. 52 59 T AE E
CMU Sphinx produces a start and end frame numbers that each phoneme as a triphone, we selected the MFCC_0_D_A
are associated with each phoneme utterance, as shown in type1.
Table 1. The HTK, on the other hand, uses a start and end
frame numbers based on integer multiples of 100ns, as shown Finally, the corpus was divided into two parts; one part for
in Table 2. Since the output of the CMU Sphinx is fed as input training and the other part for testing. The size of training part
to the HTK tool, we had to transform the CMU sphinx frame was 70% of the corpus. The remaining 30% of the corpus was
labeling into its equivalent HTK tool labeling and remove all used for testing.
other columns. Table 2, shows the HTK tool labeling of the 1) Building HMM Models
word “‫”إرﺗﻔﻌﺖ‬. We created an HMM definition file (Prototype) for each
TABLE 2. THE CORRESPONDING HTK LABELING OF THE WORD IN TABLE 1. Arabic phoneme. Initially, we used 3-emitting state HMM
with one Gaussian Mixture for all the phonemes. The 3-
HTK Labeling emitting state HMM is represented by 5-states in the HTK tool
StartFrm EndFrm Phone (one entry state, 3 emitting states and one final state). The
00000 1200000 SIL feature vector used was MFCC_0 with 13 feature values for
1300000 1500000 SIL
each frame. Fig 2 shows a simple HMM prototype file of 3
1600000 2100000 E
2200000 2400000 IH emitting states. The function of the prototype is to describe the
2500000 2700000 R form and topology of the HMM. The initial numbers
2800000 3000000 T representing transition sate probabilities used in the definition
3100000 3300000 AE are not important since they will be updated by the training
3400000 3600000 F process.
3700000 3900000 AE
4000000 4400000 AI Different numbers of Gaussian mixtures were used along
4500000 5100000 AE with different numbers of HMM states in modeling Arabic
5200000 5900000 T phonemes.
3700000 3900000 E

Another issue to handle was the incompatibility of the


MFCC feature files included in the corpus and the MFCC
feature files that are accepted by the HTK tool. Therefore, we
had to regenerate the MFCC files from the WAV files of the
corpus using the HCopy command in HTK. It is worth
mentioning that the HCopy command requires proper
configuration parameters according to the input WAV files.
Table 3 shows the configuration parameters that were most
suitable to the corpus WAV files.
TABLE 3. CONFIGURATION PARAMETERS FOR HCOPY COMMAND

HTK Tool-HCopy Configuration Parameters


Variable Value
TARGETKIND MFCC_0_D_A
TARGETRATE 100000.0
SAVECOMPRESSED F
SAVEWITHCRC F
WINDOWSIZE 250000.0
USEHAMMING F Figure 2. Simple HMM prototype
PREEMCOEF 0.97
NUMCHANS 26 2) Defining proper grammar, dictionary, and Network
CEPLIFTER 22 Lattices
NUMCEPS 12
ENORMALIZE F
Using HTK grammar definition language, we define the
grammar for each Arabic utterance as shown in Fig 3.
Regarding the output MFCC feature files, HTK supports
many different types. The two most-commonly used types in
speech recognition are the basic type, MFCC_0 with 13
features frame length, and the extended one, MFCC_0_D_A
with 39 features frame length. Since CMU Sphinx represents Figure 3. Recognition Grammar

1
The ‘0’ character means 13 features per frame, ‘D’ character
stands for Delta (first difference) and A stands for Acceleration
(Second difference).
The first part of the grammar is the variable PHONEME Fig 6, shows the needed inputs and the resulting outputs of
which defines all possible Arabic phones including silence, the initialization process. After initializing all the HMMs, the
(SIL) and inhalation (+INH+). The character ‘|’ represents the first iteration of the iterative Baum Welch algorithm is
logical OR operator. The second part represents the utterance executed using the HTK– HRest command.
where two silences are added, at the beginning and at the end
of each utterance. At the middle is the reference for the
variable PHONEME inside the characters ‘< >’ which
represent the repetition. Fig. 4 shows a mapping example on
the Arabic recognition grammar.

Figure 6. Initialization from prototype [9]


Figure 4. Utterance at the upper part, the corresponding phonemes and the
grammar rule for the transcription Its functionality is very similar to the HTK-HInit
command, except that it expects that the input HMM
Using the HTK dictionary definition rules, we define the definition has already been initialized. It uses the Baum-
phonemes’ dictionary. It consists of two identical columns Welch re-estimation algorithm in place of the Viterbi training
containing the list of Arabic phonemes based on the HTK tool [8]. All initialized HMMs are put in one file called the Master
instructions for recognition at the phoneme level [8]. Fig 5 Macro File (MMF file).
shows part of the dictionary. Next, we carry out our phoneme recognition and
transcription task shown in Fig 1 in order to determine the
most accurate configuration achievable. Fig 7 shows details of
this process. The training module receives the training MFCC
files containing the features of the segmented speech, its
corresponding text, and the initialized HMM for each
phoneme. The training module produces trained HMM’s. The
trained HMMs and the defined grammar are used to recognize
the MFCC files extracted from the testing set utterances,
producing the resulting transcription.

Figure 5. Part of the Dictionary

The dictionary could be generated automatically through


the HTK. It could also be prepared manually using any text
editor [8].
The phoneme lattice network is generated after setting the
proper grammar and the dictionary.
The Lattices network is generated by providing the
dictionary and the grammar as input to the HTK-HParse and
the HTK-HGen commands. HTK-HGen command is then
used to confirm the correctness of the grammar by
automatically generating a set of predefined number of
utterances.
B. Training HMM’s
Figure 7.Phonemes Recognition and Transcription Process
The HMM models of the phonemes are first initialized
using the Baum Welch algorithm. The HTK tool supports two
This process is repeated for different HMM configurations
types of initialization. The first type is applied when the
until no significant improvement is achieved.
boundaries of phonemes are known. The second one is applied
when the boundaries of phonemes are not known, which is V. EXPERIMENTAL RESULTS
called flat initialization. Since the phoneme boundaries of our
data are known, the first type of initialization is applied using A set of experiments were conducted. Each experiment
the HInit command of HTK. has its own configuration based on the study and analysis of
the results of the preceding experiments. A total of 25 In the third experiment, we used the extended feature
experiments were conducted in order to find the best vector with 4-Gussian mixture. In addition, we used different
configuration. Below, we report the results of four such number of emitting states as follows: The phonemes
experiments that reported the highest significant representing long and intermediate vowels such as /AA:/ ,
improvements based on our sensitivity analysis. This was /AE:/ , /AY/ , /UW/ , /AW/ and /AH:/ were assigned 4-HMM
done manually by investigating the confusion matrix of each emitting states. The un-voiced stop phonemes /E/, /T/, /K/, and
experiment. /Q/ were assigned 1-HMM emitting state, whereas the voiced-
stop phonemes /B/, /D/, and /DD/ were assigned 2-HMM
emitting states. The remaining phonemes were assigned 3-
A. Experiment 1: Basic feature vector with 3 HMM- HMM emitting states. Table 4 shows the complete list of the
emitting states Arabic phonemes with their corresponding Arabic letters
including some examples of using Arabic vowels allophones.
In this experiment, we used 3 HMM-emitting states for all The percentage of correctly recognized phonemes in this
phonemes, and extracted the 13-value basic features from the experiment increased to 51.3%. This shows that using
WAV files. No Gaussian mixtures were used. The percentage different number of HMM emitting states had a positive
of correctly recognized phonemes was 28.4%. A portion of the
impact on the correctness.
output file generated by the HTK-HResult command for
aligning the labeled tested utterances with the reference TABLE 4. ARABIC PHONEME LIST
labeling is shown in Fig 8. This low recognition rate was due
to the usage of the basic features which were not an adequate Phoneme Arabic letter Example Phoneme Arabic Letter
set of features that can suitably represent wav files for /AE/ ‫َـ‬ ‫ب‬
َ /Z/ ‫ز‬
different phonemes. /AE:/ ‫ـَﺎ‬ ‫ﺑَﺎب‬ /S/ ‫س‬
/AA/ ‫َـ‬ ‫خ‬
َ /SH/ ‫ش‬
/AA:/ ‫َـ‬ ‫ﺧَﺎب‬ /SS/ ‫ص‬
/AH/ ‫َـ‬ ‫ق‬
َ /DD/ ‫ض‬
/AH:/ ‫َـ‬ ‫ﻗَﺎل‬ /TT/ ‫ط‬
/UH/ ‫ُـ‬ ‫ب‬
ُ /DH2/ ‫ظ‬
/UW/ ‫ـُﻮ‬ ‫دُون‬ /AI/ ‫ع‬
/UX/ ‫ُـ‬ ‫ﻏُﺼﻦ‬ /GH/ ‫غ‬
Figure 8. First Experiment Results /IH/ ‫ِـ‬ ‫ﺑِﻨﺖ‬ /F/ ‫ف‬
/IY/ ‫ـِﻲ‬ ‫ﻓِﻴﻞ‬ /V/ -
Once the optimal alignment is reached, insertion errors (I), /IX/ ‫ِـ‬ ‫ﺻِﻨﻒ‬ /Q/ ‫ق‬
substitution errors (S), deletion errors (D) and the total /AW/ ‫ـَﻮ‬ ‫ﻟَﻮم‬ /K/ ‫ك‬
number of labels in the reference transcription (N) are /AY/ ‫ـَﻲ‬ ‫ﺻَﻴﻒ‬ /L/ ‫ل‬
calculated and reported as shown above. The H-value in this /E/ ‫ء‬ ‫دفء‬ /M/ ‫م‬
output refers to the number of correct hits, i.e., the number of /B/ ‫ب‬ /N/ ‫ن‬
correctly recognized phonemes. The percentages of /T/ ‫ت‬ /H/ ‫هـ‬
correctness (denoted by %Corr) and accuracy (denoted by /TH/ ‫ث‬ /W/ ‫و‬
%Acc) are based on the values of I, S, D and N. Those /JH/ ‫ج‬ /Y/ ‫ي‬
percentages are computed as follows [8]: /G/ ‫ج‬
/ZH/ ‫ج‬
N −D−S /HH/ ‫ح‬
Percent (Correct ) = X 100 % (1) /KH/ ‫خ‬
N /D/ ‫د‬
/DH/ ‫ذ‬
N −D−S −I
Percent( Accuracy) = X 100 % (2) /R/ ‫ر‬
N
B. Experiment 2: Extended feature vector with 3 HMM-
D. Experiment 4: Extended feature vector with varying
emitting states
number of HMM-states II
In the second experiment, we used the extended feature
Based on the confusion matrix of Experiment 3, it was
vector instead of the basic one with 4-Gussaian mixtures. The
evident that the phoneme /T/ requires no more than one
recognition rate increased to 43.3%. We conclude that this
emitting state. In addition, the number of emitting states for
significant increase in accuracy is attributed to our use of the
Phonemes /E/, /K/ and /Q/ were changed to 3, instead of 1.
extended feature vector of 39 features to represent the wave
Similarly, Phonemes /B/, /D/, and /DD/ were changed to 3,
files. Therefore, all subsequent experiments were conducted
instead of 2 emitting states. The number for Phoneme /AA:/
on the extended feature vector.
was reduced from 4 to 3 emitting states. Finally, Phonemes
C. Experiment 3: Extended feature vector with varying /UW/ and /AE/ were assigned 4-HMM emitting states. With
number of HMM- states I these, the percentage of correctness increased to 56.79 %.
Table 8 summarizes the results of the experiments that Conference on , vol.1, no., pp. I- 521-4 vol.1, 17-21 May 2004
have been conducted. doi: 10.1109/ICASSP.2004.1326037.
[7] Abuzeina D , Al-Khatib W , Elshafei M, Al-Muhtaseb H,
TABLE 8. EXPERIMENT’S SUMMARY. “Within-word pronunciation variation modeling for Arabic
ASRs”: a direct data-driven approach, International Journal of
Exp HMM Feature Vector Speech Technology, v.15 n.2, p.65-75, June 2012.
No. Gauss. Mix. Accuracy [8] S. Young et al., The HTK Book (Version 3.4). University of
No States Vector Length
Cambridge,2006.
[9] http://www.info2.uqam.ca/~boukadoum_m/DIC9315/Notes/Mark
1 3 MFCC-0 13 1-GM 28.4%
ov/HTK_basic_tutorial.pdf.
MFCC-
2 3 39 4-GM 43.3%
0-D-A

MFCC-
3 Different(1,2,3,4) 39 4-GM 51.3%
0-D-A

Changed Based On MFCC-


4 39 4-GM 56.79%
Exp. 3 0-D-A

VI. CONCLUSION AND FUTURE WORK


In this paper, we used a data driven approach to recognize
and transcribe the Arabic phonemes. We presented the process
of phoneme recognition using HMM, implemented with the
HTK toolkit. Four experiments were conducted using different
HMM parameters. With a fixed number of HMM states, the
maximum achieved phoneme recognition accuracy was
43.3%. When using varying number of HMM states, this rate
increased 56.79%. A grammar, word network lattices and a
dictionary were generated for the purpose of Arabic phoneme
transcription.
It is worth noting that the results achieved above were at
the phoneme level and without using the language model. We
strongly believe that using a language model will result in
great improvement on the phoneme recognition accuracy. In
addition, we look forward to studying the viability of our
proposed methodology on out of vocabulary word recognition,
and subsequently on non-MSA (dialect speech) recognition in
the near future.
REFERENCES
[1] Kim, D. Y., et al. Development of the CU-HTK 2004 Broadcast
News Transcription Systems. ICASSP 2005, pp. 861-864.
[2] Min-Siong Liang, Ren-Yuan Lyu, Yuang-Chin Chiang, Using
Speech Recognition Technique for Constructing a Phonetically
Transcribed Taiwanese (Min-nan),In The Ninth International
Conference on Spoken Language Processing (Interspeech 2006-
ICSLP) (September 2006).
[3] K. F. Lee and H. W. Hon. “Speaker independent phone
recognition using HMMs”, IEEE Trans. on ASSP, pp 1641-1648,
November 1989.
[4] S .E. Levinson, M. Y. Liberman, A. Ljolge and L. G. Miller:
“speaker independent phonetic transcription of fluent speech for
large vocabulary speech recognition”, Proc. ICASSI’89, pp 441-
444, 1989.
[5] Elmahdy, M.; Gruhn, R.; Abdennadher, S.; Minker, W.; , "Rapid
phonetic transcription using everyday life natural Chat Alphabet
orthography for dialectal Arabic speech recognition," Acoustics,
Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on , vol., no., pp.4936 4939, 22-27 May
2011 doi: 10.1109/ICASSP.2011.5947463.
[6] Bayeh, R.; Lin, S.; Chollet, G.; Mokbel, C.; , "Towards
multilingual speech recognition using data driven source/target
acoustical units association," Acoustics, Speech, and Signal
Processing, 2004. Proceedings. (ICASSP '04). IEEE International

You might also like