Professional Documents
Culture Documents
Varying Number of HMM2015
Varying Number of HMM2015
discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/236115961
DOWNLOAD VIEWS
1 154
5 AUTHORS, INCLUDING:
Mansour M. Alghamdi
mghamdi@kacst.edu.sa
Computer Research Institute
King Abdulaziz City for Science and Technology (KACST)
Riyadh 11442, Saudi Arabia
Abstract— Continuous Arabic Speech Recognition, appears in the formal style of writing and speech across the Arab world
many real life applications. Its speed, accuracy and [5].
improvement are highly dependent on the accuracy of the
language phonemes set. The main goal of this research is to Arabic dialects and accents, however, pose great
recognize and transcribe the Arabic phonemes based on a data- challenges in building automatic speech recognition systems
driven approach. We built a phoneme recognizer based on a (ASR). For example, MSA-based ASR systems suffer from
data driven approach using HTK tool. Different numbers of high word error rate (WER) when applied to those dialects
Gaussian mixtures with different numbers of HMM states were and accents. As a consequence, a data driven approach needs
used in modeling the Arabic phonemes in order to reach the best to be investigated and its ability to recognize and transcribe
configuration. The corpus used consists of about 4000 files, Arabic phonemes needs to be assessed in recognizing non
representing 5 recorded hours of modern standard Arabic of MSA phonemes that exist in other dialects, and out-of-
TV-News. The maximum phoneme recognition accuracy vocabulary words (OOV).
reached was 56.79%. This result is very encouraging and shows
the viability of our approach as compared to using a fixed This study examines automatic data driven-based
number of HMM states. transcription of the Arabic phonemes by developing a more
suitable phoneme recognition process. Multiple experiments
Keywords—Phoneme recognition; Arabic Speech Recognition; were conducted with different parameters, using the Hidden
KFUPM Arabic speech; corpus HMM. Markov Toolkit (HTK). Three parameters were used in the
experimental part; the number of HMM states for each
I. INTRODUCTION phoneme, the feature vector type and length and the number of
Gaussian mixtures for each phoneme.
Phonemes transcription plays an important role in the
process of speech recognition, text-to-speech applications and The remainder of this paper is organized as follows:
speech database construction [1]. Two traditional methods are Section 2 describes the problem statement. Section 3 gives an
usually used for phoneme transcription; the feature input overview of related work to this research. Section 4 details the
method, which is carried out by the speech recognition task, research methodology and our experimental setup. Section 5
and the text input method, which is carried out by Grapheme presents the results of our experimentation. Finally, we
to Phoneme task (G2P) [2]. Both methods include phone conclude the paper in Section 6 and provide further future
recognition. The maximum accuracy reached in continuous work.
speech recognition with large vocabulary was 80% [2]. G2P
gives more accurate recognition, but relies on a perfect II. PROBLEM STATEMENT
pronunciation lexicon. Many research work efforts have been put in continuous
Phonemes recognition in continuous speech is not an easy Arabic speech recognition using HMM with different rates of
task due to coarticulation, and the inherited variability in the accuracy. The recognition accuracy in our case is measured by
pronunciation of some phonemes. Stop consonants are short the percent of correctly recognized phonemes. The ASR
duration phonemes that are usually misclassified. accuracy is usually affected by several factors, including
noise, the phoneme set used, the number of HMM states for
State–of-the-art continuous Arabic speech recognition each phoneme and its length.
(ASR) systems recognize modern standard Arabic (MSA) that
usually appears in TV-News, newspapers, and books. MSA is In this paper we propose to build a phoneme recognizer
based on a data driven approach using the HTK tool. Different
number of Gaussian mixtures and different number of HMM
1
The ‘0’ character means 13 features per frame, ‘D’ character
stands for Delta (first difference) and A stands for Acceleration
(Second difference).
The first part of the grammar is the variable PHONEME Fig 6, shows the needed inputs and the resulting outputs of
which defines all possible Arabic phones including silence, the initialization process. After initializing all the HMMs, the
(SIL) and inhalation (+INH+). The character ‘|’ represents the first iteration of the iterative Baum Welch algorithm is
logical OR operator. The second part represents the utterance executed using the HTK– HRest command.
where two silences are added, at the beginning and at the end
of each utterance. At the middle is the reference for the
variable PHONEME inside the characters ‘< >’ which
represent the repetition. Fig. 4 shows a mapping example on
the Arabic recognition grammar.
MFCC-
3 Different(1,2,3,4) 39 4-GM 51.3%
0-D-A