You are on page 1of 6

Spoken Language Identification Using Ergodic Hidden Markov Models

Ahmed Y. Hashem , M. Waleed Fakhr


Arab Academy for Science and Technology , Heliopolis, Cairo, Egypt a_youssef83@yahoo.com , waleedf@aast.edu

Sherif Abdou
College of Computing and IT, Cairo University Sherif.Abdou@rdi-eg.com

AbstractSpoken language identification (LID) is the process of


identifying the language spoken within an utterance. This task should be performed without knowing any information about the speech utterance. This paper investigates and compares the efficiency of two Ergodic Hidden Markov Models (HMMs) approaches that we use to differentiate between Arabic and English languages. We also apply the two techniques to discriminate between different Arabic dialects, namely, Egyptian Arabic, Gulf Arabic, and Iraqi Arabic.

KeywordsSpoken language identification, Hidden Markov Model (HMM). Ergodic HMM, Arabic dialects.

while the state transition probabilities and to some extent the mixture weights model some phonetic properties of the same language. Doing this, we try to combine both the acoustic and the phonetic approaches which are used in the literature. We extend our work to differentiate between Arabic dialects like Egyptian Arabic, Iraqi Arabic, and Gulf Arabic. This paper is organized as follows: section II describes the related work in phonetic and acoustic approaches. Section III explains the proposed techniques, section IV shows the experimental setup, section V shows the experimental results and finally section VI shows the conclusion.

I. INTRODUCTION
Spoken language identification has numerous applications, one important application arises in call centers dealing with speakers speaking different languages. Another application is indexing large speech data archives that contain multiple languages. One other application with great potential is the human to machine interface [1]. At present the computer can receive input from the mouse or keyboard. More recently speech enabled technology has emerged as a promising replacement, but unfortunately the technology of speech processing is still unable to provide applications with a multiple range of languages. So this practically may prevent a user from communicating with the computer. A LID system can be used here to redirect the speech to a speech recognition system that has been built for the detected language thus enabling a way of communication between a speaker and the computer. Unfortunately, the LID task often requires familiarity with the subset of languages for the task which is impractical for a large portion of the worlds population. Also phoneme sounds of different languages may be similar in pronunciation which affects language identification process. This problem arises even when identifying dialects of the same language. In this paper we investigate different HMM systems for spoken LID. Ergodic HMM models are used so that each state models some distinct sounds in the language. In this approach, the state probabilities model the acoustics of the language

II. LITERATURE OVERVIEW


There are two main approaches used for spoken (LID), which are, phonetic approach, and acoustic approach.

II.1 Phonetic approach


The phonetic approach focuses on the occurrences of phone sequences in speech. It tokenizes speech into phonemes even if the target language is unknown, where the output phonemes are recognized by multiple phone recognizers followed by an N-gram language model. This process is known as (PPRLM) Parallel phone recognizers followed by language modeling. Zissman [2] and tucker [3] used a phone recognizer with following language modeling by n-grams . Later Zissman [2,4] and Yan [5,6] extended this by using multiple language dependent phone recognizers where multiple (k) phone recognizers, each trained on a different language are used instead of as single phone recognizer. Harbeck and Ohler [7] presented two new approaches for language identification. Both of them are based on the use of multigrams. In the first approach they used multigram models for phonotactic modeling of phoneme sequences. The multigram model can be used to segment the new observation into larger units e.g. (words) and calculates a probability for the best segmentation. In the second approach they built a fenon recognizer using the segments of the best segmentation of the training material as words inside the recognition vocabulary. English and German from OGI stories were used for testing with accuracy 73% on 10 second utterances and 84% on 30 seconds. They

got 84% and 91% respectively using interpolated 3-grams. Zissman et al. [8] show that the PRLM approach produces good results when classifying Cuban and Peruvian dialects of Spanish using an English phone recognizer, the recognition accuracy of this system over the two dialects was 84%. Recently, more research has begun to be conducted in the area of dialect and dialect identification. The task of dialect identification is to recognize the regional dialect of a speaker given a sample of his speech. This task is more difficult than language identification as there are great similarities between dialects of the same language. Biadsy and Hirschberg [9] described a system that automatically identifies the Arabic dialect of a speaker given a sample of his speech. The system adopts the phonotactic approach to distinguish between Arabic dialects. In their implementation a logistic regression classifier is employed as the back-end combiner which proved to be superior to SVM and neural networks. They hypothesized that using multiple phone recognizers allows the system to capture the subtle phonetic differences that might be crucial to distinguish dialects.

from speech segments, and stage two: using a classifier to identify patterns in these features and thus classifying the language they belong to. Assume we have a classifier with some acoustic features vector, = 1, , , the acoustic classification rule [14], given the acoustic evidence becomes = arg max = 1, . . , Stage one is referred to as training stage where a model is trained with some extracted features from speech utterances. Assume that we extracted some speech features, X, which represent a particular type of speech information, from the speech training data. During stage two which is referred to as testing stage, the same kind of speech features are first extracted from the unknown utterance. The feature set is then compared to the model set, = 1 . , where is the number of possible languages to be recognized by the system. Finally the most likely model is selected according to the equation ^ = arg max The language, ^ , represented by the selected model is identified as the language spoken in the unknown utterance. Therefore, according to this technique an extremely important issue arises which is how and what features are to be extracted from speech data.

II.2 Acoustic approach


The Acoustic approach focuses on modeling the acoustics of features extracted from speech. Torres-Carrasquillo et al. [10] developed an alternative system that identifies two Spanish dialects (Cuban and Peruvian) using GMM with shifted-deltacepstral features. Alorfi [11] uses an Ergodic HMM to model the phonetic differences between two Arabic dialects (Gulf and Egyptian Arabic) employing standard MFCC (Mel Frequency Cepstral Coefficients) and delta features. The accuracy of this system reached 96.67% on these two dialects. S. Kanokphara and J.Carson [12] presented an alternative approach to HMM-based language identification where articulatory features are used instead of phoneme models. They trained HMMs from articulatory features (AF) transcriptions, the models for each language are trained independently. The benefits of AF is that they are similar in many languages, therefore portability from one language to another is easier and processing time for AF-HMM based systems is shorter than the time required for phoneme-based systems. SantoshKumar and Ramasubramanian [13] proposed two types of Ergodic HMMs (EHMM) for automatic spoken language identification which are GMM and HMM. The states of EHMM correspond to acoustic units of a language and its state-transitions represents the bigram language model of unit sequences. They used a segmental K-means algorithm for the training of both types of Ergodic HMM.

III. PROPOSED TECHNIQUE


III.1 General Framework
Acoustic LID aims at capturing the essential differences between languages by modeling distributions of spectral features. Typically this process consists of two stages, stage one: extracting language-independent set of spectral features

Basically, an HMM based LID system is a classifier with each class representing a language modeled by a HMM. Language classification is performed according to the likelihood score calculated by the Language HMMs against a given feature vector. To determine the language in LID testing, multiple feature vectors are used where the likelihood scores are accumulated for each language. In this paper, Ergodic HMM topologies are used since we are trying to model the basic sounds of a language. The sequence of basic sounds may have any arbitrary order, and thus the need to have an Ergodic HMM, meaning that the model can make transition from any state to any other state.

III.2 Ergodic Hidden Markov Models (EHMM):


This is the first approach we used. An HMM is a sequence of states at successive times [15], the state at time is denoted by . A particular sequence of length is denoted by = { 1 , 2 ,., }. Note that the system can revisit a state at different steps and not every state need to be visited. The model for the production of any sequence is described by transition probability independent probability of having state at step + 1 given that the state at time was . There is no requirement that the transition probabilities be symmetric ( , in general), and a state may be visited in succession as shown in the following figure. +1 = the time-

3. 4.

Baum-Welch (BW) training for each model separately with its own language data. In (BW) training we do selective parameter training where we tried adapting the weights only (W), the transition matrices only (T), both weights and transition matrices (WT) and all parameters (All).

IV. EXPERIMENTAL SETUP


In all experiments we used MFCC features with c0, delta, and delta-delta coefficients. The basic cepstrum features are (12), thus giving a total of 39 features. This has shown the best results in the literature [1, 10, 11 ]. The English training and testing data that we used in our experiments is from VoxForge corpus [16]. The Gulf Arabic and Iraqi Arabic data used is from NIST 2003 LDC corpus [17,18,19]. Egyptian Arabic data used is from RDI company [20]. For the English and each dialect we experimented in our work, we used 900 utterances of speech data each of 5 seconds for the training phase. For testing each English and each dialect, we used 500 utterances of speech each of 5 seconds. In the front end processing, each frame was 20ms with 10ms overlap and Hamming windows are used. For the first approach we conducted several experiments using one to six states HMM, with different number of Gaussians (2, 4, 6, and 8). For the second approach, we used 4, 6, 8, and 10 states while keeping the model complexities almost constant in the order of 20 Gaussians. The corresponding convention used for HMM experiments is: HMM (a, b), where a is the number of states, and b is the number of Gaussians per state.

Training of the Ergodic models is done using two steps: 1. Segmental K-means (Viterbi training) where we use the data of each language to train the model of that language. We allow all parameters to adapt. 2. Baum-Welch (BW) training for each model with its own language data, also allowing all parameters to adapt. Most of the data used in this work come from telephone call as which include a large percentage of silence frames. We built a model for silence using 15 audio files that contained only silent frames. We added the silence model to the recognition grammar to provide more accuracy during testing. The following notations show the rules of grammar used:
$language = ARABIC | ENGLISH; (<SILENCE> $language <SILENCE>)

V. EXPERIMENTAL RESULTS
The experiments are divided into two parts. In the first part, the first approach is used for all language pairs. While the second approach is used only for the Iraqi-Gulf pair since it was the one producing the worst results in the first approach.

V.I EHMM Experiments:


Table 1: Performance (%) of HMM on Egyptian Arabic and English Classifier HMM (1, 4) HMM (1, 6) HMM (1, 8) HMM (2, 2) HMM (2, 4) HMM (2, 6) Egyptian Arabic 100 98.5 99 99 99 100 English 100 99.7 100 99 99 100

III.3 Ergodic Hidden Markov Models with Selective Parameter Training (EHMM-SPT):
In this approach we divide training phase into four steps: 1. 2. Segmental K-means (Viterbi training) where we use all the data to train the two models. Re-initializing transition matrices (T) of each model.

HMM (2, 8) HMM (3, 2) HMM (3, 4) HMM (3, 6) HMM (3, 8) HMM (4, 2) HMM (4, 4) HMM (4, 6) HMM (4, 8) HMM (6, 2) HMM (6, 4) HMM (6, 6) HMM (6, 8)

100 99 99.5 100 100 99 99.5 100 100 99.5 100 100 100

100 99.5 99.5 100 100 99.5 99.5 100 100 99.5 100 100 100

HMM (3, 4) HMM (3, 6) HMM (3, 8) HMM (4, 2) HMM (4, 4) HMM (4, 6) HMM (4, 8) HMM (6, 2) HMM (6, 4) HMM (6, 6) HMM (6, 8)

100 100 100 97 97.5 100 100 99 99.5 100 100

98 98.6 98.6 92 94 99 99 91 93 97 98

Table 2: Performance (%) of HMM on Gulf Arabic and English Classifier HMM (1, 4) HMM (1, 6) HMM (1, 8) HMM (2, 2) HMM (2, 4) HMM (2, 6) HMM (2, 8) HMM (3, 2) HMM (3, 4) HMM (3, 6) HMM (3, 8) HMM (4, 2) HMM (4, 4) HMM (4, 6) HMM (4, 8) HMM (6, 2) HMM (6, 4) HMM (6, 6) HMM (6, 8) Gulf Arabic 99.5 99 99.5 98 98.8 98.5 98.5 99 99.5 99.5 99.5 97 97 98.5 98 97.5 98 99.5 99.5 English 100 100 100 98.7 99 98.5 98.9 99.4 98.6 100 100 95 95.5 96.3 99 96 98.5 100 100

Table 4: Performance (%) of HMM on Egyptian Arabic and Gulf Arabic Classifier HMM (1, 4) HMM (1, 6) HMM (1, 8) HMM (2, 2) HMM (2, 4) HMM (2, 6) HMM (2, 8) HMM (3, 2) HMM (3, 4) HMM (3, 6) HMM (3, 8) HMM (4, 2) HMM (4, 4) HMM (4, 6) HMM (4, 8) HMM (6, 2) HMM (6, 4) HMM (6, 6) HMM (6, 8) Egyptian Arabic 97.7 97.5 97.7 95 96 96.6 96.6 97.5 97 98 98 98 97 97 95 97.5 97 97.5 98 Gulf Arabic 89 89.5 90 88 88 88.5 88.3 88.5 88.5 90 90 80.5 82.5 84 85 88.5 87.5 88 88.5

Table 3: Performance (%) of HMM on Iraqi Arabic and English Classifier HMM (1, 4) HMM (1, 6) HMM (1, 8) HMM (2, 2) HMM (2, 4) HMM (2, 6) HMM (2, 8) HMM (3, 2) Iraqi Arabic 100 99 99.5 98.5 99 99 99 99.5 English 98.4 98 98.5 96 96.5 97 97 98

Table 5: Performance (%) of HMM on Iraqi Arabic and Egyptian Arabic Classifier HMM (1, 4) HMM (1, 6) HMM (1, 8) HMM (2, 2) HMM (2, 4) HMM (2, 6) HMM (2, 8) HMM (3, 2) Iraqi Arabic 82.5 82 82.7 82 82.6 82.6 82.9 83 Egyptian Arabic 98 98.5 98.5 96 95.5 96 96 97

HMM (3, 4) HMM (3, 6) HMM (3, 8) HMM (4, 2) HMM (4, 4) HMM (4, 6) HMM (4, 8) HMM (6, 2) HMM (6, 4) HMM (6, 6) HMM (6, 8)

83 84 83.7 81 82.5 83.5 83.5 82 83 83.5 83.5

97.5 98.5 98 94 95.5 97 97.5 93.5 94.5 96.5 97

Table 7: Performance (%) of HMM on Iraqi Arabic and Gulf Arabic

Classifier HMM (4, 5) HMM (4, 5) HMM (4, 5) HMM (4, 5) HMM (6, 3) HMM (6, 3) HMM (6, 3) HMM (6, 3) HMM (8, 2) HMM (8, 2) HMM (8, 2) HMM (8, 2) HMM (10, 2) HMM (10, 2) HMM (10, 2) HMM (10, 2)

Table 6: Performance (%) of HMM on Iraqi Arabic and Gulf Arabic Classifier HMM (1, 4) HMM (1, 6) HMM (1, 8) HMM (2, 2) HMM (2, 4) HMM (2, 6) HMM (2, 8) HMM (3, 2) HMM (3, 4) HMM (3, 6) HMM (3, 8) HMM (4, 2) HMM (4, 4) HMM (4, 6) HMM (4, 8) HMM (6, 2) HMM (6, 4) HMM (6, 6) HMM (6, 8) Iraqi Arabic 74 74.5 74.5 73 73.5 74.5 74.5 52.5 54 52 52.5 36.3 37.5 37 39.3 40.5 42 45.3 46 Gulf Arabic 69 69 69.3 68 68.7 69.9 69 72 72.5 77.3 76 85 83 80 77.5 82 79 76 75

Iraqi Arabic 70 56 55.3 59 71.3 56.3 58 60 71 72.3 68 75 71 59 66 59

Gulf Arabic 73.3 56 55 59.6 75 55.3 59.6 61 83.5 80.6 70 64 83 93 80.6 65

Adapted parameters WT T W All WT T W All WT T W All WT T W All

Table 8: Performance (%) of HMM on Egyptian Arabic and Gulf Arabic Classifier HMM (4, 5) HMM (4, 5) HMM (4, 5) HMM (4, 5) HMM (6, 3) HMM (6, 3) HMM (6, 3) HMM (6, 3) HMM (8, 2) HMM (8, 2) HMM (8, 2) HMM (8, 2) HMM (10, 2) HMM (10, 2) HMM (10, 2) HMM (10, 2) Egyptian Arabic 88 89 89.5 90 90.5 95 94.5 96.5 95 94 93.5 96 95 93 92 96 Gulf Arabic 82 83.5 83.5 84 85 85.5 74.5 89.5 92.5 88 87 87.5 88.5 89 88 90 Adapted parameters WT T W All WT T W All WT T W All WT T W All

The following experiments were conducted on Iraqi Arabic, Gulf Arabic, and Egyptian Arabic to improve their results.

V.2 EHMM-SPT Experiments:


We have used this approach initially on one pair of Arabic dialects (Iraqi-Gulf) aiming to enhance its performance which was the worst in the first approach due to the great similarity between the two dialects. We have also extended the experiments of this approach to the other two pairs (IraqiEgyptian) and (Egyptian-Gulf) as shown in tables 7, 8 and 9.

Table 9: Performance (%) of HMM on Iraqi Arabic and Egyptian Arabic

[4] M.A. Zissman, Language identification using phoneme recognition and phonotactic language modeling, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), May. 1995, pp. 3503-3506. [5] Y. Yan and E. Barnard, An approach to automatic language identification based on language-dependent phone recognition, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), May. 1995, pp. 3511- 3514. [6] Y. Yan, Development of an approach to language identification based on language dependent phone recognition, Ph.D. thesis, Center of Spoken Language Understanding, Oregon Graduate Institute of Science and Technology, Portland, Oregon, Oct.1995. [7] S. Harbeck and U. Ohler, Multigrams for language identification, in Proc. Eurospeech, Sept. 1999, vol. 1, pp. 375-378. [8] M. A. Zissman, T. Gleason, D. Rekart, and B. Losiewicz, 1996, Automatic dialect identification of Extemporaneous Conversational, Latin American Spanish Speech, in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta. [9] F. Biadsy, J. Hirschberg, and N. Habash, Spoken Arabic dialect identification using phonotactic modeling, in Proceedings of EACL 2009 Workshop on Computational approaches to semitic languages. [10] P. Torres-Carrasquillo, E. Singer, M. Kohler, R. Greene, D. Reynolds, and J. Deller, Approaches to language identification using Gaussian Mixture Models and Shifted Delta Cepstral Features, in Int. Conf. on Spoken Language Processing, vol. 1, pp. 89-92, 2002. [11] F. S. Alorfi, (2008). PhD Dissertation: Automatic identification of Arabic dialects using Hidden Makov Models, In University of Pittsburgh. [12] S. Kanokphara, and J.Carson-Berndsen, "Articulatory-AcousticFeature-based automatic language identification", In Proc. Tutorial and Research Workshop (ITRW) on Multilingual Speech and Language Processing, Stellenbosch, South Africa, April 2006, chapter 13. [13] S.A. SantoshKumar, and V. Ramasubramanian,Automatic language identification using Ergodic-HMM, In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '05, 18-23, March 2005, Philadelphia, PA, USA, Vol.1, 609-612. [14] J. Navratil, "Automatic Language Identification", chapter in "Multilingual Speech Processing", Eds. T. Schultz & K. Kirchhoff, academic press, April2006, isbn-13:978-0-12-088501-5, pp. 233-272. [15] Richard O. Duda, E. Hart, G. Stork, Pattern classification, second edition, John Wiley & Sons, New York. [16] http://voxforge.org/ [17] 2003 NIST Language Recognition Evaluation, Linguistic Data Consortium 2006, LDC 2006s31, isbn 1-58563-364-X. [18] Gulf Arabic Conversational Telephone Speech, Linguistic Data Consortium 2006, LDC 2006s43, isbn 1-58563-400-X. [19] Iraqi Arabic Conversational Telephone Speech, Linguistic Data Consortium 2006, LDC 2006s45, isbn 1-58563-403-4. [20] RDI: http://www.rdi-eg.com/. Personal Communication.

Classifier HMM (4, 5) HMM (4, 5) HMM (4, 5) HMM (4, 5) HMM (6, 3) HMM (6, 3) HMM (6, 3) HMM (6, 3) HMM (8, 2) HMM (8, 2) HMM (8, 2) HMM (8, 2) HMM (10, 2) HMM (10, 2) HMM (10, 2) HMM (10, 2)

Iraqi Arabic 75 75 78 78.5 78.3 77 77.5 79 73 76 78.6 85 90 85 84 82

Egyptian Arabic 80 81.5 82 85 95.6 93.5 94 95 85.6 87 88 97.6 57.6 69 65 62.5

Adapted parameters WT T W All WT T W All WT T W All WT T W All

VI. CONCLUSION
We have presented two HMMs approaches for achieving the task of spoken language and dialect identification. The first approach is based on using Ergodic HMMs where the Viterbi training and Baum-Welch training are done for the each model with the specific data separately and all parameters are allowed to adapt. In the second approach, the Viterbi training is done using the pooled data. In the Baum-Welch step, each model is trained with its specific language data, and only selected parameters are allowed to adapt. We have tried the approaches on English-Arabic language recognition and on Egyptian, Iraqi, and Gulf Arabic dialect recognition, all using 5 second speech segments. The results are generally high except for the Iraqi-Gulf case where more efforts need to be done. For example, discriminative training can be tried.

REFERENCES
[1] Kim-Yung, E. Wong, Automatic spoken language identification utilizing acoustic and phonetic speech information, PhD. Thesis, Queensland University of Technology, 2004. [2] M.A. Zissman and E. Singer , Acoustic language identification of telephone speech messages using phoneme recognition and n-gram modeling, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), APR. 1994, pp. 305-308. [3] R.C.F. Tucker, M.J. Carey, and E.S. Parris, Automatic language identification using sub-word models, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), APR. 1994, pp. 301304.

You might also like