Professional Documents
Culture Documents
net/publication/324725659
CITATIONS READS
0 72
4 authors, including:
Rabul Laskar
National Institute of Technology, Silchar
93 PUBLICATIONS 407 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Chuya Bhanja on 13 March 2019.
Abstract—A Language Identification (LID) System finds out Most state of the art LID systems use training sets having
the language of a given speech utterance. Languages can be phonemically labelled data for each language to be identified.
divided into tonal and non-tonal categories based on whether the This method of LID modelling is known as Phonemic
meaning of the same word will change or not with the change in Recognition followed by Language Modelling (PRLM) [1].
pitch variation. Classifying languages into tonal and non-tonal
Although PRLM is an efficient method for LID, the amount of
categories before the individual language identification stage will
reduce the complexity of the LID system. Though state of the art work involved in labeling the data is very high and it is also a
systems use prosodic features for this purpose, this work is time consuming process. In this study, we focus on a system
focused on analysing the performance of spectral features for which does not require any phonetically labeled training data
tonal and non-tonal classification of languages. Performance so that the complexity of the system is reduced.
analysis of different spectral feature combinations namely, Mel
Frequency Cepstral Coefficients (MFCC), MFCC along with Languages can be broadly divided into two main
Shifted Delta Cepstral (SDC) Coefficients, Mean Hilbert categories, tonal and non-tonal. Tone is the usage of pitch in a
Envelope Coefficients (MHEC) and MHEC along with SDC
language to discriminate lexical meanings. Languages which
Coefficients is carried out in this study. Experiments have been
performed on Oregon Graduate Institute-Multilingual Telephone use tone to distinguish words are known as tonal languages. If
Speech Corpus (OGI-MLTS) and NITS Language database using an LID system pre-classifies the languages into tonal and non-
GMM-UBM modelling technique. Results show that MHEC with tonal categories, the accuracy and efficiency of the system can
SDC and MFCC with SDC features, at syllabic level, give be considerably improved as the number of languages in the
comparable performance of 33.97% Equal Error Rate (EER) for final classification stage is reduced.
this classification task.
Literature survey shows that classification of languages
Keywords— Tonal/Non-tonal languages; MHEC; MFCC; into tonal and non-tonal categories has been performed using
SDC; Legendre polynomial; GMM-UBM prosodic features [2]. This work is unique in the sense that it
I. INTRODUCTION presents the performance analysis of spectral features in the
classification of languages into tonal and non-tonal categories.
Speech is the basic mode of communication between In this study, two spectral features are analyzed namely, Mel
humans. Language identification (LID) is the process of Frequency Cepstral Coefficients (MFCC) and Mean Hilbert
identifying the language of an utterance. Every language Envelope Coefficients (MHEC). Of these, MFCC is a very
contains specific sound patterns, called phonemes. These commonly used feature in many speech processing
phonological units make up the inventory from which words applications. MHEC is a relatively new feature which has
are produced in a language. The rates at which these units been shown to perform better than MFCC in speaker
occur and the order in which they appear differ from language recognition and language identification tasks under
to language. An LID System is designed to distinguish among reverberant mismatched conditions [3] and noisy conditions
different languages by the utilization of various such aspects [4]. When these features are used for the pre-classification
of speech information. task, the system gives considerable performance. We also
combine these features with their Shifted Delta Cepstral
LID systems have numerous applications. They can be (SDC) coefficients. SDC feature shows a pseudo-prosodic
used as a front end in speech recognition systems. They are behaviour [5] as it gives the temporal information spread over
also required in language translation systems. In a many frames. Results obtained in this study show that
Multilingual translation system [1], the language of the input combination of SDC with MHEC and MFCC give better
speech needs to be identified quickly before the translation to results compared to that of using only MHEC or MFCC.
the target language begins. LID systems are also used in
Interactive Voice Response systems. In this study, features are considered at syllabic level.
Syllable is a region consisting of one vowel and one or more
consonants. Research has shown that considering syllable
level features for LID task gives better results as syllables are B. Syllable Segmentation
context dependant units [6]. After VOP detection, the pre-emphasized speech signal is
framed into syllable like units. This is done by taking the
Experiments have been carried out by making use of Vowel Onset Point (VOP) as an anchor point and taking the
GMM-UBM modelling technique. The results obtained using
region from 200 samples to the left of the VOP to the
two databases are presented in this paper. The databases used
succeeding VOP. This represents a region consisting of a
are Oregon Graduate Institute-Multilingual Telephone Speech
Corpus (OGI-MLTS) database and NITS Language Database single vowel and one or more consonants. This syllable is
consisting of mainly North East Indian Languages. The rest of divided into frames, each of length 160 samples with an
the paper is organized as follows. Section II presents the overlap of 80 samples. Features are then extracted from these
background and related work in this field. Section III explains frames.
the methodology of the work. Section IV presents the results
and discussions and Section V summarizes and concludes the
study along with the future scope.
II. BACKGROUND
Classification of languages into tonal and non-tonal
categories was formerly done by making use of prosodic
features, namely pitch. L. Wang et al [2] utilized various
characteristics of pitch like speed and level of pitch variation
to discriminate between tonal and non-tonal classes. N. Ryant
et al [7] have demonstrated that spectral characteristics can be
used to achieve good classification performance among the
five tonal categories in Mandarin Chinese language. They
have performed tone classification in the absence of prosodic
information by using MFCC features. In the present work, all
Fig. 2. (a) Speech Signal (b) Speech Signal showing VOPs
analysis is done by considering syllable like units. The
syllabic level approach was used by L Mary et al [6] for C. Feature Extraction
speaker and language identification tasks. They have showed
This study utilizes two spectral features, namely, MFCC
that syllabic level features are more effective for LID than
and MHEC. SDC coefficients of these two features are also
frame level features.
appended along with them.
III. METHODOLOGY 1) Mel Frequency Cepstral Coefficients (MFCC)
The process flow of the overall system is as shown in the MFCC is one of the widely used spectral features for
block diagram below. The detailed explanation of the system is different speech processing applications like speaker
as follows recognition, language identification, emotion recognition etc.
The steps involved in calculation of MFCC coefficients from
the speech frames are as follows:
Syllable Non-speech
VOP Feature Contour Input to
Segmentati frame
Detection Extraction Modelling Classifier
on Removal
Mel
Windowing FFT Frequency Logarithm DCT
Wrapping
x The filterbank coefficients are multiplied with the x DCT is applied to the spectral coefficients to obtain
power spectrum and are summed to get the filterbank MHEC.
energies.
x Take log of the filterbank energies. After obtaining MHEC coefficients, Its SDC is
x Take the Discrete Cosine Transform to obtain the computed using the standard parameters 7-1-3-7 [9]. The
cepstral coefficients. 49 SDC coefficients thus obtained are appended to the
For experiments involving only MFCC feature, the first 13 MHEC to obtain a 56 dimensional feature vector.
coefficients are considered. The SDC features of the first 7 D. Non-speech Frame Removal
coefficients of MFCC feature vector are computed using the
The feature vectors for all the frames corresponding to a
configuration parameters of 7-1-3-7 [9]. These are then
syllable are stacked together. In the next step, we use the
appended to the MFCC features to get a 56 dimensional
Voice Activity Detection (VAD) algorithm [11] to identify the
feature vector.
frames where speech is present. The feature vectors
2) Mean Hilbert Envelope Coefficients (MHEC)
corresponding to non-speech frames are removed.
MHEC is a relatively new set of spectral features
introduced by Sadjadi and Hansen in 2011 [3]. In MHEC E. Contour Modelling
feature extraction, Gammatone filterbank is used which has The feature vectors for all the frames of a syllable are
been shown to effectively model human cochlear filtering [10].
stacked together and the contour corresponding to each
In this procedure, the amplitude modulation spectrum of the
sub band speech signal is computed using the Hilbert envelope cepstral coefficient is modelled as a linear combination of
of the gammatone filterbank output. The gammatone filterbank Legendre polynomials according to the equation
uses an Equivalent Rectangular Bandwidth (ERB) scale given
by ( ) = ∑ =0 ( ) (4)
1) OGI-MLTS Database
OGI-MLTS Database [13] is a Multilingual Telephone
Speech Corpus where speech data are sampled at 8 KHz. Ten
B. Classifier
Languages from the OGI database are used out of which eight
belong to the non-tonal category and two belong to the tonal In this work, all the experiments have been performed
category. The non-tonal languages are English, Farsi, French, using GMM-UBM modelling technique. In GMM modelling,
German, Hindi, Korean, Spanish, Tamil and the tonal a probability density which is a weighted sum of multivariate
languages are Mandarin Chinese and Vietnamese. Gaussian densities,
TABLE II. EXPERIMENTAL RESULTS OF OGI AND NITS LANGUAGE DATABASE FOR DIFFERENT FEATURE COMBINATIONS
Duration of EER obtained for OGI Database (%) EER obtained for NITS
test Language Database (%)
utterances MFCC MFCC+SDC MHEC MHEC+SDC MFCC+SDC MHEC+SDC
3 sec 44.02 43.06 43.06 42.1 61.25 45
9 sec 41.14 39.22 42.1 40.18 38.75 38.75
45 sec 36.85 33.97 35.89 33.97 38.75 38.75
Fig. 6. DET Curve of MFCC Feature for OGI Database Fig. 7. DET Curve of MFCC+SDC Feature for OGI Database
Fig. 8. DET Curve of MHEC Feature for OGI Database Fig. 9. DET Curve of MHEC+SDC Feature for OGI Database
.
Fig. 10. DET Curve of MFCC+SDC Feature for NITS Language Database Fig. 11. DET Curve of MHEC+SDC Feature for NITS Language Database
In this classification task, EER of 44.02%, 41.14%and 36.85% enhanced by adding sub-segmental or spectral features like
has been obtained for 3 s, 9 s and 45 s test utterance using formants. Different classifiers like Deep Neural Network
MFCC feature for OGI database. Fig. 6 shows the DET curve (DNN), I-vector, Restricted Boltzmann machine (RBN) etc.
showing the performances of GMM-UBM classifier for can also be used to model the system.
MFCC feature. It should be noted that 45 second utterance is
giving lowest EER among the three different sizes of test data. REFERENCES
On addition of SDC features to MFCC, a relative [1] M.A. Zissman, “Comparison of four approaches to automatic language
improvement of 2.18 % for 3 s test utterance, 4.67% for 9 s identification of telephone speech”, IEEE Trans. Speech Audio
Process”. Vol. 4, pp. 31–44, 1996.
test utterance and 7.82% for 45 s test utterance can be
[2] L. Wang, E. Ambikairajah, E. H. C. Choi, “A Novel method for
observed. The result has been shown in Fig. 7. Since the automatic tonal and non-tonal language classification,” IEEE
experiment has been conducted for non noisy data MHEC International Conference on Multimedia and Expo, pp. 352-355, 2007.
feature gives comparable performance with MFCC features. [3] S.O. Sadjadi, J.H.L. Hansen, “Hilbert envelope based features for robust
EER of MHEC feature for 3 s, 9 s and 45 s test utterance is speaker identification under reverberant mismatched conditions”. Proc.
IEEE ICASSP, Prague, Czech Republic, pp. 5448–5451, 2011.
43.06%, 42.1% and 35.89% for OGI database. Fig. 8 shows
[4] S.O. Sadjadi, J.H.L. Hansen, “Mean Hilbert envelope coefficients
the DET curve for classification of tonal and non-tonal (MHEC) for robust speaker and language identification,” Speech
languages using MHEC feature. On adding SDC feature, a Communication, vol. 72 pp. 138–148, 2015.
relative improvement of 2.22%, 4.56%, and 5.34% has been [5] D. R. Gonzalez, J. R. C. De Lara, “Speaker Verification with Shifted
observed for 3 s, 9 s and 45 s test utterances. Corresponding Delta Cepstral Features: Its Pseudo-Prosodic Behaviour,” Proceedings of
results are shown in Fig. 9. In this experiment, MFCC along the I Iberian SLTech, 2009.
with SDC and MHEC along with SDC give almost similar [6] L. Mary, B. Yegnanarayana, “Extraction and representation of prosodic
features for language and speaker recognition,” Speech Communication,
results. On the other hand, for NITS Language database, vol. 50, pp. 782-796, 2008.
maximum relative improvement of 26.53% has been observed [7] N. Ryant, J. Yuan, M. Liberman, “Mandarin Tone classification without
for 3 s test utterance and for other two sizes of test utterance,it pitch tracking,” International Conference on Acoustic, Speech and
follows the same trend as that for OGI database. Fig. 10 and Signal Processing (ICASSP), 2014.
Fig. 11 represents the DET curves showing the performances [8] S. R Mahadeva Prasanna, B. V. S. Reddy, P. Krishnamoorthy, “Vowel
Onset Point Detection Using Source, Spectral Peaks, and Modulation
of above mentioned features for NITS Language database. Spectrum Energies,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 17, no. 4, May 2009.
[9] P.A. Torres-Carrasquillo, E. Singer, M.A. Kohler, R.J. Greene , D.A.
V. CONCLUSION Reynolds, J.R. Deller, “Approaches to language identification using
In order to classify languages into tonal and non-tonal Gaussian mixture models and shifted delta cepstral features,” Proc.
INTERSPEECH. Denver, CO, pp. 33–36, 2002.
categories, a system using only spectral features like MHEC,
[10] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An
MFCC and SDC has been developed. Results of experiments efficient auditory filterbank based on the gammatone function,” Appl.
performed on OGI database and NITS language database Psychol. Unit, Cambridge, UK, APU Rep. 2341, 1988.
using GMM-UBM modelling technique have been presented. [11] A. M. Kondoz, "Voice Activity Detection," Digital Speech: Coding for
The novelty of this system is that it does not make use of any Low Bit Rate Communication Systems, Second Edition, pp. 357-377,
2004.
linguistically labeled corpus or prosodic information for this
[12] D. Martinez, E. Lleida, A. Ortega, A. Miguel, “Prosodic features and
classification task. The experiments show that combination of formant modelling for an I-Vector based Language Recognition
MHEC and MFCC with SDC features give almost similar System,” International Conference on Acoustic, Speech and Signal
results for clean data conditions. In future, the performance of Processing (ICASSP), 2013.
the system for noisy data conditions can be studied. As [13] Y. K. Muthusamy, R. A. Cole, B. T Oshika, “The OGI Multi-Language
prosodic features are the state of the art features used for tonal Telephone Speech Corpus,”Proceedings of the international Conference
on spoken Language Processing, Banff, Alberta, Canada, October 1992.
and non-tonal classification, a combination of spectral and
prosodic features is expected to give improved results. The
performance of the classification system can be further