You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324725659

Spectral feature based automatic tonal and non-tonal language classification

Conference Paper · July 2017


DOI: 10.1109/ICICICT1.2017.8342752

CITATIONS READS

0 72

4 authors, including:

Chuya Bhanja Azharuddin Laskar


National Institute of Technology, Silchar National Institute of Technology, Silchar
5 PUBLICATIONS   3 CITATIONS    9 PUBLICATIONS   7 CITATIONS   

SEE PROFILE SEE PROFILE

Rabul Laskar
National Institute of Technology, Silchar
93 PUBLICATIONS   407 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Image authentication based on robust image hashing View project

Design and implementation of robust image watermarking technique View project

All content following this page was uploaded by Chuya Bhanja on 13 March 2019.

The user has requested enhancement of the downloaded file.


2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)

Spectral Feature based Automatic Tonal and Non-


Tonal Language Classification
Alice Celin Alphonsa Chuya China (Bhanja) Azharuddin Laskar Rabul Hussain Laskar
Student Student Student Assistant Professor
Electronics and Electronics and Electronics and Electronics and
Communication Engineering Communication Engineering Communication Engineering Communication Engineering
NIT Silchar NIT Silchar NIT Silchar NIT Silchar
Silchar, India Silchar, India Silchar, India Silchar, India
alicecelin@gmail.com chuya.bhanja@gmail.com azharlaskar@gmail.com rhlaskar@ece.nits.ac.in

Abstract—A Language Identification (LID) System finds out Most state of the art LID systems use training sets having
the language of a given speech utterance. Languages can be phonemically labelled data for each language to be identified.
divided into tonal and non-tonal categories based on whether the This method of LID modelling is known as Phonemic
meaning of the same word will change or not with the change in Recognition followed by Language Modelling (PRLM) [1].
pitch variation. Classifying languages into tonal and non-tonal
Although PRLM is an efficient method for LID, the amount of
categories before the individual language identification stage will
reduce the complexity of the LID system. Though state of the art work involved in labeling the data is very high and it is also a
systems use prosodic features for this purpose, this work is time consuming process. In this study, we focus on a system
focused on analysing the performance of spectral features for which does not require any phonetically labeled training data
tonal and non-tonal classification of languages. Performance so that the complexity of the system is reduced.
analysis of different spectral feature combinations namely, Mel
Frequency Cepstral Coefficients (MFCC), MFCC along with Languages can be broadly divided into two main
Shifted Delta Cepstral (SDC) Coefficients, Mean Hilbert categories, tonal and non-tonal. Tone is the usage of pitch in a
Envelope Coefficients (MHEC) and MHEC along with SDC
language to discriminate lexical meanings. Languages which
Coefficients is carried out in this study. Experiments have been
performed on Oregon Graduate Institute-Multilingual Telephone use tone to distinguish words are known as tonal languages. If
Speech Corpus (OGI-MLTS) and NITS Language database using an LID system pre-classifies the languages into tonal and non-
GMM-UBM modelling technique. Results show that MHEC with tonal categories, the accuracy and efficiency of the system can
SDC and MFCC with SDC features, at syllabic level, give be considerably improved as the number of languages in the
comparable performance of 33.97% Equal Error Rate (EER) for final classification stage is reduced.
this classification task.
Literature survey shows that classification of languages
Keywords— Tonal/Non-tonal languages; MHEC; MFCC; into tonal and non-tonal categories has been performed using
SDC; Legendre polynomial; GMM-UBM prosodic features [2]. This work is unique in the sense that it
I. INTRODUCTION presents the performance analysis of spectral features in the
classification of languages into tonal and non-tonal categories.
Speech is the basic mode of communication between In this study, two spectral features are analyzed namely, Mel
humans. Language identification (LID) is the process of Frequency Cepstral Coefficients (MFCC) and Mean Hilbert
identifying the language of an utterance. Every language Envelope Coefficients (MHEC). Of these, MFCC is a very
contains specific sound patterns, called phonemes. These commonly used feature in many speech processing
phonological units make up the inventory from which words applications. MHEC is a relatively new feature which has
are produced in a language. The rates at which these units been shown to perform better than MFCC in speaker
occur and the order in which they appear differ from language recognition and language identification tasks under
to language. An LID System is designed to distinguish among reverberant mismatched conditions [3] and noisy conditions
different languages by the utilization of various such aspects [4]. When these features are used for the pre-classification
of speech information. task, the system gives considerable performance. We also
combine these features with their Shifted Delta Cepstral
LID systems have numerous applications. They can be (SDC) coefficients. SDC feature shows a pseudo-prosodic
used as a front end in speech recognition systems. They are behaviour [5] as it gives the temporal information spread over
also required in language translation systems. In a many frames. Results obtained in this study show that
Multilingual translation system [1], the language of the input combination of SDC with MHEC and MFCC give better
speech needs to be identified quickly before the translation to results compared to that of using only MHEC or MFCC.
the target language begins. LID systems are also used in
Interactive Voice Response systems. In this study, features are considered at syllabic level.
Syllable is a region consisting of one vowel and one or more
consonants. Research has shown that considering syllable

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 1271


2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)

level features for LID task gives better results as syllables are B. Syllable Segmentation
context dependant units [6]. After VOP detection, the pre-emphasized speech signal is
framed into syllable like units. This is done by taking the
Experiments have been carried out by making use of Vowel Onset Point (VOP) as an anchor point and taking the
GMM-UBM modelling technique. The results obtained using
region from 200 samples to the left of the VOP to the
two databases are presented in this paper. The databases used
succeeding VOP. This represents a region consisting of a
are Oregon Graduate Institute-Multilingual Telephone Speech
Corpus (OGI-MLTS) database and NITS Language Database single vowel and one or more consonants. This syllable is
consisting of mainly North East Indian Languages. The rest of divided into frames, each of length 160 samples with an
the paper is organized as follows. Section II presents the overlap of 80 samples. Features are then extracted from these
background and related work in this field. Section III explains frames.
the methodology of the work. Section IV presents the results
and discussions and Section V summarizes and concludes the
study along with the future scope.
II. BACKGROUND
Classification of languages into tonal and non-tonal
categories was formerly done by making use of prosodic
features, namely pitch. L. Wang et al [2] utilized various
characteristics of pitch like speed and level of pitch variation
to discriminate between tonal and non-tonal classes. N. Ryant
et al [7] have demonstrated that spectral characteristics can be
used to achieve good classification performance among the
five tonal categories in Mandarin Chinese language. They
have performed tone classification in the absence of prosodic
information by using MFCC features. In the present work, all
Fig. 2. (a) Speech Signal (b) Speech Signal showing VOPs
analysis is done by considering syllable like units. The
syllabic level approach was used by L Mary et al [6] for C. Feature Extraction
speaker and language identification tasks. They have showed
This study utilizes two spectral features, namely, MFCC
that syllabic level features are more effective for LID than
and MHEC. SDC coefficients of these two features are also
frame level features.
appended along with them.
III. METHODOLOGY 1) Mel Frequency Cepstral Coefficients (MFCC)
The process flow of the overall system is as shown in the MFCC is one of the widely used spectral features for
block diagram below. The detailed explanation of the system is different speech processing applications like speaker
as follows recognition, language identification, emotion recognition etc.
The steps involved in calculation of MFCC coefficients from
the speech frames are as follows:
Syllable Non-speech
VOP Feature Contour Input to
Segmentati frame
Detection Extraction Modelling Classifier
on Removal
Mel
Windowing FFT Frequency Logarithm DCT
Wrapping

Fig.1. Block Diagram representation of the overall system


Fig. 3. Block diagram representation of MFCC Feature Extraction
A. Vowel Onset Point(VOP) Detection
VOP is the time instant at which the vowel region starts in x Perform windowing using Hamming window.
a speech signal. VOP detection is very important for correct x Perform the DFT of the speech frame.
identification of syllable units. A syllable can be considered as x Absolute value of the complex Fourier Transform is
a combination of a vowel and one or more consonants. squared to get the Periodogram estimate of the power
Syllables are the basic structural units for most languages and spectrum.
are used as anchor points for different speech applications. x Compute the Mel filterbank. The filterbank is based
Different algorithms can be used to detect VOPs. In this study, on the Mel scale given by
information from modulation spectrum, spectral peaks and
excitation source are combined for VOP detection [8]. ( ) = 1125 ln 1 + (1)
700

where B is the Mel of the corresponding frequency f


in Hz.

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 1272


2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)

x The filterbank coefficients are multiplied with the x DCT is applied to the spectral coefficients to obtain
power spectrum and are summed to get the filterbank MHEC.
energies.
x Take log of the filterbank energies. After obtaining MHEC coefficients, Its SDC is
x Take the Discrete Cosine Transform to obtain the computed using the standard parameters 7-1-3-7 [9]. The
cepstral coefficients. 49 SDC coefficients thus obtained are appended to the
For experiments involving only MFCC feature, the first 13 MHEC to obtain a 56 dimensional feature vector.
coefficients are considered. The SDC features of the first 7 D. Non-speech Frame Removal
coefficients of MFCC feature vector are computed using the
The feature vectors for all the frames corresponding to a
configuration parameters of 7-1-3-7 [9]. These are then
syllable are stacked together. In the next step, we use the
appended to the MFCC features to get a 56 dimensional
Voice Activity Detection (VAD) algorithm [11] to identify the
feature vector.
frames where speech is present. The feature vectors
2) Mean Hilbert Envelope Coefficients (MHEC)
corresponding to non-speech frames are removed.
MHEC is a relatively new set of spectral features
introduced by Sadjadi and Hansen in 2011 [3]. In MHEC E. Contour Modelling
feature extraction, Gammatone filterbank is used which has The feature vectors for all the frames of a syllable are
been shown to effectively model human cochlear filtering [10].
stacked together and the contour corresponding to each
In this procedure, the amplitude modulation spectrum of the
sub band speech signal is computed using the Hilbert envelope cepstral coefficient is modelled as a linear combination of
of the gammatone filterbank output. The gammatone filterbank Legendre polynomials according to the equation
uses an Equivalent Rectangular Bandwidth (ERB) scale given
by ( ) = ∑ =0 ( ) (4)

= + (2) where ( ) is the contour being modeled, ( ) is the ith


Legendre Polynomial and coefficient represents each
where Q= 9.26449 and B=24.7 are known as Glasberg and characteristic of the contour shape [12]. In this experiment,
Moore parameters and is the centre frequency of the jth Legendre polynomials of order 4 give 65 dimensional MFCC
channel in Hz. The Gammatone filterbank is defined by feature, 280 dimensional MFCC along with SDC feature, 35
ℎ( , ) = −1
exp −2 cos 2 + (3) dimensional MHEC feature and 280 dimensional MHEC
along with SDC feature, for each syllable.
where and represents the magnitude of the response and the
filter order respectively, ( ) is the filter bandwidth where
is the centre frequency of the jth channel and is the initial
phase. In this study, we use 7 MHEC coefficients along with
SDC coefficients computed from them. The steps for the
computation of MHEC coefficients are as shown below [3]

Gammatone Hilbert Envelope Mean Root


DCT
filterbank Envelope smoothing Computation Compression

Fig. 4. Block diagram representation of MHEC feature extraction

x The speech frame is decomposed into different


subbands using gammatone filterbank.
x Apply Hilbert Transform and compute the analytical
signal of the jth channel output.
Fig. 5. Comparison of actual values of the first MHEC coefficient with
x Compute the squared magnitude of the analytical Legendre approximation of order 4
signal to obtain the temporal envelope of the jth
channel output. This temporal envelope is said to be F. Input to Classifier
the Hilbert envelope of the subband signal The training and testing feature vectors are normalized
x Hilbert envelope is smoothed using low pass filter using z- normalization before being fed to the classifier.
x The temporal envelope amplitude of the frame is
IV. EXPERIMENTS AND RESULTS
estimated by calculating the sample mean. These are
the spectral coefficients. A. Database
x Root compression is applied to reduce the dynamic This study uses two databases.
range of the coefficients

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 1273


2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)

1) OGI-MLTS Database
OGI-MLTS Database [13] is a Multilingual Telephone
Speech Corpus where speech data are sampled at 8 KHz. Ten
B. Classifier
Languages from the OGI database are used out of which eight
belong to the non-tonal category and two belong to the tonal In this work, all the experiments have been performed
category. The non-tonal languages are English, Farsi, French, using GMM-UBM modelling technique. In GMM modelling,
German, Hindi, Korean, Spanish, Tamil and the tonal a probability density which is a weighted sum of multivariate
languages are Mandarin Chinese and Vietnamese. Gaussian densities,

2) NITS Language Database ( | )=∑ =1 ( ) (6)


This database is prepared from archived news recordings
of seven North East Indian Languages from All India Radio. is used to select a feature vector at frame time t; where is
This database consists of seven languages out of which the the set of model parameters , and ∑ ; is the
non-tonal languages are Assamese, Bengali, Indian English, multivariate Gaussian densities defined by means and
Hindi and Nagamese and the tonal languages are Manipuri and variances ∑ , and are the mixture weights. In the training
Mizo. The speech samples of this database are sampled at 16 phase, a Universal Background Model (UBM) is developed
KHz and collected in studio environment. which is basically a Gaussian Mixture Model(GMM) trained
using the feature vectors. Bayesian adaptation is then used to
In this classification task, 1.5 hours of data from each class adapt a GMM from the UBM for each of the classes in the
is used to train the model. All experiments in this study are system
performed using closed set tasks of 3, 9 and 45 seconds long C. Experimental Results
samples of test data in a text and speaker independent mode.
The performance of the features in the GMM-UBM In this experiment, GMM-UBM model has been tested for
framework is evaluated using the Equal Error Rate (EER). 209 utterances of OGI-MLTS Database and 50 utterances of
EER is the point where the False Positive Rate (FPR) is equal NITS Language Database including tonal and non-tonal
to the False Negative Rate (FNR) and is calculated as languages. Table I shows the FNR and FPR values obtained
from the experiment for both the databases. EER values have
+ been calculated using (5). Table II shows the experimental
= ∗ 100% (5) results of classifying languages into tonal and non-tonal
2
. The Detection Error Tradeoff (DET) curve in log mode is categories for both OGI and NITS Language database.
used to present the experimental results.
TABLE I. CONFUSION MATRIX OF OGI AND NITS LANGUAGE DATABASE FOR DIFFERENT FEATURE COMBINATIONS

Duration Confusion OGI-MLTS Database NITS Language Database


of test matrix Name of the features Name of the features
utteranc MFCC MFCC+SDC MHEC MHEC+SDC MFCC+SDC MHEC+SDC
es
3 sec FNR .4128 .4019 .4028 .4341 .6232 .4214
FPR .4702 .4589 .4702 .4106 .6120 .4810

9 sec FNR .3731 .3918 .3811 .4011 .3682 .3982


FPR .4506 .3896 .4528 .4100 .4109 .3713

45 sec FNR .3828 .3742 .3672 .3402 .3701 .3891


FPR .3526 .3014 .3518 .3386 .4078 .3901

TABLE II. EXPERIMENTAL RESULTS OF OGI AND NITS LANGUAGE DATABASE FOR DIFFERENT FEATURE COMBINATIONS

Duration of EER obtained for OGI Database (%) EER obtained for NITS
test Language Database (%)
utterances MFCC MFCC+SDC MHEC MHEC+SDC MFCC+SDC MHEC+SDC
3 sec 44.02 43.06 43.06 42.1 61.25 45
9 sec 41.14 39.22 42.1 40.18 38.75 38.75
45 sec 36.85 33.97 35.89 33.97 38.75 38.75

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 1274


2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)

Fig. 6. DET Curve of MFCC Feature for OGI Database Fig. 7. DET Curve of MFCC+SDC Feature for OGI Database

Fig. 8. DET Curve of MHEC Feature for OGI Database Fig. 9. DET Curve of MHEC+SDC Feature for OGI Database

.
Fig. 10. DET Curve of MFCC+SDC Feature for NITS Language Database Fig. 11. DET Curve of MHEC+SDC Feature for NITS Language Database

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 1275


2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)

In this classification task, EER of 44.02%, 41.14%and 36.85% enhanced by adding sub-segmental or spectral features like
has been obtained for 3 s, 9 s and 45 s test utterance using formants. Different classifiers like Deep Neural Network
MFCC feature for OGI database. Fig. 6 shows the DET curve (DNN), I-vector, Restricted Boltzmann machine (RBN) etc.
showing the performances of GMM-UBM classifier for can also be used to model the system.
MFCC feature. It should be noted that 45 second utterance is
giving lowest EER among the three different sizes of test data. REFERENCES
On addition of SDC features to MFCC, a relative [1] M.A. Zissman, “Comparison of four approaches to automatic language
improvement of 2.18 % for 3 s test utterance, 4.67% for 9 s identification of telephone speech”, IEEE Trans. Speech Audio
Process”. Vol. 4, pp. 31–44, 1996.
test utterance and 7.82% for 45 s test utterance can be
[2] L. Wang, E. Ambikairajah, E. H. C. Choi, “A Novel method for
observed. The result has been shown in Fig. 7. Since the automatic tonal and non-tonal language classification,” IEEE
experiment has been conducted for non noisy data MHEC International Conference on Multimedia and Expo, pp. 352-355, 2007.
feature gives comparable performance with MFCC features. [3] S.O. Sadjadi, J.H.L. Hansen, “Hilbert envelope based features for robust
EER of MHEC feature for 3 s, 9 s and 45 s test utterance is speaker identification under reverberant mismatched conditions”. Proc.
IEEE ICASSP, Prague, Czech Republic, pp. 5448–5451, 2011.
43.06%, 42.1% and 35.89% for OGI database. Fig. 8 shows
[4] S.O. Sadjadi, J.H.L. Hansen, “Mean Hilbert envelope coefficients
the DET curve for classification of tonal and non-tonal (MHEC) for robust speaker and language identification,” Speech
languages using MHEC feature. On adding SDC feature, a Communication, vol. 72 pp. 138–148, 2015.
relative improvement of 2.22%, 4.56%, and 5.34% has been [5] D. R. Gonzalez, J. R. C. De Lara, “Speaker Verification with Shifted
observed for 3 s, 9 s and 45 s test utterances. Corresponding Delta Cepstral Features: Its Pseudo-Prosodic Behaviour,” Proceedings of
results are shown in Fig. 9. In this experiment, MFCC along the I Iberian SLTech, 2009.
with SDC and MHEC along with SDC give almost similar [6] L. Mary, B. Yegnanarayana, “Extraction and representation of prosodic
features for language and speaker recognition,” Speech Communication,
results. On the other hand, for NITS Language database, vol. 50, pp. 782-796, 2008.
maximum relative improvement of 26.53% has been observed [7] N. Ryant, J. Yuan, M. Liberman, “Mandarin Tone classification without
for 3 s test utterance and for other two sizes of test utterance,it pitch tracking,” International Conference on Acoustic, Speech and
follows the same trend as that for OGI database. Fig. 10 and Signal Processing (ICASSP), 2014.
Fig. 11 represents the DET curves showing the performances [8] S. R Mahadeva Prasanna, B. V. S. Reddy, P. Krishnamoorthy, “Vowel
Onset Point Detection Using Source, Spectral Peaks, and Modulation
of above mentioned features for NITS Language database. Spectrum Energies,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 17, no. 4, May 2009.
[9] P.A. Torres-Carrasquillo, E. Singer, M.A. Kohler, R.J. Greene , D.A.
V. CONCLUSION Reynolds, J.R. Deller, “Approaches to language identification using
In order to classify languages into tonal and non-tonal Gaussian mixture models and shifted delta cepstral features,” Proc.
INTERSPEECH. Denver, CO, pp. 33–36, 2002.
categories, a system using only spectral features like MHEC,
[10] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An
MFCC and SDC has been developed. Results of experiments efficient auditory filterbank based on the gammatone function,” Appl.
performed on OGI database and NITS language database Psychol. Unit, Cambridge, UK, APU Rep. 2341, 1988.
using GMM-UBM modelling technique have been presented. [11] A. M. Kondoz, "Voice Activity Detection," Digital Speech: Coding for
The novelty of this system is that it does not make use of any Low Bit Rate Communication Systems, Second Edition, pp. 357-377,
2004.
linguistically labeled corpus or prosodic information for this
[12] D. Martinez, E. Lleida, A. Ortega, A. Miguel, “Prosodic features and
classification task. The experiments show that combination of formant modelling for an I-Vector based Language Recognition
MHEC and MFCC with SDC features give almost similar System,” International Conference on Acoustic, Speech and Signal
results for clean data conditions. In future, the performance of Processing (ICASSP), 2013.
the system for noisy data conditions can be studied. As [13] Y. K. Muthusamy, R. A. Cole, B. T Oshika, “The OGI Multi-Language
prosodic features are the state of the art features used for tonal Telephone Speech Corpus,”Proceedings of the international Conference
on spoken Language Processing, Banff, Alberta, Canada, October 1992.
and non-tonal classification, a combination of spectral and
prosodic features is expected to give improved results. The
performance of the classification system can be further

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 1276

View publication stats

You might also like