You are on page 1of 6

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/305676960

A Comparative Study of Discriminative


Approaches for Classifying Languages into Tonal
and Non-Tonal Categories at Syllabic Level

Conference Paper · April 2016


DOI: 10.13140/RG.2.2.34466.22720

CITATIONS READS

0 52

5 authors, including:

Biplav Choudhury Tameem Salman Choudhury


National Institute of Technology, Silchar Indian Institute of Technology Guwahati
6 PUBLICATIONS 15 CITATIONS 6 PUBLICATIONS 9 CITATIONS

SEE PROFILE SEE PROFILE

Rabul Laskar Aniket Pramanik


National Institute of Technology, Silchar National Institute of Technology, Silchar
8 PUBLICATIONS 7 CITATIONS 6 PUBLICATIONS 15 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

RWP model in Ad-Hoc Networks View project

Approaches for Classifying Languages into Tonal and Non-Tonal Categories View project

All content following this page was uploaded by Biplav Choudhury on 27 July 2016.

The user has requested enhancement of the downloaded file.


Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465
2016 3 International Conference on “Computing for Sustainable Global Development”, 16 th – 18th March, 2016
rd

Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)

A Comparative Study of Discriminative


Approaches for Classifying Languages into Tonal
and Non-Tonal Categories at Syllabic Level
Biplav Choudhury Chuya China Tameem S. R. H. Laskar Aniket Pramanik
NIT Silchar, India (Bhanja) Choudhury NIT Silchar, India NIT Silchar, India
Email id: NIT Silchar, India NIT Silchar, India Email id: Email id:
biplav93@gmail.com Email id: Email rhlaskar@ece.nits.ac.in aniketpramanik@yaho
chuya.bhanja@gmail.c id:salmantameem360 o.co.in
om @gmail.com
Abstract – Languages spoken by us, on the basis of how they use the occurrence of satire, importance, counterpoint, and stress;
tone to convey a meaning, can be grouped into two categories: and other factor of speech that may not be identified from
Tonal and Non-Tonal languages. Pitch is used as a figure of speech language rules or by selection of words [2].
in the case of tonal languages. The connotation of a word changes Language identification (LID) involves automatically
depending on the pitch or tone used. Both pitch and pitch range,
identifying the languages by analyzing the audio of the
are found to be lower for non-tonal languages. A speech signal
contains both speaker and language attributes. In tonal and non- language. Over the years different LID techniques have been
tonal language classification, discriminating cues are extracted developed focusing on different types of features as well as
from the speech signal and fed to the different classifiers. This different types of modeling techniques. Language modeling can
work is unique in the way that the speech signal is divided into its be done either by generative or discriminative approach [8].
constituent syllables before doing any further processing for Generative classifiers use the joint 𝑃 𝑥, 𝑦 but discriminative
feature extraction, instead of considering the utterance as a whole. classifiers model conditional probability 𝑃 𝑦 𝑥 directly.
In this paper, the performance analysis of different classifiers is Though the performance of every classifier depends upon the
done at syllabic level for identifying Tonal and Non-Tonal distribution of data but in some aspects such as handling
languages. In this classification tasks artificial neural network
(ANN) outperforms the other classifiers with the accuracy of
missing data etc. discriminative classifiers are preferred over
84.21%. generative classifiers. For classifying languages into tonal and
non-tonal categories, generative (Gaussian Mixture Model
Keywords – Syllable; Vowel Onset Point (VOP); Tonal and non- (GMM), Hidden Markov Model (HMM), Naive Bayes etc.) or
tonal languages, Artificial neural network (ANN); support vector discriminative (k-NN, SVM, ANN etc.) modeling techniques
machine; k-Nearest Neighbor (k-NN). can be used. In this paper the comparison of three different
discriminative classifiers i.e. k-NN, SVM, and ANN has been
I. INTRODUCTION carried out.
Speech can be described to be a "spoken" form of Tonal characteristics of a language mainly manifest itself as
communication generated by the valid incorporation of sounds pitch. So this classification is based on pitch information. It has
particular to a specific language. The sounds are produced been observed that for discriminating tonal and non-tonal
when vowels and consonants blend. An extremely large languages, energy information is also essential. So some of the
number of languages exist which can be differentiated on the existing features like average pitch, pitch changing speed,[4]
basis of their terminologies, their vocabularies, the patterns pitch changing level, along with some new features like
which arrange the words and their collection of phrases. The average energy, energy changing speed and energy changing
mechanism of identifying one's words is initiated at the most level are used for this purpose. The speech signal is first broken
primitive level, the acoustics of the spoken word. Once the into constituent syllables, then the above mentioned features
aural signal is examined, vocal sounds are further examined to are extracted from each syllable and those features are given to
isolate auditory clues and phonetic data. This data is used for the classifiers for discriminating between tonal and non-tonal
higher-level language processes [1]. Generally different types languages.
of characteristics are identified from the speech signal, and the For our work, the OGI language database has been used which
most prominent among them are the prosodic features. Prosody comprises of taped telephonic data of 11 different languages.
represents the beat, emphasis and intonation of speech. It These conversations are recorded versions of actual telephonic
provides diverse characteristics of the talker and the language: conversations of native speakers of English, Farsi, French,
for example, the mental condition of the talker and the type of German, Japanese, Korean, Mandarin Chinese, Spanish, Tamil,
the speech (assertion, doubt and instruction). It also tells about and Vietnamese. Of this database, English, Farsi, French,

Copy Right © INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5 3038


Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465
2016 3 International Conference on “Computing for Sustainable Global Development”, 16 th – 18th March, 2016
rd

German, Japanese, Korean, Spanish and Tamil are non-tonal with significant change in signal energy give the confirmation
while rest two, Mandarin Chinese and Vietnamese are tonal [3]. for the detection of VOPs. These can be used as a clue for
The organization of the paper is as follows: Associated work detecting the occurrence of VOPs. Using a local peak
done in this field is given in Section II, the method of approach identifying algorithm, peaks are located in the signal energy
followed throughout the work is explained in Section III; the plot that signals the VOPs [7].
discriminative classifiers used are explained in section IV,
results are delineated in section V, section VI lists the findings
and section VII finally wraps up the paper.

II. RELATED WORK


Categorization of languages into tonal and non-tonal classes
has been previously attempted in utterance level. L.Wang.et al
[4] modeled the pitch by considering the rate of pitch change
and the way in which it changes. Normalization has been done
using Voiced Duration and Voiced Counter. Using neural
network, the accuracy obtained was 80.6%. L.Mary.et al [9]
conducted language and speaker identification task into syllabic
level using prosodic features. They segmented the continuous
speech into syllable-like unit by using Vowel Onset Points
(VOP) as anchor points for identifying syllables.

III. METHODOLOGY
The complete process has been split into 5 steps.
 Elimination of Unvoiced Part - Voiced signals
Fig. 1. Location of a VOP in a speech signal.
comprises a lot of portions of silence. Thus, we have to remove
such parts from our signal to identify “clear” speech segments. Syllabic segmentation –Once the VOPs have been identified,
Usually, signal power content of silence parts are less than the we divide the speech into its constituent syllables. This can be
voiced parts. To achieve the identification of clear speech referred to as the syllabic approach. Syllables are the basic
segments, Signal Energy and Spectral Centroid are used [5]. units of sound, generally containing a single vowel and
The steps followed are- surrounded by consonants. The region surrounding a consonant
 Signal Energy and Spectral Centroid are calculated in vowel (CV) utterance gives us the required parameters For this
two sequences. work the characteristics of the signal around VOP,25
 Thresholds are estimated for both criteria. milliseconds before and 40 milliseconds after VOP, is taken to
 Both the sequences are subject to the respective parameterize the features of the CV unit. When extracting
thresholds. these, 10 overlapping windows surrounding the VOP where
 Using the above thresholding criterion, clean speech each window is of 20 milliseconds. Frame shift considered is 5
segments are identified. seconds [9].
Spotting of Vowel Onset Point (VOPs) - Vowel Onset Point Feature extraction –This study was done considering the pitch
(VOP) is the moment in which the utterance of a vowel begins and energy as the features. Pitch tracking involves tracking the
in our speech signal. The correct determination of VOP is fundamental frequency. The RAPT (Robust Algorithm for
important as it is imperative to identify consonant–vowel (CV) Pitch Tracking) [9] algorithm is chosen to calculate the
units in different languages. Out of all possible patterns of fundamental frequency, which is available as a standard
combination of vowel and consonant, CV pattern is most mathematical function in MATLAB. Short term energy of the
important as it is the most common pattern among all signal is considered due to the dynamic nature of speech signal.
languages, particularly in Indic languages. These onset points The part of the audio signal containing vocalized sounds will
of vowels are detected using various methods. Some of them have very large energy compared to the silent regions. This
are increasing nature of resonance peaks in the amplitude, classification using the short term energy is extremely useful,
frequency spectrum; pitch and energy characteristics, zero- and is particularly utilized for identifying the silent portions of
crossing rate, wavelet transform, neural network; dynamic time our speech signal and removing them before processing the
warping, and excitation source information [6]. The transition signal.. So overall, the features used are the fundamental
from a consonant to a vowel can be identified by the greater frequency and the short term energy. For each syllable, three
strength of excitation; this is used as a cue to detect the onset of fundamental frequencies and one energy level parameters are
vowels in our speech signal. Fig. 1 shows such an example. obtained which are all normalized before being passed on for
Generally the instantaneous energy of our signal is higher for processing.
vowels than the strength of voiced consonants. Thus the places

Copy Right © INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5 3039


A Comparitive Study of Discriminative Approaches for Classifying Languages into Tonal and Non-Tonal Categories at Syllabic
Level

Data given to the classifiers – Once the pitch and the short term used due to its simplistic nature. Results can be made more
energy of the signal are extracted and modeled mathematically, precise by considering a weighted contribution of the closest
they are given to the classifiers as input. Normalization of the neighbors, by giving the nearer neighbors more
data is always carried out with respect to the global maxima. weightage/importance than the farther ones. This classifier has
The unique aspect of our work is that the speech signal is not a downside, it is imperative for us to know the class of the
considered as a whole. The speech signal is divided into training data before we can classify the testing data. This can
syllable segments assuming CV structure as they are the most be said to be the training of the classifier. In this classifier,
dominant. These segments are then analyzed individually and parameters considered are given below, available in MATLAB
the decision is taken based on the nature of the majority of its  Distance considered = Hamming distance
constituent syllables. The final decision is taken depending on  Value of k = 5
the nature of the bulk of the syllables. This method of using the
class of the majority of the syllables to determine the tonality of B. Artificial Neural Network
the language has improved the performance of our classifier:
Neural network [12] can be defined as computer
the accuracy has improved by a significant 4-5% than what was
architecture, modeled after the human brain, in which
achieved by analyzing the utterance as a whole. The time
processors are connected in a manner suggestive of connections
required has also roughly remained the same. The accuracy of
between neurons and learns by trial and error. It is commonly
our syllable detection step is extremely important for the
used to estimate unidentified mathematical functions. These are
precise implementation of this step.
present in layers and each layer has a number of neurons called
In Fig.2, all the steps involved in our work are shown as a
nodes which are all inter-connected. Every neural network has
flowchart.
an activation function, which determines the output of each
neuron. The input layer is presented with the training data,
which is passed to the hidden layer of neurons where the actual
processing is done. Once the processing is done, the hidden
layers provide their output to the output layer. Much such
forward propagation runs can be done to minimize the error.
Advantages include flexible learning, tolerant to faults, non-
linearity etc. Back propagation neural network is used which
works on an error correcting algorithm. After completion of
every forward propagation run during the training phase, the
output is matched to the provided output. Depending on the
error margin, further forward runs are made after weight
adjustments of the interconnections. The parameters, as
implemented in MATLAB, were -
 Output layer neurons =1
 Input layer neurons =4
Fig. 2. Process Flow chart.
 Hidden layer neurons =14
 Processing (hidden) layer =1
IV. CLASSIFIERS UTILIZED
C. Support Vector Machines
Once the required features are extracted, they are given to
the classifiers. Three different types of classifiers are used: k This classifier is used only in cases where the data has only
Nearest Neighbor algorithm (k-NN), Artificial Neural Network two divisions and is a kernel learning method. Classification is
(ANN) and Support Vector Machine (SVM). A preliminary done on the basis of decision limit or boundary. The two
description of the classifiers- classes are separated by a decision boundary [10].The
hyperplane that best separates the classes is chosen by the SVM
A. K Nearest Neighbor Algorithm classifier, where the best hyperplane can be defined as the one
having the biggest margin between the classes. In case of multi-
k-NN algorithm is one of the most commonly used methods layered SVM classifier, large number of Support Vectors is
for classifying our testing data. This algorithm, as the name present in the output layer. Gaussian kernel function is used
signifies, finds k nearest neighbor in the n-dimensional space, here. Gaussian kernel function is used here. This classifier was
where n refers to the number of features. The output of this also implemented using MATLAB.
classifier gives the membership of the testing data to any one of
the classes, depending on the nature of the majority of its V. RESULTS
neighbors among the k nearest ones. k=1 refers to a special
case where the class of the testing data depends on the nearest This section lists the result obtained. Training data of 4800
neighbor among the training data. This algorithm is widely seconds is taken out of which 2400 seconds is tonal and the rest

Copy Right © INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5 3040


Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465
2016 3 International Conference on “Computing for Sustainable Global Development”, 16 th – 18th March, 2016
rd

2400 seconds is non-tonal. For the purpose of testing, 38 reason for the poor performance of SVM classifier, which
samples are taken which includes 19 tonal and 19 non-tonal generally downgrades the performance of the SVM
language speech files .For the testing a single file of classifier. Comparison of three different discriminative
600seconds file was made which included 300 seconds of tonal approaches is shown in Fig.3.
and 300 seconds of non-tonal language speech. The testing files  Performance can be increased by increasing the duration of
and the training files were ensured to be separate. Once the the speech files.
decision is taken at the syllabic level, the final decision is taken  The least time among all classifiers was taken by k-NN
depending on the nature of the majority of the syllables. The classifier.
accuracy is measured by calculating the total correct decisions
made out of the 38 samples, in each of the classifiers. The tonal Figure 3 gives a quantitative comparison of the ability of the
class is referred to as Class I and non-tonal class as Class II. classifiers to correctly identify a language as tonal or non-tonal.

1. K Nearest Neighbor Algorithm


Table I illustrates the results obtained using k-NN
algorithm.

TABLE I

Number of test Correct Correct


samples Detection Detection %
Class I 19 14 73.684

Class II 19 17 89.473
The accuracy obtained is 81.58%.

2. Artificial Neural Network

Artificial Neural Network yielded the results as shown in Table


II. The accuracy obtained is 84.21%.

TABLE II
Number of test Correct Correct
samples Detection Detection % Fig. 3. Comparison between k-NN, ANN, SVM.
Class I 19 15 78.947
VII.CONCLUSIONS
Class II 19 17 89.473
In this paper, we have studied the performance of different
classifiers i.e. Artificial Neural Network (ANN), k Nearest
3. Support Vector Machines
Neighbor (k-NN) and Support Vector Machine (SVM) in
Table III illustrates the results obtained using Support Vector identifying whether an unknown speech sample belongs to a
Machines. The accuracy obtained is 65.79%. tonal or non-tonal class. ANN classifier can be said to be the
best classifier of these three classifiers. Using a much
TABLE III uncomplicated methodology, good results have been obtained
Number of test Correct Correct for ANN and k-NN classifier. By pre-classifying the languages
samples Detection Detection % into two broad classes, we can reduce the complexity of
Class I 19 13 68.421 automatic identification of languages. Once we have pre-
Class II 19 12 63.157
classified the languages into two categories, the complexity in
the final step of identification of languages becomes less time
consuming.. In our pre-classification stage, we are just using
VI. OBSERVATIONS prosodic information whereas for identification of language,
The study of these three different classifiers gave the acoustic and phonotactic features are used. This reduces the
following observations- time in pre-classification stage.
 Artificial Neural Network has the highest accuracy with An automatic Language Identification System can be built using
84.21%. K-NN classifier follows with 81.57%. The SVM this pre-classification work as a platform.
classifier's performance was the least accurate with
65.79%. Less variance among the training data may be a

Copy Right © INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5 3041


A Comparitive Study of Discriminative Approaches for Classifying Languages into Tonal and Non-Tonal Categories at Syllabic
Level

REFERENCES
Journal References
[1]. Implementation of Advanced Communication Aid for People with Severe
Speech Impairment IOSR Journal of Electronics and Communication
Engineering (IOSR-JECE) e-ISSN: 2278-2834, p- ISSN: 2278-
8735.Volume 9, Issue 2, Ver. III (Mar - Apr. 2014), PP 61-66.
[2]. Orsucci et al.: “Prosody and synchronization in cognitive neuroscience”.
Prosody and synchronization in cognitive neuroscience. EPJ Nonlinear
Biomedical Physics 2013 1:6.
[3]. "The OGI Multi-language Telephone Speech Corpus". Y. K. Muthusamy,
R. A. Cole and B. T. Oshika. Proceedings of the International Conference
on Spoken Language Processing, Banff, Alberta, Canada, October 1992.
[4]. L. Wang, E. Ambikairajah, E. H.C. Choi, “A novel method for automatic
tonal and non-tonal language classification.” In: IEEE International
Conference on Multimedia and Expo. pp. 352 – 355,2007.
[5]. Suryakanth V. Gangashetty, C. Chandra Shekhar, B. Yenganarayana
“Extraction of fixed dimension patterns from varying duration segments
of consonant-vowel utterances”.
[6]. S.R. Mahadeva Prasanna, B. Yegnanarayana. “Detection of Vowel Onset
Point Events using Excitation Information”.
[7]. Theodoros Theodorou , Iosif Mporas and Nikos Fakotakis. “Automatic
Sound Classification of Radio Broadcast News”. International Journal of
Signal Processing, Image Processing and Pattern Recognition Vol. 5, No.
1, March, 2012
[8]. A.Y.Ng, M.I.Jordan,On Discriminative vs. Generative classifiers: A
comparison of logistic regression and naïve Bayes.
[9]. Leena Mary, B. Yegnanarayana, “Extraction and representation of
prosodic featuresfor language
and speaker recognition”Speech Communication vol.50 No.782–796,
2008.
[10]. W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, D. A. Reynolds,
Language Recognition with Support Vector Machines In: Proceedings
Odyssey. pp. 41–44. 2004.
Book References
[1]. David Talkin, “A Robust Algorithm for Pitch Tracking”. chapter 14.1995.
[2]. B.Yegnanarayana. Artificial Neural Networks,Prentice-Hall of lndia
Private Limited, New Delhi. 2005

Copy Right © INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5 3042

View publication stats

You might also like