Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/305676960
CITATIONS READS
0 52
5 authors, including:
Some of the authors of this publication are also working on these related projects:
Approaches for Classifying Languages into Tonal and Non-Tonal Categories View project
All content following this page was uploaded by Biplav Choudhury on 27 July 2016.
Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)
German, Japanese, Korean, Spanish and Tamil are non-tonal with significant change in signal energy give the confirmation
while rest two, Mandarin Chinese and Vietnamese are tonal [3]. for the detection of VOPs. These can be used as a clue for
The organization of the paper is as follows: Associated work detecting the occurrence of VOPs. Using a local peak
done in this field is given in Section II, the method of approach identifying algorithm, peaks are located in the signal energy
followed throughout the work is explained in Section III; the plot that signals the VOPs [7].
discriminative classifiers used are explained in section IV,
results are delineated in section V, section VI lists the findings
and section VII finally wraps up the paper.
III. METHODOLOGY
The complete process has been split into 5 steps.
Elimination of Unvoiced Part - Voiced signals
Fig. 1. Location of a VOP in a speech signal.
comprises a lot of portions of silence. Thus, we have to remove
such parts from our signal to identify “clear” speech segments. Syllabic segmentation –Once the VOPs have been identified,
Usually, signal power content of silence parts are less than the we divide the speech into its constituent syllables. This can be
voiced parts. To achieve the identification of clear speech referred to as the syllabic approach. Syllables are the basic
segments, Signal Energy and Spectral Centroid are used [5]. units of sound, generally containing a single vowel and
The steps followed are- surrounded by consonants. The region surrounding a consonant
Signal Energy and Spectral Centroid are calculated in vowel (CV) utterance gives us the required parameters For this
two sequences. work the characteristics of the signal around VOP,25
Thresholds are estimated for both criteria. milliseconds before and 40 milliseconds after VOP, is taken to
Both the sequences are subject to the respective parameterize the features of the CV unit. When extracting
thresholds. these, 10 overlapping windows surrounding the VOP where
Using the above thresholding criterion, clean speech each window is of 20 milliseconds. Frame shift considered is 5
segments are identified. seconds [9].
Spotting of Vowel Onset Point (VOPs) - Vowel Onset Point Feature extraction –This study was done considering the pitch
(VOP) is the moment in which the utterance of a vowel begins and energy as the features. Pitch tracking involves tracking the
in our speech signal. The correct determination of VOP is fundamental frequency. The RAPT (Robust Algorithm for
important as it is imperative to identify consonant–vowel (CV) Pitch Tracking) [9] algorithm is chosen to calculate the
units in different languages. Out of all possible patterns of fundamental frequency, which is available as a standard
combination of vowel and consonant, CV pattern is most mathematical function in MATLAB. Short term energy of the
important as it is the most common pattern among all signal is considered due to the dynamic nature of speech signal.
languages, particularly in Indic languages. These onset points The part of the audio signal containing vocalized sounds will
of vowels are detected using various methods. Some of them have very large energy compared to the silent regions. This
are increasing nature of resonance peaks in the amplitude, classification using the short term energy is extremely useful,
frequency spectrum; pitch and energy characteristics, zero- and is particularly utilized for identifying the silent portions of
crossing rate, wavelet transform, neural network; dynamic time our speech signal and removing them before processing the
warping, and excitation source information [6]. The transition signal.. So overall, the features used are the fundamental
from a consonant to a vowel can be identified by the greater frequency and the short term energy. For each syllable, three
strength of excitation; this is used as a cue to detect the onset of fundamental frequencies and one energy level parameters are
vowels in our speech signal. Fig. 1 shows such an example. obtained which are all normalized before being passed on for
Generally the instantaneous energy of our signal is higher for processing.
vowels than the strength of voiced consonants. Thus the places
Data given to the classifiers – Once the pitch and the short term used due to its simplistic nature. Results can be made more
energy of the signal are extracted and modeled mathematically, precise by considering a weighted contribution of the closest
they are given to the classifiers as input. Normalization of the neighbors, by giving the nearer neighbors more
data is always carried out with respect to the global maxima. weightage/importance than the farther ones. This classifier has
The unique aspect of our work is that the speech signal is not a downside, it is imperative for us to know the class of the
considered as a whole. The speech signal is divided into training data before we can classify the testing data. This can
syllable segments assuming CV structure as they are the most be said to be the training of the classifier. In this classifier,
dominant. These segments are then analyzed individually and parameters considered are given below, available in MATLAB
the decision is taken based on the nature of the majority of its Distance considered = Hamming distance
constituent syllables. The final decision is taken depending on Value of k = 5
the nature of the bulk of the syllables. This method of using the
class of the majority of the syllables to determine the tonality of B. Artificial Neural Network
the language has improved the performance of our classifier:
Neural network [12] can be defined as computer
the accuracy has improved by a significant 4-5% than what was
architecture, modeled after the human brain, in which
achieved by analyzing the utterance as a whole. The time
processors are connected in a manner suggestive of connections
required has also roughly remained the same. The accuracy of
between neurons and learns by trial and error. It is commonly
our syllable detection step is extremely important for the
used to estimate unidentified mathematical functions. These are
precise implementation of this step.
present in layers and each layer has a number of neurons called
In Fig.2, all the steps involved in our work are shown as a
nodes which are all inter-connected. Every neural network has
flowchart.
an activation function, which determines the output of each
neuron. The input layer is presented with the training data,
which is passed to the hidden layer of neurons where the actual
processing is done. Once the processing is done, the hidden
layers provide their output to the output layer. Much such
forward propagation runs can be done to minimize the error.
Advantages include flexible learning, tolerant to faults, non-
linearity etc. Back propagation neural network is used which
works on an error correcting algorithm. After completion of
every forward propagation run during the training phase, the
output is matched to the provided output. Depending on the
error margin, further forward runs are made after weight
adjustments of the interconnections. The parameters, as
implemented in MATLAB, were -
Output layer neurons =1
Input layer neurons =4
Fig. 2. Process Flow chart.
Hidden layer neurons =14
Processing (hidden) layer =1
IV. CLASSIFIERS UTILIZED
C. Support Vector Machines
Once the required features are extracted, they are given to
the classifiers. Three different types of classifiers are used: k This classifier is used only in cases where the data has only
Nearest Neighbor algorithm (k-NN), Artificial Neural Network two divisions and is a kernel learning method. Classification is
(ANN) and Support Vector Machine (SVM). A preliminary done on the basis of decision limit or boundary. The two
description of the classifiers- classes are separated by a decision boundary [10].The
hyperplane that best separates the classes is chosen by the SVM
A. K Nearest Neighbor Algorithm classifier, where the best hyperplane can be defined as the one
having the biggest margin between the classes. In case of multi-
k-NN algorithm is one of the most commonly used methods layered SVM classifier, large number of Support Vectors is
for classifying our testing data. This algorithm, as the name present in the output layer. Gaussian kernel function is used
signifies, finds k nearest neighbor in the n-dimensional space, here. Gaussian kernel function is used here. This classifier was
where n refers to the number of features. The output of this also implemented using MATLAB.
classifier gives the membership of the testing data to any one of
the classes, depending on the nature of the majority of its V. RESULTS
neighbors among the k nearest ones. k=1 refers to a special
case where the class of the testing data depends on the nearest This section lists the result obtained. Training data of 4800
neighbor among the training data. This algorithm is widely seconds is taken out of which 2400 seconds is tonal and the rest
2400 seconds is non-tonal. For the purpose of testing, 38 reason for the poor performance of SVM classifier, which
samples are taken which includes 19 tonal and 19 non-tonal generally downgrades the performance of the SVM
language speech files .For the testing a single file of classifier. Comparison of three different discriminative
600seconds file was made which included 300 seconds of tonal approaches is shown in Fig.3.
and 300 seconds of non-tonal language speech. The testing files Performance can be increased by increasing the duration of
and the training files were ensured to be separate. Once the the speech files.
decision is taken at the syllabic level, the final decision is taken The least time among all classifiers was taken by k-NN
depending on the nature of the majority of the syllables. The classifier.
accuracy is measured by calculating the total correct decisions
made out of the 38 samples, in each of the classifiers. The tonal Figure 3 gives a quantitative comparison of the ability of the
class is referred to as Class I and non-tonal class as Class II. classifiers to correctly identify a language as tonal or non-tonal.
TABLE I
Class II 19 17 89.473
The accuracy obtained is 81.58%.
TABLE II
Number of test Correct Correct
samples Detection Detection % Fig. 3. Comparison between k-NN, ANN, SVM.
Class I 19 15 78.947
VII.CONCLUSIONS
Class II 19 17 89.473
In this paper, we have studied the performance of different
classifiers i.e. Artificial Neural Network (ANN), k Nearest
3. Support Vector Machines
Neighbor (k-NN) and Support Vector Machine (SVM) in
Table III illustrates the results obtained using Support Vector identifying whether an unknown speech sample belongs to a
Machines. The accuracy obtained is 65.79%. tonal or non-tonal class. ANN classifier can be said to be the
best classifier of these three classifiers. Using a much
TABLE III uncomplicated methodology, good results have been obtained
Number of test Correct Correct for ANN and k-NN classifier. By pre-classifying the languages
samples Detection Detection % into two broad classes, we can reduce the complexity of
Class I 19 13 68.421 automatic identification of languages. Once we have pre-
Class II 19 12 63.157
classified the languages into two categories, the complexity in
the final step of identification of languages becomes less time
consuming.. In our pre-classification stage, we are just using
VI. OBSERVATIONS prosodic information whereas for identification of language,
The study of these three different classifiers gave the acoustic and phonotactic features are used. This reduces the
following observations- time in pre-classification stage.
Artificial Neural Network has the highest accuracy with An automatic Language Identification System can be built using
84.21%. K-NN classifier follows with 81.57%. The SVM this pre-classification work as a platform.
classifier's performance was the least accurate with
65.79%. Less variance among the training data may be a
REFERENCES
Journal References
[1]. Implementation of Advanced Communication Aid for People with Severe
Speech Impairment IOSR Journal of Electronics and Communication
Engineering (IOSR-JECE) e-ISSN: 2278-2834, p- ISSN: 2278-
8735.Volume 9, Issue 2, Ver. III (Mar - Apr. 2014), PP 61-66.
[2]. Orsucci et al.: “Prosody and synchronization in cognitive neuroscience”.
Prosody and synchronization in cognitive neuroscience. EPJ Nonlinear
Biomedical Physics 2013 1:6.
[3]. "The OGI Multi-language Telephone Speech Corpus". Y. K. Muthusamy,
R. A. Cole and B. T. Oshika. Proceedings of the International Conference
on Spoken Language Processing, Banff, Alberta, Canada, October 1992.
[4]. L. Wang, E. Ambikairajah, E. H.C. Choi, “A novel method for automatic
tonal and non-tonal language classification.” In: IEEE International
Conference on Multimedia and Expo. pp. 352 – 355,2007.
[5]. Suryakanth V. Gangashetty, C. Chandra Shekhar, B. Yenganarayana
“Extraction of fixed dimension patterns from varying duration segments
of consonant-vowel utterances”.
[6]. S.R. Mahadeva Prasanna, B. Yegnanarayana. “Detection of Vowel Onset
Point Events using Excitation Information”.
[7]. Theodoros Theodorou , Iosif Mporas and Nikos Fakotakis. “Automatic
Sound Classification of Radio Broadcast News”. International Journal of
Signal Processing, Image Processing and Pattern Recognition Vol. 5, No.
1, March, 2012
[8]. A.Y.Ng, M.I.Jordan,On Discriminative vs. Generative classifiers: A
comparison of logistic regression and naïve Bayes.
[9]. Leena Mary, B. Yegnanarayana, “Extraction and representation of
prosodic featuresfor language
and speaker recognition”Speech Communication vol.50 No.782–796,
2008.
[10]. W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, D. A. Reynolds,
Language Recognition with Support Vector Machines In: Proceedings
Odyssey. pp. 41–44. 2004.
Book References
[1]. David Talkin, “A Robust Algorithm for Pitch Tracking”. chapter 14.1995.
[2]. B.Yegnanarayana. Artificial Neural Networks,Prentice-Hall of lndia
Private Limited, New Delhi. 2005