Classification of audio signals using SVM and RBFNN

P. Dhanalakshmi
*
, S. Palanivel, V. Ramalingam
Department of Computer Science and Engineering, Annamalai University, Annamalainagar, Chidambaram 608 002, Tamil Nadu, India
a r t i c l e i n f o
Keywords:
Support vector machines
Radial basis function neural network
Linear predictive coefficients
Linear predictive cepstral coefficients
Mel-frequency cepstral coefficients
a b s t r a c t
In the age of digital information, audio data has become an important part in many modern computer
applications. Audio classification has been becoming a focus in the research of audio processing and pat-
tern recognition. Automatic audio classification is very useful to audio indexing, content-based audio
retrieval and on-line audio distribution, but it is a challenge to extract the most common and salient
themes fromunstructured rawaudio data. In this paper, we propose effective algorithms to automatically
classify audio clips into one of six classes: music, news, sports, advertisement, cartoon and movie. For
these categories a number of acoustic features that include linear predictive coefficients, linear predictive
cepstral coefficients and mel-frequency cepstral coefficients are extracted to characterize the audio con-
tent. Support vector machines are applied to classify audio into their respective classes by learning from
training data. Then the proposed method extends the application of neural network (RBFNN) for the clas-
sification of audio. RBFNN enables nonlinear transformation followed by linear transformation to achieve
a higher dimension in the hidden space. The experiments on different genres of the various categories
illustrate the results of classification are significant and effective.
Ó 2008 Elsevier Ltd. All rights reserved.
1. Introduction
Audio data is an integral part of many modern computer and
multimedia applications. A typical multimedia database often con-
tains millions of audio clips, including environmental sounds,
machine noise, music, animal sounds, speech sounds, and other
non-speech utterances. The effectiveness of their deployment is
greatly dependent on the ability to classify and retrieve the audio
files in terms of their sound properties or content. The need to
automatically recognize to which class an audio sound belongs
makes audio classification and categorization an emerging and
important research area. However, a raw audio signal data is a fea-
tureless collection of bytes with most rudimentary fields attached
such as name, file format and sampling rate.
Content-based classification and retrieval of audio sound is
essentially a pattern recognition problem in which there are two
basic issues: feature selection, and classification based on the
selected features. In the first step, an audio sound is reduced to a
small set of parameters using various feature extraction tech-
niques. The terms linear predictive coefficients (LPC), linear predic-
tive cepstral coefficients (LPCC), mel-frequency cepstral
coefficients (MFCC) refer to the features extracted from audio data.
In the second step, classification or categorization algorithms rang-
ing from simple Euclidean distance methods to sophisticated
statistical techniques are carried out over these coefficients. The
efficacy of an audio classification or categorization depends on
the ability to capture proper audio features and to accurately clas-
sify each feature set corresponding to its own class.
1.1. Related work
During the recent years, there have been many studies on auto-
matic audio classification and segmentation using several features
and techniques. The most common problem in audio classification
is speech/music classification, in which the highest accuracy has
been achieved, especially when the segmentational information
is known beforehand. In Lin, Chen, Truong, and Chang (2005),
wavelets are first applied to extract acoustical features such as sub-
band power and pitch information. The method uses a bottom-up
SVM over these acoustic features and additional parameters, such
as frequency cepstral coefficients, to accomplish audio classifica-
tion and categorization. An audio feature extraction and a multi-
group classification scheme that focuses on identifying
discriminatory time–frequency subspaces using the local discrim-
inant bases (LDB) technique has been described in Umapathy,
Krishnan, and Rao (2007). For pure music and vocal music, a num-
ber of features such as LPC and LPCC are extracted in Xu, Maddage,
and Shao (2005), to characterize the music content. Based on calcu-
lated features, a clustering algorithm is applied to structure the
music content.
A new approach towards high performance speech/music dis-
crimination on realistic tasks related to the automatic transcription
0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2008.06.126
* Corresponding author. Tel.: +91 04144 220433.
E-mail addresses: abi_dhana@rediffmail.com (P. Dhanalakshmi),
spal_yughu@yahoo.com (S. Palanivel), aucsevr@yahoo.com (V. Ramalingam).
Expert Systems with Applications xxx (2008) xxx–xxx
Contents lists available at ScienceDirect
Expert Systems with Applications
j our nal homepage: www. el sevi er . com/ l ocat e/ eswa
ARTICLE IN PRESS
Please cite this article in press as: Dhanalakshmi, P. et al., Classification of audio signals using SVM and RBFNN, Expert Systems with Ap-
plications (2008), doi:10.1016/j.eswa.2008.06.126
of broadcast news is described in Ajmera, McCowan, and Bourlard
(2003), in which an artificial neural network (ANN) and hidden
Markov model (HMM) are used. In Kiranyaz, Qureshi, and Gabbouj
(2006), a generic audio classification and segmentation approach
for multimedia indexing and retrieval is described. A method is
proposed in Panagiotakis and Tziritas (2005) for speech/music dis-
crimination based on root mean square and zero-crossings. The
method proposed in Eronen et al. (2006), investigates the feasibil-
ity of an audio-based context recognition system where simplistic
low-dimensional feature vectors are evaluated against more stan-
dard spectral features. Using discriminative training, competitive
recognition accuracies are achieved with very low-order hidden
Markov models (Ajmera et al., 2003).
The classification of continuous general audio data for content-
based retrieval was addressed in Li, Sethi, Dimitrova, and McGee
(2001), where the audio segments where classified based on
MFCC and LPC. They also showed that cepstral-based features
gave a better classification accuracy. The method described in
content-based audio classification and retrieval using joint
time–frequency analysis exploits the non-stationary behavior of
music signals and extracts features that characterize their spectral
change over time (Esmaili, Krishnan, & Raahemifar, 2004). The
audio signals were decomposed in Umapathy, Krishnan, and Jimaa
(2005), using an adaptive time frequency decomposition algo-
rithm, and the signal decomposition parameter based on octave
(scaling) was used to generate a set of 42 features over three fre-
quency bands within the auditory range. These features were ana-
lyzed using linear discriminant functions and classified into six
music groups. An approach given in Jiang, Bai, Zhang, and Xu
(2005), uses support vector machine (SVM) for audio scene classi-
fication, which classifies audio clips into one of five classes: pure
speech, non-pure speech, music, environment sound, and silence.
Radial basis function neural networks (RBFNN) are used in McC-
onaghy, Leung, Boss, and Varadan (2003) to classify real-life audio
radar signals that are collected by a ground surveillance radar
mounted on a tank.
For audio retrieval, a new metric has been proposed in Guo and
Li (2003), called distance-from-boundary (DFB). When a query
audio is given, the system first finds a boundary inside which the
query pattern is located. Then, all the audio patterns in the data-
base are sorted by their distances to this boundary. All boundaries
are learned by the SVMs and stored together with the audio data-
base. In Mubarak, Ambikairajah, and Epps (2005) a speech/music
discrimination systemwas proposed based on mel-frequency ceps-
tral coefficient (MFCC) and GMMclassifier. This systemcan be used
to select the optimum coding scheme for the current frame of an
input signal without knowing a priori whether it contains
speech-like or music-like characteristics. A hybrid model com-
prised of Gaussian mixtures models (GMMs) and hidden Markov
models (HMMs) is used to model generic sounds with large intra
class perceptual variations in Rajapakse and Wyse (2005). The
number of mixture components in the GMM was derived using
the minimum description length (MDL) criterion.
A new pattern classification method called the nearest feature
line (NFL) is proposed in Li (2000), where the NFL explores the
information provided by multiple prototypes per class. Audio fea-
tures like MFCC, ZCR, brightness and bandwidth, spectrum flux
were extracted (Lu, Zhang, & Li, 2003), and the performance using
SVM, K-nearest neighbor (KNN), and Gaussian mixture model
(GMM) were compared. Audio classification techniques for speech
recognition and audio segmentation, for unsupervised multispeak-
er change detection are proposed in Huang and Hansen (2006).
Two new extended-time features: variance of the spectrum flux
(VSF) and variance of the zero-crossing rate (VZCR) are used to pre-
classify the audio and supply weights to the output probabilities of
the GMM networks. The classification is then implemented using
weighted GMM networks.
1.2. Outline of the work
In this paper, automatic audio feature extraction and classifica-
tion approaches are presented. In order to discriminate the six cat-
egories of audio namely music, news, sports, advertisement,
cartoon and movie, a number of features such as LPC, LPCC, MFCC
are extracted to characterize the audio content. Support vector ma-
chine (SVM) is applied to obtain the optimal class boundary be-
tween the classes by learning from training data. The
performance of SVM is compared to RBF network. Each RBF center
approximates a cluster of training data vectors that are close to
each other in Euclidean space. When a vector is input to the
RBFNN, the centers near to that vector become strongly activated,
in turn activating certain output nodes. Experimental results show
that the classification accuracy of RBF with mel cepstral features
can provide a better result. Fig. 1 illustrates the block diagram of
audio classification.
The paper is organized as follows. The acoustic feature extrac-
tion is presented in Section 2, modeling techniques for audio clas-
sification is described in Section 3. Experimental results using SVM
and RBF are reported in Section 4. Finally, conclusions and future
work are given in Section 5.
Fig. 1. Block diagram of audio classification.
2 P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx
ARTICLE IN PRESS
Please cite this article in press as: Dhanalakshmi, P. et al., Classification of audio signals using SVM and RBFNN, Expert Systems with Ap-
plications (2008), doi:10.1016/j.eswa.2008.06.126
2. Acoustic feature extraction
Acoustic features representing the audio information can be
extracted from the speech signal at the segmental level. The seg-
mental features are the features extracted from short (10–30 ms)
segments of the speech signal. These features represent the
short-time spectrum of the speech signal. The short-time spectrum
envelope of the speech signal is attributed primarily to the shape of
the vocal tract. The spectral information of the same sound uttered
by two persons may differ due to change in the shape of the indi-
vidual’s vocal tract system, and the manner of speech production.
The selected features include linear prediction coefficients (LPC),
Linear prediction derived cepstrum coefficients(LPCC) and mel-fre-
quency cepstral coefficients (MFCC).
2.1. Linear prediction analysis
For acoustic feature extraction, the differenced speech signal is
divided into frames of 20 ms, with a shift of 5 ms. A pth order LP
analysis is used to capture the properties of the signal spectrum.
In the LP analysis of speech each sample is predicted as linear
weighted sum of the past p samples, where p represents the order
of prediction (Rabiner & Schafer, 2005; Palanivel, 2004). If sðnÞ is
the present sample, then it is predicted by the past p samples as
^sðnÞ ¼ À
¸
p
k¼1
a
k
sðn À kÞ: ð1Þ
The difference between the actual and the predicted sample value is
termed as the prediction error or residual, and is given by
eðnÞ ¼ sðnÞ À^sðnÞ ¼ sðnÞ þ
¸
p
k¼1
a
k
sðn À kÞ ð2Þ
¼
¸
p
k¼0
a
k
sðn À kÞ; a
0
¼ 1 ð3Þ
The LP coefficients fa
k
g are determined by minimizing the mean
squared error over an analysis frame and it is described in Rabiner
and Juang (2003).
The recursive relation (4) between the predictor coefficients
and cepstral coefficients is used to convert the LP coefficients into
LP cepstral coefficients fc
k
g
c
0
¼ lnr
2
;
c
m
¼ a
m
þ
¸
mÀ1
k¼1
k
m

c
k
a
mÀk
; 1 6 m 6 p;
c
m
¼
¸
mÀ1
k¼1
k
m

c
k
a
mÀk
; p < m 6 D;
ð4Þ
where r
2
is the gain term in the LP analysis and D is the number of
LP cepstral coefficients.. The cepstral coefficients are linearly
weighted to get the weighted linear prediction cepstral coefficients
(WLPCC). In this work, a 19 dimensional WLPCC is obtained from
the 14th order LP analysis for each frame. Linear channel effects
are compensated to some extent by removing the mean of the tra-
jectory of each cepstral coefficient.. The 19 dimensional WLPCC
(mean subtracted) for each frame is used as an acoustic feature
vector.
2.2. Mel-frequency cepstral coefficients
The mel-frequency cepstrum has proven to be highly effective
in recognizing structure of music signals and in modeling the sub-
jective pitch and frequency content of audio signals. Psychophysi-
cal studies have found the phenomena of the mel pitch scale and
the critical band, and the frequency scale-warping to the mel scale
has led to the cepstrum domain representation. The mel scale is
defined as
F
mel
¼
c log 1 þ
f
c

log 2 ð Þ
; ð5Þ
where F
mel
is the logarithmic scale of f normal frequency scale. The
mel-cepstral features (Xu et al., 2005), can be illustrated by the
MFCCs, which are computed from the fast Fourier transform (FFT)
power coefficients. The power coefficients are filtered by a triangu-
lar bandpass filter bank. When c in (5) is in the range of 250–350,
the number of triangular filters that fall in the frequency range
200–1200 Hz (i.e., the frequency range of dominant audio informa-
tion) is higher than the other values of c. Therefore, it is efficient to
set the value of c in that range for calculating MFCCs. Denoting the
output of the filter bank by S
k
ðk ¼ 1; 2; . . . ; KÞ, the MFCCs are calcu-
lated as
c
n
¼
ffiffiffiffi
2
K

¸
K
k¼1
ðlog S
k
Þ cos nðk À0:5Þ
p
K

; n ¼ 1; 2. . . ; L: ð6Þ
3. Modeling techniques for audio classification
3.1. Support vector machine (SVM) for classification
The SVM (Vapnik, 1998), is a useful statistic machine learning
technique that has been successfully applied in the pattern recog-
nition area (Jiang et al., 2005; Guo & Li, 2003). If the data are line-
arly nonseparable but nonlinearly separable, the nonlinear support
vector classifier will be applied. The basic idea is to transforminput
vectors into a high-dimensional feature space using a nonlinear
transformation /, and then to do a linear separation in feature
space as shown in Fig. 2.
To construct a nonlinear support vector classifier, the inner
product (x, y) is replaced by a kernel function K(x, y):
f ðxÞ ¼ sgn
¸
l
i¼1
a
i
y
i
Kðx
i
; xÞ þ b

: ð7Þ
The SVM has two layers. During the learning process, the first layer
selects the basis Kðx
i
; xÞ, i ¼ 1; 2; . . . ; N, from the given set of bases
defined by the kernel; the second layer constructs a linear function
in this space. This is completely equivalent to constructing the opti-
mal hyperplane in the corresponding feature space.
The SVM algorithmcan construct a variety of learning machines
by use of different kernel functions. Three kinds of kernel functions
are usually used. They are as follows:
Fig. 2. Principle of support vector machines.
P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx 3
ARTICLE IN PRESS
Please cite this article in press as: Dhanalakshmi, P. et al., Classification of audio signals using SVM and RBFNN, Expert Systems with Ap-
plications (2008), doi:10.1016/j.eswa.2008.06.126
1. Polynomial kernel of degree d:
KðX; YÞ ¼ ðhX; Yi þ1Þ
d
: ð8Þ
2. Radial basis function with Gaussian kernel of width C > 0:
KðX; YÞ ¼ exp
ÀjX À Yj
2
c

: ð9Þ
3. Neural networks with tanh activation function:
KðX; YÞ ¼ tan hðKhX; Yi þlÞ; ð10Þ
where the parameters K and l are the gain and shift.
3.2. Radial basis function neural network (RBFNN) model for
classification
The radial basis function neural network (Haykin, 2001) has a
feed forward architecture with an input layer, a hidden layer,
and an output layer as shown in Fig. 3. Radial basis functions are
embedded into a two-layer feed forward neural network. Such a
network is characterized by a set of inputs and a set of outputs.
In between the inputs and outputs there is a layer of processing
units called hidden units. Each of them implements a radial basis
function. The input layer of this network has n
i
units for a n
i
dimensional input vector. The input units are fully connected to
the n
h
hidden layer units, which are in turn fully connected to
the n
c
output layer units, where n
c
is the number of output classes.
The activation functions of the hidden layer were chosen to be
Gaussians, and are characterized by their mean vectors (centers)
l
i
, and covariance matrices C
i
, i ¼ 1; 2; . . . ; n
h
. For simplicity, it is
assumed that the covariance matrices are of the form C
i
¼ r
2
i
I,
i ¼ 1; 2; . . . ; n
h
. Then the activation function of the ith hidden unit
for an input vector x
j
is given by
g
i
ðx
j
Þ ¼ exp
Àkx
j
À l
i
k
2
2r
2
i

: ð11Þ
The l
i
and r
2
i
are calculated by using suitable clustering algorithm.
Here the k-means clustering algorithm is employed to determine
the centers. The algorithm is composed of the following steps:
1. Randomly initialize the samples to k means (clusters)
l
1
; . . . ; l
k
:
2. Classify n samples according to nearest l
k
.
3. Recompute l
k
.
4. Repeat steps 2 and 3 until no change in l
k
.
The number of activation functions in the network and their
spread influence the smoothness of the mapping. The assumption
r
2
i
¼ r
2
is made and r
2
is given in (12) to ensure that the activa-
tion functions are not too peaked or too flat:
r
2
¼
gd
2
2
: ð12Þ
In the above equation d is the maximum distance between the cho-
sen centers, and g is an empirical scale factor which serves to con-
trol the smoothness of the mapping function. Therefore, the above
equation is written as
g
i
ðx
j
Þ ¼ exp
Àkx
j
À l
i
k
2
gd
2

: ð13Þ
The hidden layer units are fully connected to the n
c
output layer
units through weights w
ik
. The output units are linear, and the re-
sponse of the kth output unit for an input x
j
is given by
y
k
ðx
j
Þ ¼
¸
i¼0
n
h
w
ik
g
i
ðx
j
Þ; k ¼ 1; 2; . . . ; n
c
; ð14Þ
where g
0
ðx
j
Þ ¼ 1. Given n
t
feature vectors from n
c
classes, training
the RBFNN involves estimating l
i
, i ¼ 1; 2; . . . ; n
h
, g, d
2
, and w
ik
,
i ¼ 0; 1; 2; . . . ; n
h
, k ¼ 1; 2; . . . ; n
c
. The training procedure is given
below:
Determination of l
i
and d
2
: Conventionally, the unsupervised k-
means clustering algorithm (Duda, Hart, & Stork, 2001), can be
applied to find n
h
clusters fromn
t
training vectors. However, the
training vectors of a class may not fall into a single cluster. In
order to obtain clusters only according to class, the k-means
clustering may be used in a supervised manner. Training feature
vectors belonging to the same class are clustered to n
h
=n
c
clus-
ters using the k-means clustering algorithm. This is repeated for
each class yielding n
h
cluster for n
c
classes. These cluster means
are used as the centers l
i
of the Gaussian activation functions in
the RBFNN. The parameter d was then computed by finding the
maximum distance between n
h
cluster means.
Determining the weights w
ik
between the hidden and output layer:
Given that the Gaussian function centers and widths are com-
puted from n
t
training vectors (14) may be written in matrix
form as
Y ¼ GW; ð15Þ
where Y is a n
t
Ân
c
matrix with elements Y
ij
¼ y
j
ðx
i
Þ, G is a
n
t
 ðn
h
þ1Þ matrix with elements G
ij
¼ y
j
ðx
i
Þ, and W is a
ðn
h
þ1Þ Ân
c
matrix of unknown weights. W is obtained from the
standard least squares solution as given by
W ¼ ðG
T

À1
G
T
Y: ð16Þ
To solve W from (16), G is completely specified by the clustering
results, and the elements of Y are specified as
Y
ij
¼
1 if x
i
2 class j;
0 otherwise:

ð17Þ
4. Experimental results
The evaluation of the proposed audio classification and segmen-
tation algorithms have been performed by using a generic audio
database which consists of the following contents: 100 clips of Fig. 3. Radial basis function neural network.
4 P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx
ARTICLE IN PRESS
Please cite this article in press as: Dhanalakshmi, P. et al., Classification of audio signals using SVM and RBFNN, Expert Systems with Ap-
plications (2008), doi:10.1016/j.eswa.2008.06.126
advertisement in different languages, 100 clips of songs (sung by
male and female) in different languages, 100 cartoon clips, 100
clips of movie from different languages, 100 clips of sports and
100 clips of news (both Tamil and English). Audio samples are of
different length, ranging from 1 s to about 10 s, with a sampling
rate of 8 kHz and 16-bits per sample. The signal duration was
slightly increased using the following rationale that the longer
the audio signal analyzed, the better the extracted feature which
exhibits more accurate audio characteristics. The training data
should be sufficient to be statistically significant. The training data
is segmented into fixed-length and overlapping frames (in our
experiments we used 20 ms frames with 10 ms overlapping).
When neighboring frames are overlapped, the temporal character-
istics of the audio content can be taken into consideration in the
training process. Due to radiation effects of the sound from lips,
high-frequency components have relatively low amplitude, which
will influence the capture of the features at the high end of the
spectrum. One simple solution is to augment the energy of the
high-frequency spectrum.This procedure can be implemented via
a pre-emphasizing filter that is defined as
s
0
ðnÞ ¼ sðnÞ À0:96 Â sðn À1Þ; n ¼ 1; . . . ; N À1; ð18Þ
where sðnÞ is the nth sample of the frame s and s
0
ð0Þ ¼ sð0Þ. Then the
pre-emphasized frame is Hamming-windowed by
hðnÞ ¼ 0:54 À0:46 Ã cosð2pn=N À1Þ; 0 6 n 6 N À1: ð19Þ
4.1. Preprocessing
The aim of preprocessing is to remove silence from a music se-
quence. Silence is defined as a segment of imperceptible audio,
including unnoticeable noise and very short clicks. We use short-
time energy to detect silence. The short-time energy function of
a music signal is defined as
E
n
¼
1
N
¸
m
½xðmÞwðn À mފ
2
; ð20Þ
where xðmÞ is the discrete time music signal, n is the time index of
the short-time energy, and wðmÞ is a rectangular window, i.e.,
wðnÞ ¼
1 if 0 6 n 6 N À1;
0 otherwise:

ð21Þ
If the short-time energy function is continuously lower than a cer-
tain set of thresholds (there may be durations in which the energy is
higher than the threshold, but the durations should be short enough
and far apart from each other), the segment is indexed as silence.
Silence segments will be removed from the audio sequence. The
processed audio sequence will be segmented into fixed-length
and 10 ms overlapping frames.
4.2. Feature selection from non-silent frames
Feature selection is important for audio content analysis. The
selected features should reflect the significant characteristics of
different kinds of audio signals. The selected features include linear
prediction coefficients (LPC)–linear prediction derived cepstrum
coefficients (LPCC) and mel-frequency cepstrum coeffi-
cients(MFCC). LPC and LPCC are two linear prediction methods
and they are highly correlated to each other. LPC-based algorithms
(Abu-El-Quran, Goubran, & Chan, 2006), measure three values from
the audio segment to be classified. These values are the change of
the energy of the signal, speech duration, and the change of the
pitch value. The audio signals are recorded for 60 s at 8000 samples
per second and divided into frames of 20 ms, with a shift of 10 ms.
A 14th order LP analysis is used to capture the properties of the
signal spectrum as described in Section 2.1. The recursive relation
(4) between the predictor coefficients and cepstral coefficients is
used to convert the 14 LP coefficients into 19 cepstral coefficients.
The LP coefficients for each frame is linearly weighted to form the
WLPCC.
In order to evaluate the relative performance of the proposed
work, we compared it with the well-known MFCC features. MFCCs
are short-term spectral features as described in Section 2.2 and are
widely used in the area of audio and speech processing. To obtain
MFCCs (Umapathy et al., 2007), the audio signals were segmented
and windowed into short frames of 256 samples. Magnitude spec-
trum was computed for each of these frames using fast Fourier
transform (FFT) and converted into a set of mel scale filter bank
outputs. Logarithm was applied to the filter bank outputs followed
by discrete cosine transformation to obtain the MFCCs. For each
audio signal we arrived at 39 features. This number, 39, is com-
puted from the length of the parameterized static vector 13, plus
the delta coefficients (+13) plus the acceleration coefficients (+13).
4.3. Modeling using SVM
We use a nonlinear support vector classifier to discriminate the
various categories. Classification parameters are calculated using
support vector machine learning. The training process analyzes
audio training data to find an optimal way to classify audio frames
into their respective classes. The training data should be sufficient
to be statistically significant. The training data is segmented into
fixed-length and overlapping frames. When neighboring frames
are overlapped, the temporal characteristics of audio content can
be taken into consideration in the training process. Features such
as LPC, LPCC and MFCC are calculated for each frame. The support
vector machine learning algorithm is applied to produce the classi-
fication parameters according to calculated features. The derived
classification parameters are used to classify audio data. The audio
content can be discriminated into the various categories in terms
of the designed support vector classifier. The classification results
for the different features are shown in Fig. 4. From the results,
we observe that the overall classification accuracies is 92% using
MFCC as feature.
4.4. Modeling using RBFNN
For RBFNN training, 14th order LPC, 19 dimensional LPCC and
26 dimensional MFCC features are extracted from the audio frames
as described in Section 4.2. These features are given as input to the
Fig. 4. Performance of SVM for audio classification.
P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx 5
ARTICLE IN PRESS
Please cite this article in press as: Dhanalakshmi, P. et al., Classification of audio signals using SVM and RBFNN, Expert Systems with Ap-
plications (2008), doi:10.1016/j.eswa.2008.06.126
RBFNN model. The RBF centers are located using k-means algo-
rithm. The weights are determined using least squares algorithm.
The value of k = 1, 2, and 5 has been used in our studies for each
category. The system gives optimal performance for k = 5. For
training, the weight matrix is calculated using the least squares
algorithm discussed in Section 3.2 for each of the features.
For classification the feature vectors are extracted and each of
the feature vector is given as input to the RBFNN model. The aver-
age output is calculated for each of the output neuron. The class to
which the audio sample belongs is decided based on the highest
output. Fig. 5 shows the performance of RBFNN for audio classifica-
tion. Fig. 6 shows the performance of RBFNN for different means.
The performance of the system for MFCC using SVM and RBFNN
for audio classification is given in Table 1.
5. Conclusion
In this paper, we have proposed an automatic audio classifica-
tion system using SVM and RBFNN. Linear prediction cepstrum
coefficients (LPC, LPCC) and mel-frequency cepstral coefficients
are calculated as features to characterize audio content. A nonlin-
ear support vector machine learning algorithm is applied to ob-
tain the optimal class boundary between the various classes
namely music, news, sports, advertisement, cartoon and movie,
by learning from training data. Experimental results show that
the proposed audio classification scheme is very effective and
the accuracy rate is 92%. The performance was compared to Ra-
dial basis function neural network which showed an accuracy of
93%. The classification rate using LPC and LPCC were slightly low-
er than the 39 dimensional MFCC feature vectors. This work indi-
cates that support vector machines and radial basis function
neural networks can be effectively used for audio classification.
Eventhough by now some progress has been achieved, there are
still remaining challenges and directions for further research,
such as, extracting different features and developing better classi-
fication algorithms and integration of classifiers to reduce the
classification errors.
References
Abu-El-Quran, A. R., Goubran, R. A., & Chan, A. D. C. (2006). Security monitoring
using microphone arrays and audio classification. IEEE Transactions on
Instrumentation and Measurement, 55(4), 1025–1032.
Ajmera, J., McCowan, I., & Bourlard, H. (2003). Speech/music segmentation using
entropy and dynamism features in a HMM classification framework. Speech
Communication, 40(3), 351–363.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: John
Wiley-Interscience.
Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al.
(2006). Audio-based context recognition. IEEE Transactions on Audio, Speech and
Language Processing, 14(1), 321–329.
Esmaili, S., Krishnan, S., & Raahemifar, K. (2004). Content based audio classification
and retrieval using joint time–frequency analysis. In IEEE international
conference on acoustics, speech and signal processing (pp. 665–668).
Guo, G., & Li, S. Z. (2003). Content-based audio classification and retrieval by
support vector machines. IEEE Transactions on Neural Networks, 14(1), 308–315.
Haykin, S. (2001). Neural networks a comprehensive foundation. Asia: Pearson
Education.
Huang, R., & Hansen, J. H. L. (2006). Advances in unsupervised audio classification
and segmentation for the broadcast news and NGSW corpora. IEEE Transactions
on Audio, Speech and Language Processing, 14(3), 907–919.
Jiang, H., Bai, J., Zhang, S., & Xu, B. (2005). SVM-based audio scene classification.
Proceeding of the IEEE, 131–136.
Kiranyaz, S., Qureshi, A. F., & Gabbouj, M. (2006). A generic audio classification and
segmentation approach for multimedia indexing and retrieval. IEEE Transactions
on Audio, Speech and Language Processing, 14(3), 1062–1081.
Li, S. Z. (2000). Content-based audio classification and retrieval using the nearest
feature line method. IEEE Transactions on Speech and Audio Processing, 8(5),
619–625.
Lin, C.-C., Chen, S.-H., Truong, T.-K., & Chang, Y. (2005). Audio classification and
categorization based on wavelets and support vector machine. IEEE Transactions
on Speech and Audio Processing, 13(5), 644–651.
Li, D., Sethi, I. K., Dimitrova, N., & McGee, T. (2001). Classification of general audio
data for content-based retrieval. Pattern Recognition Letters, 22, 533–544.
Lu, L., Zhang, H.-J., & Li, S. Z. (2003). Content-based audio classification and
segmentation by using support vector machines. Multimedia Systems, 8,
482–492.
McConaghy, T., Leung, H., Boss, E., & Varadan, V. (2003). Classification of audio radar
signals using radial basis function neural networks. IEEE Transactions on
Instrumentation and Measurement, 52(6), 1771–1779.
Mubarak, O. M., Ambikairajah, E., & Epps, J. (2005). Analysis of an MFCC-based audio
indexing system for efficient coding of multimedia sources. In IEEE international
conference on acoustics, speech and signal processing (pp. 619–622).
Palanivel, S. (2004). Person authentication using speech, face and visual speech.
Ph.D thesis, Madras: IIT.
Panagiotakis, C., & Tziritas, G. (2005). A speech/music discriminator based on rms
and zero-crossings. IEEE Transactions on Multimedia, 7(1), 155–156.
Rabiner, L., & Juang, B. (2003). Fundamentals of speech recognition. Singapore:
Pearson Education.
Rabiner, L., & Schafer, R. (2005). Digital processing of speech signals. Pearson
Education.
Rajapakse, M., & Wyse, L. (2005). Generic audio classification using a hybrid model
based on GMMS and HMMS. Proceedings of the IEEE, 1550–1555.
Fig. 5. Performance of RBFNN for audio classification.
Fig. 6. Performance of RBFNN for different means.
Table 1
Audio classification performance using SVM and RBFNN
SVM (%) RBFNN (%)
92.1 93.7
6 P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx
ARTICLE IN PRESS
Please cite this article in press as: Dhanalakshmi, P. et al., Classification of audio signals using SVM and RBFNN, Expert Systems with Ap-
plications (2008), doi:10.1016/j.eswa.2008.06.126
Umapathy, K., Krishnan, S., & Jimaa, S. (2005). Multigroup classification of audio
signals using time–frequency parameters. IEEE Transactions on Multimedia, 7(2),
308–315.
Umapathy, K., Krishnan, S., & Rao, R. K. (2007). Audio signal feature extraction and
classification using local discriminant bases. IEEE Transactions on Audio, Speech
and Language Processing, 15(4), 1236–1246.
Vapnik, V. (1998). Statistical learning theory. New York: John Wiley and Sons.
Xu, C., Maddage, N. C., & Shao, X. (2005). Automatic music classification and
summarization. IEEE Transactions on Speech and Audio Processing, 13(3),
441–450.
P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx 7
ARTICLE IN PRESS
Please cite this article in press as: Dhanalakshmi, P. et al., Classification of audio signals using SVM and RBFNN, Expert Systems with Ap-
plications (2008), doi:10.1016/j.eswa.2008.06.126

Experimental results show that the classification accuracy of RBF with mel cepstral features can provide a better result. non-pure speech. 1. where the NFL explores the information provided by multiple prototypes per class. competitive recognition accuracies are achieved with very low-order hidden Markov models (Ajmera et al. The classification of continuous general audio data for contentbased retrieval was addressed in Li. and Gaussian mixture model (GMM) were compared.2. Then. & Raahemifar. and Varadan (2003) to classify real-life audio radar signals that are collected by a ground surveillance radar mounted on a tank. A method is proposed in Panagiotakis and Tziritas (2005) for speech/music discrimination based on root mean square and zero-crossings. advertisement. and Xu (2005). Each RBF center approximates a cluster of training data vectors that are close to each other in Euclidean space. When a query audio is given. This system can be used to select the optimum coding scheme for the current frame of an input signal without knowing a priori whether it contains speech-like or music-like characteristics.2008. news.eswa. and Gabbouj (2006). The method described in content-based audio classification and retrieval using joint time–frequency analysis exploits the non-stationary behavior of music signals and extracts features that characterize their spectral change over time (Esmaili. and Jimaa (2005). a generic audio classification and segmentation approach for multimedia indexing and retrieval is described. in turn activating certain output nodes. Audio classification techniques for speech recognition and audio segmentation. the system first finds a boundary inside which the query pattern is located. The paper is organized as follows. in which an artificial neural network (ANN) and hidden Markov model (HMM) are used. automatic audio feature extraction and classification approaches are presented. Bai. Ambikairajah. McCowan. sports. and silence. Zhang. Sethi. The audio signals were decomposed in Umapathy. spectrum flux were extracted (Lu. ZCR. Krishnan.ARTICLE IN PRESS 2 P. Audio features like MFCC. investigates the feasibility of an audio-based context recognition system where simplistic low-dimensional feature vectors are evaluated against more standard spectral features. and McGee (2001). MFCC are extracted to characterize the audio content. A new pattern classification method called the nearest feature line (NFL) is proposed in Li (2000). Classification of audio signals using SVM and RBFNN. Qureshi. An approach given in Jiang. and Bourlard (2003). where the audio segments where classified based on MFCC and LPC. Outline of the work In this paper. Zhang. All boundaries are learned by the SVMs and stored together with the audio database. Dhanalakshmi et al. uses support vector machine (SVM) for audio scene classification. using an adaptive time frequency decomposition algorithm. Using discriminative training. Krishnan. a number of features such as LPC. They also showed that cepstral-based features gave a better classification accuracy. Fig. & Li. doi:10. modeling techniques for audio classification is described in Section 3. Boss.. 2003). for unsupervised multispeaker change detection are proposed in Huang and Hansen (2006). For audio retrieval. In Kiranyaz. Experimental results using SVM and RBF are reported in Section 4. When a vector is input to the RBFNN. Radial basis function neural networks (RBFNN) are used in McConaghy. and Epps (2005) a speech/music discrimination system was proposed based on mel-frequency cepstral coefficient (MFCC) and GMM classifier. 1 illustrates the block diagram of audio classification. The performance of SVM is compared to RBF network. The number of mixture components in the GMM was derived using the minimum description length (MDL) criterion. music.. all the audio patterns in the database are sorted by their distances to this boundary. (2006).126 .06. These features were analyzed using linear discriminant functions and classified into six music groups. In order to discriminate the six categories of audio namely music.1016/j. conclusions and future work are given in Section 5. The method proposed in Eronen et al. cartoon and movie. Fig. 1. Leung. which classifies audio clips into one of five classes: pure speech. the centers near to that vector become strongly activated. The classification is then implemented using weighted GMM networks. P. In Mubarak. The acoustic feature extraction is presented in Section 2. 2004). Block diagram of audio classification. 2003). K-nearest neighbor (KNN). called distance-from-boundary (DFB). environment sound. et al. LPCC. A hybrid model comprised of Gaussian mixtures models (GMMs) and hidden Markov models (HMMs) is used to model generic sounds with large intra class perceptual variations in Rajapakse and Wyse (2005). Dimitrova. Two new extended-time features: variance of the spectrum flux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. a new metric has been proposed in Guo and Li (2003). / Expert Systems with Applications xxx (2008) xxx–xxx of broadcast news is described in Ajmera. and the performance using SVM. Finally. brightness and bandwidth. Support vector machine (SVM) is applied to obtain the optimal class boundary between the classes by learning from training data. and the signal decomposition parameter based on octave (scaling) was used to generate a set of 42 features over three frequency bands within the auditory range. Expert Systems with Applications (2008). Please cite this article in press as: Dhanalakshmi.

p < m 6 D. The recursive relation (4) between the predictor coefficients and cepstral coefficients is used to convert the LP coefficients into LP cepstral coefficients fck g The SVM (Vapnik. . . 2. Please cite this article in press as: Dhanalakshmi. The mel-cepstral features (Xu et al. the nonlinear support vector classifier will be applied. and then to do a linear separation in feature space as shown in Fig. If sðnÞ is the present sample. 2003). P.. cm ¼ am þ mÀ1 Xk ck amÀk . m k¼1 mÀ1   X k ck amÀk . xÞ þ b : ð7Þ c0 ¼ ln r2 . .06. Guo & Li.. Psychophysical studies have found the phenomena of the mel pitch scale and the critical band. The selected features include linear prediction coefficients (LPC). is a useful statistic machine learning technique that has been successfully applied in the pattern recognition area (Jiang et al. Expert Systems with Applications (2008).. Principle of support vector machines. To construct a nonlinear support vector classifier. 2. can be illustrated by the MFCCs.. Therefore. et al. They are as follows: where r2 is the gain term in the LP analysis and D is the number of LP cepstral coefficients. 2005. Linear prediction derived cepstrum coefficients(LPCC) and mel-frequency cepstral coefficients (MFCC). / Expert Systems with Applications xxx (2008) xxx–xxx 3 2. 2. . In the LP analysis of speech each sample is predicted as linear weighted sum of the past p samples. 1 6 m 6 p.126 . doi:10. from the given set of bases defined by the kernel. Linear channel effects are compensated to some extent by removing the mean of the trajectory of each cepstral coefficient. . which are computed from the fast Fourier transform (FFT) power coefficients. The SVM algorithm can construct a variety of learning machines by use of different kernel functions.2.. These features represent the short-time spectrum of the speech signal. The spectral information of the same sound uttered by two persons may differ due to change in the shape of the individual’s vocal tract system.1. Dhanalakshmi et al. . N. The segmental features are the features extracted from short (10–30 ms) segments of the speech signal. and the manner of speech production. . the number of triangular filters that fall in the frequency range 200–1200 Hz (i. .1016/j. In this work. with a shift of 5 ms.eswa. This is completely equivalent to constructing the optimal hyperplane in the corresponding feature space. KÞ. The 19 dimensional WLPCC (mean subtracted) for each frame is used as an acoustic feature vector. . 2. Palanivel. If the data are linearly nonseparable but nonlinearly separable. 2 . the first layer selects the basis Kðxi . the second layer constructs a linear function in this space. Classification of audio signals using SVM and RBFNN. . y) is replaced by a kernel function K(x. the frequency range of dominant audio information) is higher than the other values of c. The mel scale is defined as F mel ¼   f c log 1 þ c log ð2Þ .ARTICLE IN PRESS P. where p represents the order of prediction (Rabiner & Schafer. i ¼ 1.. 2. Three kinds of kernel functions are usually used.2008. . K k¼1 K n ¼ 1.1. 1998). the inner product (x. 2. it is efficient to set the value of c in that range for calculating MFCCs. 2005. When c in (5) is in the range of 250–350.e. L: ð6Þ 3. a 19 dimensional WLPCC is obtained from the 14th order LP analysis for each frame. The cepstral coefficients are linearly weighted to get the weighted linear prediction cepstral coefficients (WLPCC). The power coefficients are filtered by a triangular bandpass filter bank. 2004). Support vector machine (SVM) for classification ^ðnÞ ¼ À s p X k¼1 ak sðn À kÞ: ð1Þ The difference between the actual and the predicted sample value is termed as the prediction error or residual. a0 ¼ 1 The LP coefficients fak g are determined by minimizing the mean squared error over an analysis frame and it is described in Rabiner and Juang (2003). y): f ðxÞ ¼ sgn l X i¼1 ! ai yi Kðxi . Denoting the output of the filter bank by Sk ðk ¼ 1. Acoustic feature extraction Acoustic features representing the audio information can be extracted from the speech signal at the segmental level. During the learning process. The short-time spectrum envelope of the speech signal is attributed primarily to the shape of the vocal tract. and the frequency scale-warping to the mel scale Fig. The basic idea is to transform input vectors into a high-dimensional feature space using a nonlinear transformation /. A pth order LP analysis is used to capture the properties of the signal spectrum. 2005). ð5Þ where F mel is the logarithmic scale of f normal frequency scale. xÞ. Modeling techniques for audio classification 3. and is given by eðnÞ ¼ sðnÞ À ^ðnÞ ¼ sðnÞ þ s ¼ p X k¼0 Xp a sðn k¼1 k À kÞ ð2Þ ð3Þ ak sðn À kÞ. cm ¼ m k¼1 ð4Þ The SVM has two layers. Linear prediction analysis For acoustic feature extraction. Mel-frequency cepstral coefficients The mel-frequency cepstrum has proven to be highly effective in recognizing structure of music signals and in modeling the subjective pitch and frequency content of audio signals. then it is predicted by the past p samples as has led to the cepstrum domain representation. the differenced speech signal is divided into frames of 20 ms. the MFCCs are calculated as cn ¼ rffiffiffiffi K h 2X pi ðlog Sk Þ cos nðk À 0:5Þ .

. Therefore. a hidden layer. . / Expert Systems with Applications xxx (2008) xxx–xxx 1. and g is an empirical scale factor which serves to control the smoothness of the mapping function.. i Here the k-means clustering algorithm is employed to determine the centers. Each of them implements a radial basis function.126 . . Dhanalakshmi et al. the above equation is written as g i ðxj Þ ¼ exp Àkxj À li k2 ! : ð13Þ gd2 The hidden layer units are fully connected to the nc output layer units through weights wik . Then the activation function of the ith hidden unit for an input vector xj is given by In the above equation d is the maximum distance between the chosen centers. The output units are linear. G is completely specified by the clustering results. . The assumption r2 ¼ r2 is made and r2 is given in (12) to ensure that the activai tion functions are not too peaked or too flat: ! ÀjX À Yj2 : KðX. Neural networks with tanh activation function: ð9Þ KðX. nh . 2001) has a feed forward architecture with an input layer. For simplicity. and the elements of Y are specified as  Y ij ¼ 1 if xi 2 class j. & Stork. . The algorithm is composed of the following steps: 1. G is a nt  ðnh þ 1Þ matrix with elements Gij ¼ yj ðxi Þ. The activation functions of the hidden layer were chosen to be Gaussians. In between the inputs and outputs there is a layer of processing units called hidden units. . nc . YÞ ¼ ðhX. . 3. i i ¼ 1. W is obtained from the standard least squares solution as given by W ¼ ðGT GÞÀ1 GT Y: ð16Þ To solve W from (16). where nc is the number of output classes. k ¼ 1. . Training feature vectors belonging to the same class are clustered to nh =nc clusters using the k-means clustering algorithm. The input units are fully connected to the nh hidden layer units. and the response of the kth output unit for an input xj is given by yk ðxj Þ ¼ i¼0 X nh wik g i ðxj Þ. Radial basis function neural network. nh . . Randomly initialize the samples to k means (clusters) l1 . can be applied to find nh clusters from nt training vectors. doi:10. . 0 otherwise: ð17Þ 4. d . Given nt feature vectors from nc classes. lk : Y ¼ GW. 1. and wik . Classify n samples according to nearest lk . and W is a ðnh þ 1Þ Â nc matrix of unknown weights. the k-means clustering may be used in a supervised manner. g. Determining the weights wik between the hidden and output layer: Given that the Gaussian function centers and widths are computed from nt training vectors (14) may be written in matrix form as 2 g i ðxj Þ ¼ exp ! Àkxj À li k2 : 2r2 i ð11Þ The li and r2 are calculated by using suitable clustering algorithm. i ¼ 1. ð14Þ where g 0 ðxj Þ ¼ 1. the unsupervised kmeans clustering algorithm (Duda. . it is assumed that the covariance matrices are of the form C i ¼ r2 I. .ARTICLE IN PRESS 4 P. 3.1016/j. This is repeated for each class yielding nh cluster for nc classes. et al. 3.2008.2. P. 2. . Polynomial kernel of degree d: KðX.06. . and an output layer as shown in Fig. where the parameters K and ð10Þ r2 ¼ gd2 2 : ð12Þ l are the gain and shift. . The number of activation functions in the network and their spread influence the smoothness of the mapping. 4. 2. 2001). the training vectors of a class may not fall into a single cluster. YÞ ¼ tan hðKhX. . k ¼ 1. . The parameter d was then computed by finding the maximum distance between nh cluster means. ð15Þ where Y is a nt  nc matrix with elements Y ij ¼ yj ðxi Þ. . i ¼ 1. . . These cluster means are used as the centers li of the Gaussian activation functions in the RBFNN. . i ¼ 0. 2. Radial basis functions are embedded into a two-layer feed forward neural network. Experimental results The evaluation of the proposed audio classification and segmentation algorithms have been performed by using a generic audio database which consists of the following contents: 100 clips of Fig. Such a network is characterized by a set of inputs and a set of outputs. Radial basis function with Gaussian kernel of width C > 0: d ð8Þ 2. Expert Systems with Applications (2008). . Repeat steps 2 and 3 until no change in lk . 2. and covariance matrices C i . In order to obtain clusters only according to class. 2. . Classification of audio signals using SVM and RBFNN. nh . . However. . Yi þ lÞ. Recompute lk . The input layer of this network has ni units for a ni dimensional input vector. Radial basis function neural network (RBFNN) model for classification The radial basis function neural network (Haykin. Yi þ 1Þ : 2. which are in turn fully connected to the nc output layer units. Please cite this article in press as: Dhanalakshmi. 3. . Hart. nc . YÞ ¼ exp c 3. nh . and are characterized by their mean vectors (centers) li .eswa. The training procedure is given below: Determination of li and d : Conventionally. . 2. training 2 the RBFNN involves estimating li . .

100 clips of songs (sung by male and female) in different languages. 19 dimensional LPCC and 26 dimensional MFCC features are extracted from the audio frames as described in Section 4. 4. Dhanalakshmi et al. is computed from the length of the parameterized static vector 13. 100 clips of movie from different languages. the audio signals were segmented and windowed into short frames of 256 samples. speech duration. n ¼ 1. The training data is segmented into fixed-length and overlapping frames (in our experiments we used 20 ms frames with 10 ms overlapping). 39. The training data should be sufficient to be statistically significant. 2007).2. Classification parameters are calculated using support vector machine learning. 4.. 4. Silence segments will be removed from the audio sequence. Modeling using SVM We use a nonlinear support vector classifier to discriminate the various categories.2 and are widely used in the area of audio and speech processing.1. Performance of SVM for audio classification. 100 clips of sports and 100 clips of news (both Tamil and English).3.06. 0 otherwise: ð21Þ If the short-time energy function is continuously lower than a certain set of thresholds (there may be durations in which the energy is higher than the threshold. Features such as LPC. N m ð20Þ where xðmÞ is the discrete time music signal.e. A 14th order LP analysis is used to capture the properties of the Fig. P. This number. The short-time energy function of a music signal is defined as En ¼ 1X ½xðmÞwðn À mފ2 . . . the temporal characteristics of audio content can be taken into consideration in the training process. The selected features should reflect the significant characteristics of different kinds of audio signals. The training data should be sufficient to be statistically significant. The audio content can be discriminated into the various categories in terms of the designed support vector classifier. with a shift of 10 ms. the better the extracted feature which exhibits more accurate audio characteristics.1. We use shorttime energy to detect silence. high-frequency components have relatively low amplitude. 4. These values are the change of the energy of the signal. When neighboring frames are overlapped. . The training data is segmented into fixed-length and overlapping frames. Modeling using RBFNN For RBFNN training. 2006). Feature selection from non-silent frames Feature selection is important for audio content analysis. and wðmÞ is a rectangular window. 14th order LPC. i. The audio signals are recorded for 60 s at 8000 samples per second and divided into frames of 20 ms. 4. measure three values from the audio segment to be classified.eswa. For each audio signal we arrived at 39 features. Due to radiation effects of the sound from lips.. From the results. Then the pre-emphasized frame is Hamming-windowed by hðnÞ ¼ 0:54 À 0:46 à cosð2pn=N À 1Þ. To obtain MFCCs (Umapathy et al. When neighboring frames are overlapped. LPC and LPCC are two linear prediction methods and they are highly correlated to each other. The signal duration was slightly increased using the following rationale that the longer the audio signal analyzed. Please cite this article in press as: Dhanalakshmi. The training process analyzes audio training data to find an optimal way to classify audio frames into their respective classes. The processed audio sequence will be segmented into fixed-length and 10 ms overlapping frames. These features are given as input to the s0 ðnÞ ¼ sðnÞ À 0:96  sðn À 1Þ.ARTICLE IN PRESS P. One simple solution is to augment the energy of the high-frequency spectrum. but the durations should be short enough and far apart from each other). . In order to evaluate the relative performance of the proposed work. The support vector machine learning algorithm is applied to produce the classification parameters according to calculated features. with a sampling rate of 8 kHz and 16-bits per sample. plus the delta coefficients (+13) plus the acceleration coefficients (+13). The classification results for the different features are shown in Fig.126 . including unnoticeable noise and very short clicks. the segment is indexed as silence. Expert Systems with Applications (2008). we observe that the overall classification accuracies is 92% using MFCC as feature. Goubran. Audio samples are of different length. n is the time index of the short-time energy. Classification of audio signals using SVM and RBFNN. MFCCs are short-term spectral features as described in Section 2. the temporal characteristics of the audio content can be taken into consideration in the training process. wðnÞ ¼  1 if 0 6 n 6 N À 1. LPC-based algorithms (Abu-El-Quran. LPCC and MFCC are calculated for each frame. The LP coefficients for each frame is linearly weighted to form the WLPCC.4. ð18Þ where sðnÞ is the nth sample of the frame s and s0 ð0Þ ¼ sð0Þ. N À 1.2. and the change of the pitch value. 4. doi:10. Magnitude spectrum was computed for each of these frames using fast Fourier transform (FFT) and converted into a set of mel scale filter bank outputs. Preprocessing 0 6 n 6 N À 1: ð19Þ The aim of preprocessing is to remove silence from a music sequence. ranging from 1 s to about 10 s. Silence is defined as a segment of imperceptible audio. The derived classification parameters are used to classify audio data.2008. & Chan. The recursive relation (4) between the predictor coefficients and cepstral coefficients is used to convert the 14 LP coefficients into 19 cepstral coefficients. we compared it with the well-known MFCC features.This procedure can be implemented via a pre-emphasizing filter that is defined as signal spectrum as described in Section 2. The selected features include linear prediction coefficients (LPC)–linear prediction derived cepstrum coefficients (LPCC) and mel-frequency cepstrum coefficients(MFCC). et al.. / Expert Systems with Applications xxx (2008) xxx–xxx 5 advertisement in different languages. Logarithm was applied to the filter bank outputs followed by discrete cosine transformation to obtain the MFCCs.1016/j. 100 cartoon clips. which will influence the capture of the features at the high end of the spectrum.

(2001). & Varadan. (2005). advertisement. & Gabbouj.eswa. Fundamentals of speech recognition.-H. 5 shows the performance of RBFNN for audio classification. & Chan. H. P. & Schafer. 907–919.. N. R. S. et al. T. Dimitrova. Chen. G.. D. Klapuri. Peltonen. Content-based audio classification and retrieval by support vector machines. G. D. 533–544. Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora. news. C. (2006). 8. Classification of audio signals using SVM and RBFNN.. IEEE Transactions on Speech and Audio Processing. F. 14(3). IEEE Transactions on Audio. S. / Expert Systems with Applications xxx (2008) xxx–xxx 5.. J. IEEE Transactions on Instrumentation and Measurement. 6 shows the performance of RBFNN for different means. Dhanalakshmi et al.. E.. J.. P. J. H. IEEE Transactions on Speech and Audio Processing. The performance was compared to Radial basis function neural network which showed an accuracy of 93%. Lu. Qureshi. Performance of RBFNN for audio classification. A.. B. (2003). Singapore: Pearson Education. (2003). & Bourlard. Fig. References Abu-El-Quran. Speech and Language Processing. 5.. 2..2008. Expert Systems with Applications (2008)... E. L. Eronen. 155–156. A. Linear prediction cepstrum coefficients (LPC. 55(4).. Person authentication using speech. A. K. For classification the feature vectors are extracted and each of the feature vector is given as input to the RBFNN model. (2005). Classification of general audio data for content-based retrieval. R. Security monitoring using microphone arrays and audio classification. D. Proceeding of the IEEE. L.. et al. McConaghy. (2004). 7(1). doi:10. Content-based audio classification and segmentation by using support vector machines. Truong. The performance of the system for MFCC using SVM and RBFNN for audio classification is given in Table 1. (2001). Fagerlund. 644–651. & Raahemifar. A speech/music discriminator based on rms and zero-crossings. T. The average output is calculated for each of the output neuron.. Performance of RBFNN for different means. Speech and Language Processing. & Wyse. 14(1). New York: John Wiley-Interscience. H.. (2003).. Guo.. and 5 has been used in our studies for each category. 619–622). H. cartoon and movie. speech and signal processing (pp. 22. Jiang. Speech Communication. (2006). (2006). & McGee. S. Audio-based context recognition.ARTICLE IN PRESS 6 P. 40(3). A. S. & Stork. Sethi. & Hansen. S... K. Krishnan. A generic audio classification and segmentation approach for multimedia indexing and retrieval... The system gives optimal performance for k = 5. Madras: IIT. IEEE Transactions on Multimedia.. IEEE Transactions on Instrumentation and Measurement. Leung. M. Sorsa.. 131–136. S. Pattern classification. S. I. Content-based audio classification and retrieval using the nearest feature line method. 8(5). face and visual speech. SVM-based audio scene classification.126 . Proceedings of the IEEE. (2003). L. Asia: Pearson Education.. For training. the weight matrix is calculated using the least squares algorithm discussed in Section 3. Lin. P... extracting different features and developing better classification algorithms and integration of classifiers to reduce the classification errors. McCowan. & Tziritas. S. & Juang. LPCC) and mel-frequency cepstral coefficients are calculated as features to characterize audio content. 321–329. IEEE Transactions on Audio. R. T. 1025–1032. 1062–1081. Hart. 482–492. Experimental results show that the proposed audio classification scheme is very effective and the accuracy rate is 92%. Rabiner. Table 1 Audio classification performance using SVM and RBFNN SVM (%) 92. (2005).-K..2 for each of the features. Classification of audio radar signals using radial basis function neural networks. Pearson Education. S. R..1016/j.7 RBFNN model. The value of k = 1. Mubarak. & Chang. Zhang. E. 14(3).. Y. (2006). Fig. L. A. (2004).. Rabiner. Tuomi.1 RBFNN (%) 93. Generic audio classification using a hybrid model based on GMMS and HMMS. Z. M. Ajmera. Speech and Language Processing. Content based audio classification and retrieval using joint time–frequency analysis. J. 6. Multimedia Systems. Duda. 308–315. The class to which the audio sample belongs is decided based on the highest output. Bai. Analysis of an MFCC-based audio indexing system for efficient coding of multimedia sources. speech and signal processing (pp. O. V.. H. 1771–1779. (2000).06. C. Speech/music segmentation using entropy and dynamism features in a HMM classification framework.. Fig. T. Z. L. sports. 14(1). Fig. Li. Li. S.-C. The RBF centers are located using k-means algorithm. IEEE Transactions on Audio. (2001). The classification rate using LPC and LPCC were slightly lower than the 39 dimensional MFCC feature vectors. J. (2003).. I. J. Audio classification and categorization based on wavelets and support vector machine.-J. Neural networks a comprehensive foundation. V. G. Panagiotakis. A. Esmaili. Kiranyaz. S. Goubran. by learning from training data. This work indicates that support vector machines and radial basis function neural networks can be effectively used for audio classification. Zhang. R. Rajapakse. In IEEE international conference on acoustics. C. (2005).. Eventhough by now some progress has been achieved.. & Li. 1550–1555. A nonlinear support vector machine learning algorithm is applied to obtain the optimal class boundary between the various classes namely music. 619–625. T. Please cite this article in press as: Dhanalakshmi. & Li. 13(5). there are still remaining challenges and directions for further research. The weights are determined using least squares algorithm. M. & Xu.D thesis. Palanivel.. T. Boss. Ph. Ambikairajah. In IEEE international conference on acoustics. (2005). Huang. Z. Conclusion In this paper. O. & Epps. 52(6). Digital processing of speech signals. such as. B. 351–363.. we have proposed an automatic audio classification system using SVM and RBFNN. (2005).. Pattern Recognition Letters. Haykin. IEEE Transactions on Neural Networks. 665–668).

308–315.eswa. Classification of audio signals using SVM and RBFNN.. Expert Systems with Applications (2008). doi:10. R. P. 441–450. Automatic music classification and summarization. & Shao. K. Xu. Maddage. S.. et al. & Jimaa. (1998).. IEEE Transactions on Speech and Audio Processing. Multigroup classification of audio signals using time–frequency parameters. C. (2005).. S. C. Please cite this article in press as: Dhanalakshmi. & Rao. K.. Speech and Language Processing.ARTICLE IN PRESS P. Dhanalakshmi et al.06. New York: John Wiley and Sons.126 . S. 1236–1246. 15(4). 7(2).1016/j. N.. 13(3). IEEE Transactions on Multimedia. (2007).. / Expert Systems with Applications xxx (2008) xxx–xxx Umapathy. Statistical learning theory. X. Krishnan. Krishnan.2008. (2005). 7 Vapnik. Audio signal feature extraction and classification using local discriminant bases. IEEE Transactions on Audio. V. K. Umapathy.