- Ανάπτυξη προγράμματος πεπερασμένων στοιχείων
- acoustics
- COMSOL
- Juan Martin Flamenco Master Class
- 10.1.1.30.630
- Primary School Mathematics Curriculum
- Jot Hi Lakshmi 2013
- Onenote Quick Reference 2010
- Article on SVM
- Acoustic Design
- lsa352.lec6
- An App-Based Solution for Farmers using Opinion Based Technique in Agricultural Development
- Estimation of Scour Downstream of Spillways
- Walking Bass & Chords for Jazz Guitar
- Matlab Audio Project
- Mechanical Vibration solved examples
- math1 1st
- ref dip
- Assignment 2 - Solutions
- A Survey on Techniques Used for Content Based Image Retrieval
- Comparative Study of Various Sentiment Classification Techniques in Twitter 1
- 1802.08402
- Fb 3110231028
- Evaluating Automatic Speaker Recognition Systems an Overview of the NIST Speaker Recognition Evaluations (1996-2014)
- Unification of Evidence Theoretic Fusion Algorithms: A Case Study in Level-2 and Level-3 Fingerprint Features
- Analyzing Different Stock Options and Support in Decision Making for First Time Investors
- Session 04_Paper 54
- ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN NONOVERLAPPING VIEWS
- Implicit Shape Kernel for Discriminative Learning of the Hough Transform Detector - Zhang, Chen - Procedings of the British Machine Vision Conference - 2010
- Answers 02
- Big Data SVM
- ART an Approach Network Outage Detection Drive Test DATABASE
- 5GU Call for Papers
- Chapter 7 Svm
- Solutions Manual - Computer Systems Organization

P. Dhanalakshmi

*

, S. Palanivel, V. Ramalingam

Department of Computer Science and Engineering, Annamalai University, Annamalainagar, Chidambaram 608 002, Tamil Nadu, India

a r t i c l e i n f o

Keywords:

Support vector machines

Radial basis function neural network

Linear predictive coefﬁcients

Linear predictive cepstral coefﬁcients

Mel-frequency cepstral coefﬁcients

a b s t r a c t

In the age of digital information, audio data has become an important part in many modern computer

applications. Audio classiﬁcation has been becoming a focus in the research of audio processing and pat-

tern recognition. Automatic audio classiﬁcation is very useful to audio indexing, content-based audio

retrieval and on-line audio distribution, but it is a challenge to extract the most common and salient

themes fromunstructured rawaudio data. In this paper, we propose effective algorithms to automatically

classify audio clips into one of six classes: music, news, sports, advertisement, cartoon and movie. For

these categories a number of acoustic features that include linear predictive coefﬁcients, linear predictive

cepstral coefﬁcients and mel-frequency cepstral coefﬁcients are extracted to characterize the audio con-

tent. Support vector machines are applied to classify audio into their respective classes by learning from

training data. Then the proposed method extends the application of neural network (RBFNN) for the clas-

siﬁcation of audio. RBFNN enables nonlinear transformation followed by linear transformation to achieve

a higher dimension in the hidden space. The experiments on different genres of the various categories

illustrate the results of classiﬁcation are signiﬁcant and effective.

Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction

Audio data is an integral part of many modern computer and

multimedia applications. A typical multimedia database often con-

tains millions of audio clips, including environmental sounds,

machine noise, music, animal sounds, speech sounds, and other

non-speech utterances. The effectiveness of their deployment is

greatly dependent on the ability to classify and retrieve the audio

ﬁles in terms of their sound properties or content. The need to

automatically recognize to which class an audio sound belongs

makes audio classiﬁcation and categorization an emerging and

important research area. However, a raw audio signal data is a fea-

tureless collection of bytes with most rudimentary ﬁelds attached

such as name, ﬁle format and sampling rate.

Content-based classiﬁcation and retrieval of audio sound is

essentially a pattern recognition problem in which there are two

basic issues: feature selection, and classiﬁcation based on the

selected features. In the ﬁrst step, an audio sound is reduced to a

small set of parameters using various feature extraction tech-

niques. The terms linear predictive coefﬁcients (LPC), linear predic-

tive cepstral coefﬁcients (LPCC), mel-frequency cepstral

coefﬁcients (MFCC) refer to the features extracted from audio data.

In the second step, classiﬁcation or categorization algorithms rang-

ing from simple Euclidean distance methods to sophisticated

statistical techniques are carried out over these coefﬁcients. The

efﬁcacy of an audio classiﬁcation or categorization depends on

the ability to capture proper audio features and to accurately clas-

sify each feature set corresponding to its own class.

1.1. Related work

During the recent years, there have been many studies on auto-

matic audio classiﬁcation and segmentation using several features

and techniques. The most common problem in audio classiﬁcation

is speech/music classiﬁcation, in which the highest accuracy has

been achieved, especially when the segmentational information

is known beforehand. In Lin, Chen, Truong, and Chang (2005),

wavelets are ﬁrst applied to extract acoustical features such as sub-

band power and pitch information. The method uses a bottom-up

SVM over these acoustic features and additional parameters, such

as frequency cepstral coefﬁcients, to accomplish audio classiﬁca-

tion and categorization. An audio feature extraction and a multi-

group classiﬁcation scheme that focuses on identifying

discriminatory time–frequency subspaces using the local discrim-

inant bases (LDB) technique has been described in Umapathy,

Krishnan, and Rao (2007). For pure music and vocal music, a num-

ber of features such as LPC and LPCC are extracted in Xu, Maddage,

and Shao (2005), to characterize the music content. Based on calcu-

lated features, a clustering algorithm is applied to structure the

music content.

A new approach towards high performance speech/music dis-

crimination on realistic tasks related to the automatic transcription

0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved.

doi:10.1016/j.eswa.2008.06.126

* Corresponding author. Tel.: +91 04144 220433.

E-mail addresses: abi_dhana@rediffmail.com (P. Dhanalakshmi),

spal_yughu@yahoo.com (S. Palanivel), aucsevr@yahoo.com (V. Ramalingam).

Expert Systems with Applications xxx (2008) xxx–xxx

Contents lists available at ScienceDirect

Expert Systems with Applications

j our nal homepage: www. el sevi er . com/ l ocat e/ eswa

ARTICLE IN PRESS

Please cite this article in press as: Dhanalakshmi, P. et al., Classiﬁcation of audio signals using SVM and RBFNN, Expert Systems with Ap-

plications (2008), doi:10.1016/j.eswa.2008.06.126

of broadcast news is described in Ajmera, McCowan, and Bourlard

(2003), in which an artiﬁcial neural network (ANN) and hidden

Markov model (HMM) are used. In Kiranyaz, Qureshi, and Gabbouj

(2006), a generic audio classiﬁcation and segmentation approach

for multimedia indexing and retrieval is described. A method is

proposed in Panagiotakis and Tziritas (2005) for speech/music dis-

crimination based on root mean square and zero-crossings. The

method proposed in Eronen et al. (2006), investigates the feasibil-

ity of an audio-based context recognition system where simplistic

low-dimensional feature vectors are evaluated against more stan-

dard spectral features. Using discriminative training, competitive

recognition accuracies are achieved with very low-order hidden

Markov models (Ajmera et al., 2003).

The classiﬁcation of continuous general audio data for content-

based retrieval was addressed in Li, Sethi, Dimitrova, and McGee

(2001), where the audio segments where classiﬁed based on

MFCC and LPC. They also showed that cepstral-based features

gave a better classiﬁcation accuracy. The method described in

content-based audio classiﬁcation and retrieval using joint

time–frequency analysis exploits the non-stationary behavior of

music signals and extracts features that characterize their spectral

change over time (Esmaili, Krishnan, & Raahemifar, 2004). The

audio signals were decomposed in Umapathy, Krishnan, and Jimaa

(2005), using an adaptive time frequency decomposition algo-

rithm, and the signal decomposition parameter based on octave

(scaling) was used to generate a set of 42 features over three fre-

quency bands within the auditory range. These features were ana-

lyzed using linear discriminant functions and classiﬁed into six

music groups. An approach given in Jiang, Bai, Zhang, and Xu

(2005), uses support vector machine (SVM) for audio scene classi-

ﬁcation, which classiﬁes audio clips into one of ﬁve classes: pure

speech, non-pure speech, music, environment sound, and silence.

Radial basis function neural networks (RBFNN) are used in McC-

onaghy, Leung, Boss, and Varadan (2003) to classify real-life audio

radar signals that are collected by a ground surveillance radar

mounted on a tank.

For audio retrieval, a new metric has been proposed in Guo and

Li (2003), called distance-from-boundary (DFB). When a query

audio is given, the system ﬁrst ﬁnds a boundary inside which the

query pattern is located. Then, all the audio patterns in the data-

base are sorted by their distances to this boundary. All boundaries

are learned by the SVMs and stored together with the audio data-

base. In Mubarak, Ambikairajah, and Epps (2005) a speech/music

discrimination systemwas proposed based on mel-frequency ceps-

tral coefﬁcient (MFCC) and GMMclassiﬁer. This systemcan be used

to select the optimum coding scheme for the current frame of an

input signal without knowing a priori whether it contains

speech-like or music-like characteristics. A hybrid model com-

prised of Gaussian mixtures models (GMMs) and hidden Markov

models (HMMs) is used to model generic sounds with large intra

class perceptual variations in Rajapakse and Wyse (2005). The

number of mixture components in the GMM was derived using

the minimum description length (MDL) criterion.

A new pattern classiﬁcation method called the nearest feature

line (NFL) is proposed in Li (2000), where the NFL explores the

information provided by multiple prototypes per class. Audio fea-

tures like MFCC, ZCR, brightness and bandwidth, spectrum ﬂux

were extracted (Lu, Zhang, & Li, 2003), and the performance using

SVM, K-nearest neighbor (KNN), and Gaussian mixture model

(GMM) were compared. Audio classiﬁcation techniques for speech

recognition and audio segmentation, for unsupervised multispeak-

er change detection are proposed in Huang and Hansen (2006).

Two new extended-time features: variance of the spectrum ﬂux

(VSF) and variance of the zero-crossing rate (VZCR) are used to pre-

classify the audio and supply weights to the output probabilities of

the GMM networks. The classiﬁcation is then implemented using

weighted GMM networks.

1.2. Outline of the work

In this paper, automatic audio feature extraction and classiﬁca-

tion approaches are presented. In order to discriminate the six cat-

egories of audio namely music, news, sports, advertisement,

cartoon and movie, a number of features such as LPC, LPCC, MFCC

are extracted to characterize the audio content. Support vector ma-

chine (SVM) is applied to obtain the optimal class boundary be-

tween the classes by learning from training data. The

performance of SVM is compared to RBF network. Each RBF center

approximates a cluster of training data vectors that are close to

each other in Euclidean space. When a vector is input to the

RBFNN, the centers near to that vector become strongly activated,

in turn activating certain output nodes. Experimental results show

that the classiﬁcation accuracy of RBF with mel cepstral features

can provide a better result. Fig. 1 illustrates the block diagram of

audio classiﬁcation.

The paper is organized as follows. The acoustic feature extrac-

tion is presented in Section 2, modeling techniques for audio clas-

siﬁcation is described in Section 3. Experimental results using SVM

and RBF are reported in Section 4. Finally, conclusions and future

work are given in Section 5.

Fig. 1. Block diagram of audio classiﬁcation.

2 P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx

ARTICLE IN PRESS

Please cite this article in press as: Dhanalakshmi, P. et al., Classiﬁcation of audio signals using SVM and RBFNN, Expert Systems with Ap-

plications (2008), doi:10.1016/j.eswa.2008.06.126

2. Acoustic feature extraction

Acoustic features representing the audio information can be

extracted from the speech signal at the segmental level. The seg-

mental features are the features extracted from short (10–30 ms)

segments of the speech signal. These features represent the

short-time spectrum of the speech signal. The short-time spectrum

envelope of the speech signal is attributed primarily to the shape of

the vocal tract. The spectral information of the same sound uttered

by two persons may differ due to change in the shape of the indi-

vidual’s vocal tract system, and the manner of speech production.

The selected features include linear prediction coefﬁcients (LPC),

Linear prediction derived cepstrum coefﬁcients(LPCC) and mel-fre-

quency cepstral coefﬁcients (MFCC).

2.1. Linear prediction analysis

For acoustic feature extraction, the differenced speech signal is

divided into frames of 20 ms, with a shift of 5 ms. A pth order LP

analysis is used to capture the properties of the signal spectrum.

In the LP analysis of speech each sample is predicted as linear

weighted sum of the past p samples, where p represents the order

of prediction (Rabiner & Schafer, 2005; Palanivel, 2004). If sðnÞ is

the present sample, then it is predicted by the past p samples as

^sðnÞ ¼ À

¸

p

k¼1

a

k

sðn À kÞ: ð1Þ

The difference between the actual and the predicted sample value is

termed as the prediction error or residual, and is given by

eðnÞ ¼ sðnÞ À^sðnÞ ¼ sðnÞ þ

¸

p

k¼1

a

k

sðn À kÞ ð2Þ

¼

¸

p

k¼0

a

k

sðn À kÞ; a

0

¼ 1 ð3Þ

The LP coefﬁcients fa

k

g are determined by minimizing the mean

squared error over an analysis frame and it is described in Rabiner

and Juang (2003).

The recursive relation (4) between the predictor coefﬁcients

and cepstral coefﬁcients is used to convert the LP coefﬁcients into

LP cepstral coefﬁcients fc

k

g

c

0

¼ lnr

2

;

c

m

¼ a

m

þ

¸

mÀ1

k¼1

k

m

c

k

a

mÀk

; 1 6 m 6 p;

c

m

¼

¸

mÀ1

k¼1

k

m

c

k

a

mÀk

; p < m 6 D;

ð4Þ

where r

2

is the gain term in the LP analysis and D is the number of

LP cepstral coefﬁcients.. The cepstral coefﬁcients are linearly

weighted to get the weighted linear prediction cepstral coefﬁcients

(WLPCC). In this work, a 19 dimensional WLPCC is obtained from

the 14th order LP analysis for each frame. Linear channel effects

are compensated to some extent by removing the mean of the tra-

jectory of each cepstral coefﬁcient.. The 19 dimensional WLPCC

(mean subtracted) for each frame is used as an acoustic feature

vector.

2.2. Mel-frequency cepstral coefﬁcients

The mel-frequency cepstrum has proven to be highly effective

in recognizing structure of music signals and in modeling the sub-

jective pitch and frequency content of audio signals. Psychophysi-

cal studies have found the phenomena of the mel pitch scale and

the critical band, and the frequency scale-warping to the mel scale

has led to the cepstrum domain representation. The mel scale is

deﬁned as

F

mel

¼

c log 1 þ

f

c

log 2 ð Þ

; ð5Þ

where F

mel

is the logarithmic scale of f normal frequency scale. The

mel-cepstral features (Xu et al., 2005), can be illustrated by the

MFCCs, which are computed from the fast Fourier transform (FFT)

power coefﬁcients. The power coefﬁcients are ﬁltered by a triangu-

lar bandpass ﬁlter bank. When c in (5) is in the range of 250–350,

the number of triangular ﬁlters that fall in the frequency range

200–1200 Hz (i.e., the frequency range of dominant audio informa-

tion) is higher than the other values of c. Therefore, it is efﬁcient to

set the value of c in that range for calculating MFCCs. Denoting the

output of the ﬁlter bank by S

k

ðk ¼ 1; 2; . . . ; KÞ, the MFCCs are calcu-

lated as

c

n

¼

ﬃﬃﬃﬃ

2

K

¸

K

k¼1

ðlog S

k

Þ cos nðk À0:5Þ

p

K

; n ¼ 1; 2. . . ; L: ð6Þ

3. Modeling techniques for audio classiﬁcation

3.1. Support vector machine (SVM) for classiﬁcation

The SVM (Vapnik, 1998), is a useful statistic machine learning

technique that has been successfully applied in the pattern recog-

nition area (Jiang et al., 2005; Guo & Li, 2003). If the data are line-

arly nonseparable but nonlinearly separable, the nonlinear support

vector classiﬁer will be applied. The basic idea is to transforminput

vectors into a high-dimensional feature space using a nonlinear

transformation /, and then to do a linear separation in feature

space as shown in Fig. 2.

To construct a nonlinear support vector classiﬁer, the inner

product (x, y) is replaced by a kernel function K(x, y):

f ðxÞ ¼ sgn

¸

l

i¼1

a

i

y

i

Kðx

i

; xÞ þ b

: ð7Þ

The SVM has two layers. During the learning process, the ﬁrst layer

selects the basis Kðx

i

; xÞ, i ¼ 1; 2; . . . ; N, from the given set of bases

deﬁned by the kernel; the second layer constructs a linear function

in this space. This is completely equivalent to constructing the opti-

mal hyperplane in the corresponding feature space.

The SVM algorithmcan construct a variety of learning machines

by use of different kernel functions. Three kinds of kernel functions

are usually used. They are as follows:

Fig. 2. Principle of support vector machines.

P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx 3

ARTICLE IN PRESS

Please cite this article in press as: Dhanalakshmi, P. et al., Classiﬁcation of audio signals using SVM and RBFNN, Expert Systems with Ap-

plications (2008), doi:10.1016/j.eswa.2008.06.126

1. Polynomial kernel of degree d:

KðX; YÞ ¼ ðhX; Yi þ1Þ

d

: ð8Þ

2. Radial basis function with Gaussian kernel of width C > 0:

KðX; YÞ ¼ exp

ÀjX À Yj

2

c

: ð9Þ

3. Neural networks with tanh activation function:

KðX; YÞ ¼ tan hðKhX; Yi þlÞ; ð10Þ

where the parameters K and l are the gain and shift.

3.2. Radial basis function neural network (RBFNN) model for

classiﬁcation

The radial basis function neural network (Haykin, 2001) has a

feed forward architecture with an input layer, a hidden layer,

and an output layer as shown in Fig. 3. Radial basis functions are

embedded into a two-layer feed forward neural network. Such a

network is characterized by a set of inputs and a set of outputs.

In between the inputs and outputs there is a layer of processing

units called hidden units. Each of them implements a radial basis

function. The input layer of this network has n

i

units for a n

i

dimensional input vector. The input units are fully connected to

the n

h

hidden layer units, which are in turn fully connected to

the n

c

output layer units, where n

c

is the number of output classes.

The activation functions of the hidden layer were chosen to be

Gaussians, and are characterized by their mean vectors (centers)

l

i

, and covariance matrices C

i

, i ¼ 1; 2; . . . ; n

h

. For simplicity, it is

assumed that the covariance matrices are of the form C

i

¼ r

2

i

I,

i ¼ 1; 2; . . . ; n

h

. Then the activation function of the ith hidden unit

for an input vector x

j

is given by

g

i

ðx

j

Þ ¼ exp

Àkx

j

À l

i

k

2

2r

2

i

: ð11Þ

The l

i

and r

2

i

are calculated by using suitable clustering algorithm.

Here the k-means clustering algorithm is employed to determine

the centers. The algorithm is composed of the following steps:

1. Randomly initialize the samples to k means (clusters)

l

1

; . . . ; l

k

:

2. Classify n samples according to nearest l

k

.

3. Recompute l

k

.

4. Repeat steps 2 and 3 until no change in l

k

.

The number of activation functions in the network and their

spread inﬂuence the smoothness of the mapping. The assumption

r

2

i

¼ r

2

is made and r

2

is given in (12) to ensure that the activa-

tion functions are not too peaked or too ﬂat:

r

2

¼

gd

2

2

: ð12Þ

In the above equation d is the maximum distance between the cho-

sen centers, and g is an empirical scale factor which serves to con-

trol the smoothness of the mapping function. Therefore, the above

equation is written as

g

i

ðx

j

Þ ¼ exp

Àkx

j

À l

i

k

2

gd

2

: ð13Þ

The hidden layer units are fully connected to the n

c

output layer

units through weights w

ik

. The output units are linear, and the re-

sponse of the kth output unit for an input x

j

is given by

y

k

ðx

j

Þ ¼

¸

i¼0

n

h

w

ik

g

i

ðx

j

Þ; k ¼ 1; 2; . . . ; n

c

; ð14Þ

where g

0

ðx

j

Þ ¼ 1. Given n

t

feature vectors from n

c

classes, training

the RBFNN involves estimating l

i

, i ¼ 1; 2; . . . ; n

h

, g, d

2

, and w

ik

,

i ¼ 0; 1; 2; . . . ; n

h

, k ¼ 1; 2; . . . ; n

c

. The training procedure is given

below:

Determination of l

i

and d

2

: Conventionally, the unsupervised k-

means clustering algorithm (Duda, Hart, & Stork, 2001), can be

applied to ﬁnd n

h

clusters fromn

t

training vectors. However, the

training vectors of a class may not fall into a single cluster. In

order to obtain clusters only according to class, the k-means

clustering may be used in a supervised manner. Training feature

vectors belonging to the same class are clustered to n

h

=n

c

clus-

ters using the k-means clustering algorithm. This is repeated for

each class yielding n

h

cluster for n

c

classes. These cluster means

are used as the centers l

i

of the Gaussian activation functions in

the RBFNN. The parameter d was then computed by ﬁnding the

maximum distance between n

h

cluster means.

Determining the weights w

ik

between the hidden and output layer:

Given that the Gaussian function centers and widths are com-

puted from n

t

training vectors (14) may be written in matrix

form as

Y ¼ GW; ð15Þ

where Y is a n

t

Ân

c

matrix with elements Y

ij

¼ y

j

ðx

i

Þ, G is a

n

t

Â ðn

h

þ1Þ matrix with elements G

ij

¼ y

j

ðx

i

Þ, and W is a

ðn

h

þ1Þ Ân

c

matrix of unknown weights. W is obtained from the

standard least squares solution as given by

W ¼ ðG

T

GÞ

À1

G

T

Y: ð16Þ

To solve W from (16), G is completely speciﬁed by the clustering

results, and the elements of Y are speciﬁed as

Y

ij

¼

1 if x

i

2 class j;

0 otherwise:

ð17Þ

4. Experimental results

The evaluation of the proposed audio classiﬁcation and segmen-

tation algorithms have been performed by using a generic audio

database which consists of the following contents: 100 clips of Fig. 3. Radial basis function neural network.

4 P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx

ARTICLE IN PRESS

Please cite this article in press as: Dhanalakshmi, P. et al., Classiﬁcation of audio signals using SVM and RBFNN, Expert Systems with Ap-

plications (2008), doi:10.1016/j.eswa.2008.06.126

advertisement in different languages, 100 clips of songs (sung by

male and female) in different languages, 100 cartoon clips, 100

clips of movie from different languages, 100 clips of sports and

100 clips of news (both Tamil and English). Audio samples are of

different length, ranging from 1 s to about 10 s, with a sampling

rate of 8 kHz and 16-bits per sample. The signal duration was

slightly increased using the following rationale that the longer

the audio signal analyzed, the better the extracted feature which

exhibits more accurate audio characteristics. The training data

should be sufﬁcient to be statistically signiﬁcant. The training data

is segmented into ﬁxed-length and overlapping frames (in our

experiments we used 20 ms frames with 10 ms overlapping).

When neighboring frames are overlapped, the temporal character-

istics of the audio content can be taken into consideration in the

training process. Due to radiation effects of the sound from lips,

high-frequency components have relatively low amplitude, which

will inﬂuence the capture of the features at the high end of the

spectrum. One simple solution is to augment the energy of the

high-frequency spectrum.This procedure can be implemented via

a pre-emphasizing ﬁlter that is deﬁned as

s

0

ðnÞ ¼ sðnÞ À0:96 Â sðn À1Þ; n ¼ 1; . . . ; N À1; ð18Þ

where sðnÞ is the nth sample of the frame s and s

0

ð0Þ ¼ sð0Þ. Then the

pre-emphasized frame is Hamming-windowed by

hðnÞ ¼ 0:54 À0:46 Ã cosð2pn=N À1Þ; 0 6 n 6 N À1: ð19Þ

4.1. Preprocessing

The aim of preprocessing is to remove silence from a music se-

quence. Silence is deﬁned as a segment of imperceptible audio,

including unnoticeable noise and very short clicks. We use short-

time energy to detect silence. The short-time energy function of

a music signal is deﬁned as

E

n

¼

1

N

¸

m

½xðmÞwðn À mÞ

2

; ð20Þ

where xðmÞ is the discrete time music signal, n is the time index of

the short-time energy, and wðmÞ is a rectangular window, i.e.,

wðnÞ ¼

1 if 0 6 n 6 N À1;

0 otherwise:

ð21Þ

If the short-time energy function is continuously lower than a cer-

tain set of thresholds (there may be durations in which the energy is

higher than the threshold, but the durations should be short enough

and far apart from each other), the segment is indexed as silence.

Silence segments will be removed from the audio sequence. The

processed audio sequence will be segmented into ﬁxed-length

and 10 ms overlapping frames.

4.2. Feature selection from non-silent frames

Feature selection is important for audio content analysis. The

selected features should reﬂect the signiﬁcant characteristics of

different kinds of audio signals. The selected features include linear

prediction coefﬁcients (LPC)–linear prediction derived cepstrum

coefﬁcients (LPCC) and mel-frequency cepstrum coefﬁ-

cients(MFCC). LPC and LPCC are two linear prediction methods

and they are highly correlated to each other. LPC-based algorithms

(Abu-El-Quran, Goubran, & Chan, 2006), measure three values from

the audio segment to be classiﬁed. These values are the change of

the energy of the signal, speech duration, and the change of the

pitch value. The audio signals are recorded for 60 s at 8000 samples

per second and divided into frames of 20 ms, with a shift of 10 ms.

A 14th order LP analysis is used to capture the properties of the

signal spectrum as described in Section 2.1. The recursive relation

(4) between the predictor coefﬁcients and cepstral coefﬁcients is

used to convert the 14 LP coefﬁcients into 19 cepstral coefﬁcients.

The LP coefﬁcients for each frame is linearly weighted to form the

WLPCC.

In order to evaluate the relative performance of the proposed

work, we compared it with the well-known MFCC features. MFCCs

are short-term spectral features as described in Section 2.2 and are

widely used in the area of audio and speech processing. To obtain

MFCCs (Umapathy et al., 2007), the audio signals were segmented

and windowed into short frames of 256 samples. Magnitude spec-

trum was computed for each of these frames using fast Fourier

transform (FFT) and converted into a set of mel scale ﬁlter bank

outputs. Logarithm was applied to the ﬁlter bank outputs followed

by discrete cosine transformation to obtain the MFCCs. For each

audio signal we arrived at 39 features. This number, 39, is com-

puted from the length of the parameterized static vector 13, plus

the delta coefﬁcients (+13) plus the acceleration coefﬁcients (+13).

4.3. Modeling using SVM

We use a nonlinear support vector classiﬁer to discriminate the

various categories. Classiﬁcation parameters are calculated using

support vector machine learning. The training process analyzes

audio training data to ﬁnd an optimal way to classify audio frames

into their respective classes. The training data should be sufﬁcient

to be statistically signiﬁcant. The training data is segmented into

ﬁxed-length and overlapping frames. When neighboring frames

are overlapped, the temporal characteristics of audio content can

be taken into consideration in the training process. Features such

as LPC, LPCC and MFCC are calculated for each frame. The support

vector machine learning algorithm is applied to produce the classi-

ﬁcation parameters according to calculated features. The derived

classiﬁcation parameters are used to classify audio data. The audio

content can be discriminated into the various categories in terms

of the designed support vector classiﬁer. The classiﬁcation results

for the different features are shown in Fig. 4. From the results,

we observe that the overall classiﬁcation accuracies is 92% using

MFCC as feature.

4.4. Modeling using RBFNN

For RBFNN training, 14th order LPC, 19 dimensional LPCC and

26 dimensional MFCC features are extracted from the audio frames

as described in Section 4.2. These features are given as input to the

Fig. 4. Performance of SVM for audio classiﬁcation.

P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx 5

ARTICLE IN PRESS

Please cite this article in press as: Dhanalakshmi, P. et al., Classiﬁcation of audio signals using SVM and RBFNN, Expert Systems with Ap-

plications (2008), doi:10.1016/j.eswa.2008.06.126

RBFNN model. The RBF centers are located using k-means algo-

rithm. The weights are determined using least squares algorithm.

The value of k = 1, 2, and 5 has been used in our studies for each

category. The system gives optimal performance for k = 5. For

training, the weight matrix is calculated using the least squares

algorithm discussed in Section 3.2 for each of the features.

For classiﬁcation the feature vectors are extracted and each of

the feature vector is given as input to the RBFNN model. The aver-

age output is calculated for each of the output neuron. The class to

which the audio sample belongs is decided based on the highest

output. Fig. 5 shows the performance of RBFNN for audio classiﬁca-

tion. Fig. 6 shows the performance of RBFNN for different means.

The performance of the system for MFCC using SVM and RBFNN

for audio classiﬁcation is given in Table 1.

5. Conclusion

In this paper, we have proposed an automatic audio classiﬁca-

tion system using SVM and RBFNN. Linear prediction cepstrum

coefﬁcients (LPC, LPCC) and mel-frequency cepstral coefﬁcients

are calculated as features to characterize audio content. A nonlin-

ear support vector machine learning algorithm is applied to ob-

tain the optimal class boundary between the various classes

namely music, news, sports, advertisement, cartoon and movie,

by learning from training data. Experimental results show that

the proposed audio classiﬁcation scheme is very effective and

the accuracy rate is 92%. The performance was compared to Ra-

dial basis function neural network which showed an accuracy of

93%. The classiﬁcation rate using LPC and LPCC were slightly low-

er than the 39 dimensional MFCC feature vectors. This work indi-

cates that support vector machines and radial basis function

neural networks can be effectively used for audio classiﬁcation.

Eventhough by now some progress has been achieved, there are

still remaining challenges and directions for further research,

such as, extracting different features and developing better classi-

ﬁcation algorithms and integration of classiﬁers to reduce the

classiﬁcation errors.

References

Abu-El-Quran, A. R., Goubran, R. A., & Chan, A. D. C. (2006). Security monitoring

using microphone arrays and audio classiﬁcation. IEEE Transactions on

Instrumentation and Measurement, 55(4), 1025–1032.

Ajmera, J., McCowan, I., & Bourlard, H. (2003). Speech/music segmentation using

entropy and dynamism features in a HMM classiﬁcation framework. Speech

Communication, 40(3), 351–363.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classiﬁcation. New York: John

Wiley-Interscience.

Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al.

(2006). Audio-based context recognition. IEEE Transactions on Audio, Speech and

Language Processing, 14(1), 321–329.

Esmaili, S., Krishnan, S., & Raahemifar, K. (2004). Content based audio classiﬁcation

and retrieval using joint time–frequency analysis. In IEEE international

conference on acoustics, speech and signal processing (pp. 665–668).

Guo, G., & Li, S. Z. (2003). Content-based audio classiﬁcation and retrieval by

support vector machines. IEEE Transactions on Neural Networks, 14(1), 308–315.

Haykin, S. (2001). Neural networks a comprehensive foundation. Asia: Pearson

Education.

Huang, R., & Hansen, J. H. L. (2006). Advances in unsupervised audio classiﬁcation

and segmentation for the broadcast news and NGSW corpora. IEEE Transactions

on Audio, Speech and Language Processing, 14(3), 907–919.

Jiang, H., Bai, J., Zhang, S., & Xu, B. (2005). SVM-based audio scene classiﬁcation.

Proceeding of the IEEE, 131–136.

Kiranyaz, S., Qureshi, A. F., & Gabbouj, M. (2006). A generic audio classiﬁcation and

segmentation approach for multimedia indexing and retrieval. IEEE Transactions

on Audio, Speech and Language Processing, 14(3), 1062–1081.

Li, S. Z. (2000). Content-based audio classiﬁcation and retrieval using the nearest

feature line method. IEEE Transactions on Speech and Audio Processing, 8(5),

619–625.

Lin, C.-C., Chen, S.-H., Truong, T.-K., & Chang, Y. (2005). Audio classiﬁcation and

categorization based on wavelets and support vector machine. IEEE Transactions

on Speech and Audio Processing, 13(5), 644–651.

Li, D., Sethi, I. K., Dimitrova, N., & McGee, T. (2001). Classiﬁcation of general audio

data for content-based retrieval. Pattern Recognition Letters, 22, 533–544.

Lu, L., Zhang, H.-J., & Li, S. Z. (2003). Content-based audio classiﬁcation and

segmentation by using support vector machines. Multimedia Systems, 8,

482–492.

McConaghy, T., Leung, H., Boss, E., & Varadan, V. (2003). Classiﬁcation of audio radar

signals using radial basis function neural networks. IEEE Transactions on

Instrumentation and Measurement, 52(6), 1771–1779.

Mubarak, O. M., Ambikairajah, E., & Epps, J. (2005). Analysis of an MFCC-based audio

indexing system for efﬁcient coding of multimedia sources. In IEEE international

conference on acoustics, speech and signal processing (pp. 619–622).

Palanivel, S. (2004). Person authentication using speech, face and visual speech.

Ph.D thesis, Madras: IIT.

Panagiotakis, C., & Tziritas, G. (2005). A speech/music discriminator based on rms

and zero-crossings. IEEE Transactions on Multimedia, 7(1), 155–156.

Rabiner, L., & Juang, B. (2003). Fundamentals of speech recognition. Singapore:

Pearson Education.

Rabiner, L., & Schafer, R. (2005). Digital processing of speech signals. Pearson

Education.

Rajapakse, M., & Wyse, L. (2005). Generic audio classiﬁcation using a hybrid model

based on GMMS and HMMS. Proceedings of the IEEE, 1550–1555.

Fig. 5. Performance of RBFNN for audio classiﬁcation.

Fig. 6. Performance of RBFNN for different means.

Table 1

Audio classiﬁcation performance using SVM and RBFNN

SVM (%) RBFNN (%)

92.1 93.7

6 P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx

ARTICLE IN PRESS

Please cite this article in press as: Dhanalakshmi, P. et al., Classiﬁcation of audio signals using SVM and RBFNN, Expert Systems with Ap-

plications (2008), doi:10.1016/j.eswa.2008.06.126

Umapathy, K., Krishnan, S., & Jimaa, S. (2005). Multigroup classiﬁcation of audio

signals using time–frequency parameters. IEEE Transactions on Multimedia, 7(2),

308–315.

Umapathy, K., Krishnan, S., & Rao, R. K. (2007). Audio signal feature extraction and

classiﬁcation using local discriminant bases. IEEE Transactions on Audio, Speech

and Language Processing, 15(4), 1236–1246.

Vapnik, V. (1998). Statistical learning theory. New York: John Wiley and Sons.

Xu, C., Maddage, N. C., & Shao, X. (2005). Automatic music classiﬁcation and

summarization. IEEE Transactions on Speech and Audio Processing, 13(3),

441–450.

P. Dhanalakshmi et al. / Expert Systems with Applications xxx (2008) xxx–xxx 7

ARTICLE IN PRESS

Please cite this article in press as: Dhanalakshmi, P. et al., Classiﬁcation of audio signals using SVM and RBFNN, Expert Systems with Ap-

plications (2008), doi:10.1016/j.eswa.2008.06.126

Experimental results show that the classiﬁcation accuracy of RBF with mel cepstral features can provide a better result. non-pure speech. 1. where the NFL explores the information provided by multiple prototypes per class. competitive recognition accuracies are achieved with very low-order hidden Markov models (Ajmera et al. The classiﬁcation of continuous general audio data for contentbased retrieval was addressed in Li. and Gaussian mixture model (GMM) were compared.2. Then. & Raahemifar. and Varadan (2003) to classify real-life audio radar signals that are collected by a ground surveillance radar mounted on a tank. A method is proposed in Panagiotakis and Tziritas (2005) for speech/music discrimination based on root mean square and zero-crossings. advertisement. and Xu (2005). Each RBF center approximates a cluster of training data vectors that are close to each other in Euclidean space. When a query audio is given. This system can be used to select the optimum coding scheme for the current frame of an input signal without knowing a priori whether it contains speech-like or music-like characteristics.2008. news.eswa. and Gabbouj (2006). The method described in content-based audio classiﬁcation and retrieval using joint time–frequency analysis exploits the non-stationary behavior of music signals and extracts features that characterize their spectral change over time (Esmaili. and Jimaa (2005). a generic audio classiﬁcation and segmentation approach for multimedia indexing and retrieval is described. in turn activating certain output nodes. Audio classiﬁcation techniques for speech recognition and audio segmentation. the system ﬁrst ﬁnds a boundary inside which the query pattern is located. The paper is organized as follows. in which an artiﬁcial neural network (ANN) and hidden Markov model (HMM) are used. automatic audio feature extraction and classiﬁcation approaches are presented. Bai. Ambikairajah. McCowan. sports. and silence. Zhang. Sethi. The audio signals were decomposed in Umapathy. spectrum ﬂux were extracted (Lu. ZCR. Krishnan.ARTICLE IN PRESS 2 P. Audio features like MFCC. investigates the feasibility of an audio-based context recognition system where simplistic low-dimensional feature vectors are evaluated against more standard spectral features. and McGee (2001). MFCC are extracted to characterize the audio content. A new pattern classiﬁcation method called the nearest feature line (NFL) is proposed in Li (2000). Classiﬁcation of audio signals using SVM and RBFNN. Qureshi. An approach given in Jiang. and Bourlard (2003). where the audio segments where classiﬁed based on MFCC and LPC. Outline of the work In this paper. Zhang. All boundaries are learned by the SVMs and stored together with the audio database. Dhanalakshmi et al. uses support vector machine (SVM) for audio scene classiﬁcation. using an adaptive time frequency decomposition algorithm. Using discriminative training. Krishnan. a number of features such as LPC. They also showed that cepstral-based features gave a better classiﬁcation accuracy. Fig. & Li. doi:10. modeling techniques for audio classiﬁcation is described in Section 3. Boss.. 2003). for unsupervised multispeaker change detection are proposed in Huang and Hansen (2006). For audio retrieval. In Kiranyaz. Experimental results using SVM and RBF are reported in Section 4. When a vector is input to the RBFNN. Radial basis function neural networks (RBFNN) are used in McConaghy. and Epps (2005) a speech/music discrimination system was proposed based on mel-frequency cepstral coefﬁcient (MFCC) and GMM classiﬁer. 1 illustrates the block diagram of audio classiﬁcation. The performance of SVM is compared to RBF network. The number of mixture components in the GMM was derived using the minimum description length (MDL) criterion. music.. all the audio patterns in the database are sorted by their distances to this boundary. (2006).126 .06. These features were analyzed using linear discriminant functions and classiﬁed into six music groups. In order to discriminate the six categories of audio namely music.1016/j. conclusions and future work are given in Section 5. The method proposed in Eronen et al. cartoon and movie. Fig. 1. Leung. which classiﬁes audio clips into one of ﬁve classes: pure speech. the centers near to that vector become strongly activated. The classiﬁcation is then implemented using weighted GMM networks. P. In Mubarak. The acoustic feature extraction is presented in Section 2. 2004). Block diagram of audio classiﬁcation. 2003). K-nearest neighbor (KNN). called distance-from-boundary (DFB). environment sound. et al. LPCC. A hybrid model comprised of Gaussian mixtures models (GMMs) and hidden Markov models (HMMs) is used to model generic sounds with large intra class perceptual variations in Rajapakse and Wyse (2005). Dimitrova. Two new extended-time features: variance of the spectrum ﬂux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. a new metric has been proposed in Guo and Li (2003). / Expert Systems with Applications xxx (2008) xxx–xxx of broadcast news is described in Ajmera. and the performance using SVM. Finally. brightness and bandwidth. Support vector machine (SVM) is applied to obtain the optimal class boundary between the classes by learning from training data. and the signal decomposition parameter based on octave (scaling) was used to generate a set of 42 features over three frequency bands within the auditory range. Expert Systems with Applications (2008). Please cite this article in press as: Dhanalakshmi.

p < m 6 D. The recursive relation (4) between the predictor coefﬁcients and cepstral coefﬁcients is used to convert the LP coefﬁcients into LP cepstral coefﬁcients fck g The SVM (Vapnik. . . 2. Please cite this article in press as: Dhanalakshmi. The mel-cepstral features (Xu et al. the nonlinear support vector classiﬁer will be applied. and then to do a linear separation in feature space as shown in Fig. If sðnÞ is the present sample. 2003). P.. cm ¼ am þ mÀ1 Xk ck amÀk . m k¼1 mÀ1 X k ck amÀk . xÞ þ b : ð7Þ c0 ¼ ln r2 . .06. Guo & Li.. Psychophysical studies have found the phenomena of the mel pitch scale and the critical band. The selected features include linear prediction coefﬁcients (LPC). is a useful statistic machine learning technique that has been successfully applied in the pattern recognition area (Jiang et al. Expert Systems with Applications (2008).. Principle of support vector machines. To construct a nonlinear support vector classiﬁer. 2. can be illustrated by the MFCCs.. Therefore. et al. They are as follows: where r2 is the gain term in the LP analysis and D is the number of LP cepstral coefﬁcients. 2005. Linear prediction derived cepstrum coefﬁcients(LPCC) and mel-frequency cepstral coefﬁcients (MFCC). / Expert Systems with Applications xxx (2008) xxx–xxx 3 2. 2. . In the LP analysis of speech each sample is predicted as linear weighted sum of the past p samples. 1 6 m 6 p.126 . doi:10. from the given set of bases deﬁned by the kernel. Linear channel effects are compensated to some extent by removing the mean of the trajectory of each cepstral coefﬁcient. . which are computed from the fast Fourier transform (FFT) power coefﬁcients. The SVM algorithm can construct a variety of learning machines by use of different kernel functions.2.. These features represent the short-time spectrum of the speech signal. The spectral information of the same sound uttered by two persons may differ due to change in the shape of the individual’s vocal tract system.1. Dhanalakshmi et al. . N. The segmental features are the features extracted from short (10–30 ms) segments of the speech signal. and the manner of speech production. . the number of triangular ﬁlters that fall in the frequency range 200–1200 Hz (i. .1016/j. In this work. with a shift of 5 ms.eswa. This is completely equivalent to constructing the optimal hyperplane in the corresponding feature space. KÞ. The 19 dimensional WLPCC (mean subtracted) for each frame is used as an acoustic feature vector. . 2. Palanivel. If the data are linearly nonseparable but nonlinearly separable. 2 . the ﬁrst layer selects the basis Kðxi . the second layer constructs a linear function in this space. Classiﬁcation of audio signals using SVM and RBFNN. . y) is replaced by a kernel function K(x. the frequency range of dominant audio information) is higher than the other values of c. The mel scale is deﬁned as F mel ¼ f c log 1 þ c log ð2Þ .ARTICLE IN PRESS P. where p represents the order of prediction (Rabiner & Schafer. i ¼ 1.. 2. Three kinds of kernel functions are usually used.2008. . K k¼1 K n ¼ 1.1. 1998). the inner product (x. 2. it is efﬁcient to set the value of c in that range for calculating MFCCs. 2005. When c in (5) is in the range of 250–350.e. L: ð6Þ 3. a 19 dimensional WLPCC is obtained from the 14th order LP analysis for each frame. The cepstral coefﬁcients are linearly weighted to get the weighted linear prediction cepstral coefﬁcients (WLPCC). The power coefﬁcients are ﬁltered by a triangular bandpass ﬁlter bank. 2004). Support vector machine (SVM) for classiﬁcation ^ðnÞ ¼ À s p X k¼1 ak sðn À kÞ: ð1Þ The difference between the actual and the predicted sample value is termed as the prediction error or residual. a0 ¼ 1 The LP coefﬁcients fak g are determined by minimizing the mean squared error over an analysis frame and it is described in Rabiner and Juang (2003). y): f ðxÞ ¼ sgn l X i¼1 ! ai yi Kðxi . Denoting the output of the ﬁlter bank by Sk ðk ¼ 1. Acoustic feature extraction Acoustic features representing the audio information can be extracted from the speech signal at the segmental level. During the learning process. The short-time spectrum envelope of the speech signal is attributed primarily to the shape of the vocal tract. and the frequency scale-warping to the mel scale Fig. The basic idea is to transform input vectors into a high-dimensional feature space using a nonlinear transformation /. A pth order LP analysis is used to capture the properties of the signal spectrum. 2005). ð5Þ where F mel is the logarithmic scale of f normal frequency scale. xÞ. Modeling techniques for audio classiﬁcation 3. and is given by eðnÞ ¼ sðnÞ À ^ðnÞ ¼ sðnÞ þ s ¼ p X k¼0 Xp a sðn k¼1 k À kÞ ð2Þ ð3Þ ak sðn À kÞ. cm ¼ m k¼1 ð4Þ The SVM has two layers. Linear prediction analysis For acoustic feature extraction. Mel-frequency cepstral coefﬁcients The mel-frequency cepstrum has proven to be highly effective in recognizing structure of music signals and in modeling the subjective pitch and frequency content of audio signals. then it is predicted by the past p samples as has led to the cepstrum domain representation. the differenced speech signal is divided into frames of 20 ms. the MFCCs are calculated as cn ¼ rﬃﬃﬃﬃ K h 2X pi ðlog Sk Þ cos nðk À 0:5Þ .

. Therefore. a hidden layer. . / Expert Systems with Applications xxx (2008) xxx–xxx 1. and g is an empirical scale factor which serves to control the smoothness of the mapping function.. i Here the k-means clustering algorithm is employed to determine the centers. Each of them implements a radial basis function.126 . . Dhanalakshmi et al. the above equation is written as g i ðxj Þ ¼ exp Àkxj À li k2 ! : ð13Þ gd2 The hidden layer units are fully connected to the nc output layer units through weights wik . Then the activation function of the ith hidden unit for an input vector xj is given by In the above equation d is the maximum distance between the chosen centers. The output units are linear. G is completely speciﬁed by the clustering results. . The assumption r2 ¼ r2 is made and r2 is given in (12) to ensure that the activai tion functions are not too peaked or too ﬂat: ! ÀjX À Yj2 : KðX. Neural networks with tanh activation function: ð9Þ KðX. nh . 2001) has a feed forward architecture with an input layer. For simplicity. and the elements of Y are speciﬁed as Y ij ¼ 1 if xi 2 class j. & Stork. . The algorithm is composed of the following steps: 1. G is a nt Â ðnh þ 1Þ matrix with elements Gij ¼ yj ðxi Þ. The activation functions of the hidden layer were chosen to be Gaussians. In between the inputs and outputs there is a layer of processing units called hidden units. . nc . YÞ ¼ ðhX. . 3. i i ¼ 1. W is obtained from the standard least squares solution as given by W ¼ ðGT GÞÀ1 GT Y: ð16Þ To solve W from (16). where nc is the number of output classes. k ¼ 1. . Training feature vectors belonging to the same class are clustered to nh =nc clusters using the k-means clustering algorithm. The input units are fully connected to the nh hidden layer units. and the response of the kth output unit for an input xj is given by yk ðxj Þ ¼ i¼0 X nh wik g i ðxj Þ. Radial basis function neural network. nh . . Randomly initialize the samples to k means (clusters) l1 . can be applied to ﬁnd nh clusters from nt training vectors. doi:10. . 0 otherwise: ð17Þ 4. d . Given nt feature vectors from nc classes. lk : Y ¼ GW. 1. and wik . Classify n samples according to nearest lk . and W is a ðnh þ 1Þ Â nc matrix of unknown weights. the k-means clustering may be used in a supervised manner. g. Determining the weights wik between the hidden and output layer: Given that the Gaussian function centers and widths are computed from nt training vectors (14) may be written in matrix form as 2 g i ðxj Þ ¼ exp ! Àkxj À li k2 : 2r2 i ð11Þ The li and r2 are calculated by using suitable clustering algorithm. i ¼ 1. ð14Þ where g 0 ðxj Þ ¼ 1. the unsupervised kmeans clustering algorithm (Duda. . it is assumed that the covariance matrices are of the form C i ¼ r2 I. .ARTICLE IN PRESS 4 P. 3.1016/j. This is repeated for each class yielding nh cluster for nc classes. et al. 3.2008.2. P. 2. . Polynomial kernel of degree d: KðX.06. . and an output layer as shown in Fig. where the parameters K and ð10Þ r2 ¼ gd2 2 : ð12Þ l are the gain and shift. . The number of activation functions in the network and their spread inﬂuence the smoothness of the mapping. 4. 2. 2001). the training vectors of a class may not fall into a single cluster. YÞ ¼ tan hðKhX. . k ¼ 1. . The parameter d was then computed by ﬁnding the maximum distance between nh cluster means. ð15Þ where Y is a nt Â nc matrix with elements Y ij ¼ yj ðxi Þ. . i ¼ 1. . . These cluster means are used as the centers li of the Gaussian activation functions in the RBFNN. . i ¼ 0. 2. Radial basis functions are embedded into a two-layer feed forward neural network. Experimental results The evaluation of the proposed audio classiﬁcation and segmentation algorithms have been performed by using a generic audio database which consists of the following contents: 100 clips of Fig. Such a network is characterized by a set of inputs and a set of outputs. Radial basis function with Gaussian kernel of width C > 0: d ð8Þ 2. Expert Systems with Applications (2008). . Repeat steps 2 and 3 until no change in lk . 2. and covariance matrices C i . In order to obtain clusters only according to class. 2. . Classiﬁcation of audio signals using SVM and RBFNN. nh . . However. . Yi þ lÞ. Recompute lk . The input layer of this network has ni units for a ni dimensional input vector. Radial basis function neural network (RBFNN) model for classiﬁcation The radial basis function neural network (Haykin. Yi þ 1Þ : 2. which are in turn fully connected to the nc output layer units. Please cite this article in press as: Dhanalakshmi. 3. . Hart. nc . YÞ ¼ exp c 3. nh . and are characterized by their mean vectors (centers) li .eswa. The training procedure is given below: Determination of li and d : Conventionally. . 2. training 2 the RBFNN involves estimating li . .

100 clips of songs (sung by male and female) in different languages. 19 dimensional LPCC and 26 dimensional MFCC features are extracted from the audio frames as described in Section 4. 4. Dhanalakshmi et al. is computed from the length of the parameterized static vector 13. 100 clips of movie from different languages. the audio signals were segmented and windowed into short frames of 256 samples. speech duration. n ¼ 1. The training data is segmented into ﬁxed-length and overlapping frames (in our experiments we used 20 ms frames with 10 ms overlapping). 39. The training data should be sufﬁcient to be statistically signiﬁcant. 2007).2. Classiﬁcation parameters are calculated using support vector machine learning. 4.. 4. Silence segments will be removed from the audio sequence. Modeling using SVM We use a nonlinear support vector classiﬁer to discriminate the various categories.2 and are widely used in the area of audio and speech processing.1. Performance of SVM for audio classiﬁcation. 100 clips of sports and 100 clips of news (both Tamil and English).3.06. 0 otherwise: ð21Þ If the short-time energy function is continuously lower than a certain set of thresholds (there may be durations in which the energy is higher than the threshold. Features such as LPC. N m ð20Þ where xðmÞ is the discrete time music signal.e. A 14th order LP analysis is used to capture the properties of the Fig. P. This number. The short-time energy function of a music signal is deﬁned as En ¼ 1X ½xðmÞwðn À mÞ2 . . . the temporal characteristics of audio content can be taken into consideration in the training process. The selected features should reﬂect the signiﬁcant characteristics of different kinds of audio signals. The training data should be sufﬁcient to be statistically signiﬁcant. The audio content can be discriminated into the various categories in terms of the designed support vector classiﬁer. with a shift of 10 ms. the better the extracted feature which exhibits more accurate audio characteristics.1. We use shorttime energy to detect silence. high-frequency components have relatively low amplitude. 4. These values are the change of the energy of the signal. When neighboring frames are overlapped. . The training data is segmented into ﬁxed-length and overlapping frames. Modeling using RBFNN For RBFNN training. 2006). Feature selection from non-silent frames Feature selection is important for audio content analysis. and wðmÞ is a rectangular window. 14th order LPC. i. The audio signals are recorded for 60 s at 8000 samples per second and divided into frames of 20 ms. 4. measure three values from the audio segment to be classiﬁed.eswa. For each audio signal we arrived at 39 features. Due to radiation effects of the sound from lips.. From the results. Then the pre-emphasized frame is Hamming-windowed by hðnÞ ¼ 0:54 À 0:46 Ã cosð2pn=N À 1Þ. To obtain MFCCs (Umapathy et al. When neighboring frames are overlapped. LPC and LPCC are two linear prediction methods and they are highly correlated to each other. The signal duration was slightly increased using the following rationale that the longer the audio signal analyzed. Please cite this article in press as: Dhanalakshmi. The training process analyzes audio training data to ﬁnd an optimal way to classify audio frames into their respective classes. The processed audio sequence will be segmented into ﬁxed-length and 10 ms overlapping frames. These features are given as input to the s0 ðnÞ ¼ sðnÞ À 0:96 Â sðn À 1Þ.ARTICLE IN PRESS P. One simple solution is to augment the energy of the high-frequency spectrum. but the durations should be short enough and far apart from each other). . In order to evaluate the relative performance of the proposed work. The support vector machine learning algorithm is applied to produce the classiﬁcation parameters according to calculated features. with a sampling rate of 8 kHz and 16-bits per sample. plus the delta coefﬁcients (+13) plus the acceleration coefﬁcients (+13). The classiﬁcation results for the different features are shown in Fig.126 . including unnoticeable noise and very short clicks. the segment is indexed as silence. Expert Systems with Applications (2008). we observe that the overall classiﬁcation accuracies is 92% using MFCC as feature. Goubran. Audio samples are of different length. n is the time index of the short-time energy. Classiﬁcation of audio signals using SVM and RBFNN. MFCCs are short-term spectral features as described in Section 2. the temporal characteristics of the audio content can be taken into consideration in the training process. wðnÞ ¼ 1 if 0 6 n 6 N À 1. LPC-based algorithms (Abu-El-Quran. LPCC and MFCC are calculated for each frame. The LP coefﬁcients for each frame is linearly weighted to form the WLPCC.4. ð18Þ where sðnÞ is the nth sample of the frame s and s0 ð0Þ ¼ sð0Þ. N À 1.2. and the change of the pitch value. 4. doi:10. Magnitude spectrum was computed for each of these frames using fast Fourier transform (FFT) and converted into a set of mel scale ﬁlter bank outputs. Preprocessing 0 6 n 6 N À 1: ð19Þ The aim of preprocessing is to remove silence from a music sequence. ranging from 1 s to about 10 s. Silence is deﬁned as a segment of imperceptible audio. The derived classiﬁcation parameters are used to classify audio data.2008. & Chan. The recursive relation (4) between the predictor coefﬁcients and cepstral coefﬁcients is used to convert the 14 LP coefﬁcients into 19 cepstral coefﬁcients. we compared it with the well-known MFCC features.This procedure can be implemented via a pre-emphasizing ﬁlter that is deﬁned as signal spectrum as described in Section 2. The selected features include linear prediction coefﬁcients (LPC)–linear prediction derived cepstrum coefﬁcients (LPCC) and mel-frequency cepstrum coefﬁcients(MFCC). et al.. / Expert Systems with Applications xxx (2008) xxx–xxx 5 advertisement in different languages. Logarithm was applied to the ﬁlter bank outputs followed by discrete cosine transformation to obtain the MFCCs.1016/j. 100 cartoon clips. which will inﬂuence the capture of the features at the high end of the spectrum.

(2001). & Varadan. (2005). advertisement. & Gabbouj.eswa. Fundamentals of speech recognition.-H. 5 shows the performance of RBFNN for audio classiﬁcation. & Chan. H. P. & Schafer. 907–919.. N. R. S. et al. T. Dimitrova. Chen. G.. D. Klapuri. Peltonen. Content-based audio classiﬁcation and retrieval by support vector machines. G. D. 533–544. Advances in unsupervised audio classiﬁcation and segmentation for the broadcast news and NGSW corpora. news. C. (2006). 8. Classiﬁcation of audio signals using SVM and RBFNN.. IEEE Transactions on Speech and Audio Processing. F. 14(3). IEEE Transactions on Audio. S. / Expert Systems with Applications xxx (2008) xxx–xxx 5.. J. IEEE Transactions on Instrumentation and Measurement. 6 shows the performance of RBFNN for different means. Dhanalakshmi et al.. E.. J.. P. J. H. IEEE Transactions on Speech and Audio Processing. The performance was compared to Radial basis function neural network which showed an accuracy of 93%. Lu. Qureshi. Performance of RBFNN for audio classiﬁcation. A.. B. (2003). Singapore: Pearson Education. (2003). & Bourlard. Fig. References Abu-El-Quran. Speech and Language Processing. 5.. 2..2008. Expert Systems with Applications (2008)... E. L. Eronen. 155–156. A. Linear prediction cepstrum coefﬁcients (LPC. 55(4).. Person authentication using speech. A. K. For classiﬁcation the feature vectors are extracted and each of the feature vector is given as input to the RBFNN model. (2005). Classiﬁcation of general audio data for content-based retrieval. R. Security monitoring using microphone arrays and audio classiﬁcation. D. Proceeding of the IEEE. L.. et al. McConaghy. (2004). 7(1). doi:10. Content-based audio classiﬁcation and segmentation by using support vector machines. Truong. The performance of the system for MFCC using SVM and RBFNN for audio classiﬁcation is given in Table 1. (2001). Fagerlund. 644–651. & Raahemifar. A speech/music discriminator based on rms and zero-crossings. T. The average output is calculated for each of the output neuron.. Performance of RBFNN for different means. Speech and Language Processing. & Wyse. 14(1). New York: John Wiley-Interscience. H.. (2003).. Guo.. and 5 has been used in our studies for each category. 619–622). H. cartoon and movie. speech and signal processing (pp. 22. Jiang. Speech Communication. (2006). (2006). & McGee. S. Audio-based context recognition.ARTICLE IN PRESS 6 P. 40(3). A. S. & Stork. Sethi. & Hansen. S... K. Krishnan. A generic audio classiﬁcation and segmentation approach for multimedia indexing and retrieval... The system gives optimal performance for k = 5. Madras: IIT. IEEE Transactions on Multimedia.. IEEE Transactions on Instrumentation and Measurement. Leung. M. Sorsa.. 131–136. S. Pattern classiﬁcation. S. I. Content-based audio classiﬁcation and retrieval using the nearest feature line method. 8(5). face and visual speech. SVM-based audio scene classiﬁcation.126 . Proceedings of the IEEE. (2003). L. Asia: Pearson Education.. For training. the weight matrix is calculated using the least squares algorithm discussed in Section 3. Lin. P... extracting different features and developing better classiﬁcation algorithms and integration of classiﬁers to reduce the classiﬁcation errors. McCowan. & Tziritas. S. & Juang. LPCC) and mel-frequency cepstral coefﬁcients are calculated as features to characterize audio content. 321–329. IEEE Transactions on Audio. R. T. 1025–1032. 1062–1081. Hart. 482–492. Experimental results show that the proposed audio classiﬁcation scheme is very effective and the accuracy rate is 92%. Rabiner. Table 1 Audio classiﬁcation performance using SVM and RBFNN SVM (%) 92. (2005).-K..2 for each of the features. Classiﬁcation of audio radar signals using radial basis function neural networks. Pearson Education. S. R..1016/j.7 RBFNN model. The value of k = 1. Mubarak. & Chang. Zhang. E. 14(3).. Y. (2006). Fig. L. A. (2004).. Rabiner. Tuomi.1 RBFNN (%) 93. Generic audio classiﬁcation using a hybrid model based on GMMS and HMMS. Z. M. Ajmera. Speech and Language Processing. Content based audio classiﬁcation and retrieval using joint time–frequency analysis. J. 6. Multimedia Systems. Duda. 308–315. The class to which the audio sample belongs is decided based on the highest output. Bai. Analysis of an MFCC-based audio indexing system for efﬁcient coding of multimedia sources. speech and signal processing (pp. O. V.. H. 1771–1779. (2000).06. C. Speech/music segmentation using entropy and dynamism features in a HMM classiﬁcation framework.. Fig. T. Z. L. sports. 14(1). Fig. Li. Li. S.-C. The RBF centers are located using k-means algorithm. IEEE Transactions on Audio. (2001). The classiﬁcation rate using LPC and LPCC were slightly lower than the 39 dimensional MFCC feature vectors. J. (2003).. I. J. Audio classiﬁcation and categorization based on wavelets and support vector machine.-J. Neural networks a comprehensive foundation. V. G. Panagiotakis. A. Esmaili. Kiranyaz. S. Goubran. by learning from training data. This work indicates that support vector machines and radial basis function neural networks can be effectively used for audio classiﬁcation. Zhang. R. Rajapakse. In IEEE international conference on acoustics. C. (2005).. Eventhough by now some progress has been achieved.. & Li. 1550–1555. A nonlinear support vector machine learning algorithm is applied to obtain the optimal class boundary between the various classes namely music. 619–625. T. Please cite this article in press as: Dhanalakshmi. & Li. 13(5). there are still remaining challenges and directions for further research. The weights are determined using least squares algorithm. M. & Xu.D thesis. Palanivel.. T. Boss. Ph. Ambikairajah. In IEEE international conference on acoustics. (2005). Huang. Z. Conclusion In this paper. O. & Epps. 52(6). Digital processing of speech signals. such as. B. 351–363.. we have proposed an automatic audio classiﬁcation system using SVM and RBFNN. (2005).. Pattern Recognition Letters. Haykin. IEEE Transactions on Neural Networks. 665–668).

308–315.eswa. Classiﬁcation of audio signals using SVM and RBFNN.. Expert Systems with Applications (2008). doi:10. R. P. 441–450. Automatic music classiﬁcation and summarization. & Shao. K. Xu. Maddage. S.. et al. & Jimaa. (1998).. IEEE Transactions on Speech and Audio Processing. Multigroup classiﬁcation of audio signals using time–frequency parameters. C. (2005).. S. C. Please cite this article in press as: Dhanalakshmi. & Rao. K.. Speech and Language Processing.ARTICLE IN PRESS P. Dhanalakshmi et al.06. New York: John Wiley and Sons.126 . S. 1236–1246. 15(4). 7(2).1016/j. N.. 13(3). IEEE Transactions on Multimedia. (2007).. / Expert Systems with Applications xxx (2008) xxx–xxx Umapathy. Statistical learning theory. X. Krishnan. Krishnan.2008. (2005). 7 Vapnik. Audio signal feature extraction and classiﬁcation using local discriminant bases. IEEE Transactions on Audio. V. K. Umapathy.

- Ανάπτυξη προγράμματος πεπερασμένων στοιχείωνUploaded byAtsis Papadopoulos
- acousticsUploaded byNiong David
- COMSOLUploaded byManikandan Sudharsan
- Juan Martin Flamenco Master ClassUploaded byfretworks
- 10.1.1.30.630Uploaded byHussein Razaq
- Primary School Mathematics CurriculumUploaded bykerol apih
- Jot Hi Lakshmi 2013Uploaded byalmajhs
- Onenote Quick Reference 2010Uploaded byOryon Foo
- Article on SVMUploaded byapi-3747254
- Acoustic DesignUploaded byLourdes Mendoza
- lsa352.lec6Uploaded byjeromeku
- An App-Based Solution for Farmers using Opinion Based Technique in Agricultural DevelopmentUploaded byIRJET Journal
- Estimation of Scour Downstream of SpillwaysUploaded byAli Mohsen
- Walking Bass & Chords for Jazz GuitarUploaded byMilton Mermikides
- Matlab Audio ProjectUploaded bynawalkist
- Mechanical Vibration solved examplesUploaded byParas Thakur
- math1 1stUploaded byAlexander James Pascua
- ref dipUploaded bynagoor318
- Assignment 2 - SolutionsUploaded bySatwikMohanty
- A Survey on Techniques Used for Content Based Image RetrievalUploaded byAnonymous kw8Yrp0R5r
- Comparative Study of Various Sentiment Classification Techniques in Twitter 1Uploaded byInternational Journal of Innovative Science and Research Technology
- 1802.08402Uploaded bymuhammadasif_5839675
- Fb 3110231028Uploaded byAnonymous 7VPPkWS8O
- Evaluating Automatic Speaker Recognition Systems an Overview of the NIST Speaker Recognition Evaluations (1996-2014)Uploaded byostugeaqp
- Unification of Evidence Theoretic Fusion Algorithms: A Case Study in Level-2 and Level-3 Fingerprint FeaturesUploaded byMia Amalia
- Analyzing Different Stock Options and Support in Decision Making for First Time InvestorsUploaded byEditor IJTSRD
- Session 04_Paper 54Uploaded byNicholas Dawson
- ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN NONOVERLAPPING VIEWSUploaded byCS & IT
- Implicit Shape Kernel for Discriminative Learning of the Hough Transform Detector - Zhang, Chen - Procedings of the British Machine Vision Conference - 2010Uploaded byzukun
- Answers 02Uploaded by2eaa889c