You are on page 1of 6

Feature Analysis and Extraction for Audio Automatic

Classification
Bai Liang, Hu Yanli, Lao Songyang, Chen Jianyun, Wu Lingda
Multimedia Research &Development Center
National University of Defense and Technology
ChangSha 410073, P. R. China
xabpz@163.com
accuracy of above 90% is reported. Srinivasan et al. [8], try
Abstract - Feature analysis and extraction are the
to detect and classify audio that consists of mixed classes,
foundation of audio automatic classification. This paper
such as combinations of speech and music together with
divides audio streams into five classes: silence, noise, pure
environment sound. The reported accuracy of classification
speech, speech over background sound and music. We
is over 80%. In [9], an algorithm of audio classification is
present our work on audio feature analysis and extraction
presented, where audio is discriminated with one-second
on the frame level and clip level. Four new features are
window. An accuracy of above 96% is reported. However,
proposed, including silence ratio, pitch frequency standard
in these works, a rule-based classifier is used for audio
deviation, harmonicity ratio and smooth pitch ratio. We
classification; these systems require threshold setting. But
have presented an SVM-based approach to classification.
this method can only characterize static features of audio
The effectiveness of the features is evaluated in
and threshold is very difficult to select, since it should be
experiments. Experiment results show that the features we
adjusted for various circumstances. This makes the rule-
selected and proposed are rational and effective.
based approach not general enough to fit different
applications.
Keywords: Feature analysis and extraction, content-based
audio classification, support vector machines.
In reference to the analysis mentioned above, for the
1 Introduction purposes of this paper, audio is divided into five types,
such as silence, noise, music, pure-speech and speech over
Audio automatic classification can provide useful background sound. Discriminating features of these audio
information for both audio content understanding and video types are analyzed on the frame level and one-second-clip
content analysis [1,2]. It is of critical importance in audio level. A set of features including some new features that
indexing and retrieval. Feature analysis and extraction are could further improve the performance of audio
the foundational step for audio automatic classification. classification is proposed. A SVM-based classifier is also
presented. The effectiveness of the proposed features and
Many studies on audio classification employ different the evaluation of the performance of SVM on audio
features and methods. Pfeiffer et al [3] presented a classification are tested in the proposed SVM-based
theoretical framework and application of automatic audio classifier.
content analysis using some perceptual features. Saunders
[4] presented a speech/music classifier based on simple The rest of this paper is organized as follow: Section 2
features such as zero-crossing rate and short-time energy analyzes discriminating features of different audio types
for radio broadcast. Meanwhile, Scheirer et al. [5] and describes how an audio clip is represented by low level
introduced more features for audio classification and perceptual and spectral feature set. Section 3 gives an
performed experiments with different classification models. overview of linear and kernel SVM and describes how a
Z. Liu et al.[6]presented a set of low-level audio features SVM-based classifier is designed. In Section 4,
for characterizing semantic contents of short audio clips in experiments and evaluations on a 5-hour database are given.
order to use the associated audio information for video Section 5 draws some conclusion and proposes important
scene analysis. However, in spite of these research efforts, research directions in the future.
high accuracy audio classification is only achieved for the
simple cases such as speech/music discrimination. It is also 2 Audio Feature Analysis
found that methods based on such simple features cannot
work well. In order to obtain high accuracy for audio
classification, it is critical to select good features that can
Other research efforts focused on audio classification capture the temporal and spectral characteristics of audio
algorithms to discriminate more classes. In the work by signals and are robust for changing circumstance. In our
Zhang and Kou [7], with a heuristic-based model, many work, features are analyzed and extracted in two levels:
features including pitch-tracking methods are introduced to frame-level and one-second clip-level. The features
discriminate audio recordings into more classes. An extracted from one clip are combined as one feature vector
after normalization. Before feature extraction, an audio discriminator for speech and music. These two features are
signal is converted into a general format (“wav” format also commonly used, as in [4,5,6,7,8].
used in this paper), which is 22.050KHz, 16-bit, mono-
channel. Then, it is pre-emphasized with parameter 0.98 or Zero-crossing rate. ZCR describes the times that an audio
0.97 to equalize the inherent spectral tilt and then divided waveform crosses the zero axes within a frame. It is a
into one-second non-overlapping clips. A clip is used as the simple measure of the frequency content of a signal:
classification unit. The clip is further divided into half-
overlapping 23 ms-long frames by use of a Hamming 1 N −1
window. ZCR = ∑ sgn[x(m + 1)] − sgn[x(m )] ⑶
2( N − 1) m =1

2.1 Audio frame-level features Where sgn[] is a sign function and x(m) is the discrete
In our method, the frame-level features we extracted audio signal. In general, speech signals are composed of
include: 12 order MFCCs, frequency energy, zero crossing alternating voiced sounds and unvoiced sounds in the
rates (ZCR), sub-band energy, brightness, bandwidth syllable rate, while music signals do not have this kind of
pitch frequency. The definitions and methods of calculating structure. Hence, for the speech signal, its variation of ZCR
of these features are given below. will be in general greater than that of music signals.
Considering this, many systems have used ZCR for audio
Mel-frequency cepstral coefficients. First, the audio is classification [4,7,8,9].
Hamming-windowed in overlapping steps. For each
window, the log of the power spectrum is computed using a Brightness and Bandwidth. The brightness is the
discrete Fourier Transform (DFT). The log spectral frequency centroid of the spectrum in a frame. It can be
coefficients are perceptually weighted by a non-linear map defined as:
of the frequency scale called Mel-scaling. The final stage is
∫ ω F (ω ) dω
to further transform the Mel-weighted spectrum (using ω0 2

another DFT) into “cepstral” coefficients. FC = 0 ω 0 ⑷


∫0 F (ω ) dω
2

Frequency energy and Sub-band energy ratio.


Frequency energy (FE) is the total spectrum power of a Bandwidth is the square root of the power-weighted
frame. In our work, its logarithm is used: average of the squared difference between the spectral
components and frequency centroid:
(
FE = log ∫0
ω0
F (ω ) dω
2
) ⑴
∫0 (ω − FC ) F (ω ) dω
ω0 2 2

Where F (ω ) denotes the Fast Fourier Transform (FFT) BW = ⑸


∫0 F (ω ) dω
ω0 2

coefficients, ω 0 is the half sampling frequency. Silence


frames can be discriminated according to FE value. A
In general, the bandwidth range of speech is from 0.3KHz
frame is considered to be a silence when its FE is less than
to 3.4KHz. For music the range is much wider, ordinarily
a threshold. The frequency spectrum is divided into four
BW is 22.05KHz. Brightness and Bandwidth represent the
sub-bands with intervals [0, ω 0 /8], [ ω 0 /8, ω 0 /4],
frequency characteristic, and they shown effectiveness in
[ ω 0 /4, ω 0 /2] and [ ω 0 /2, ω 0 ]. The ratio between sub-band many audio classification systems [5,6,10].
power and total power in a frame is defined as:
Pitch frequency. Pitch frequency represents the pitch
1 2 magnitude. In this paper, we calculate the pitch frequency
∫Lj F (ω ) dω
Hj
D= ⑵ value by using autocorrelation function method and center
FE
clipping operation with parameter 0.70.
Where Lj and Hj are lower and upper bound of sub-band j
respectively. FE is an effective feature, especially for 2.2 Audio clip-level features
discriminating speech from music signals. In general, there
In our work, a clip is used as the classification unit.
are more silence frames in speech than in music; thus, the
Audio clip-level features can be computer based on the
variation of FE measure will be much higher for speech
frame-level features. For ZCR, energy, sub-band energy
than that for music. Since the frequency characteristics are
ratio, bandwidth and brightness, we compute the mean of
very different between human voice and music apparatus,
all frames in a given clip respectively as base clip-level
for example, the distribution of FE of music is relatively
features. Some new clip-level features are defined as follow.
even on each sub-band, but 80% FE is concentrated on first
sub-band for speech. Sub-Band Energy Ratio is also a good
Silence ratio. Silence ratio (SR) is defined as the ratio of 1 N −1
silence frames in a given audio clip. A frame is considered LFER = ∑ [sgn (0.5avE − E (n )) + 1] ⑺
2N n =0
as a silence if its total energy is lower than a preset
threshold. In general, the SR value of speech is higher than
that for music because there are many more silence frames Where N is the total number of frames, E(n) is the
frequency energy at the nth frame, and avE is the average E
in the speech class than in the music class.
in a given clip. In Fig 2, it is shown clearly that LFER
value of speech is around 0.24 to 0.48, while that of music
High zero-crossing rate ratio. Zero-crossing rate is
is mostly less than 0.25. Therefore, LFER is a good
proved to be useful in characterizing different audio signals.
discriminator between speech and music.
In general, speech signals are composed of alternating
voiced sounds and unvoiced sounds in the syllable rate,
while music signals do not have this kind of structure. speech
0.55
Hence, for speech signal, its variation of ZCR will be in music
0.50
general greater than that of music. Therefore, we use high
ZCR ratio (HZCRR) as one feature in our approach. 0.45

HZCRR is defined as the ratio of the number of frames 0.40

whose ZCR are above 1.5-fold average ZCR in a given clip. 0.35

0.30

LFER
1 N −1 0.25
HZCRR = ∑ [sgn(ZCR(n) −1.5avZCR) +1] ⑹ 0.20
2N n=0
0.15

0.10
Where n is frame index, ZCR(n) is the ZCR at the nth
0.05
frame, N is the total number of frames, avZCR is the
0.00
average ZCR in a clip. Fig 1 illustrates the curves of 0 20 40 60 80 100 120 140 160 180
HZCRR for speech and music signals. It can be seen that clip index
the value of HZCRR of speech segment is around 0.14,
while that of music segment mostly fall below 0.11.
Fig.2. LFER curves of speech/music
Spectrum flux. Spectrum flux (SF) is defined as the
speech
0.24
average variation value of spectrum between two adjacent
0.22
music frames in a given clip.
0.20 1
SF =
0.18
(N − 1)(K − 1)
0.16
2

N −1K −1
∑ ∑ [log A(n , k ) − log A(n − 1, k )]
0.14
HZCRR

0.12
×
n =1 k =1
0.10

0.08 Where A(n,k) is the discrete Fourier transform of the nth


0.06 frame of input signal.
0.04
∞ 2π
j km
0.02

0.00
A(n, k ) = ∑ x(m )ω (nL − m)e
m = −∞
L

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180

and x(m ) is the original audio data, ω (m ) is the window


clip index

function, L is the window length, K is the order of DFT, N


is the total number of frames. In our experiment, we found
Fig.1. HZCRR curve of speech/music
that, in general, the SF values of speech are higher than
Low frequency energy ratio. In general, there are more those of music. Fig 3 shows an experiment of SF of speech
silence frames in speech than in music. Therefore, similar and music. The speech segment is from 0 to 160s; the music
to ZCR, we also selected the variation, instead of the exact segment is from 161 to 320s.This feature can be used in
value, of frequency energy as one clip-level feature. Here, both speech/music classification and pure-speech/speech
we use low frequency energy ratio (LFER) to represent the over background sound classification.
variation of frequency energy (E). LFER is defined as the
ratio of the number of frames whose frequency energy are
less than 0.5 time of average frequency energy in a clip, as
the follow:
from the ensemble of the training data. The normalized
feature vector is considered as the final representation of an
70
audio clip.
60
3 Classifier based on SVM
spectrum flux

50
3.1 Overview of SVM
40
This section introduces some basic concepts for
Support vector machine (SVM). More details are offered in
30
[12,13]. SVM learns an optimal separating hyper plane
from a given set of positive and negative examples. It
0 100 200 300 400 500 minimizes the structural risk. This is in contrast to
clip index traditional pattern recognition techniques of minimizing the
empirical risk, which optimize the performance on the
Fig.3. Spectrum flux curve of speech / music training data. A SVM can be either linear or nonlinear. The
former is used in linearly separable case, and the latter is
In general, speech signals are composed of alternating used in linearly non-separable but nonlinearly separable
voiced sounds and unvoiced sounds in the syllable rate. In case.
our experiment, we found the variation of pitch frequency
for speech is higher than that for music and there are much
Given a set of training vectors { xi }i =1 , and the
l

more harmonicity frames for music than that for speech.


corresponding class labels { yi }i =1 such that yi ∈ {−1, +1}
l
We consider a frame as a harmonicity frame when its pitch
frequency is not 0. We also found some audio segments
whose pitch frequencies are smooth often occur in music and xi ∈ R n , SVM select a set of support vectors
segments, but few “smooth” segments in speech signal.
Therefore, three features are proposed in our approach:
{x }i
sv Nsv
i =i
that is a subset of the training set { xi }i =1 and find
l

pitch frequency standard deviation, harmonicity ratio, and an optimal decision function:
smooth pitch ratio.
 Nsv 
f (x ) = sign ∑ y i α i x isv x − b  ⑾
Pitch frequency standard deviation. This feature can  i =1 
characterize the changing range of pitch frequency.
Where α i and b are parameters for the classifier, the
Harmonicity ratio. Harmonicity ratio (HR) is defined as
the ratio of harmonicity frames in a given audio clip. solution vector x isv is called the Support Vector with α i
being non-zero. In the linearly non-separable but non-
Smooth pitch ratio. If the pitch frequency value of the ith linearly separable case, the SVM replaces the inner product
frame does not equal 0 and the variation of pitch frequency by a kernel function K(x,y), and then constructs an optimal
value between the ith frame and the (i-1)th is lower than a separating hyper-plane in the mapped space. According to
preset threshold, the ith frame is considered a smooth pitch the Mercer theorem [12], the kernel function implicitly
frame. Smooth pitch ratio (SPR) is defined as the ratio of maps the input vectors into a high dimensional feature
smooth pitch frames to total frames whose pitch space in which the mapped data is linearly separable. In our
frequencies are not 0 in a given clip. method, we use the Gaussian Radial Basis kernel.

After extraction of audio clip-level features, these 3.2 Classification scheme


features and MFCCs are concatenated into combined
feature vector. But the characteristics of the feature In our work, an audio clip is classified into five
components are so different that it is not appropriate to just classes: silence, noise, music, pure speech and speech over
put these features into a feature vector. Each feature background sound. Silence and noise are discriminated
component should be normalized to make their scale from the input audio depending on the energy and ZCR
similar. The normalization is processed as: information. This step is based on a rule-based classifier.
The rules discriminating silence are defined as:
(x i − µ i )
x i' = ⑽ avZCR < ZCR threshold ; avE < E threshold
σi

Where avZCR is mean of ZCR values of total frames in a


Where x i is ith feature component, the corresponding
given clip, corresponding avE is mean of energy values. A
mean µ i and standard derivation σ i can be calculated
clip will be marked as silence if the energy and the ZCR Table 1 shows that accuracy for noise is low, only 64.38%.
are less than a predefined threshold. The reason is that there are different noise types in different
audio types. The rule-based method using simple rules
We consider a clip to be noise (we only process lacks veracity for different noise types. Therefore, the rule-
broad-band noise here) when it does not contain any based method is not adaptable for noise discrimination.
semantic content that we can understand. The ZCR value of
noise is high because its energy of high-frequency 4.2 Classification for music, pure speech and
component is high. The variation of noise energy is low
during a period of time. Therefore, we propose the rules for speech over background sound
discriminating noise as: Two kernel SVMs with Gaussian Radial Basis
( σ = 10 ) are trained using training set for discriminating
∀0 ≤ i < N , ZCR i > ZCR threshold ; σ Energy < σ threshold music/speech and pure-speech/speech over background
sound. Table 2 shows the results of these experiments.
Where N is total number of frames in the clip, ZCRi is
ZCR value of the ith frame in the clip, σ Energy is the Table 2. Experiment result of different classifying type
energy deviation of the clip. A clip is considered as a noise Classifying Accuracy
clip when these two rules are met. Then, for those non- Music/Speech 94.5%
silence and non-noise clips, an SVM kernel of Gaussian Pure-speech/Speech over 91.2%
Radial Basis with parameter σ =10 is used to classify the background sound
left three classes. Because SVM is a two-class classifier, Table 2 shows that each discriminator can achieve
we propose a two-tier system of SVMs for our work. high accuracy. This reveals that the features that are
SVM1 discriminate between music and speech. Then those proposed above are very effective on audio classification
“speech” clips are classified into pure speech and speech and that SVM-based approach performs well.
over background sound by using SVM2.
4.3 Analysis on feature effectiveness
4 Experiment and analysis
In our work, some new features are used. To show the
The database used in our experiments is collected effectiveness of common features and new features, we
from CCTV-5 news programs and music CDs, composed first implement a base feature set (B set), which includes
of 18000 audio clips, which is 300 minutes in total length ZCR, energy, sub-band energy ratio, bandwidth and
with each clip labeled in terms of the predefined five brightness. Silence ratio, HZCRR, LFER and spectrum flux
classes. The database consists of 1584 silence clips, 438 characterize the different speciality that derived from
noise clips, 7440 music clips, 5178 pure speech clips and different structure between music and speech. So a new
3360 speech over background sound clips. The audio data feature set call structure feature set (S set) can be composed
are all mono channel, 16bit pre sample, 22.025KHz sample of these features. On the basis of the analysis on pitch
rates and in “wav” format. The database is partitioned into frequency, three new features including Pitch frequency
a training set and a testing set, 2 hour for training set and 3 standard deviation, harmonicity ratio and Smooth pitch
hour for testing set. The accuracy of classification is ratio are added to compose a new feature set called pitch
measured by rates of correctly classified samples for testing feature set (P set). Classification results between using the
samples. new feature sets and the base feature set are compared to
show the effectiveness of each feature. The results are
4.1 Classification for silence and noise listed in Table 3. The number in parenthesis means the
corresponding improvement compared with the base
In our method, silence and noise are discriminated classification results.
from input audio by using a rule-based method with ZCR
and energy information. The results are listed in Table 1. From Table 3, it can be seen that the base feature set
has some degree of effectiveness. After adding the P set,
Table1. Experiment result of classification for silence/noise the performance of the two classifiers improve a lot,
respectively 4.4% and 4.3%. After adding the S set, the
Right Error Accuracy performance on music/speech classification improves quite
Silence 1524 141 96.21% a lot, 6.1%. The reason is that there is much difference
Noise 282 228 64.38% between the structure of music and that of speech. After
From Table 1 it can be seen that rule-based method can using all features, the two classifiers work very well.
perform very well for silence discrimination. But because Accuracy is 94.5% and 91.2%, respectively. These results
there are silence and other audio types in a clip at the same prove that the features we proposed are rational and
time, some errors occur in the experiment. This problem effective.
can be solved by adjusting the threshold. At the same time,
Table 3 Effectiveness of different feature sets [2] J.Foote. An overview of audio information retrieval.
Accuracy B set B set+ B set+ B set+
ACM-Springer Multimedia Systems, 1998.
(%) P set S set All
Music/ 86.3 90.7 92.4 94.5 [3] S.Pfeiffer, S. Fischer, W. Effelsberg. Automatic
Speech (+4.4) (+6.1) (+10.0) Audio Content Analysis, Proc. Of the fourth ACM
Pure- 85.1 89.4 87.3 91.2 international conference on Multimedia, pp.21-30, 1996.
speech/
Speech over (+4.3) (+2.2) (+5.1) [4] J.Saunders. Real-time Discrimination of Broadcast
background Speech/ Music. Proc. Of ICASSP96, Vol. II, pp. 993-996,
In many relative researches, different MFCCs are Atlanta, May, 1996.
used. We analyze the affect of different orders of MFCCs.
In our experiments, 12, 20, 30 and 40 order MFCCs are [5] E. Scheirer, M.Slaney. Construction and Evaluation
computed. The different orders MFCCs compose different of a Robust Multifeature Music/Speech Discrimination.
classifiers with other features. The experiment results are Proc. Of ICASSP97, vol II, pp 1331-1334, April 1997.
listed in Table 4.
[6] Z. Liu, J. Huang, Y Wang, T. Chen. Audio feature
extraction and analysis for scene classification. IEEE
Table 4 Experiment result for MFCC
Signal Processing Society 1997 Workshop on Multimedia
Classifier MFCC MFCC MFCC MFCC Signal Processing, 1997.
12 20 30 40
Music/ 94.5 95.1 96.5 96.8 [7] Tong Zhang, C.Jay Kuo. Heuristic Approach for
Speech
Generic Audio Data Segmentation and Annotation.
Pure- speech / 91.2 91.7 92.2 92.9
Speech over In:Proceedings of the 7 th ACM International Conference
background on Multimedia, Orlando,1999,67-76.
From Table 4, it can be seen that the accuracy of
music/speech discrimination increases with the increasing [8] Savitha Srinivasan, Dragutin Petkovic, Dulce
of MFCCs order. The reason is that high orders MFCCs Ponceleon. Towards robust features for classifying audio in
characterize harmonicity of audio signals better and the cudeVideo system. In:Proceedings of the 7 th ACM
harmonicity is a very effective feature for discriminating International Conference on Multimedia, Orlando, 1999,
between music and speech. 393-400.

[9] L. Lu, H. Jiang, H. J. Zhang. A Robust Audio


5 Conclusion Classification and Segmentation Method. Proc. of the 9th
Feature analysis and extraction are the foundation of ACM international conference on Multimedia, pp. 203-211,
audio automatic classification, which is of critical 2001.
importance in audio indexing, retrieval and video content
analysis. We have analyzed in detail the discriminating [10] S.Z.Li. Content-based classification and retrieval of
features for these audio classes. We have also proposed a audio using the nearest feature line method. IEEE
new set of features for representation of audio streams, Transactions on Speech and Audio Processing, September
including silence ratio, pitch frequency standard deviation, 2000.
harmonicity ratio and smooth pitch ratio. We have
presented an SVM-based approach to classification. The [11] P.J Moreno, Ryan Rifkin. Using the Fisher Kernel
effectiveness of the features is evaluated in experiments. Method for Web Audio Classification. Proc.of
Experiment results show that the features we selected and ICASSP2000, Vol. IV. Pp. 2417-2440, June 2000.
proposed are rational and effective.
[12] Vapnik V. The Nature of Statistical Learning Theory.
As for future direction, we will find the discriminating New York:Springer-Verlag, 1995.
features for noise in different environments in order to
increase accuracy of noise discrimination. On the other [13] Cortes C,Vapnik V. Support Vector Networks.
hand, we will improve our classification scheme to Machine Learning, 1995, 20:273-297.
discriminate more audio classes and find new features that
can characterize new audio classes.

References
[1] J.Foote. Content-base retrieval of music and audio.
In:C.C.J.Kuo et al.(eds) Multimedia Storage and Archiving
Systems II, Proc.of SPIE, volume 3229,pp. 138-147, 1997.

You might also like