Professional Documents
Culture Documents
Classification
Bai Liang, Hu Yanli, Lao Songyang, Chen Jianyun, Wu Lingda
Multimedia Research &Development Center
National University of Defense and Technology
ChangSha 410073, P. R. China
xabpz@163.com
accuracy of above 90% is reported. Srinivasan et al. [8], try
Abstract - Feature analysis and extraction are the
to detect and classify audio that consists of mixed classes,
foundation of audio automatic classification. This paper
such as combinations of speech and music together with
divides audio streams into five classes: silence, noise, pure
environment sound. The reported accuracy of classification
speech, speech over background sound and music. We
is over 80%. In [9], an algorithm of audio classification is
present our work on audio feature analysis and extraction
presented, where audio is discriminated with one-second
on the frame level and clip level. Four new features are
window. An accuracy of above 96% is reported. However,
proposed, including silence ratio, pitch frequency standard
in these works, a rule-based classifier is used for audio
deviation, harmonicity ratio and smooth pitch ratio. We
classification; these systems require threshold setting. But
have presented an SVM-based approach to classification.
this method can only characterize static features of audio
The effectiveness of the features is evaluated in
and threshold is very difficult to select, since it should be
experiments. Experiment results show that the features we
adjusted for various circumstances. This makes the rule-
selected and proposed are rational and effective.
based approach not general enough to fit different
applications.
Keywords: Feature analysis and extraction, content-based
audio classification, support vector machines.
In reference to the analysis mentioned above, for the
1 Introduction purposes of this paper, audio is divided into five types,
such as silence, noise, music, pure-speech and speech over
Audio automatic classification can provide useful background sound. Discriminating features of these audio
information for both audio content understanding and video types are analyzed on the frame level and one-second-clip
content analysis [1,2]. It is of critical importance in audio level. A set of features including some new features that
indexing and retrieval. Feature analysis and extraction are could further improve the performance of audio
the foundational step for audio automatic classification. classification is proposed. A SVM-based classifier is also
presented. The effectiveness of the proposed features and
Many studies on audio classification employ different the evaluation of the performance of SVM on audio
features and methods. Pfeiffer et al [3] presented a classification are tested in the proposed SVM-based
theoretical framework and application of automatic audio classifier.
content analysis using some perceptual features. Saunders
[4] presented a speech/music classifier based on simple The rest of this paper is organized as follow: Section 2
features such as zero-crossing rate and short-time energy analyzes discriminating features of different audio types
for radio broadcast. Meanwhile, Scheirer et al. [5] and describes how an audio clip is represented by low level
introduced more features for audio classification and perceptual and spectral feature set. Section 3 gives an
performed experiments with different classification models. overview of linear and kernel SVM and describes how a
Z. Liu et al.[6]presented a set of low-level audio features SVM-based classifier is designed. In Section 4,
for characterizing semantic contents of short audio clips in experiments and evaluations on a 5-hour database are given.
order to use the associated audio information for video Section 5 draws some conclusion and proposes important
scene analysis. However, in spite of these research efforts, research directions in the future.
high accuracy audio classification is only achieved for the
simple cases such as speech/music discrimination. It is also 2 Audio Feature Analysis
found that methods based on such simple features cannot
work well. In order to obtain high accuracy for audio
classification, it is critical to select good features that can
Other research efforts focused on audio classification capture the temporal and spectral characteristics of audio
algorithms to discriminate more classes. In the work by signals and are robust for changing circumstance. In our
Zhang and Kou [7], with a heuristic-based model, many work, features are analyzed and extracted in two levels:
features including pitch-tracking methods are introduced to frame-level and one-second clip-level. The features
discriminate audio recordings into more classes. An extracted from one clip are combined as one feature vector
after normalization. Before feature extraction, an audio discriminator for speech and music. These two features are
signal is converted into a general format (“wav” format also commonly used, as in [4,5,6,7,8].
used in this paper), which is 22.050KHz, 16-bit, mono-
channel. Then, it is pre-emphasized with parameter 0.98 or Zero-crossing rate. ZCR describes the times that an audio
0.97 to equalize the inherent spectral tilt and then divided waveform crosses the zero axes within a frame. It is a
into one-second non-overlapping clips. A clip is used as the simple measure of the frequency content of a signal:
classification unit. The clip is further divided into half-
overlapping 23 ms-long frames by use of a Hamming 1 N −1
window. ZCR = ∑ sgn[x(m + 1)] − sgn[x(m )] ⑶
2( N − 1) m =1
2.1 Audio frame-level features Where sgn[] is a sign function and x(m) is the discrete
In our method, the frame-level features we extracted audio signal. In general, speech signals are composed of
include: 12 order MFCCs, frequency energy, zero crossing alternating voiced sounds and unvoiced sounds in the
rates (ZCR), sub-band energy, brightness, bandwidth syllable rate, while music signals do not have this kind of
pitch frequency. The definitions and methods of calculating structure. Hence, for the speech signal, its variation of ZCR
of these features are given below. will be in general greater than that of music signals.
Considering this, many systems have used ZCR for audio
Mel-frequency cepstral coefficients. First, the audio is classification [4,7,8,9].
Hamming-windowed in overlapping steps. For each
window, the log of the power spectrum is computed using a Brightness and Bandwidth. The brightness is the
discrete Fourier Transform (DFT). The log spectral frequency centroid of the spectrum in a frame. It can be
coefficients are perceptually weighted by a non-linear map defined as:
of the frequency scale called Mel-scaling. The final stage is
∫ ω F (ω ) dω
to further transform the Mel-weighted spectrum (using ω0 2
whose ZCR are above 1.5-fold average ZCR in a given clip. 0.35
0.30
LFER
1 N −1 0.25
HZCRR = ∑ [sgn(ZCR(n) −1.5avZCR) +1] ⑹ 0.20
2N n=0
0.15
0.10
Where n is frame index, ZCR(n) is the ZCR at the nth
0.05
frame, N is the total number of frames, avZCR is the
0.00
average ZCR in a clip. Fig 1 illustrates the curves of 0 20 40 60 80 100 120 140 160 180
HZCRR for speech and music signals. It can be seen that clip index
the value of HZCRR of speech segment is around 0.14,
while that of music segment mostly fall below 0.11.
Fig.2. LFER curves of speech/music
Spectrum flux. Spectrum flux (SF) is defined as the
speech
0.24
average variation value of spectrum between two adjacent
0.22
music frames in a given clip.
0.20 1
SF =
0.18
(N − 1)(K − 1)
0.16
2
⑻
N −1K −1
∑ ∑ [log A(n , k ) − log A(n − 1, k )]
0.14
HZCRR
0.12
×
n =1 k =1
0.10
0.00
A(n, k ) = ∑ x(m )ω (nL − m)e
m = −∞
L
⑼
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
50
3.1 Overview of SVM
40
This section introduces some basic concepts for
Support vector machine (SVM). More details are offered in
30
[12,13]. SVM learns an optimal separating hyper plane
from a given set of positive and negative examples. It
0 100 200 300 400 500 minimizes the structural risk. This is in contrast to
clip index traditional pattern recognition techniques of minimizing the
empirical risk, which optimize the performance on the
Fig.3. Spectrum flux curve of speech / music training data. A SVM can be either linear or nonlinear. The
former is used in linearly separable case, and the latter is
In general, speech signals are composed of alternating used in linearly non-separable but nonlinearly separable
voiced sounds and unvoiced sounds in the syllable rate. In case.
our experiment, we found the variation of pitch frequency
for speech is higher than that for music and there are much
Given a set of training vectors { xi }i =1 , and the
l
pitch frequency standard deviation, harmonicity ratio, and an optimal decision function:
smooth pitch ratio.
Nsv
f (x ) = sign ∑ y i α i x isv x − b ⑾
Pitch frequency standard deviation. This feature can i =1
characterize the changing range of pitch frequency.
Where α i and b are parameters for the classifier, the
Harmonicity ratio. Harmonicity ratio (HR) is defined as
the ratio of harmonicity frames in a given audio clip. solution vector x isv is called the Support Vector with α i
being non-zero. In the linearly non-separable but non-
Smooth pitch ratio. If the pitch frequency value of the ith linearly separable case, the SVM replaces the inner product
frame does not equal 0 and the variation of pitch frequency by a kernel function K(x,y), and then constructs an optimal
value between the ith frame and the (i-1)th is lower than a separating hyper-plane in the mapped space. According to
preset threshold, the ith frame is considered a smooth pitch the Mercer theorem [12], the kernel function implicitly
frame. Smooth pitch ratio (SPR) is defined as the ratio of maps the input vectors into a high dimensional feature
smooth pitch frames to total frames whose pitch space in which the mapped data is linearly separable. In our
frequencies are not 0 in a given clip. method, we use the Gaussian Radial Basis kernel.
References
[1] J.Foote. Content-base retrieval of music and audio.
In:C.C.J.Kuo et al.(eds) Multimedia Storage and Archiving
Systems II, Proc.of SPIE, volume 3229,pp. 138-147, 1997.