Professional Documents
Culture Documents
Weimin Huang, Tuan Kiang Chiew, Haizhou Li, Tian Shiang Kok, Jit Biswas
Institute for Inofcomm Researach, Singapore 138632
{wmhuang, chiewtk, hli, tskok, biswas}@i2r.a-star.edu.sg
Abstract— Audio signal is an important clue for the situation speech while some researchers worked out some approaches
awareness. It provides complementary information for video to general sound classification [7,14,23].
signal. For home care, elder care, and security application,
screaming is one of the events people (family member, care giver, In this paper, we are focusing on the detection of one
and security guard) are especially interested in. We present here specific audio event, i.e. screaming, which usually represents
an approach to scream detection, using both analytic and urgent or serious cases for home care and elderly care. One
statistical features for the classification. In audio features, sound example is the measurement of verbal aggression of dementia
energy is a useful feature to detect scream like audio. We adopt patients by detection of the duration and frequency of
the log energy to detect the energy continuity of the audio to
screaming. In this work, we propose a new method for human
represent the screaming which is often lasting longer than many
other sounds. Further, a robust high pitch detection based on the
screaming detection, based on the perceptual analysis of the
autocorrelation is presented to extract the highest pitch for each scream sound. An energy analysis is first introduced to detect
frame, followed by a pitch analysis for a time window containing abnormal sound, followed by a pitch extraction to pick up the
multiple frames. Finally, to validate the scream like sound, a ‘scream’ like audio. An example based learning classifier
SVM based classifier is applied with the feature vector generated (Support Vector Machine) is adopted to further validate the
from the MFCCs across a window of frames. Experiments of two audio classes (scream/non-scream), based on the spectrum
screaming detection is discussed with promising results shown. features on a sliding window of frames.
The algorithm is ported and run in a Linux based set top box
connected by a microphone array to capture the audio for live
A. Related works
scream detection.
Most related works to scream detection are presented in [4-
Keywords-component; screaming detection; pitch detection; 7,18,19]. In [18], the system detects and locates gunshot from
SVM learning; monitorig, home applications; elderare; security distance, the features used to detect the shock wave (usually
generated by gunshot or other supersonic projectiles) are the
I. INTRODUCTION features to detect the N shape wave, such as pulse width, peak
Audio-visual monitoring plays important roles in situation waveform amplitude, ratio of the positive peak to the negative
awareness [1,2]. In home and many other applications, audio peak of the N of the gunshot pulse, and slope. Periodicity of
is some times the dominant or sole accessible signal wave is detected as one feature of gunshot. The dynamic
indicating what happens [3-7], such as screaming, yelling, synapse neural networks are then used to classify gunshots (12
crying, gunshot, explosion, and door opening/closing etc. Of guns from other sound as noise) with input from wavelet
these audio events, screaming/yelling is one of the decomposition of the wave. Generally, [18] relies on the
specifically concerned signals in home care and eldercare for gunshot features, which will not be usable directly to scream
the security and safety reason. detection. Rabaoui et al [7] use one class SVM (support vector
machine) to character each class of sound and compare the
Early works related to audio activity/background distance to each class to decide the class of sound. It exams a
recognition start on the auditory scene analysis that big set of common audio features such as Zero-Crossing Rate,
recognizes an environment using audio information [8,9]. Spectral Centroid, Spectral Roll-off point, MFCCs, LPCCs,
Besides continuous effort on auditory scene recognition [10], PLP, Wavelet features. According to their experiments, the 1st
some researchers worked on the efficient indexing and and 2nd derivative is not much helpful to the performance. It is
retrieval of audio data given a large amount of music, speech contrary to the observation in [6] for gunshot feature and our
and other sound clips available today, in order for human to observation in scream features. In [6] it uses Gaussian Mixture
browse the data easily [11,12,19], where audio classification Model (GMM) for classification of scream, gunshot and other
of speech and music genre is one of the main focuses [13,14]. sounds. A feature selection process is being used to select the
‘optimal’ features. However it is controversial for the features
Recently, more researchers are attracted to the study of
in MFCC derivatives discarded in [6] are found useful in our
detection of unique audio event from sounds [4,6,15-18] for
study for scream detection. In [19], simple energy feature of
different applications, of which gunshot and scream were
audio is used to detect the action scene based on the
studied by energy surge and audio feature analysis. Other
assumption that there will be high energy in the audio of the
works [20-22] have studied the stress detection from human
action scene. However, in live monitoring system, the energy
978-1-4244-5046-6/10/$26.00 2010
c IEEE 2115
power of sound will decay with the distance increased For convenience, in a live scream detection system we can
between the sensor and the sound source [16]. GMM and fix the sampling frequency of audio signal, here we set the
HMM are applied in [4] to classify explosion, gunshot and sampling rate at Fs=11k per second. With the given Fs, we use
screaming, where MFCC and MPEG7 features such as different frame length to computing the audio features in the
Spectrum Flatness, Waveform and Fundamental Frequency. following sections. For energy calculation, around 20ms audio
Similar to [6], GMM method is applied to model the low level is used as a frame fe which is 256 samples, xi, i=1,…,256.
MFCC features for abnormal sound detection [5]. Experiments show that it is able to detect the short stop
reliably between two lauder sounds or the breath between the
B. Contribution and organization voice segments. The log energy of a 20ms frame is computed
Based on the perceptual features, a new approach to as
scream detection is designed in this paper. It is based on the 2
energy envelop proposed to locate the continuous sound, the
E 10 log( ¦x
i 1,...256
i ) (1)
ration of highest pitch of consecutive frames extracted using
the autocorrelation measure the ‘high’ pitch of screaming
sound and the compact representation of MFCC in a sliding Let a segment of audio signal S(t) is composed of overlapped
window for SVM learning. frames, …, fe(k-1), fe(k). There are 128 samples overlap
between any two consecutive frames fe(k-1) and fe(k). To
The paper is organized as follows. Section II describes the detect the segment S(t) of a continuous sound until time t, we
brief approaches of the related works. Section III presents the use the energy measurement, checking if the energy is larger
features we are using. Section IV proposes the classifier and than a threshold T within the segment:
system diagram for screaming detection. The experiments and
system setup are shown in Section V, followed by the S(t) = {fe(k) | E(k)>T, any k} (2)
conclusion and future works in Section VI.
The duration of S(t) is written as |S(t)|. In the experiment
II. FEATURE DESCRIPTION setting, we observed that most of the screams last more than
0.5 second, which can be adjusted accordingly later. Based on
There are many features used for audio analysis.
the probability of duration of the screaming in the audio
Perceptually we studied some of them to represent the
corpus, and considering the beginning and ending of the
property of scream feature. One is the energy feature, which
scream, we set |S(t)|>0.3 which will keep 99% of screaming
represents the sound power. However as indicated in [16], the
clips detected using the duration only. If it is too short, many
energy power itself is not reliable because the signal decays
of speech could be detected as scream due to the similarity of
with distance increase between the sound source and the
the sound ‘ah’. If it is too long, many scream segments will be
microphone. Although the sound power may not be high
filtered out.
enough as a reliable feature, compared to many other sounds
such as normal speech, knocking door, individual laughing An example of the energy envelope (continuity) is shown
etc, scream usually presents as a sound segment with in the Figure 1 below. The figure on top is the sound wave
continuous and relative high energy. This feature enables us to consisted of different type of sound, such as speech, scream,
distinguish scream from many non-scream sounds. (glass) breaking, explosion etc. The figure at the bottom is the
IV. EXPERIMENTS
Figure 3. System diagram of live scream detection
A. Experiment setting
The 36x20 MFCCs can be used for training of a SVM. One
To validate the results, we collected the audio corpus from
of the issues is the training time given the size of feature
our live recording, the internet and some from movies. The
vectors. It also causes the problem of over-fitting in case the
training set of scream contains 26 different scream clips,
number of training samples is not bigger enough. We also
duration from 1.0s to 10.88s. The testing set contains 56 clips,
extracted and tested the statistics of the MFCCs across 20
duration from 0.34s to 9.4s. Within each clips there can be
frames by obtaining the mean, minimum, maximum and
more than one segment of screams. Other non-screaming
standard deviation of each coefficient in MFCCs. In this way,
sounds, including speech (female, male), laugh, cry, applause,
the features are further reduced to 36x4. The new
clap, knock, glass break, etc, totally have 49 clips for training,
representation of the compact features is shown in Figure 4.
from 0.5s to 51.99s. Test set for non-screaming audio has 271
clips for speech, cry, break, applause, knock, laugh, etc.,
duration from 0.31s to 51.23s.
For simplicity, the performance is measured based on the
clips. If there is a detection of scream at any time in a non-
36 MFCCs for a frame
scream clip, we call that one false positive (FP: false alarm). If
no scream is detected at any time in a scream clip, we call that
maximum
standard deviation
minimum
mean
VI. ACKNOWLEDGEMENT
This work is partially supported by EU project ASTRALS
(FP6-IST-0028097).
REFERENCES
[1] W. Zajdel, J. D. Krijnders, T. Andringa, D. M. Gavrila, “CASSANDRA:
Audio-video Sensor Fusion for Aggression Detection”, AVSS 2007,
pp.200-205
[2] P.K. Atrey, N.C. Maddage, and M.S. Kankanhalli. Audio Based Event
Detection for Multimedia Surveillance. ICASSP 2006
[3] A. F. Smeaton, M. McHugh, Towards event detection in an audio-based
sensor network, Proceedings of the third ACM international workshop
on Video surveillance & sensor networks, 2005, pp.87-94
[4] S. Ntalampiras, I. Potamitis, N. Fakotakis, "On acoustic surveillance of
hazardous situations," icassp, pp.165-168, 2009 IEEE International
Figure 6 ROC curve of scream detection Conference on Acoustics, Speech and Signal Processing, 2009