Professional Documents
Culture Documents
1. Machine Listening
2. Background Classication
3. Foreground Event Recognition
4. Speech Separation
5. Open Issues
Environmental Sound
Recognition and Classication
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
1. Machine Listening
2
Audio Lifelog
Diarization
Environment
Classication
speech/music/
silent/machine
4
Zhang & Kuo 01
25 concepts from
Kodak user study
boat, crowd, cheer, dance, ...
5
8GMM+Bha (9/15)
pLSA500+lognorm (12/15)
1G+KL2 (10/15)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlapped Concepts
group of 3+
music
crowd
cheer
singing
one person
night
group of two
show
dancing
beach
park
baby
playground
parade
boat
sports
graduation
sunset
birthday
animal
wedding
picnic
museum
ski
0
3mc c s o n g s d b p b p p b s g s s b a wpm
L
a
b
e
l
e
d
C
o
n
c
e
p
t
s
T
o
p
i
c
A
V
C
l
i
p
A
V
C
l
i
p
p(ftr | clip)
p
(
t
o
p
i
c
|
c
l
i
p
)
p(ftr | topic)
=
*
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Background Classication Results
Characterize sounds by
perceptually-sufcient statistics
.. veried by matched
resynthesis
Subband
distributions
& env x-corrs
Mahalanobis
distance ...
11
McDermott Simoncelli 09
Ellis, Zheng, McDermott 11
Sound
Automatic
gain
control
Envelope
correlation
Cross-band
correlations
(318 samples)
Modulation
energy
(18 x 6)
mean, var,
skew, kurt
(18 x 4)
mel
filterbank
(18 chans)
x
x
x
x
x
x
FFT
Octave bins
0.5,1,2,4,8,16 Hz
Histogram
617
1273
2404
5
10
15
0.5 2 8 32
0.5 2 8 32 0.5 2
8 32
5 10 15 8 32
5 10 15
0
level
1
f
r
e
q
/
H
z
m
e
l
b
a
n
d
1159_10 urban cheer clap
Texture features
1062_60 quiet dubbed speech music
0 2 4 6 8 10
time / s
moments mod frq / Hz mel band moments mod frq / Hz mel band
0 2 4 6 8 10
K S V M K S V M
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Sound Texture Features
Contrasts in
feature sets
correlation of labels...
Perform
~ same as MFCCs
combine well
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
3. Foreground Event Recognition
Rene models by
EM realignment
Transients =
foreground events?
Onset detector
nds energy bursts
best SNR
PCA basis to
represent each
300 ms x auditory freq
bag of transients
15
Cotton, Ellis, Loui 11
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Nonnegative Matrix Factorization
Compare to MFCC-HMM
detector
17
Cotton, Ellis 11
f
r
e
q
(
H
z
)
442
1398
3474
442
1398
3474
442
1398
3474
442
1398
3474
0.0 0.5 0.0 0.5
time (s)
0.0 0.5 0.0 0.5 0.0 0.5
Patch 1 Patch 2 Patch 3 Patch 4 Patch 5
Patch 6 Patch 7 Patch 8 Patch 9 Patch 10
Patch 11 Patch 12 Patch 13 Patch 14 Patch 15
Patch 16 Patch 17 Patch 18 Patch 19 Patch 20
10 15 20 25 30
0
50
100
150
200
250
300
SNR (dB)
A
c
o
u
s
t
i
c
E
v
e
n
t
E
r
r
o
r
R
a
t
e
(
A
E
E
R
%
)
Error Rate in Noise
MFCC
NMF
combined
NMF more
noise-robust
combines well ...
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
4. Speech Separation
Idea: Find
speaker model parameter space
generalize without
losing detail?
Eigenvoice model:
19
Kuhn et al. 98, 00
Weiss & Ellis 07, 08, 09
Speaker models
Speaker subspace bases
= + U w + B h
adapted mean eigenvoice weights channel channel
model voice bases bases weights
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Eigenvoice Bases
Mean model
280 states x 320 bins
= 89,600 dimensions
Eigencomponents
shift formants/
coloration
additional
components for
acoustic channel
20
F
r
e
q
u
e
n
c
y
(
k
H
z
)
Mean Voice
b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
F
r
e
q
u
e
n
c
y
(
k
H
z
)
Eigenvoice dimension 1
b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
F
r
e
q
u
e
n
c
y
(
k
H
z
)
Eigenvoice dimension 2
b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
F
r
e
q
u
e
n
c
y
(
k
H
z
)
Eigenvoice dimension 3
b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
50
40
30
20
10
0
2
4
6
8
0
2
4
6
8
0
2
4
6
8
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Eigenvoice Speech Separation
Improves over
simple Wu criterion
especially for mid SNR
Wu
CS-GTPTHvarAdjC5
M
e
a
n
B
E
R
/
%
1
10
100
Mean BER of pitch tracking algorithms
SNR / dB
-10 -5 0 5 10 15 20 25
BS Lee & Ellis 11
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Noise-Robust Pitch Tracking
Trained selection
includes more
off-harmonic
channels
30
F
r
e
q
/
k
H
z
0
2
4
C
h
a
n
n
e
l
s
20
40
0
12.5
25
0
12.5
25
p
e
r
i
o
d
/
m
s
C
h
a
n
n
e
l
s
20
40
0
12.5
25
0
12.5
25
p
e
r
i
o
d
/
m
s
40
20
0
20
08_rbf_pinknoise5dB
0 1 2 3 4
time / sec
W
u
C
S
-
G
T
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
5. Outstanding Issues
Large-scale analysis
Object-related features
from both audio (transients) & video (patches)
32
J
o
i
n
t
A
u
d
i
o
-
V
i
s
u
a
l
A
t
o
m
d
i
s
c
o
v
e
r
y
time
. . .
visual atom
. . .
visual atom
. . .
visual atom
. . .
time
audio spectrum
f
r
e
q
u
e
n
c
y
time
f
r
e
q
u
e
n
c
y
audio atom
audio atom
audio atom
Jiang et al. 09
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Audio-Visual Atoms
visual atom
time
N audio atoms
M Visual Atoms
visual atom
time
codebook
learning
Joint A-V
Codebook
M!N
A-V Atoms
Machine Listening:
Getting useful information from sound
Jon Barker, Martin Cooke, & Dan Ellis, Decoding Speech in the Presence of Other Sources, Speech
Communication 45(1): 5-25, 2005.
Courtenay Cotton, Dan Ellis, & Alex Loui, Soundtrack classification by transient events, IEEE
ICASSP, Prague, May 2011.
Courtenay Cotton & Dan Ellis, Spectral vs. Spectro-Temporal Features for Acoustic Event
Classification, submitted to IEEE WASPAA, 2011.
Dan Ellis, Xiaohong Zheng, Josh McDermott, Classifying soundtracks with audio texture features,
IEEE ICASSP, Prague, May 2011.
Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, Short-Term Audio-Visual
Atoms for Generic Video Concept Classification, ACM MultiMedia, 5-14, Beijing, Oct 2009.
Keansub Lee & Dan Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE
Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010.
Keansub Lee, Dan Ellis, Alex Loui, Detecting local semantic concepts in environmental sounds
using Markov model based clustering, IEEE ICASSP, 2278-2281, Dallas, Apr 2010.
Byung-Suk Lee & Dan Ellis, Noise-robust pitch tracking by trained channel selection, submitted to
IEEE WASPAA, 2011.
Andriy Temko & Climent Nadeu, Classification of acoustic events using SVM-based clustering
schemes, Pattern Recognition 39(4): 682-694, 2006
Ron Weiss & Dan Ellis, Speech separation using speaker-adapted Eigenvoice speech models,
Computer Speech & Lang. 24(1): 16-29, 2010.
Ron Weiss, Michael Mandel, & Dan Ellis, Combining localization cues and source model
constraints for binaural source separation, Speech Communication 53(5): 606-621, May 2011.
Tong Zhang & C.-C. Jay Kuo, Audio content analysis for on-line audiovisual data segmentation,
IEEE TSAP 9(4): 441-457, May 2001
36