Environmental Sound Recognition and Classification

Environmental Sound Recognition - Dan Ellis 2011-06-01 /36 1
1. Machine Listening
2. Background Classication
3. Foreground Event Recognition
4. Speech Separation
5. Open Issues
Environmental Sound
Recognition and Classication
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
1. Machine Listening
2
Extracting useful information from sound

... like animals do
D
e
c
t
e
c
t
C
l
a
s
s
i
f
y
D
e
s
c
r
i
b
e
Environmental
Sound
Speech Music
T
a
s
k
Domain
ASR
Music
Transcription
Music
Recommendation
VAD
Speech/Music
Emotion
Sound
Intelligence
Environment
Awareness
Automatic
Narration
Environmental Sound Applications
Audio Lifelog
Diarization
Consumer Video Classication & Search

3
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
preschool
cafe
preschool
cafe
lecture
office
office
outdoor
group
lab
cafe
meeting2
office
office
outdoor
lab
cafe
Ron
Manuel
Arroyo?
Lesser
2004-09-13
L2
cafe
office
outdoor
lecture
outdoor
meeting2
outdoor
office
cafe
office
outdoor
office
postlec
office
DSP03
compmtg
Mike
Sambarta?
2004-09-14
Prior Work
Environment
Classication
speech/music/
silent/machine
4
Zhang & Kuo 01
Nonspeech Sound Recognition

Meeting room
Audio Event Classication
sports events - cheers, bat/ball
sounds, ...
Temko & Nadeu 06
Consumer Video Dataset
25 concepts from
Kodak user study
boat, crowd, cheer, dance, ...
5
8GMM+Bha (9/15)
pLSA500+lognorm (12/15)
1G+KL2 (10/15)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlapped Concepts
group of 3+
music
crowd
cheer
singing
one person
night
group of two
show
dancing
beach
park
baby
playground
parade
boat
sports
graduation
sunset
birthday
animal
wedding
picnic
museum
ski
0
3mc c s o n g s d b p b p p b s g s s b a wpm
L
a
b
e
l
e
d

C
o
n
c
e
p
t
s
Grab top 200 videos

from YouTube search
then lter for quality,
unedited = 1873 videos
manually relabel with
concepts
Obtaining Labeled Data
Amazon Mechanical Turk

10s clips
9,641 videos in 4 weeks
6

Y-G Jiang et al. 2011
2. Background Classication
Baseline for soundtrack classication

divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= bag of features
Classify by e.g. KL distance + SVM

7
VTS_04_0001 - Spectrogram
f
r
e
q

/

k
H
z

1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
-20
-10
0
10
20
30
time / sec
time / sec
level / dB
value
M
F
C
C

b
i
n

1 2 3 4 5 6 7 8 9
2
4
6
8
10
12
14
16
18
20
-20
-15
-10
-5
0
5
10
15
20
MFCC dimension
M
F
C
C

d
i
m
e
n
s
i
o
n
MFCC covariance
5 10 15 20
2
4
6
8
10
12
14
16
18
20
-50
0
50
Video
Soundtrack
MFCC
features
MFCC
Covariance
Matrix
Codebook Histograms
Convert high-dim. distributions to multinomial
Classify by distance on histograms

KL, Chi-squared
+ SVM
8
-20 -10 0 10 20
-10
-8
-6
-4
-2
0
2
4
6
8
MFCC(0)
M
F
C
C
(
1
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
50
100
150
GMM mixture
c
o
u
n
t
MFCC
features
Global
Gaussian
Mixture Model
Per-Category
Mixture Component
Histogram
Latent Semantic Analysis (LSA)
Probabilistic LSA (pLSA) models each histogram

as a mixture of several topics
.. each clip may have several things going on
Topic sets optimized through EM

p(ftr | clip) = !
topics
p(ftr | topic) p(topic | clip)
use (normalized?) p(topic | clip) as per-clip features
9
GMM histogram ftrs GMM histogram ftrs Topic
T
o
p
i
c
A
V

C
l
i
p
A
V

C
l
i
p
p(ftr | clip)
p
(
t
o
p
i
c

|

c
l
i
p
)
p(ftr | topic)
=
*
Background Classication Results
Wide range in performance

audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
10
K Lee & Ellis 10
Concept class
g
r
o
u
p

o
f

3
+
m
u
s
i
c
c
r
o
w
d
c
h
e
e
r
s
i
n
g
i
n
g
o
n
e

p
e
r
s
o
n
n
i
g
h
t
g
r
o
u
p

o
f

t
w
o
s
h
o
w
d
a
n
c
i
n
g
b
e
a
c
h
p
a
r
k
b
a
b
y
p
l
a
y
g
r
o
u
n
d
p
a
r
a
d
e
b
o
a
t
s
p
o
r
t
s
g
r
a
d
u
a
t
i
o
n
s
k
i
s
u
n
s
e
t
b
i
r
t
h
d
a
y
a
n
i
m
a
l
w
e
d
d
i
n
g
p
i
c
n
i
c
m
u
s
e
u
m
M
E
A
N
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A
v
e
r
a
g
e

P
r
e
c
i
s
i
o
n
Guessing
MFCC + GMM Classifier
SingleGaussian + KL2
8GMM + Bha
1024GMM Histogram + 500log(P(z|c)/Pz))
Sound Texture Features
Characterize sounds by
perceptually-sufcient statistics
.. veried by matched
resynthesis
Subband
distributions
& env x-corrs
Mahalanobis
distance ...
11
McDermott Simoncelli 09
Ellis, Zheng, McDermott 11
Sound
Automatic
gain
control
Envelope
correlation
Cross-band
correlations
(318 samples)
Modulation
energy
(18 x 6)
mean, var,
skew, kurt
(18 x 4)
mel
filterbank
(18 chans)
x
x
x
x
x
x
FFT
Octave bins
0.5,1,2,4,8,16 Hz
Histogram
617
1273
2404

5
10
15

0.5 2 8 32

0.5 2 8 32 0.5 2

8 32

5 10 15 8 32

5 10 15
0
level
1
f
r
e
q

/

H
z
m
e
l

b
a
n
d
1159_10 urban cheer clap
Texture features
1062_60 quiet dubbed speech music
0 2 4 6 8 10
time / s
moments mod frq / Hz mel band moments mod frq / Hz mel band
0 2 4 6 8 10
K S V M K S V M
Sound Texture Features
Test on MED 2010

development data
10 specially-collected
manual labels
12
MED 2010 Txtr Feature (normd) means

M V S K ms cbc
rural
urban
quiet
noisy
other
Dubbed
Speech
Music
Cheer
cLap
0.5
0
0.5
M V S K ms cbc
MED 2010 Txtr Feature (normd) stddevs

M V S K ms cbc
rural
urban
quiet
noisy
other
Dubbed
Speech
Music
Cheer
cLap
0
0.5
1
M V S K ms cbc
MED 2010 Txtr Feature (normd) means/stddevs

M V S K ms cbc
rural
urban
quiet
noisy
other
Dubbed
Speech
Music
Cheer
cLap
0.5
0
0.5
Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg
0.5
0.6
0.7
0.8
0.9
1

M
MVSK
ms
cbc
MVSK+ms
MVSK+cbc
MVSK+ms+cbc
Contrasts in
feature sets
correlation of labels...
Perform
~ same as MFCCs
combine well
3. Foreground Event Recognition
Global vs. local class models

tell-tale acoustics may be washed out in statistics
try iterative realignment of HMMs:
background model shared by all clips
13
Old Way:
All frames contribute
New Way:
Limited temporal extents
time / s
f
r
e
q

/

k
H
z
YT baby 002:
voice
baby
laugh
5 10 15
0
1
2
3
4
voice
baby
Iaugh
time / s
f
r
e
q

/

k
H
z
5 10 15
0
1
2
3
4
voice Iaugh
Iaugh
bg
bg
voice
baby
baby
K Lee, Ellis, Loui 10
Foreground Event HMMs
Training labels only

at clip-level
Rene models by
EM realignment
Use for classifying

entire video...
or seeking to relevant
part
14
50
0
Spectrogram : YT_beach_001.mpeg,
Manual Clip-level Labels : beach, group of 3+, cheer
0 1 2 3 4 5 6 7 8 9
0
2k
4k
Viterbi Path by 27state Markov Model
f
r
e
q

/

H
z
10
animal
baby
beach
birthday
boat
crowd
graduation
group of 3+
group of 2
museum
night
one person
park
picnic
playground
show
sports
sunset
wedding
dancing
parade
singing
ski
cheer
music
global BG1
global BG2
2
5

c
o
n
c
e
p
t
s

+

2

g
l
o
b
a
l

B
G
s
time/sec
Manual Frame-level Annotation
water wave
cheer
unseen1
speaker2
speaker3 cheer
voice
voice
voice voice voice
speaker4
Background
1k
3k
drum wind etc
voice
voice
Lee & Ellis 10
Transient Features
Transients =
foreground events?
Onset detector
nds energy bursts
best SNR
PCA basis to
represent each
300 ms x auditory freq
bag of transients
15
Cotton, Ellis, Loui 11
Nonnegative Matrix Factorization
Decompose spectrograms into

templates
+ activation
fast forgiving
gradient descent
algorithm
2D patches
sparsity control...
16
Smaragdis Brown 03
Abdallah Plumbley 04
Virtanen 07
X = W H
Original mixture
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
f
r
e
q

/

H
z
442
883
1398
2203
3474
5478
0 1 2 3 4 5 6 7 8 9 10
time / s
NMF Transient Features
Learn 20 patches from

Meeting Room Acoustic
Event data
Compare to MFCC-HMM
detector
17
Cotton, Ellis 11
f
r
e
q

(
H
z
)
442
1398
3474
442
1398
3474
442
1398
3474
442
1398
3474
0.0 0.5 0.0 0.5
time (s)
0.0 0.5 0.0 0.5 0.0 0.5
Patch 1 Patch 2 Patch 3 Patch 4 Patch 5
10 15 20 25 30
0
50
100
150
200
250
300
SNR (dB)
A
c
o
u
s
t
i
c

E
v
e
n
t

E
r
r
o
r

R
a
t
e

(
A
E
E
R
%
)
Error Rate in Noise

MFCC
NMF
combined
NMF more
noise-robust
combines well ...
4. Speech Separation
Speech recognition is nding best-t

parameters - argmax P(W | X)
Recognize mixtures with Factorial HMM

model + state sequence for each voice/source
exploit sequence constraints, speaker differences
separation relies on detailed speaker model
18
model 1
model 2
observations / time
Eigenvoices
Idea: Find
speaker model parameter space
generalize without
losing detail?
Eigenvoice model:
19
Kuhn et al. 98, 00
Weiss & Ellis 07, 08, 09
Speaker models
Speaker subspace bases
= + U w + B h
adapted mean eigenvoice weights channel channel
model voice bases bases weights
Eigenvoice Bases
Mean model
280 states x 320 bins
= 89,600 dimensions
Eigencomponents
shift formants/
coloration
additional
components for
acoustic channel
20
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Mean Voice

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Eigenvoice dimension 1

2
4
6
8
F
r
e
q
u
e
n
c
y

(
k
H
z
)

2
4
6
8
F
r
e
q
u
e
n
c
y

(
k
H
z
)

2
4
6
8
50
40
30
20
10
0
2
4
6
8
0
2
4
6
8
0
2
4
6
8
Eigenvoice Speech Separation
Factorial HMM analysis

with tuning of source model parameters
= eigenvoice speaker adaptation
21
Weiss & Ellis 10
22
Eigenvoices for Speech Separation task

speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
23
SI
SA
SD
Mix
Binaural Cues
Model interaural spectrum of each source

as stationary level and time differences:
e.g. at 75, in reverb:

24
IPD IPD residual ILD
Model-Based EM Source Separation
and Localization (MESSL)
can model more sources than sensors
exible initialization
25
Mandel & Ellis 09
Assign spectrogram points
to sources
Re-estimate
source parameters
MESSL Results
Modeling uncertainty improves results

tradeoff between constraints & noisiness
26
2.45 dB 2.45 dB
9.12 dB 9.12 dB
0.22 dB 0.22 dB
-2.72 dB -2.72 dB
12.35 dB 12.35 dB
8.77 dB 8.77 dB
MESSL with Source Priors
Fixed or adaptive speech models

27
Weiss, Mandel & Ellis 11
MESSL-EigenVoice Results
Source models function as priors
Interaural parameter spatial separation

source model prior improves spatial estimate
28
0
0.2
0.4
0.6
0.8
1
Ground truth (12.04 dB)
F
r
e
q
u
e
n
c
y

(
k
H
z
)
0 0.5 1 1.5
0
2
4
6
8
DUET (3.84 dB)
0 0.5 1 1.5
0
2
4
6
8
2D FD BSS (5.41 dB)
0 0.5 1 1.5
0
2
4
6
8
MESSL (5.66 dB)
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Time (sec)
0 0.5 1 1.5
0
2
4
6
8
MESSL SP (10.01 dB)
Time (sec)
0 0.5 1 1.5
0
2
4
6
8
MESSL EV (10.37 dB)
Time (sec)

0 0.5 1 1.5
0
2
4
6
8
Noise-Robust Pitch Tracking
Important for voice detection & separation
Based on channel selection Wu & Wang (2003)

pitch from summary
autocorrelation
over good bands
trained classier decides
which channels to include
29
Improves over
simple Wu criterion
especially for mid SNR
Wu
CS-GTPTHvarAdjC5
M
e
a
n

B
E
R

/

%
1
10
100
Mean BER of pitch tracking algorithms
SNR / dB
-10 -5 0 5 10 15 20 25
BS Lee & Ellis 11
Noise-Robust Pitch Tracking
Trained selection
includes more
off-harmonic
channels
30
F
r
e
q

/

k
H
z

0
2
4
C
h
a
n
n
e
l
s

20
40

0
12.5
25
0
12.5
25
p
e
r
i
o
d

/

m
s
C
h
a
n
n
e
l
s
20
40
0
12.5
25
0
12.5
25
p
e
r
i
o
d

/

m
s

40
20
0
20
08_rbf_pinknoise5dB

0 1 2 3 4
time / sec

W
u
C
S
-
G
T
5. Outstanding Issues
Better object/event separation

parametric models
spatial information?
computational
auditory
scene analysis...
Large-scale analysis
Integration with video

31
Frequency Proximity Harmonicity Common Onset
Barker et al. 05
Audio-Visual Atoms
Object-related features
from both audio (transients) & video (patches)
32
J
o
i
n
t

A
u
d
i
o
-
V
i
s
u
a
l

A
t
o
m

d
i
s
c
o
v
e
r
y
time
. . .
visual atom
. . .
visual atom
. . .
visual atom
. . .
time
audio spectrum
f
r
e
q
u
e
n
c
y
time
f
r
e
q
u
e
n
c
y
audio atom
audio atom
audio atom
Jiang et al. 09
Audio-Visual Atoms
Multi-instance learning of A-V co-occurrences

33
audio atom audio atom
visual atom
time
N audio atoms
M Visual Atoms
visual atom
time
codebook
learning
Joint A-V
Codebook
M!N
A-V Atoms

Audio-Visual Atoms
34
Audio only
Visual only
Visual + Audio
black suit
+ romantic
music
Wedding
marching
people
+ parade
sound
Parade
sand
+ beach
sounds
Beach
Summary
Machine Listening:
Getting useful information from sound
Background sound classication

... from whole-clip statistics?
Foreground event recognition

... by focusing on peak energy patches
Speech content is very important

... separate with pitch, models, ...
35
References
Jon Barker, Martin Cooke, & Dan Ellis, Decoding Speech in the Presence of Other Sources, Speech
Communication 45(1): 5-25, 2005.
Courtenay Cotton, Dan Ellis, & Alex Loui, Soundtrack classification by transient events, IEEE
ICASSP, Prague, May 2011.
Courtenay Cotton & Dan Ellis, Spectral vs. Spectro-Temporal Features for Acoustic Event
Classification, submitted to IEEE WASPAA, 2011.
Dan Ellis, Xiaohong Zheng, Josh McDermott, Classifying soundtracks with audio texture features,
IEEE ICASSP, Prague, May 2011.
Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, Short-Term Audio-Visual
Atoms for Generic Video Concept Classification, ACM MultiMedia, 5-14, Beijing, Oct 2009.
Keansub Lee & Dan Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE
Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010.
Keansub Lee, Dan Ellis, Alex Loui, Detecting local semantic concepts in environmental sounds
using Markov model based clustering, IEEE ICASSP, 2278-2281, Dallas, Apr 2010.
Byung-Suk Lee & Dan Ellis, Noise-robust pitch tracking by trained channel selection, submitted to
IEEE WASPAA, 2011.
Andriy Temko & Climent Nadeu, Classification of acoustic events using SVM-based clustering
schemes, Pattern Recognition 39(4): 682-694, 2006
Ron Weiss & Dan Ellis, Speech separation using speaker-adapted Eigenvoice speech models,
Computer Speech & Lang. 24(1): 16-29, 2010.
Ron Weiss, Michael Mandel, & Dan Ellis, Combining localization cues and source model
constraints for binaural source separation, Speech Communication 53(5): 606-621, May 2011.
Tong Zhang & C.-C. Jay Kuo, Audio content analysis for on-line audiovisual data segmentation,
IEEE TSAP 9(4): 441-457, May 2001
36

Environmental Sound Recognition and Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Environmental Sound Recognition and Classification

Uploaded by

Copyright:

Available Formats

Environmental Sound Recognition - Dan Ellis 2011-06-01 /36 1

Extracting useful information from sound

Consumer Video Classication & Search

Nonspeech Sound Recognition

Grab top 200 videos

Amazon Mechanical Turk

Baseline for soundtrack classication

Classify by e.g. KL distance + SVM

Convert high-dim. distributions to multinomial

Classify by distance on histograms

Probabilistic LSA (pLSA) models each histogram

Topic sets optimized through EM

Wide range in performance

Test on MED 2010

Global vs. local class models

Training labels only

Use for classifying

Decompose spectrograms into

Learn 20 patches from

Speech recognition is nding best-t

Recognize mixtures with Factorial HMM

Factorial HMM analysis

Eigenvoices for Speech Separation task

Model interaural spectrum of each source

e.g. at 75, in reverb:

Modeling uncertainty improves results

Fixed or adaptive speech models

Source models function as priors

Interaural parameter spatial separation

Important for voice detection & separation

Based on channel selection Wu & Wang (2003)

Better object/event separation

Integration with video

Multi-instance learning of A-V co-occurrences

Environmental Sound Recognition - Dan Ellis 2011-06-01 /36

Background sound classication

Foreground event recognition

Speech content is very important

You might also like