You are on page 1of 36

Environmental Sound Recognition - Dan Ellis 2011-06-01 /36 1

1. Machine Listening
2. Background Classication
3. Foreground Event Recognition
4. Speech Separation
5. Open Issues
Environmental Sound
Recognition and Classication
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
1. Machine Listening
2

Extracting useful information from sound


... like animals do
D
e
c
t
e
c
t
C
l
a
s
s
i
f
y
D
e
s
c
r
i
b
e
Environmental
Sound
Speech Music
T
a
s
k
Domain
ASR
Music
Transcription
Music
Recommendation
VAD
Speech/Music
Emotion
Sound
Intelligence
Environment
Awareness
Automatic
Narration
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Environmental Sound Applications

Audio Lifelog
Diarization

Consumer Video Classication & Search


3
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
preschool
cafe
preschool
cafe
lecture
office
office
outdoor
group
lab
cafe
meeting2
office
office
outdoor
lab
cafe
Ron
Manuel
Arroyo?
Lesser
2004-09-13
L2
cafe
office
outdoor
lecture
outdoor
meeting2
outdoor
office
cafe
office
outdoor
office
postlec
office
DSP03
compmtg
Mike
Sambarta?
2004-09-14
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Prior Work

Environment
Classication
speech/music/
silent/machine
4
Zhang & Kuo 01

Nonspeech Sound Recognition


Meeting room
Audio Event Classication
sports events - cheers, bat/ball
sounds, ...
Temko & Nadeu 06
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Consumer Video Dataset

25 concepts from
Kodak user study
boat, crowd, cheer, dance, ...
5
8GMM+Bha (9/15)
pLSA500+lognorm (12/15)
1G+KL2 (10/15)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlapped Concepts
group of 3+
music
crowd
cheer
singing
one person
night
group of two
show
dancing
beach
park
baby
playground
parade
boat
sports
graduation
sunset
birthday
animal
wedding
picnic
museum
ski
0
3mc c s o n g s d b p b p p b s g s s b a wpm
L
a
b
e
l
e
d

C
o
n
c
e
p
t
s

Grab top 200 videos


from YouTube search
then lter for quality,
unedited = 1873 videos
manually relabel with
concepts
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Obtaining Labeled Data

Amazon Mechanical Turk


10s clips
9,641 videos in 4 weeks
6

Y-G Jiang et al. 2011
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
2. Background Classication

Baseline for soundtrack classication


divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= bag of features

Classify by e.g. KL distance + SVM


7
VTS_04_0001 - Spectrogram
f
r
e
q

/

k
H
z


1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
-20
-10
0
10
20
30
time / sec
time / sec
level / dB
value
M
F
C
C

b
i
n

1 2 3 4 5 6 7 8 9
2
4
6
8
10
12
14
16
18
20
-20
-15
-10
-5
0
5
10
15
20
MFCC dimension
M
F
C
C

d
i
m
e
n
s
i
o
n
MFCC covariance
5 10 15 20
2
4
6
8
10
12
14
16
18
20
-50
0
50
Video
Soundtrack
MFCC
features
MFCC
Covariance
Matrix
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Codebook Histograms

Convert high-dim. distributions to multinomial

Classify by distance on histograms


KL, Chi-squared
+ SVM
8
-20 -10 0 10 20
-10
-8
-6
-4
-2
0
2
4
6
8
MFCC(0)
M
F
C
C
(
1
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
50
100
150
GMM mixture
c
o
u
n
t
MFCC
features
Global
Gaussian
Mixture Model
Per-Category
Mixture Component
Histogram
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Latent Semantic Analysis (LSA)

Probabilistic LSA (pLSA) models each histogram


as a mixture of several topics
.. each clip may have several things going on

Topic sets optimized through EM


p(ftr | clip) = !
topics
p(ftr | topic) p(topic | clip)
use (normalized?) p(topic | clip) as per-clip features
9
GMM histogram ftrs GMM histogram ftrs Topic

T
o
p
i
c

A
V

C
l
i
p
A
V

C
l
i
p
p(ftr | clip)
p
(
t
o
p
i
c

|

c
l
i
p
)
p(ftr | topic)
=
*
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Background Classication Results

Wide range in performance


audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
10
K Lee & Ellis 10
Concept class
g
r
o
u
p

o
f

3
+
m
u
s
i
c
c
r
o
w
d
c
h
e
e
r
s
i
n
g
i
n
g
o
n
e

p
e
r
s
o
n
n
i
g
h
t
g
r
o
u
p

o
f

t
w
o
s
h
o
w
d
a
n
c
i
n
g
b
e
a
c
h
p
a
r
k
b
a
b
y
p
l
a
y
g
r
o
u
n
d
p
a
r
a
d
e
b
o
a
t
s
p
o
r
t
s
g
r
a
d
u
a
t
i
o
n
s
k
i
s
u
n
s
e
t
b
i
r
t
h
d
a
y
a
n
i
m
a
l
w
e
d
d
i
n
g
p
i
c
n
i
c
m
u
s
e
u
m
M
E
A
N
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A
v
e
r
a
g
e

P
r
e
c
i
s
i
o
n
Guessing
MFCC + GMM Classifier
SingleGaussian + KL2
8GMM + Bha
1024GMM Histogram + 500log(P(z|c)/Pz))
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Sound Texture Features

Characterize sounds by
perceptually-sufcient statistics
.. veried by matched
resynthesis

Subband
distributions
& env x-corrs
Mahalanobis
distance ...
11
McDermott Simoncelli 09
Ellis, Zheng, McDermott 11
Sound
Automatic
gain
control
Envelope
correlation
Cross-band
correlations
(318 samples)
Modulation
energy
(18 x 6)
mean, var,
skew, kurt
(18 x 4)
mel
filterbank
(18 chans)
x
x
x
x
x
x
FFT
Octave bins
0.5,1,2,4,8,16 Hz
Histogram
617
1273
2404


5
10
15


0.5 2 8 32

0.5 2 8 32 0.5 2

8 32


5 10 15 8 32


5 10 15
0
level
1
f
r
e
q

/

H
z
m
e
l

b
a
n
d
1159_10 urban cheer clap
Texture features
1062_60 quiet dubbed speech music
0 2 4 6 8 10
time / s
moments mod frq / Hz mel band moments mod frq / Hz mel band
0 2 4 6 8 10
K S V M K S V M
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Sound Texture Features

Test on MED 2010


development data
10 specially-collected
manual labels
12
MED 2010 Txtr Feature (normd) means


M V S K ms cbc
rural
urban
quiet
noisy
other
Dubbed
Speech
Music
Cheer
cLap
0.5
0
0.5
M V S K ms cbc
MED 2010 Txtr Feature (normd) stddevs


M V S K ms cbc
rural
urban
quiet
noisy
other
Dubbed
Speech
Music
Cheer
cLap
0
0.5
1
M V S K ms cbc
MED 2010 Txtr Feature (normd) means/stddevs


M V S K ms cbc
rural
urban
quiet
noisy
other
Dubbed
Speech
Music
Cheer
cLap
0.5
0
0.5
Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg
0.5
0.6
0.7
0.8
0.9
1


M
MVSK
ms
cbc
MVSK+ms
MVSK+cbc
MVSK+ms+cbc

Contrasts in
feature sets
correlation of labels...

Perform
~ same as MFCCs
combine well
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
3. Foreground Event Recognition

Global vs. local class models


tell-tale acoustics may be washed out in statistics
try iterative realignment of HMMs:
background model shared by all clips
13
Old Way:
All frames contribute
New Way:
Limited temporal extents
time / s
f
r
e
q

/

k
H
z
YT baby 002:
voice
baby
laugh
5 10 15
0
1
2
3
4
voice
baby
Iaugh
time / s
f
r
e
q

/

k
H
z
5 10 15
0
1
2
3
4
voice Iaugh
Iaugh
bg
bg
voice
baby
baby
K Lee, Ellis, Loui 10
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Foreground Event HMMs

Training labels only


at clip-level

Rene models by
EM realignment

Use for classifying


entire video...
or seeking to relevant
part
14
50
0
Spectrogram : YT_beach_001.mpeg,
Manual Clip-level Labels : beach, group of 3+, cheer
0 1 2 3 4 5 6 7 8 9
0
2k
4k
Viterbi Path by 27state Markov Model
f
r
e
q

/

H
z
10
animal
baby
beach
birthday
boat
crowd
graduation
group of 3+
group of 2
museum
night
one person
park
picnic
playground
show
sports
sunset
wedding
dancing
parade
singing
ski
cheer
music
global BG1
global BG2
2
5

c
o
n
c
e
p
t
s

+

2

g
l
o
b
a
l

B
G
s
time/sec
Manual Frame-level Annotation
water wave
cheer
unseen1
speaker2
speaker3 cheer
voice
voice
voice voice voice
speaker4
Background
1k
3k
drum wind etc
voice
voice
Lee & Ellis 10
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Transient Features

Transients =
foreground events?

Onset detector
nds energy bursts
best SNR

PCA basis to
represent each
300 ms x auditory freq

bag of transients
15
Cotton, Ellis, Loui 11
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Nonnegative Matrix Factorization

Decompose spectrograms into


templates
+ activation
fast forgiving
gradient descent
algorithm
2D patches
sparsity control...
16
Smaragdis Brown 03
Abdallah Plumbley 04
Virtanen 07
X = W H
Original mixture
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
f
r
e
q

/

H
z
442
883
1398
2203
3474
5478
0 1 2 3 4 5 6 7 8 9 10
time / s
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
NMF Transient Features

Learn 20 patches from


Meeting Room Acoustic
Event data

Compare to MFCC-HMM
detector
17
Cotton, Ellis 11
f
r
e
q

(
H
z
)
442
1398
3474
442
1398
3474
442
1398
3474
442
1398
3474
0.0 0.5 0.0 0.5
time (s)
0.0 0.5 0.0 0.5 0.0 0.5
Patch 1 Patch 2 Patch 3 Patch 4 Patch 5
Patch 6 Patch 7 Patch 8 Patch 9 Patch 10
Patch 11 Patch 12 Patch 13 Patch 14 Patch 15
Patch 16 Patch 17 Patch 18 Patch 19 Patch 20
10 15 20 25 30
0
50
100
150
200
250
300
SNR (dB)
A
c
o
u
s
t
i
c

E
v
e
n
t

E
r
r
o
r

R
a
t
e

(
A
E
E
R
%
)
Error Rate in Noise


MFCC
NMF
combined

NMF more
noise-robust
combines well ...
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
4. Speech Separation

Speech recognition is nding best-t


parameters - argmax P(W | X)

Recognize mixtures with Factorial HMM


model + state sequence for each voice/source
exploit sequence constraints, speaker differences
separation relies on detailed speaker model
18
model 1
model 2
observations / time
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Eigenvoices

Idea: Find
speaker model parameter space
generalize without
losing detail?

Eigenvoice model:
19
Kuhn et al. 98, 00
Weiss & Ellis 07, 08, 09
Speaker models
Speaker subspace bases
= + U w + B h
adapted mean eigenvoice weights channel channel
model voice bases bases weights
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Eigenvoice Bases

Mean model
280 states x 320 bins
= 89,600 dimensions

Eigencomponents
shift formants/
coloration
additional
components for
acoustic channel
20
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Mean Voice


b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Eigenvoice dimension 1


b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Eigenvoice dimension 2


b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Eigenvoice dimension 3


b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax
2
4
6
8
50
40
30
20
10
0
2
4
6
8
0
2
4
6
8
0
2
4
6
8
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Eigenvoice Speech Separation

Factorial HMM analysis


with tuning of source model parameters
= eigenvoice speaker adaptation
21
Weiss & Ellis 10
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Eigenvoice Speech Separation
22
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Eigenvoice Speech Separation

Eigenvoices for Speech Separation task


speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
23
SI
SA
SD
Mix
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Binaural Cues

Model interaural spectrum of each source


as stationary level and time differences:

e.g. at 75, in reverb:


24
IPD IPD residual ILD
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Model-Based EM Source Separation
and Localization (MESSL)
can model more sources than sensors
exible initialization
25
Mandel & Ellis 09
Assign spectrogram points
to sources
Re-estimate
source parameters
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
MESSL Results

Modeling uncertainty improves results


tradeoff between constraints & noisiness
26
2.45 dB 2.45 dB
9.12 dB 9.12 dB
0.22 dB 0.22 dB
-2.72 dB -2.72 dB
12.35 dB 12.35 dB
8.77 dB 8.77 dB
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
MESSL with Source Priors

Fixed or adaptive speech models


27
Weiss, Mandel & Ellis 11
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
MESSL-EigenVoice Results

Source models function as priors

Interaural parameter spatial separation


source model prior improves spatial estimate
28
0
0.2
0.4
0.6
0.8
1
Ground truth (12.04 dB)
F
r
e
q
u
e
n
c
y

(
k
H
z
)
0 0.5 1 1.5
0
2
4
6
8
DUET (3.84 dB)
0 0.5 1 1.5
0
2
4
6
8
2D FD BSS (5.41 dB)
0 0.5 1 1.5
0
2
4
6
8
MESSL (5.66 dB)
F
r
e
q
u
e
n
c
y

(
k
H
z
)
Time (sec)
0 0.5 1 1.5
0
2
4
6
8
MESSL SP (10.01 dB)
Time (sec)
0 0.5 1 1.5
0
2
4
6
8
MESSL EV (10.37 dB)
Time (sec)


0 0.5 1 1.5
0
2
4
6
8
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Noise-Robust Pitch Tracking

Important for voice detection & separation

Based on channel selection Wu & Wang (2003)


pitch from summary
autocorrelation
over good bands
trained classier decides
which channels to include
29

Improves over
simple Wu criterion
especially for mid SNR
Wu
CS-GTPTHvarAdjC5
M
e
a
n

B
E
R

/

%
1
10
100
Mean BER of pitch tracking algorithms
SNR / dB
-10 -5 0 5 10 15 20 25
BS Lee & Ellis 11
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Noise-Robust Pitch Tracking

Trained selection
includes more
off-harmonic
channels
30
F
r
e
q

/

k
H
z


0
2
4
C
h
a
n
n
e
l
s


20
40


0
12.5
25
0
12.5
25
p
e
r
i
o
d

/

m
s
C
h
a
n
n
e
l
s
20
40
0
12.5
25
0
12.5
25
p
e
r
i
o
d

/

m
s


40
20
0
20
08_rbf_pinknoise5dB


0 1 2 3 4
time / sec


W
u
C
S
-
G
T
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
5. Outstanding Issues

Better object/event separation


parametric models
spatial information?
computational
auditory
scene analysis...

Large-scale analysis

Integration with video


31
Frequency Proximity Harmonicity Common Onset
Barker et al. 05
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Audio-Visual Atoms

Object-related features
from both audio (transients) & video (patches)
32
J
o
i
n
t

A
u
d
i
o
-
V
i
s
u
a
l

A
t
o
m

d
i
s
c
o
v
e
r
y
time
. . .
visual atom
. . .
visual atom
. . .
visual atom
. . .
time
audio spectrum
f
r
e
q
u
e
n
c
y
time
f
r
e
q
u
e
n
c
y
audio atom
audio atom
audio atom
Jiang et al. 09
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Audio-Visual Atoms

Multi-instance learning of A-V co-occurrences


33
audio atom audio atom

visual atom
time
N audio atoms

M Visual Atoms

visual atom
time
codebook
learning
Joint A-V
Codebook
M!N
A-V Atoms

Environmental Sound Recognition - Dan Ellis 2011-06-01 /36


Audio-Visual Atoms
34
Audio only
Visual only
Visual + Audio
black suit
+ romantic
music
Wedding
marching
people
+ parade
sound
Parade
sand
+ beach
sounds
Beach
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
Summary

Machine Listening:
Getting useful information from sound

Background sound classication


... from whole-clip statistics?

Foreground event recognition


... by focusing on peak energy patches

Speech content is very important


... separate with pitch, models, ...
35
Environmental Sound Recognition - Dan Ellis 2011-06-01 /36
References

Jon Barker, Martin Cooke, & Dan Ellis, Decoding Speech in the Presence of Other Sources, Speech
Communication 45(1): 5-25, 2005.

Courtenay Cotton, Dan Ellis, & Alex Loui, Soundtrack classification by transient events, IEEE
ICASSP, Prague, May 2011.

Courtenay Cotton & Dan Ellis, Spectral vs. Spectro-Temporal Features for Acoustic Event
Classification, submitted to IEEE WASPAA, 2011.

Dan Ellis, Xiaohong Zheng, Josh McDermott, Classifying soundtracks with audio texture features,
IEEE ICASSP, Prague, May 2011.

Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, Short-Term Audio-Visual
Atoms for Generic Video Concept Classification, ACM MultiMedia, 5-14, Beijing, Oct 2009.

Keansub Lee & Dan Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE
Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010.

Keansub Lee, Dan Ellis, Alex Loui, Detecting local semantic concepts in environmental sounds
using Markov model based clustering, IEEE ICASSP, 2278-2281, Dallas, Apr 2010.

Byung-Suk Lee & Dan Ellis, Noise-robust pitch tracking by trained channel selection, submitted to
IEEE WASPAA, 2011.

Andriy Temko & Climent Nadeu, Classification of acoustic events using SVM-based clustering
schemes, Pattern Recognition 39(4): 682-694, 2006

Ron Weiss & Dan Ellis, Speech separation using speaker-adapted Eigenvoice speech models,
Computer Speech & Lang. 24(1): 16-29, 2010.

Ron Weiss, Michael Mandel, & Dan Ellis, Combining localization cues and source model
constraints for binaural source separation, Speech Communication 53(5): 606-621, May 2011.

Tong Zhang & C.-C. Jay Kuo, Audio content analysis for on-line audiovisual data segmentation,
IEEE TSAP 9(4): 441-457, May 2001
36

You might also like