You are on page 1of 39

Cocktail Party Problem

as Binary Classification
DeLiang Wang

Perception & Neurodynamics Lab
Ohio State University
2
Outline of presentation
Cocktail party problem
Computational theory analysis
Ideal binary mask
Speech intelligibility tests
Unvoiced speech segregation as binary classification
3
Real-world audition
What?
Speech
message
speaker
age, gender, linguistic origin, mood,
Music
Car passing by
Where?
Left, right, up, down
How close?
Channel characteristics
Environment characteristics
Room reverberation
Ambient noise

4
Sources of intrusion and distortion
additive noise from
other sound sources
reverberation from
surface reflections
5
Cocktail party problem
Term coined by Cherry
One of our most important faculties is our ability to listen to, and
follow, one speaker in the presence of others. This is such a common
experience that we may take it for granted; we may call it the
cocktail party problem (Cherry57)
For cocktail party-like situations when all voices are equally
loud, speech remains intelligible for normal-hearing listeners even
when there are as many as six interfering talkers (Bronkhorst &
Plomp92)
Ball-room problem by Helmholtz
Complicated beyond conception (Helmholtz, 1863)
Speech segregation problem
6
Approaches to Speech Segregation Problem
Speech enhancement
Enhance signal-to-noise ratio (SNR) or speech quality by
attenuating interference. Applicable to monaural recordings
Limitation: Stationarity and estimation of interference
Spatial filtering (beamforming)
Extract target sound from a specific spatial direction with a sensor
array
Limitation: Configuration stationarity. What if the target switches
or changes location?
Independent component analysis (ICA)
Find a demixing matrix from mixtures of sound sources
Limitation: Strong assumptions. Chief among them is stationarity
of mixing matrix
No machine has yet been constructed to do just that
[solving the cocktail party problem]. (Cherry57)


7
Auditory scene analysis
Listeners parse the complex mixture of sounds arriving
at the ears in order to form a mental representation of
each sound source
This perceptual process is called auditory scene analysis
(Bregman90)
Two conceptual processes of auditory scene analysis
(ASA):
Segmentation. Decompose the acoustic mixture into sensory
elements (segments)
Grouping. Combine segments into groups, so that segments in the
same group likely originate from the same environmental source
8
Computational auditory scene analysis
Computational auditory scene analysis (CASA)
approaches sound separation based on ASA principles
Feature based approaches
Model based approaches
9
Outline of presentation
Cocktail party problem
Computational theory analysis
Ideal binary mask
Speech intelligibility tests
Unvoiced speech segregation as binary classification
10
What is the goal of CASA?
What is the goal of perception?
The perceptual systems are ways of seeking and extracting
information about the environment from sensory input (Gibson66)
The purpose of vision is to produce a visual description of the
environment for the viewer (Marr82)
By analogy, the purpose of audition is to produce an auditory
description of the environment for the listener
What is the computational goal of ASA?
The goal of ASA is to segregate sound mixtures into separate
perceptual representations (or auditory streams), each of which
corresponds to an acoustic event (Bregman90)
By extrapolation the goal of CASA is to develop computational
systems that extract individual streams from sound mixtures
11
Marrian three-level analysis
According to Marr (1982), a complex information
processing system must be understood in three levels
Computational theory: goal, its appropriateness, and basic processing
strategy
Representation and algorithm: representations of input and output
and transformation algorithms
Implementation: physical realization
All levels of explanation are required for eventual
understanding of perceptual information processing
Computational theory analysis understanding the
character of the problem is critically important
12
Computational-theory analysis of ASA
To form a stream, a sound must be audible on its own
The number of streams that can be computed at a time is
limited
Magical number 4 for simple sounds such as tones and vowels
(Cowan01)?
1+1, or figure-ground segregation, in noisy environment such as a
cocktail party?
Auditory masking further constrains the ASA output
Within a critical band a stronger signal masks a weaker one
13
Computational-theory analysis of ASA (cont.)
ASA outcome depends on sound types (overall
SNR is 0)
Noise-Noise: pink , white , pink+white
Tone-Tone: tone1 , tone2 , tone1+tone2
Speech-Speech:
Noise-Tone:
Noise-Speech:
Tone-Speech:


14
Some alternative CASA goals
Extract all underlying sound sources or the target sound
source (the gold standard)
Implicit in speech enhancement, spatial filtering, and ICA
Segregating all sources is implausible, and probably unrealistic with
one or two microphones
Enhance automatic speech recognition (ASR)
Close coupling with a primary motivation of speech segregation
Perceiving is more than recognizing (Treisman99)
Enhance human listening
Advantage: close coupling with auditory perception
There are applications that involve no human listening
15
Ideal binary mask as CASA goal
Motivated by above analysis, we have suggested the ideal binary
mask as a main goal of CASA (Hu & Wang01, 04)
Key idea is to retain parts of a mixture where the target sound is
stronger than the acoustic background, and discard the rest
What a target is depends on intention, attention, etc.
The definition of the ideal binary mask (IBM)



s(t, f ): Target energy in unit (t, f )
n(t, f ): Noise energy
: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB
It does not actually separate the mixture!

>
=
otherwise 0
) , ( ) , ( if 1
) , (
u f t n f t s
f t IBM
16
IBM illustration
17
Properties of IBM
Flexibility: With the same mixture, the definition leads to different
IBMs depending on what target is
Well-definedness: IBM is well-defined no matter how many
intrusions are in the scene or how many targets need to be
segregated
Consistent with computational-theory analysis of ASA
Audibility and capacity
Auditory masking
Effects of target and noise types
Optimality: Under certain conditions the ideal binary mask with =
0 dB is the optimal binary mask from the perspective of SNR gain
The ideal binary mask provides an excellent front-end for robust
ASR
18
Subject tests of ideal binary masking
Recent studies found large speech intelligibility
improvements by applying ideal binary masking for
normal-hearing (Brungart et al.06; Li & Loizou08),
and hearing-impaired (Anzalone et al.06; Wang et
al.09) listeners
Improvement for stationary noise is above 7 dB for normal-hearing
(NH) listeners, and above 9 dB for hearing-impaired (HI) listeners
Improvement for modulated noise is significantly larger than for
stationary noise

19
Test conditions of Wang et al.09
SSN: Unprocessed monaural mixtures of speech-shaped
noise (SSN) and Dantale II sentences (0 dB: -10 dB: )
CAF: Unprocessed monaural mixtures of cafeteria
noise (CAF) and Dantale II sentences (0 dB:
-10 dB: )
SSN-IBM: IBM applied to SSN (0 dB: -10 dB:
-20 dB: )
CAF-IBM: IBM applied to CAF (0 dB: -10 dB:
-20 dB: )
Intelligibility results are measured in terms of speech
reception threshold (SRT), the required SNR level for
50% intelligibility score

20
Wang et al.s results
12 NH subjects (10 male and 2 female), and 12 HI subjects (9 male and 3 female)
SRT means for the 4 conditions for NH listeners: (-8.2, -10.3, -15.6, -20.7)
SRT means for the 4 conditions for HI listeners: (-5.6, -3.8, -14.8, -19.4)

NH
HI
SSN CAFE SSN-IBM CAFE-IBM
-24
-22
-20
-18
-16
-14
-12
-10
-8
-6
-4
-2
0
2
4
D
a
n
t
a
l
e

I
I

S
R
T

(
d
B
)
21
Speech perception of noise with binary gains
Wang et al. (2008) found that, when LC is chosen to be the same as
the input SNR, nearly perfect intelligibility is obtained when input
SNR is - dB (i.e. the mixture contains noise only with no target
speech)
Time (s)
C
e
n
t
e
r

F
r
e
q
u
e
n
c
y

(
H
z
)
0.4 0.8 1.2 1.6 2
7743
2489
603
55
96 dB
72 dB
48 dB
24 dB
0 dB
Time (s)
C
e
n
t
e
r

F
r
e
q
u
e
n
c
y

(
H
z
)
0.4 0.8 1.2 1.6 2
7743
2489
603
55
Time (s)
C
h
a
n
n
e
l

N
u
m
b
e
r
0.4 0.8 1.2 1.6 2
32
22
12
2
Time (s)
C
e
n
t
e
r

F
r
e
q
u
e
n
c
y

(
H
z
)
0.4 0.8 1.2 1.6 2
7743
2489
603
55
22
Wang et al.08 results
Despite a great reduction of spectrotemporal information, a pattern
of binary gains is apparently sufficient for human speech recognition
Mean numbers for the
4 conditions: (97.1%,
92.9%, 54.3%, 7.6%)


Number of channels
4 8 16 32
0
10
20
30
40
50
60
70
80
90
100
P
e
r
c
e
n
t

c
o
r
r
e
c
t
23
Interim summary
Ideal binary mask is an appropriate computational goal
of auditory scene analysis in general, and speech
segregation in particular
Hence solving the cocktail party problem would amount
to binary classification
This formulation opens the problem to a variety of pattern
classification methods
24
Outline of presentation
Cocktail party problem
Computational theory analysis
Ideal binary mask
Speech intelligibility tests
Unvoiced speech segregation as binary classification
25
Unvoiced speech
Speech sounds consist of vowels and consonants;
consonants further consist of voiced and unvoiced
consonants
For English, unvoiced speech sounds come from the
following consonant categories:
Stops (plosives)
Unvoiced: /p/ (pool), /t/ (tool), and /k/ (cake)
Voiced: /b/ (book), /d/ (day), and /g/ (gate)
Fricatives
Unvoiced: /s/(six), /sh/ (sheep), /f/ (fix), and /th/ (this)
Voiced: /z/ (zoo), /zh/ (pleasure), /v/ (vine), and /dh/ (that)
Mixed: /h/ (high)
Affricates (stop followed by fricative)
Unvoiced: /ch/ (chicken)
Voiced: /jh/ (orange)
We refer to the above consonants as expanded obstruents

26
Unvoiced speech segregation
Unvoiced speech constitutes 20-25% of all speech sounds
It carries crucial information for speech intelligibility
Unvoiced speech is more difficult to segregate than
voiced speech
Voiced speech is highly structured, whereas unvoiced speech lacks
harmonicity and is often noise-like
Unvoiced speech is usually much weaker than voiced speech and
therefore more susceptible to interference
27
Processing stages of Hu-Wang08 model


Segmentation



Mixture
Auditory
periphery
Segregated
speech
Grouping

Peripheral processing results in a two-dimensional cochleagram
28
Auditory segmentation
Auditory segmentation is to decompose an auditory scene
into contiguous time-frequency (T-F) regions (segments),
each of which should contain signal mostly from the
same sound source
The definition of segmentation applies to both voiced and unvoiced
speech
This is equivalent to identifying onsets and offsets of
individual T-F segments, which correspond to sudden
changes of acoustic energy
Our segmentation is based on a multiscale onset/offset
analysis (Hu & Wang07)
Smoothing along time and frequency dimensions
Onset/offset detection and onset/offset front matching
Multiscale integration

29
Smoothed intensity
(a)
F
r
e
q
u
e
n
c
y

(
H
z
)
0 0.5 1 1.5 2 2.5
50
363
1246
3255
8000
(b)
F
r
e
q
u
e
n
c
y

(
H
z
)
0 0.5 1 1.5 2 2.5
50
363
1246
3255
8000
(c)
F
r
e
q
u
e
n
c
y

(
H
z
)
Time (s)
0 0.5 1 1.5 2 2.5
50
363
1246
3255
8000
(d)
F
r
e
q
u
e
n
c
y

(
H
z
)
Time (s)
0 0.5 1 1.5 2 2.5
50
363
1246
3255
8000
Utterance: That noise problem grows more annoying each day
Interference: Crowd noise in a playground. Mixed at 0 dB SNR
Scale in freq. and time: (a) (0, 0), initial intensity. (b) (2, 1/14). (c) (6, 1/14). (d) (6, 1/4)
30
Segmentation result
The bounding contours of
estimated segments from
multiscale analysis. The
background is represented
by blue:
(a)One scale analysis
(b)Two-scale analysis
(c)Three-scale analysis
(d)Four-scale analysis
(e)The ideal binary mask
(f)The mixture
31
Grouping
Apply auditory segmentation to generate all segments for
the entire mixture
Segregate voiced speech using an existing algorithm
Identify segments dominated by voiced target using
segregated voiced speech
Identify segments dominated by unvoiced speech based
on speech/nonspeech classification
Assuming nonspeech interference due to the lack of sequential
organization

32
Speech/nonspeech classification
A T-F segment is classified as speech if


X
s
: The energy of all the T-F units within segment s
H
0
: The hypothesis that s is dominated by expanded obstruents
H
1
: The hypothesis that s is interference dominant

) | ( ) | (
1 0 s s
H P H P X X >
33
Speech/nonspeech classification (cont.)
By the Bayes rule, we have


Since segments have varied durations, directly evaluating the above
likelihoods is computationally infeasible
Instead, we assume that each time frame within a segment is
statistically independent given a hypothesis



A multilayer perceptron is trained to distinguish expanded
obstruents from nonspeech interference
) ( ) | ( ) ( ) | (
1 1 0 0
H P H P H P H P
s s
X X >
1
) | ) ( (
) | ) ( (
) (
) (
2
1 1
0
1
0
>
[
=
m
m m
s
s
H m X P
H m X P
H P
H P
34
Speech/nonspeech classification (cont.)
The prior probability ratio of , is found to
be approximately linear with respect to input SNR
Assuming that interference energy does not vary greatly
over the duration of an utterance, earlier segregation of
voiced speech enables us to estimate input SNR
) ( / ) (
1 0
H P H P
35
Speech/nonspeech classification (cont.)
With estimated input SNR, each segment is then
classified as either expanded obstruents or interference
Segments classified as expanded obstruents join the
segregated voiced speech to produce the final output
36
(a) Clean utterance
F
r
e
q
u
e
n
c
y

(
H
z
)
0.5 1 1.5 2 2.5
50
363
1246
3255
8000
(c) Segregated voiced utterance
F
r
e
q
u
e
n
c
y

(
H
z
)
0.5 1 1.5 2 2.5
50
363
1246
3255
8000
(b) Mixture (SNR 0 dB)
0.5 1 1.5 2 2.5
(d) Segregated whole utterance
0.5 1 1.5 2 2.5
(e) Utterance segregated from IBM
F
r
e
q
u
e
n
c
y

(
H
z
)
Time (S)
0.5 1 1.5 2 2.5
50
363
1246
3255
8000
Example of segregation
Utterance: That noise problem grows more annoying each day
Interference: Crowd noise in a playground (IBM: Ideal binary mask)
37
SNR of segregated target
-5 0 5 10 15
0
5
10
15
(a)
O
v
e
r
a
l
l

S
N
R

(
d
B
)
Proposed system
Spectral subtraction
0 5 10 15
-10
-5
0
5
10
(b)
S
N
R

i
n

u
n
v
o
i
c
e
d

f
r
a
m
e
s

(
d
B
)
Mixture SNR (dB)
Compared to spectral subtraction assuming perfect speech pause detection

38
Conclusion
Analysis of ideal binary mask as CASA goal
Formulation of the cocktail party problem as binary
classification
Segregation of unvoiced speech based on segment
classification
The proposed model represents the first systematic study on
unvoiced speech segregation
39
Credits
Speech intelligibility tests of IBM: Joint with Ulrik
Kjems, Michael S. Pedersen, Jesper Boldt, and Thomas
Lunner, at Oticon
Unvoiced speech segregation: Joint with Guoning Hu

You might also like