Listen and Look: Audio-Visual Matching Assisted Speech Source Separation

Listen and Look: Audio–Visual Matching Assisted Speech
Source Separation
Seminar report 2
Submitted by
Mane Pooja
(M190442EC)
MASTER OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
(Signal Processing)
DEPARTMENT OF ELECTRONICS AND COMMUNICATION

ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY, CALICUT
NIT CAMPUS P.O., CALICUT
KERALA, INDIA 673601.
Contents
1 Introduction 1
1.1 Prior work: Audio only Deep Approach . . . . . . . . . . . . . 2
2 Proposed method: Audio–Visual Matching Assisted Speech Source

Separation 3
2.1 Methodology: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 AV Match Architecture: . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Audio-Network . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Visual-Network . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 Mask Correction using AV Match . . . . . . . . . . . . . 5
3 Experiments 6
3.1 Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . 6
4 Results 8
5 Conclusion 10
Abstract
The main aim of this paper is to separate individual voices from an audio mixture of
multiple simultaneous talkers. Earlier audio based approaches do not give satisfactory
results. Audio-based methods could not solve source permutation problem. To elimi-
nate source permutation problem and for better speech source seperation we adopt a
new model- AV Match.
AV Match method use both audio embeddings as well as visual embeddings for better
speech separation.
The proposed AV Match model computes audio only embeddings using audio net-
work and visual embeddings using visual network.We compute similarities between the
separated audio and visual streams with our proposed “AV Match model”.Experimental
results of AV Match model shows that it outperforms audio only based Deep Cluster-
ing speech separation method and alleviates Source Permutation problem to greater
extent.
Chapter 1
Introduction
Speech Separation The main aim of speech separation is to separate individual

voices from an audio mixture of simultaneous talkers. In the process of speech sepa-
ration source permutation, i.e., assigning separated signal snippets to wrong sources
over time, is a major issue . So here we present a new AV Match model which uses
both audio and visual cues for the speech separation process.
As there exists correlation between lip movements and acoustic signals, we design a
new model to leverage both audio and visual cues for better speech separation.
Proposed model considers both audio and visual data .AV Match model computes
audio and visual embeddings using audio and visual networks respectively.We com-
pute similarities of audio and visual embeddings by inner product and correct the
predicted masks using and Indicator function.
AV Match model alleviates source permutation problems that happen with audio
only mixtures on same gender speech mixtures.
1
Figure 1.1: Speech Separation Flowchart
1.1 Prior work: Audio only Deep Approach

DC Deep Clustering is one of the audio-only speech separation model.The aim of
this method is to learn an embedding vector for each T-F bin of speech mixture.
Based on these T-F bins clustering is performed to obtain the T-F mask for each
speaker.
Source Permutation problem occurs when speakers are of same gender i.e M-M or
F-F.
2
Chapter 2
Proposed method: Audio–Visual

Matching Assisted Speech Source
Separation
AV Match model computes both audio and visual embeddings using Audio Net-
work and Visual Network respectively for better speech separation.
2.1 Methodology:
Audio and video mixture is fed as input to both audio stream and visual stream.
STFT Log Magnitude is fed as input to Audio Only Separation Model (Deep Clus-
tering) which predicts masks.Output is a D-dimensional embedding vector for each
T-F bin constituting an embedding matrix.
Visual data in the form of Optical Flow and Gray Image are fed to AV Match model.
If there exists source permuation problem in predicted masks using DC model, AV
match model corrects prdicted masks using similaties.
AV Match model considers both audio and visual data .It computes audio and visual
embeddings using audio and visual networks respectively.We compute similarities of
audio and visual embeddings by inner product and correct the predicted masks using
and Indicator function.
Figure 2.1: AV Match speech separation
3
2.2 AV Match Architecture:
Figure 2.2: AV Match architecture
2.2.1 Audio-Network
Considering top portion of fig 2.2, t-th frame of dot product of Mixture spectro-
gram:X and mask:Y are fed as input to fully connected layers with output sizes of
256, 128, D followed by ReLU nonlinearity to get an embedding vector of D dimen-
sion for(t=1,2,.....T).
2.2.2 Visual-Network
The bottom portion of fig2.2 explains visual network. Here we consider VGG CNN
as our feed forward network which captures local invariant features of lip regions. N
consecutive gray images (H x W x N=3) and optical data (H x W x N=2) of the
lip regions are fed as input to the early stages of conv1- 1,conv1-2,conv2-1,conv2-2
layers.Output of these layers are concatenated and fed to conv3 and conv4 layers. All
convolutional layers have 3 x 3 filters with stride 1, followed by batch normalisation
4
and ReLU nonlinearity.There are maxpooling layers with a kernel size of 2 and stride
of 2 after conv1-2, conv2-2, conv3, and conv4. fc-in layer with output size of 128
followed by ReLU nonlinearity.We add a single layer of bidirectional long short-term
memory (BLSTM) with a hidden size of 256 on top of the feedforward network to
model the contextual information.Output of BLSTM layer is fed to fc-out with ReLU
to get the visual embeddings.We compute similarities of audio and visual embeddings
by inner product. Similarities obtained by the predicted masks are denoted as
We exploit the relative similarity of audio and visual streams by applying the
triplet loss for training and set m = 1 empirically
2.2.3 Mask Correction using AV Match

Indicator function decides whether we permute masks predicted by the audio-only
model can be obtained by the following equation For Indicator = 1, implies that
predicted masks are in correct order, else we swap masks to correct predictions.
Indicator function can be noisy sometimes, we apply median filter for smoothening
results.
5
Chapter 3
Experiments
We carried out experiments on 2-speaker mixtures to show the performance improve-

ment of AV Match model compared with previous audio only methods. We also
show that proposed AV Match model performs well for ”same gender” mixtures com-
pared to DC audio only separation.
3.1 Dataset and Preprocessing

We make use of 2-speaker mixture of GRID and WSJ0 datasets to show the effec-
tiveness of AV Match model.
WSJO Dataset:
WSJ0 is an audio-only dataset.We generate
Training set : 30 h
Validation set :10 h
Test set : 5 h
This dataset is used to pretrain the audio-only model in some experiments.
GRID Dataset:
It provides both audio and visual data.
Number of speakers: 34 speakers (17-males and 15-females).

Duration of each Vedio: 3 secondswith 25FPS.
Image size : 726x576 (height x width).
Validation set: 3-males and 3-females of 2.5hours.
test set: 3-males and 3-females of 2.5hours.
GRID-Extreme Dataset:
It provides both audio and visual data.
We select the same testing speakers used in the GRID dataset. This dataset is to
show the benefits of the proposed AV Match method in challenging situations.
6
Preprocessing and setup:
All audio recordings are downsampled to 8 kHz for STFT with a window size of
32 and hopsize of 8ms.Linear-amplitude spectrogram(X) and log-amplitude spectro-
gram (Xlog) are fed into “audio–visual” and “audio-only” networks, respectively. Lip
regions are detected with the Dlib library. The stack of three consecutive gray im-
ages (N = 3) achieves the best performance. We set D = 128 as the dimension of
the audio–visual embedding space to reduce the dimensionality of the outputs.All
models are trained by Adam with a learning rate of = 0.001, we stop training if
the validation loss does not decrease for five consecutive epochs.
7
Chapter 4
Results
GRID means DC model is trained on GRID dataset.

GRID+WSJ0 means DC model is pretrained on WSJ0 and fine tuned on GRID.
Superscript AV shows results by our proposed audio–visual method.
* shows results of the optimal permutation.
Subscripts 35 and 115 indicate the lengths of median filters (in audio time frames),which
are chosen empirically.
Figure 4.1: comparision of separation quality on GRID dataset with other models
It is clear that AV Match method improves separation quality by a large margin

on GRID dataset.We achieve an improvement of 0.75 dB on the overall SDR. The
improvement for same gender mixtures: 2.17 dB for F–F and 0.57 dB for M–M.This
is because same gender mixtures have similar vocal characteristics, which is difficult
for audio only separation model. Av match method also solves source permutaion
problem to a greater extent for same gender mixture.
When the DC model is pretrained on the WSJ0 dataset, it already achieves a high
overall SDR of 9.83dB. This makes the improvement of our proposed AV Match ap-
proach (9.88 dB) limited.Nevertheless, the proposed method still matters in cases
when the “audio-only” method fails.
8
we can see that as the performance of the DC model degrades, the improvement
due to AV Match model is more pronounced, reaching 4.6 dB improvement on SDR
when the DC model’s performance is 2.5 dB.
Figure 4.2: Improvement on SDR of AV Match approach against the DC baseline.
Results on GRID-Extreme Dataset

Fig 4.3 shows the separation quality on GRID-Extreme dataset in terms of SDR.
We get overall improvements of 1.65 and 1.68 dB compared to the DC model un-
der both cases. These results indicate that when the “audio-only” model fails to
separate speech mixtures properly, the proposed “AV Match”still maintains stable
performance.
Figure 4.3: SDR on GRID-EXTREME Dataset
9
Chapter 5
Conclusion
Proposed AV Match model successfully corrects the permutation problems in the

masks predicted by the audio-only separation model.
AV Match method is much effective when Audio Only separation is poor.
The training procedure of the AV Match model is independent of the audio-only
separation model, allowing it to be combined with any mask-estimation based audio-
only separation methods.
The proposed method outperforms the audio-only Deep Clustering model giving bet-
ter speech separation.
10
Bibliography
[1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Dis-

criminative embeddings for segmentation and separation,” in 41th International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
[2] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single channel

multi-speaker separation using deep clustering,” in Proc. Interspeech,2016, pp.
545–549.
[3] A. Torfi, S. M. Iranmanesh, N. Nasrabadi, and J. Dawson, “3D convolutional

neural networks for cross audio-visual matching recognition,” IEEE Access, vol.
5, pp. 22081–22091, 2017.
[4] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc.

Interspeech, 2018, pp. 1170–1174.
11

Listen and Look: Audio-Visual Matching Assisted Speech Source Separation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Listen and Look: Audio-Visual Matching Assisted Speech Source Separation

Uploaded by

Copyright:

Available Formats

Listen and Look: Audio–Visual Matching Assisted Speech

DEPARTMENT OF ELECTRONICS AND COMMUNICATION

2 Proposed method: Audio–Visual Matching Assisted Speech Source

Speech Separation The main aim of speech separation is to separate individual

1.1 Prior work: Audio only Deep Approach

Proposed method: Audio–Visual

Figure 2.1: AV Match speech separation

Figure 2.2: AV Match architecture

2.2.3 Mask Correction using AV Match

We carried out experiments on 2-speaker mixtures to show the performance improve-

3.1 Dataset and Preprocessing

Number of speakers: 34 speakers (17-males and 15-females).

GRID means DC model is trained on GRID dataset.

It is clear that AV Match method improves separation quality by a large margin

Figure 4.2: Improvement on SDR of AV Match approach against the DC baseline.

Results on GRID-Extreme Dataset

Figure 4.3: SDR on GRID-EXTREME Dataset

Proposed AV Match model successfully corrects the permutation problems in the

[1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Dis-

[2] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single channel

[3] A. Torfi, S. M. Iranmanesh, N. Nasrabadi, and J. Dawson, “3D convolutional

[4] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc.

You might also like