Professional Documents
Culture Documents
BEE,4th Year
Department of Electrical Engineering
Jadavpur University
Kolkata 700032
Abstract: Audio-visual recognition (AVR) has been effects; while intrinsic variations are factors inherent
considered as a solution for speech recognition tasks to the speaker themselves such as age, accent,
when the audio is corrupted, as well as a visual emotion, intonation and manner of speaking,
recognition method used for speaker verification in amongst others. Deep Convolutional Neural
multi-speaker scenarios. The approach of AVR Networks (CNNs) have given rise to substantial
systems is to leverage the extracted information from improvements in speech recognition, computer
one modality to improve the recognition ability of the vision and related fields due to their ability to deal
other modality by complementing the missing
with real world, noisy datasets without the need for
information. Speaker recognition models that utilize
handcrafted features. One of the most important
neural networks to map utterances where distances
ingredients for the success of such methods,
reflect similarity between speakers have driven the
recent progress in the speaker recognition task. however, is the availability of large training datasets
However, there is still a significant performance gap Speaker recognition with only a few and short
between training set and unseen speakers. Here we utterances is a challenging problem. Recently many
have employed certain time and frequency domain approaches have been proposed for speaker
techniques to investigate the features of the speech
identification using deep neural networks but they
segments which is then applied in a VGG model with
assume availability of large amount of training data.
classification loss to estimate the speaker of the
Moreover, the deep learning pipelines attempting to
speech segment.
learn from short utterances are, in general, based on
Index Terms—Convolutional Networks, Deep Learning, i-vectors or Mel-frequency cepstral coefficients
Audio-visual Recognition. (MFCCs). While MFCCs are known for susceptibility to
1. Introduction: noise, the performance of i-vectors tends to suffer in
case of short utterances. Moreover, it has been
Speaker recognition under noisy and unconstrained shown that convolutional neural networks (CNNs) are
conditions is an extremely challenging topic. able to mitigate the noise susceptibility of i-vectors,
Applications of speaker recognition are many and MFCCs and have been successfully used for speaker
varied, ranging from authentication in high-security recognition. Since convolutional neural networks are
systems and forensic tests, to searching for persons in data-hungry, and are able to exploit structured
large corpora of speech data. All such tasks require information such as images very effectively, recently
high speaker recognition performance under ‘real large-scale speaker recognition datasets have been
world’ conditions. This is an extremely difficult task made publicly available where benchmarks setup on
due to both extrinsic and intrinsic variations; extrinsic CNNs with spectrogram as input have been shown to
variations include background chatter and music, perform very well. However, their effectiveness
laughter, reverberation, channel and microphone
Figure 1 Architecture of the speaker recognition model
tolearn or generalize with limited amount of data combined to create a spoken word, and a viseme is its
(few-shots) and short utterances is not established corresponding visual equivalent. For recognizing full
very well. Few shot learning paradigms have recently words, a Long Short-Term Memory (LSTM) classifier
been effectively applied for audio processing. with Discrete Cosine Transform (DCT) and Deep
However, their effectiveness to speech processing Bottleneck Features (DBF) is trained [14]. Similarly,
especially speaker recognition is still unknown. We LSTM with Histogram of Oriented Gradients (HOG)
choose VGGNet as the base architectures and features are used also for phrase recognition [15].
evaluate them under various settings. While CNNs One of the most challenging applications of audio-
have shown to perform very well, but they are not visual recognition is the audio-video synchronization
able to exploit the spatial relationships within an for which the audio-visual matching ability is
image. CNNs are not able to leverage the spatial required. The research efforts related to this paper
information within spectrograms such as between consist of different audio-visual recognition tasks
pitch and formant. Therefore, we utilized VGG which try to find the correspondence between the
networks for speech command recognition. However, two modalities. Different approaches have been
there are two problems in applying VGG network for employed for tackling the audio-visual matching
speaker recognition. First is that their applicability to problem. Some are based on data-driven approaches,
complex data is yet to be established and hence the such as using DNN classifiers to determine the off-
generalization ability. Second, they are extremely sync time [16], [17] and some are based on Canonical
computationally intensive. The projected feature Correlation Analysis (CCA) [18] and Co-Inertia Analysis
vector from the embedding space are then subjected (CoIA).
to L2 loss to learn from the constrained data. The
entire pipeline is trained end-to-end.
3. Methodology: