You are on page 1of 5

Some Investigations on Speaker Identification from Speech

Segments using Deep Learning


Abhik Pramanik Arka prabha Saha Ankan Kumar Bhunia

BEE,4th Year
Department of Electrical Engineering
Jadavpur University

Kolkata 700032

Abstract: Audio-visual recognition (AVR) has been effects; while intrinsic variations are factors inherent
considered as a solution for speech recognition tasks to the speaker themselves such as age, accent,
when the audio is corrupted, as well as a visual emotion, intonation and manner of speaking,
recognition method used for speaker verification in amongst others. Deep Convolutional Neural
multi-speaker scenarios. The approach of AVR Networks (CNNs) have given rise to substantial
systems is to leverage the extracted information from improvements in speech recognition, computer
one modality to improve the recognition ability of the vision and related fields due to their ability to deal
other modality by complementing the missing
with real world, noisy datasets without the need for
information. Speaker recognition models that utilize
handcrafted features. One of the most important
neural networks to map utterances where distances
ingredients for the success of such methods,
reflect similarity between speakers have driven the
recent progress in the speaker recognition task. however, is the availability of large training datasets
However, there is still a significant performance gap Speaker recognition with only a few and short
between training set and unseen speakers. Here we utterances is a challenging problem. Recently many
have employed certain time and frequency domain approaches have been proposed for speaker
techniques to investigate the features of the speech
identification using deep neural networks but they
segments which is then applied in a VGG model with
assume availability of large amount of training data.
classification loss to estimate the speaker of the
Moreover, the deep learning pipelines attempting to
speech segment.
learn from short utterances are, in general, based on
Index Terms—Convolutional Networks, Deep Learning, i-vectors or Mel-frequency cepstral coefficients
Audio-visual Recognition. (MFCCs). While MFCCs are known for susceptibility to
1. Introduction: noise, the performance of i-vectors tends to suffer in
case of short utterances. Moreover, it has been
Speaker recognition under noisy and unconstrained shown that convolutional neural networks (CNNs) are
conditions is an extremely challenging topic. able to mitigate the noise susceptibility of i-vectors,
Applications of speaker recognition are many and MFCCs and have been successfully used for speaker
varied, ranging from authentication in high-security recognition. Since convolutional neural networks are
systems and forensic tests, to searching for persons in data-hungry, and are able to exploit structured
large corpora of speech data. All such tasks require information such as images very effectively, recently
high speaker recognition performance under ‘real large-scale speaker recognition datasets have been
world’ conditions. This is an extremely difficult task made publicly available where benchmarks setup on
due to both extrinsic and intrinsic variations; extrinsic CNNs with spectrogram as input have been shown to
variations include background chatter and music, perform very well. However, their effectiveness
laughter, reverberation, channel and microphone
Figure 1 Architecture of the speaker recognition model

tolearn or generalize with limited amount of data combined to create a spoken word, and a viseme is its
(few-shots) and short utterances is not established corresponding visual equivalent. For recognizing full
very well. Few shot learning paradigms have recently words, a Long Short-Term Memory (LSTM) classifier
been effectively applied for audio processing. with Discrete Cosine Transform (DCT) and Deep
However, their effectiveness to speech processing Bottleneck Features (DBF) is trained [14]. Similarly,
especially speaker recognition is still unknown. We LSTM with Histogram of Oriented Gradients (HOG)
choose VGGNet as the base architectures and features are used also for phrase recognition [15].
evaluate them under various settings. While CNNs One of the most challenging applications of audio-
have shown to perform very well, but they are not visual recognition is the audio-video synchronization
able to exploit the spatial relationships within an for which the audio-visual matching ability is
image. CNNs are not able to leverage the spatial required. The research efforts related to this paper
information within spectrograms such as between consist of different audio-visual recognition tasks
pitch and formant. Therefore, we utilized VGG which try to find the correspondence between the
networks for speech command recognition. However, two modalities. Different approaches have been
there are two problems in applying VGG network for employed for tackling the audio-visual matching
speaker recognition. First is that their applicability to problem. Some are based on data-driven approaches,
complex data is yet to be established and hence the such as using DNN classifiers to determine the off-
generalization ability. Second, they are extremely sync time [16], [17] and some are based on Canonical
computationally intensive. The projected feature Correlation Analysis (CCA) [18] and Co-Inertia Analysis
vector from the embedding space are then subjected (CoIA).
to L2 loss to learn from the constrained data. The
entire pipeline is trained end-to-end.
3. Methodology:

2. Related Work: Our aim is to move from techniques that require


traditional hand-crafted features, to a CNN
Lip reading and audio-visual speech recognition architecture that can choose the features required for
(AVSR) are highly correlated such that the relevant the task of speaker recognition. This allows us to
information of one modality can improve the minimise the pre-processing of the audio data and
recognition of the other modality in any of the two hence avoid losing valuable information in the
aforementioned applications. DNNs have been process. We create a CNN by modifying an existing
employed for fusing speech and visual modalities for VGG architecture and train it on spectrograms from
audio visual automatic speech recognition (AV-ASR) 50 unique speakers.
[10]. Moreover, in [11], a connectionist HMM system
3.1 Architecture:
is introduced for AVSR and a pre-trained CNN is used
to classify phonemes. Some researchers have used
All audio is first converted to single-channel, 16-bit
CNNs to predict and generate phonemes [12] or
streams at a 16kHz sampling rate for consistency.
visemes [13] without considering the word level Spectrograms are then generated in a sliding window
prediction. Phonemes are the smallest fashion using a hamming window of width 25ms and
distinguishable unit of an audio stream which are step 10ms. This gives spectrograms of size 512 x 300
for 3 seconds of speech. Mean and variance constructed from the raw audio input i.e. no pre-
normalisation is performed on every frequency bin of processing such as silence removal etc. is performed.
the spectrum. No other speech-specific pre-
processing (e.g. silence removal, voice activity 4. Experiments:
detection, or removal of unvoiced speech) is used.
These short time magnitude spectrograms are then First, we will discuss about the dataset we have used
used as input to the CNN. Architecture. Since speaker and some of the salient implementation details. Then,
identification under a closed set can be treated as a we will present our results of our proposed model.
multiple-class classification problem, we base our
architecture on the VGG-M CNN, known for good 4.1 Datasets:
classification performance on image data, with
modifications to adapt to the spectrogram input. The For our experiments, we have used a simple dataset
fully connected fc6 layer of dimension 9 × 8 (support available on Kaggle. It consists of 50 speakers’ audio
in both dimensions) is replaced by two layers – a fully data with length more than 1 hour for each. This
connected layer of 9×1 (support in the frequency dataset can be used for speaker recognition kind of
domain) and an average pool layer with support 1 × problems. This dataset was scraped from YouTube and
n, where n depends on the length of the input speech Librivox. Each speaker has 80 – 90 tracks. For the
segment (for example for a 3 second segment, n = 8). training purpose we have used different numbers of
This makes the network invariant to temporal tracks i.e., 10, 20, 30, 40, 50 tracks per speaker and the
position but not frequency, and at the same time rest of the tracks were used to test our model.
keeps the output dimensions the same as those of the
original fully connected layer. This also reduces the 4.2 Implementation Details:
number of parameters from 319M in VGG-M to 67M
in our network, which helps avoid overfitting. We implement our framework in TensorFlow on a
server with Nvidia Titan X GPU. Optimization of the
network is done with Adam Optimizer. The model is
3.2 Spectrogram Construction: trained for 20k iterations with batch size 32 and
learning rate 0.001. The weight decay regularization
The Speech signal is a non-stationary signal. So the parameter is fixed to 5 × 10−4. The computational cost
normal Fourier analysis will not give proper result. increases with the length of the images resulting more
Thus we have to use STFT or Short Time Fourier time to converge. Also, batch size was reduced to 32
Transform that gives the frequency spectrum of the to accommodate the GPU’s memory. During
signal as a function of time as well as frequency. The evaluation we noticed that each file takes roughly
spectrum so obtained is called spectrogram. But STFT 85ms on average on GeForce Titan X.
has a disadvantage that the proper frequency
resolution and time resolution cannot be decided Table I. Experimental results using different number of
properly. So we can use wavelet transform using any training samples used per speaker.
wavelets like Harr, Mexican Hat, Daubechis, Mathieu
etc. The visual representation of the transform so Training data/speaker Accuracy (%)
obtained is called scalogram. 10 51.3
20 71.3
So, for spectrogram First we convert all audio to 30 81.3
single-channel, 16-bit streams at a 16 kHz sampling 40 85.3
rate for consistency. The spectrograms are then 50 94.2
generated by sliding window protocol by using a
hamming window. The width of the hamming window
is 25 ms with a step size of 10 ms. This gives 4.3 Results:
spectrograms of size 128 (number of FFT features) x
300 for 3 seconds of randomly sampled speech for In Table I, the results of the method is shown.
each audio. Subsequently, each frequency bin is According to the obtained data, it is evident that
normalized (mean, variance). the spectrograms are accuracy of our model decreases with decreasing
training samples. We have seen that if we use 10 set. Then from each class, we choose a subset of
samples per classes, the results become 51.3%. samples which are considered as support points while
the rest are considered as query points. The flow
5. Future Extension: diagram for few shot learning is shown in figure above.

The method has several challenges. First of all, in 6. Conclusion:


reality, we will have access to a very limited number of
samples per speaker. Thus, it is not possible to train In order to establish benchmark performance, we
the model with that limited examples. Secondly, the develop a VGG CNN architecture with the ability to
method uses an end-to-end training strategy. Thus, it deal with variable length audio inputs. We have shown
is difficult to use the model for new classes. There is a the results on a speech dataset collected from 50
requirement of such model that is easily scalable. speakers. We have discussed some of the
Keeping these disadvantages in our mind, we have disadvantages of the method. We also formulated our
formulated a few-shot learning techniques. This future extension of our work using few-shot learning
model can be used to obtain a feature representation technique.
of the audio files using some metric learning methods.
Now, we will discuss the theoretical foundation of our 7. References:
future work based on few-shot learning. We are
assuming that for each speaker, only a few very short
1. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey,
utterances are available. Specifically, we learn from 1
and S. Khudanpur, “X-vectors: Robust dnn
to 5 small utterances of 3 seconds each. Apart from
embeddings for speaker recognition,” in 2018
computational and technical improvisations obtained
IEEE International Conference on Acoustics,
by solving under such constraints, this setting also
Speech and Signal Processing (ICASSP). IEEE, 2018,
allows enrolment of speakers to be easier and more
pp. 5329–5333.
practical.
2. D. Snyder, D. Garcia-Romero, D. Povey, and S.
Khudanpur, “Deep neural network embeddings
for text-independent speaker verification.” in
5.1 Few shot learning using Prototypical Loss:
Interspeech, 2017, pp. 999–1003.
3. Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A
In few shot learning, at test time we have to classify
novel scheme for speaker recognition using a
test samples among K new classes given very few
phonetically-aware deep neural network,” in
labeled examples of each class. We are provided with
2014 IEEE International Conference on Acoustics,
D dimensional embeddings Xi ⋳ ℝD for each input Xi
Speech and Signal Processing (ICASSP). IEEE, 2014,
and its corresponding label Yi where
pp. 1695–1699.
4. J. Guo, N. Xu, K. Qian, Y. Shi, K. Xu, Y. Wu, and A.
Alwan, “Deep neural network based i-vector
Yi ⋳ {1, . . . ,K}. The objective is to compute a M
mapping for speaker verification using short
dimensional representation i.e. the prototype of each
utterances,” Speech Communication, vol. 105, pp.
class ak ⋳ ℝM. The embedding is computed via a
92–102, 2018.
function f𝜙 : ℝD → ℝM where f𝜙 indicates a deep
5. X. Zhao and D. Wang, “Analyzing noise robustness
neural network and _ indicates its parameters. The
of mfcc and gfcc features in speaker
prototype ak is the mean of the support points for a
identification,” in 2013 IEEE international
class. By using a distance function d, a distribution
conference on acoustics, speech and signal
over classes is learned for a query point q, and is given
processing. IEEE, 2013, pp. 7204–7208.
as
6. A. Kanagasundaram, R. Vogt, D. B. Dean, S.
Sridharan, and M. W. Mason, “I-vector based
exp(−𝑑(𝑓∅ (𝑞),𝑎𝑘 ))
𝑝∅ ( 𝑦 = (𝑘|𝑞) = ∑𝑘′ exp(−𝑑(𝑓∅ (𝑞),𝑎𝑘′ ))
speaker recognition on short utterances,” in
Proceedings of the 12th Annual Conference of the
At train time, we minimize negative log probability of International Speech Communication Association.
the positive class. The training data is generated by International Speech Communication Association
randomly selecting a smaller subset from the training (ISCA), 2011, pp. 2341– 2344.
7. M. McLaren, Y. Lei, N. Scheffer, and L. Ferrer,
“Application of convolutional neural networks to
speaker recognition in noisy conditions,” in
Fifteenth Annual Conference of the International
Speech Communication Association, 2014.
8. Z. Liu, Z. Wu, T. Li, J. Li, and C. Shen, “Gmm and
cnn hybrid method for short utterance speaker
recognition,” IEEE Transactions on Industrial
Informatics, vol. 14, no. 7, pp. 3244–3252, 2018.
9. Y. Mroueh, E. Marcheret, and V. Goel, “Deep
multimodal learning for audio-visual speech
recognition,” in 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), April
2015, pp. 2130–2134.
10. K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T.
Ogata, “Audio-visual speech recognition using deep
learning,” Applied Intelligence, vol. 42, no. 4, pp. 722–
737, Jun. 2015. [Online]. Available:
http://dx.doi.org/10.1007/s10489-014-0629-7
11. K. Noda, Y. Yamaguchi, K. Nakadai, H. Okuno, and T.
Ogata, Lip reading using convolutional neural network.
International Speech and Communication Association,
2014, pp. 1149–1153.
12. O. Koller, H. Ney, and R. Bowden, “Deep learning of
mouth shapes for sign language,” in 2015 IEEE
International Conference on Computer Vision
Workshop (ICCVW), Dec 2015, pp. 477–483.
13. S. Petridis and M. Pantic, “Deep complementary
bottleneck features for visual speech recognition,” in
2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), March 2016,
pp. 2304–2308.
14. M. Wand, J. Koutnk, and J. Schmidhuber, “Lipreading
with long shortterm memory,” in 2016 IEEE
International Conference on Acoustics, Speech and
Signal Processing (ICASSP), March 2016, pp. 6115–
6119
15. J. S. Chung and A. Zisserman, “Out of time: automated
lip sync in the wild,” in Workshop on Multi-view Lip-
reading, ACCV, 2016.
16. E. Marcheret, G. Potamianos, J. Vopicka, and V. Goel,
“Detecting audiovisual synchrony using deep neural
networks,” in INTERSPEECH, 2015.
17. M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp,
“Audiovisual synchronization and fusion using
canonical correlation analysis,” IEEE Transactions on
Multimedia, vol. 9, no. 7, pp. 1396–1403, 2007.

You might also like