You are on page 1of 5

Speaker Identification Based Proxy Attendance

Detection System
Naman Gupta, Shikha Jain
Department of Computer Science and Information Technology
Jaypee Institute of Information Technology
Noida, India
namang500.ng@gmail.com, shi_81@rediffmail.com

Abstract—Recently, voice biometrics has gained and the native languages is still an open research
popularity due to its large number of applications. The problem. Most of the researchers [10][13] have
paper presents Siamese Network and CIFAR network- used MFCC, delta, and delta-delta as the feature
based architecture for speaker identification. It uses the extraction methods in context to speech. These
voice metrics and voice features independent of the feature extraction methods have gained high
uttered words. The model is trained on a dataset
containing hundreds of audio clips of 10 speakers. The popularity in the field of speaker verification and
model achieved a test accuracy of 88% and training recognition with beats and tempo counted as
accuracy of 98.5%. The proposed model is used for secondary features. Nevertheless, the use of CNN
proxy attendance detection in the classroom. in speaker verification simplifies the task of
Keywords: Voice Biometrics; Siamese Network, feature extraction by finding out those voice
CIFAR network; Speaker Identification; Proxy patterns in human voices at its own. The
Attendance Detection architecture of Siamese Network uses the shared
I.INTRODUCTION weights of CNN’s for finding similar and different
features between same and two different persons.
Machine learning and artificial intelligence are
transforming the world rapidly, by automating CNN mainly consists of three layers: the
various manual processes. Voice biometrics is one convolutional layer, the pooling layer and the
of them. It has gained much popularity in recent dense layer. The convolutional layer is used to
years. Voice biometrics is a technology used to find important features by matching it with
match personal voice pattern and verify the corresponding filter patterns. The pooling layer is
speaker's identity using only voice. Speaker used to capture any translational, rotational and
identification is an emerging research area in scaling invariance. The dense layer is the fully
speech technology. connected layer used for classification. Thus,
CNN is robust to any type of manipulations which
There are a large number of applications of can be carried out on the images. The filters used
speaker identification systems such as banking, in CNN are used to find particular patterns in
telecommunication systems, security devices, images which are important for classification,
forensic department and many more. The current object detection or recognition task.
open fields for voice biometrics are Government
Heigold et al. [6] used techniques to infer the use
Immigrant Check-in, Customer Service
of triplet loss function for a dataset with triplet
Authentication, and Employee Workforce
pairs consisting of an anchor, positive and
Management Check-in.
negative samples. The anchor is the sample which
For a human, it is very easy to verify a person just is required to test on positive voice recording of
by hearing the known voices. But for machines, it the same person. The negative samples are those
is indeed a difficult task to achieve high accuracy which are hard for the model to verify speaker
comparable to a human. A lot of work [1-12] has identity. Thus, we need to make our model to
already been done in this area. However, either have much far distance from the negative sample
good accuracy is not achieved or the model is and at the same time have the much closer
text-dependent [1][4][6-7]. Such models are distance to positive sample. In this paper, we have
dependent on text or the words a person utters. used hundreds of audio recording of ten different
Therefore, having a speaker recognizing model to speakers with five male and five female voices.
achieve high accuracy independent of the gender The dataset is thus, constructed using various

978-1-5386-9436-7/19/$31.00 ©2019 IEEE


175

Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 08,2023 at 06:14:25 UTC from IEEE Xplore. Restrictions apply.
combinations of same and different speaker pairs Around, 19 MFCC(Mel Frequency Cepstral
irrespective of their gender and language. The Coefficients) and log-energies features are first
dataset contains ten pairs where both audio clips normalized along mean and variance and then
are of same speaker and ten pairs where audio combined with tempo, delta, delta-delta, and
recording are of different speakers. rhythmic features to produce quality feature
vectors.
The paper mainly focuses on the speaker
identification based on the voice metrics and Residual architecture is an arrangement of several
voice features. The model is designed using residual units. The residual unit is a mapping with
modified CIFAR network architecture as a shared x l +1 = x l + F(x l , w 1 ) where x l and x l +1 are the
weight part on Siamese network. The proposed unit’s input and output. Additive "shortcut
model is used for proxy attendance detection in connection" allows the network to satisfy the
the classroom. basic property: there is no harm in making model
deep of layers. Thus, it becomes easy to train
The paper is organized as follows: Section 2
large networks. The model referred at [14] was
discusses the related work followed by critical
trained on dataset 300 different speaker (143
analysis in Section 3. The proposed model and the
female and 157 male) in different environment
results are discussed in details in Section 4 and 5
uttering same text. Adaptive Moment optimizer
respectively. The paper is concluded in Section 6
with learning rate set at 10e-4 is used for training
followed by future research directives in Section
Network is trained to discriminate between all
7.
speakers in training set using the softmax layer
II.RELATED WORK and categorical cross-entropy loss function.
The task of speaker recognition model is to extract Earlier 512 units of i-vectors were used to find a
those feature vector embeddings that are most person’s authenticity and identity.
required to verify persons’ identity. This has been For achieving high accuracy on deep residual
an area of keen interest for many researchers and CNN, it requires large training data and deep layer
there have been great improvements in the way architectures and well-tuned hyper-parameters.
machines are able to authenticate persons’
The following steps are used for features
identity. Earlier, many different approaches were
extraction using MFCC:
used to tackle the problem. For instance, the
rigorous research was done on pattern matching of 1. Split an original data (audio file) into small
general text-dependent word samples spoken by partitions (30 sec into many small partitions of
familiar persons on a telecom. HMM (Hidden 20 ms)
Markov’s Model) model was tested over 10 2. Calculate each frame’s density of the power
speakers produced an accuracy of around 68% to spectrum.
76% [9]. Recently, the most advanced systems are
accurate enough to achieve very good accuracy 3. Then, filter bank calculates energy and sums
and at the same time acquiring 200 voices related up the energy of each filter.
features. Nammous et al. [3] proposed a method 4. Then, calculate the logarithm of all filter bank
for linear prediction which smoothen the speech energies.
signals transformed by Fast Fourier Transform
(FFT) used for transition from time-amplitude to 5. Calculate the Direct Cosine Transform of all
the amplitude-frequency domain. The circulant filter banks energies.
(Töeplitz) matrices are calculated from these Recently, a dataset Vox-Celeb is proposed [2]
points to extract their minimal eigenvalues and which were extracted by running model on
eigenvectors, which are used to produce the thousands of YouTube videos of different celebs
unique feature vector for each audio sample. along with matching the lips of the speaker using
These features are then fed to Radial Basis LipNet model. Another researcher [1] gives
Function Neural Network. intuition about triplet-loss training examples. The
Residual Conv-Nets are used in text-independent triplets are namely Anchor (testing data), Positive
speaker verification for training deep network [4]. (similar recording of a person), Negative (similar

176

Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 08,2023 at 06:14:25 UTC from IEEE Xplore. Restrictions apply.
recording of a different person which can closely train the model to find similarity between voice
mimic the anchor person). The main aim is to patterns of two different speakers. For this, a
decrease the similarity distance between the dataset consisting of ten different speakers with
anchor and positive and increase the dissimilarity equal number of male and female speakers is
distance between the anchor and the negative used. This dataset contains hundreds of audio
recordings of speakers in different languages.
III.CRITICAL ANALYSIS
The model is implemented using Google Colab
Speaker authentication has been a problem
platform using K80 Tesla GPU and i7 7th gen core
that has been addressed by many researchers in
processor. The original CIFAR network is having
recent times. For this a variety of approaches have
10 neurons in the last layer. Here, instead of 10
been adopted. Some researchers have used
neurons, 128 neurons are used to generate more
VGG16 CNN model [2] to extract features from
detailed feature vector (as shown in figure 1).
mel-spectrograms from voice samples. This model
Nevertheless, in the learning phase, a special type
is robust to invariance and is capable of extracting
of neural network, Siamese Network (SNN) is
important features. Yet, the problem with VGG16
used. The advantage of using SNN is that it uses a
model is that it has high time complexity as
concept of distance-based loss. As a result it
requires a lot of computation and memory to load
generates better embeddings. Moreover, this kind
model architecture and its weights along with
of network learns to differentiate between two
heavy paired dataset. Researchers have been
inputs rather than simply classifying them. The
trying to improve the performance of model but
proposed architecture is shown in figure 2.
the problem of high time complexity has not been
tackled. Finally, it generates zero indicating different
speakers and one indicating same speaker. Based
IV.PROPOSED ALGORITHM
upon the error, the weights are updated. The
This paper proposes a speaker identification learning rate is set to 0.0001 for the initial hundred
model using modified CIFAR network epochs. As the model approaches to near optimum
architecture as shared weight part on Siamese level, the learning rate was decreased ten times to
network. The beauty of the model is that it uses 0.00001 so that it can achieve global optimum.
minimal calculations and less memory. For The model is trained using ADAM optimizer using
accurate identification of speakers, first step is to batch normalization at each layer along with
extract the important features in terms of vector dropout set to 0.2 for dense layer connected just
embeddings. These feature vectors actually after convolution layer for minimizing the chances
represents the pattern in the voice of the speaker of over fitting.
independent of the uttered words. Next step is to

Fig 1: Modified CIFAR architecture

177

Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 08,2023 at 06:14:25 UTC from IEEE Xplore. Restrictions apply.
1 if input A and input
feature embeddings with speaker 3 as can be seen
B are same else 0 by comparing F1-Score.

Sigmoid
Activation

Vector embedding Vector embedding


for input A for input B

CIFAR Network Shared Weights CIFAR Network

Input A Input B Fig 3: Classification Report of ten speakers


(Image of Voice (Image of Voice
Melspectrogram) Melspectrogram)

Table 1 shows some of the test cases of speaker


Fig 2: Proposed architecture identification. It can be seen that the model is able
Once, the model is trained and reached the to identify the speaker correctly. Moreover, it is
global optimum and achieved promising accuracy able to detect the proxy attendance as well. The
then the last layer of the model was popped out in model has achieved a decent test accuracy of 88%
order to give 128 hidden unit layer which acts as and training accuracy of 98.5%. Table 2 shows the
feature vector embedding. The 1 X 128 shaped comparative performance of CIFAR based
vector-embedding of different speakers is the proposed model with VGG16 based model. It can
quality feature of the voice of that particular be seen that the proposed model outperform the
speaker. The vector embedding of all the speakers existing model in terms of time complexity and
was extracted using Siamese model and is stored accuracy both.
in the database. Table 1: Test Cases
Input Output
Once the embeddings are generated for each Input Audio file: Expected Speaker:
speaker and the model is trained to differentiate Speaker_1.wav Speaker_1
between any two speakers, the model is saved for Name called for: Speaker_1 Prediction:
the actual use of speaker identification. For Person Verified.... Speaker
speaker identification, the Mel-spectrogram of the no. 1
speaker is passed as input to the trained model to Input Audio file: Expected Speaker:
Speaker_3.wav Speaker_3
generate its 1 X 128 vector embedding. Then, the
Name called for: Speaker_4 Prediction: Beware, It’s a
cosine similarity is used to determine the similarity proxy.... This is
between the current speakers and all the already Speaker no. 3
processed speakers at real time to identify the Input Audio file: Expected Speaker:
speaker. Here, we considered threshold parameter Speaker_8.wav Speaker_8
to be equal to 0.75. The proposed model is used Name called for: Speaker_8 Prediction:
for proxy attendance detection in the classroom. Person Verified.... Speaker
no. 8
V.RESULTS Input Audio file: Expected Speaker:
Speaker_6.wav Speaker_6
For the verifying the speaker, the vector Name called for: Speaker_6 Prediction:
embedding were compared between various Person Verified.... Speaker
speakers. Figure 3 presents the generated no. 6
classification report. It can be observed that the
model is able to identify 9 speakers with high F1-
Score except speaker 6. It is because of similar

178

Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 08,2023 at 06:14:25 UTC from IEEE Xplore. Restrictions apply.
Artificial Intelligence Systems, pp. 412-419.
Table 2: Comparative Performance
Parameters VGG 16 Proposed
Springer, Berlin, Heidelberg, 2011.
Based Model Model [4] E. Malykh, S. Novoselov, and O. Kudashev. "On
residual CNN in text-dependent speaker verification
No of Epochs 3000 2500
task." In International Conference on Speech and
Training Time 3 Hrs 1.5 Hrs
Computer, pp. 593-601. Springer, Cham, 2017.
Accuracy 81% 88%
[5] S. Narang, and D. Gupta. "Speech feature extraction
VI. CONCLUSION techniques: a review." International Journal of
Computer Science and Mobile Computing 4, no. 3,
The paper presents a speaker identification 2015, pp: 107-114.
model based on the voice metrics and voice [6] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer.
features independent of the uttered words. The "End-to-end text-dependent speaker verification." In
model is designed using modified CIFAR network 2016 IEEE International Conference on Acoustics,
architecture as shared weight part on Siamese Speech and Signal Processing (ICASSP), pp. 5115-
5119. IEEE, 2016.
network generating a feature vector of size 1 X
128. The model is trained on dataset containing [7] D. K. Burton, “Text-dependent speaker verification
hundreds of audio clips of 10 speakers The model using vector quan-tization source coding,” IEEE
Trans. Acoust., Speech, Signal Process.,vol. ASSP-
achieved a decent test accuracy of 88% and 35, pp. 133, 1987.
training accuracy of 98.5%. Thus, Model was
[8] H. Salehghaffari, "Speaker Verification using
accurate to verify 8 speakers independently. Due
Convolutional Neural Networks." arXiv preprint
to some correlation in voices between speaker 3 arXiv:1803.05427, 2018.
and speaker 6, the model was sometimes mistaken
[9] U. Mahola, F. V. Nelwamondo, and T. Marwala.
for some voice recordings. The model can be used "HMM Speaker Identification Using Linear and
in a number of applications such as banking, Non-linear Merging Techniques." arXiv preprint
telecommunication systems, security devices, arXiv:0705.1585 , 2007.
forensic department. Here, it is used as a proxy [10] P. Annesi, R. Basili, R. Gitto, A. Moschitti, and R.
attendance detection model in the classroom. Petitti. "Audio feature engineering for automatic
music genre classification." In Large Scale Semantic
VII.FUTURE WORK
Access to Content (Text, Image, Video, and Sound),
In future, we intend to work upon the accuracy pp. 702-711. Le Centre De Hautes Etudes
Internationales D'Informatique Documentaire, 2007.
metric suitable for large scale speaker
identification with end to end deployment. This [11] D. Snyder, D. Garcia-Romero, D. Povey, and S.
requires a different arrangement of dataset with Khudanpur. "Deep Neural Network Embeddings for
Text-Independent Speaker Verification." In
anchor, positive and negative samples trained on Interspeech, pp. 999-1003. 2017.
triplet loss function and having really deep neural
network with latest neural network architecture. [12] K. Chen, and A. Salman. "Extracting speaker-
specific information with a regularized siamese deep
VIII.REFERENCES network." In Advances in Neural Information
Processing Systems, pp. 298-306. 2011.
[1] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y.
Cao, A. Kannan, and Z. Zhu. "Deep speaker: an end- [13] N. Dave, "Feature extraction methods LPC, PLP and
to-end neural speaker embedding system." arXiv MFCC in speech recognition." International journal
preprint arXiv:1705.02304 , 2017. for advance research in engineering and technology
1, no. 6, 2013, pp: 1-4.
[2] A. Nagrani, J. S. Chung, and A. Zisserman.
"Voxceleb: a large-scale speaker identification [14] S. Safavi, and I. Mporas. "Improving performance of
dataset." arXiv preprint arXiv:1706.08612, 2017. speaker identification systems using score level
fusion of two modes of operation." In International
[3] M. K. Nammous, A. Szczepański, and K. Saeed. "An Conference on Speech and Computer, pp. 438-444.
exploratory research on text-independent speaker Springer, Cham, 2017.
recognition." In International Conference on Hybrid
`

179

Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 08,2023 at 06:14:25 UTC from IEEE Xplore. Restrictions apply.

You might also like