You are on page 1of 5

Batch Normalized Siamese Network Deep Learning

Based Image Similarity Estimation


M. Shyamala Devi J. Arun Pandian Aparna Joshi
Computer Science & Engineering School of Information Technology and Department of Information Technology,
Vel Tech Rangarajan Dr. Sagunthala Engineering, Army Institute of Technology, Pune
R&D Institute of Science and Vellore Institute of Technology, aparna.joshi82@gmail.com
Technology Vellore, India
Chennai, Tamilnadu, India aparunpandian@gmail.com
2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT) | 978-1-6654-9360-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICECCT56650.2023.10179689

shyamalapmr@gmail.com

Yeluri Praveen
Computer Science & Engineering
Vel Tech Rangarajan Dr. Sagunthala
R&D Institute of Science and
Technology
Chennai, Tamilnadu, India
vtu15374@veltech.edu.in

Abstract— The assessment of how two distinct images are one-shot classification model known as the Siamese network
equal are indeed called image similarity and consistency. In to make a prediction. Because it requires less knowledge, it is
other words, it measures how much the intensity patterns in two additionally resistant to class disparity. People are rather
images are comparable to one another. In order to achieve this, proficient at learning and recognizing different patterns.
researchers examine the image descriptors recursively in order Researchers notice in particularly that when exposed to
to identify descriptor pairs that are comparable. The two images stimulation, people appear to have the ability to acquire novel
are deemed comparable if the number of related descriptors ideas rapidly and subsequently distinguish variations on these
exceeds a predetermined threshold and both images exhibit the ideas in upcoming perceptions. In the field of supervised
very same entity. The computation of image similarity is used
machine learning known as resemblance acquisition, the
for various applications which graves to be the mandatory
process for production of the application. With this intent, the
objective is to develop a resemblance purpose that calculates
Fashion MNIST dataset from KAGGLE is used for the degree of similarity or relationship between two items and
implementing the image similarity estimation. This paper delivers a degree of similarity.
proposes Batch Normalized Siamese Network (BNSN) deep
II. LITERATURE REVIEW
learning based model for computing the image similarity. The
BNSN model is designed with two subnetworks that generates When the things are comparable, the similarity score is
feature vectors of two input images. The lambda batch stronger; when they are distinct, the similarity score is lower.
normalization is performed with single dense layer to predict the Simply by measuring the distance between these vectors, we
image similarity with label 0 indicating the identical images and may determine if they are similar or very distinct from one
label 1 denoting the different images. The 30,000 training images another. Determining the similarities between the two
were fitted with BNSN and tested with 30,000 images. Python specified pictures, which may be easily done by learning a
was implemented on a Geforce Tesla V100 NVidia Graphics similarity criterion between the images, is one of the essential
card webserver with a batch size of 64 and 30 training epochs. problems. With the help of Siamese Convolutional Neural
The training images are also tested with traditional image Networks, this is easily accomplished. Siamese CNNs are
similarity method and implementation of proposed BNSN shows
capable of learning a similarity metric between different types
the accuracy of 91.91%, Precision of 92.93%, Recall of 90.72%
of picture pairs [1]. It has been demonstrated that
and FScore of 91.81%.
convolutional neural networks improve at stereoscopic
Keywords— Siamese Network, deep learning, image estimation. Modern architecture, on the other hand, rely on
similarity, normalization. Siamese networks that use synthesis and then additional
processing layers, necessitating a minute of GPU computing
I. INTRODUCTION for each pair of images. In comparison, the equivalent circuit
Siamese neural network, often described as a dual neural was extremely generated accurate results in less than a second
network, is a type of artificial neural network that employs the of GPU processing [2].
same weights to estimate equivalent output vectors from two Recursive neural networks, which are capable of
distinct input vectors simultaneously. The contrastive loss processing directed ordered acyclic graphs, are used in the
method employs the final layers from both networks to graph-based image representation that shows the connections
determine how comparable the two images resemble. Siamese between different regions within the image. Recursive neural
networks are capable of learning certain potent representations networks could find the best representations for exploring the
that you can then use to other computer vision tasks like object image data, while the graph-based description integrates
identification or image segmentation. A neural network structural and subsymbolic aspects of the image [3]. Using a
architecture called SimSiam makes use of Siamese networks Siamese neural network-based gesture recognition paradigm,
to determine how similar two data points are. To pull off the robust and discriminative gait characteristics are
trick, make sure each of the photos you give the network go automatically extracted for human identification. The Siamese
via the same embedding function. Therefore, weights must be network, in contrast to typical deep neural networks, can use
shared between both network branches to make sure that this distance metric learning to make the similarity metric big for
occurs. A single training sample is all that is needed for the pairs of gaits from various individuals and small for pairs from

Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on August 09,2023 at 21:01:18 UTC from IEEE Xplore. Restrictions apply.
the same individual. In particular, the gait energy images were capabilities. To adjust the network weights when the item to
combined rather than the raw gait sequence in order to create monitor is unknown in advance, stochastic gradient descent
an effective model with less training data [4]. A Siamese must be performed online, which significantly slows down the
convolutional neural network learn descriptors that aggregate system [10]. Texture feature extraction and similarity
image pixels and optical flow conditions to encode local assessment are two techniques that make up a full texture
spatiotemporal patterns between both the two input picture image retrieval system. To extract texture characteristics, non-
regions. The resulting matching likelihood is then produced subsampled contoured transform and dual-tree complex
by combining the CNN output with a collection of contextual wavelet transform were employed [11]. An in-depth
characteristics extracted from the position and size of the examination of human perception skills and technical
compared input patches using a gradient boosting classifier. specifications is necessary to determine the efficacy and
This learning strategy is proven by utilizing a multi-person efficiency of such assessments. Extensive subjective
tracker that employs linear programming, which demonstrates evaluations are necessary for the creation and testing of
that, when fed with our learned matching probabilities, even a objective texture similarity metrics that concur with human
basic and effective tracker may outperform considerably more assessments of texture similarity [12]. A Siamese network
complicated models [5]. Re-identification of people in multi- works by using two comparable networks that have different
camera networks created with a Siamese Convolution Neural images and learning the absolute difference between the two
Network and a hierarchical structure. A convolution neural feature vectors while calculating the similarity score between
network is used to train a nonlinear transformation that the two images. The ideal activation mechanism, however, is
projects a series of human couple pairs within the same uncertain for this activity [13]. When dealing with recently
attribute subspace. The loss function is minimized during the discovered malicious programs, it is impossible to train
learning process, ensuring that the similarity distance between machine learning models with enough malware attacks. The
positive and negative pairings is smaller than the reduced level method converts malware samples into scaled grayscale
and the difference between them is greater than the upper images during the pre-processing step and groups them by
level. Due to the computation time, modest scale of dataset average hash into the same category. Siamese networks are
was adapted [6]. trained to rank data resemblance during the training and
testing phases, and the accuracy is determined using one-shot
A Siamese network is developed by rewarding tasks [14]. A bilateral Siamese neural network is employed in
misinterpretation on pairs of input images patches makes up the neural networks for fault identification with the
the most recent deep convolutional neural networks. A triplet presumption that various category objects may share certain
network can enhance classification accuracy in a number of structural similarities as evidenced by learned image pairwise
issues, according to recent machine learning results, however differences [15]. To decrease the modality gap, meaningful
this hasn't yet been proven for acquiring neighboring information from each image pair are extracted using the
pixel descriptors. Additionally, the stochastic gradient descent Siamese convolution network architectural framework [16].
method used to train the current Siamese and triplet networks
can lead to overfitting because it calculates the gradients from III. RESEARCH METHODOLOGY
specific pairs or triplets of local picture patches [7]. Numerous
applications, such as object segmentation or picture retrieval, The proposed BNSN framework process flow is shown in the
need the evaluation of similarity. New framework f or Fig. 1. The main significance of this research work is to
measuring texture similarity based on contemporary deep design the Batch Normalized Siamese Network deep learning
learning methods have been designed. It is done to compare based model for computing the image similarity. The two
patches made from both homogeneous and non-homogeneous vectors V(A) and V(B) are comparable as in equation (1).
textures of real-world photographs. Siamese Neural Network, 𝑑𝑖𝑠𝑡(𝐴, 𝐵) = ‖𝑉(𝐴) − 𝑉(𝐵)‖2 (1)
which is intended to assess the similarity of image pairs, was If the distance is minimal, the vectors are similar. In order to
utilized. The Siamese Neural Network acquires the ability to determine the distance between two vectors, we can build a
choose the most distinctive elements in charge of texture distance function called d. In light of this, we can create a loss
discrimination. For the image it processes, each of the twin function. The loss function is defined exactly as the L2 norm
networks generates a feature vector [8]. between the two vectors when A and B are a positive pair, that
Siamese timing latency neural network created by joining is, when they belong to the same individual as in equation (2)
the outputs of two similar networks. The network gains the Loss(𝐴, 𝐵) = ‖𝑉(𝐴) − 𝑉(𝐵)‖2 (2)
ability to compare the similarity of two signature
combinations during training. Only one half of the Siamese Therefore, when the loss function is reduced, we are
network is assessed if it is employed for authentication. The actually reducing the distance d. (A, B). But when two images
feature vector for the input signature is the half network's in a pair are negative, then employ a distinct type of loss
outcome. Verification entails contrasting this feature vector function called hinge loss. When we have a negative pair that
with a feature vector that is kept on file for the signer. All the has a distance higher than max between them, we don't want
other signatures are disregarded as frauds unless they are to waste our time further separating them because when the
nearer than a specified threshold to stored representation [9]. two faces in a pair are different, we want V(A) and V(B) to
By learning a model of an object's appearance alone online have a distance greater than max as in equation (3)
and using the video itself as the only training data, the
Loss(𝐴, 𝐵) = max⁡(0, 𝑚𝑎𝑥‖𝑉(𝐴) − 𝑉(𝐵)‖)2 (3)
challenge of arbitrary object tracking has traditionally been
solved. Despite these methods' success, the depth of the model The contrastive loss is given by equation (4)
they can learn is naturally constrained by their online-only
methodology. Recently, various attempts have been made to Loss (𝐴, 𝐵) = dist(A, B)‖𝑉(𝐴) − 𝑉(𝐵)‖2 ⁡ + (1 − V(a) −
take advantage of deep convolutional networks' expressive V(B)max⁡(0, 𝑚𝑎𝑥‖𝑉(𝐴) − 𝑉(𝐵)‖)2 (4)

Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on August 09,2023 at 21:01:18 UTC from IEEE Xplore. Restrictions apply.
Fashion MNIST dataset

Data Analysis

Labeling of Data

Training and Testing data Seperation (50:50)


Fig. 3. Sample images of Fashion MNIST dataset

Model Training and Fitting The labeling of the data have been done with Label 1
indicating the different object and Label 0 indicating the same
Two Input Layer Splitting
object and is shown in Fig. 4, Fig. 5 and Fig. 6.

Functional Layer Splitting

Fig. 4. Sample data Pairs for Training


Lambda Layer Layout

Batch Normalization

Fig. 5. Sample data Pairs for Validation

Dense Layer Layout

Model Fitting
Fig. 6. Sample data Pairs for Testing

Accuracy, Contrastive Loss, ROC The training and testing are splitted as in equation (5) – (7).
Curve Analysis
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦⁡ = 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔30000 + 𝑇𝑒𝑠𝑡𝑖𝑛𝑔30000  
Similarity Identification 30000
𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔30000 = ⌊⋃ {∑255 255
𝑖=1 ∑𝑗−1 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑖𝑗 𝑡 }⌋ (6)
𝑇=1
Fig. 1. Research Methodology
30000
The BNSN architecture with two input layer, single {∑255 255
(7)
𝑇𝑒𝑠𝑡𝑖𝑛𝑔30000 = ⌊⋃ 𝑖=1 ∑𝑗−1 𝑡𝑒𝑠𝑡𝑖𝑛𝑔𝑖𝑗 𝑒 }⌋
functional layer, lambda layer, Batch normalized layer and 𝐸=1
Dense layers and output layer for image similarity The Fashion MNIST dataset images are standardized, and the
identification and is shown in Fig.2. significant features are extracted from proposed BNSN
model with the sigmoid cross entropy loss layer that reduces
the difference between the probability distribution between
the assigned dataset labels with the predicted labels as in
equation (8).

⁡𝐿𝑜𝑠𝑠𝑖 = ⁡ ∑𝐶𝐿=1 𝑡𝑟𝐿 (𝑡𝑒𝑖 ) log 𝑃(⁡𝑡𝑒𝑖 = 𝐿⁡|⁡𝑑𝑎𝑡𝑎𝑖 ; 𝑡𝑒𝑖 ⁡) (8)

Where𝑡𝑟𝐿 (𝑡𝑒𝑖 ) is the distribution of assigned dataset labels


and 𝑃(⁡𝑡𝑒𝑖 = 𝐿⁡|⁡𝑑𝑎𝑡𝑎𝑖 ; 𝑡𝑒𝑖 ) is the probability distribution of
predicted labels. The learning rate of the proposed BNSN
model is Adadelta gradient descent function and is shown by
equation (9)
𝛿 𝜂
𝐿𝑅𝑖+1 = 𝐿𝑅𝑖 −⁡∝ ⁡ ]+⁡𝜖⁡
(9)
𝛿𝐿𝑅 √𝐸[𝐿𝑅𝑖
Where 𝜂 is the learning rate,⁡𝜆 is the decay rate, ∝⁡is the loss
error, 𝐸[𝐿𝑅𝑖 ]⁡is the accumulative gradient learning rate. The
training and testing Fashion MNIST dataset is fitted with
Fig. 2. BNSN Layered architecture of Siamese Network
proposed BNSN model as given by equation (10) – (11).
30000
The Fashion MNIST dataset splitted with 30,000 training Modelfit = 𝑅𝑎𝑛𝑑𝑜𝑚 (⋃ {∑255 255
𝑖=1 ∑𝑗−1 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑖𝑗 𝑡 })(10)
images, 30,000 testing images and is shown in Fig. 3. 𝑇=1

Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on August 09,2023 at 21:01:18 UTC from IEEE Xplore. Restrictions apply.
30000
Modelfit = 𝑅𝑎𝑛𝑑𝑜𝑚(⁡⋃ {∑255 255
𝑖=1 ∑𝑗−1 𝑡𝑒𝑠𝑡𝑖𝑛𝑔𝑖𝑗 𝑒 }) (11)
𝐸=1

IV. IMPLEMENTATION SETUP AND RESULTS


The Fashion MNIST dataset splitted with 30,000 training
images, 30,000 testing images that is used for predicting the
leaf species identification. The proposed BNSN architecture
designed with two input layer, single functional layer, lambda
layer, Batch normalized layer and Dense layers and output
layer for image similarity identification. The sample
predictions of the Fashion MNIST dataset after fitting to the
proposed BNSN model is shown in the fig. 7. The testing
Fashion MNIST dataset image is fitted to BNSN model and
other traditional image similarity method, and the
performance metrics was analyzed is shown in Table I and
Fig. 8. Fig. 10. Contrastive Loss of proposed BNSN model.

Fig. 7. Sample Predictions of proposed BNSN model.

Fig. 8. Efficiency Analysis of proposed BNSN model. Fig. 11. Receiver Operating Characteristic of proposed BNSN model.

TABLE I. PERFORMANCE COMPARISON OF PROPOSED BNSN. V. CONCLUSION

Accuracy Precision Recall F1-Score


This paper investigates the effectiveness of proposed
Classifier
BNSN model in terms of identifying the image similarity. The
Traditional
Method
83.43 83.34 84.65 84.54 novelty of the proposed BNSN model is achieved by adapting
Proposed BNSN two subnetworks that generates feature vectors of two input
91.91 92.93 90.72 91.81 images. This paper attempt to prove how well the proposed
method
BNSN model performs well when compared to the traditional
After fitting the Fashion MNIST dataset to the proposed image similarity method. The lambda batch normalization is
BNSN, the obtained training accuracy, contrastive loss and performed with single dense layer to predict the image
Receiver operating characteristic is shown in Fig. 9, Fig. 10. similarity with label 0 indicating the identical images and label
and Fig. 11. 1 denoting the different images. The 30,000 training images
were fitted with BNSN and tested with 30,000 images.
Implementation of proposed BNSN shows the accuracy of
91.91%, Precision of 92.93%, Recall of 90.72% and FScore
of 91.81
REFERENCES

[1] S. Nandy, S. Haldar, Banerjee and S. Mitra, “A Survey on Applications


of Siamese Neural Networks in Computer Vision”, In Proceedings of
the International Conference for Emerging Technology, pp. 1-5, 2020.
doi: 10.1109/INCET49848.2020.9153977.
[2] W. Luo, A. G. Schwing and R. Urtasun, “Efficient Deep Learning for
Stereo Matching”, In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5695-5703, 2016. doi:
10.1109/CVPR.2016.614.
[3] C. de Mauro, M. Diligenti, M. Gori and M. Maggini, “Similarity
learning for graph-based image representations”, Pattern Recognition
Fig. 9. Model Accuracy of proposed BNSN model. Letters, vol. 24, pp. 1115-1122, 2003.

Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on August 09,2023 at 21:01:18 UTC from IEEE Xplore. Restrictions apply.
[4] C. Zhang, W. Liu, H. Ma and H. Fu, “Siamese neural network based
gait recognition for human identification”, In Proceedings of the IEEE
International Conference on Acoustics Speech and Signal Processing,
pp. 2832-2836, 2016.
[5] L. Leal-Taixe, C. Canton-Ferrer and K. Schindler, “Learning by
tracking: Siamese cnn for robust target association”, In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pp. 33-40, 2016.
[6] KB. Low and UU. Sheikh, “Learning hierarchical representation using
siamese convolution neural network for human re-identification”, In
Proceedings of the International Conference on Digital Information
Management, pp. 217-222, 2015.
[7] BG. Kumar, G. Carneiro and I. Reid, “Learning local image descriptors
with deep siamese and triplet convolutional networks by minimising
global loss functions”, In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5385-5394, 2016.
[8] L. Hudec and W. Bencsova, “Texture Similarity Evaluation via
Siamese Convolutional Neural Network”, In Proceedings of the
International Conference on Systems, Signals and Image Processing,
pp. 1-5, 2018. doi: 10.1109/IWSSIP.2018.8439387.
[9] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun and C. Moore,
“Signature Verification Using a Siamese Time Delay Neural
Network”, International Journal of Pattern Recognition and Artificial
Intelligence, vol. 07, no. 04, pp. 669-688, 1993.
[10] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi and P. H. Torr,
“Fully-convolutional siamese networks for object tracking”, Lecture
Notes in Computer Science, vol. 9914 LNCS, pp. 850-865, 2016.
[11] Z. Zhu, C. Zhao and Y. Hou, “Research on Similarity Measurement for
Texture Image Retrieval”, PLoS ONE, vol. 7, no. 9, pp. e45302, 2012.
[12] J. Zujovic, T. N. Pappas, D. L. Neuhoff, R. Van Egmond and H. De
Ridder, “Effective and efficient subjective testing of texture similarity
metrics”, JOSA A, vol. 32, no. 2, pp. 329-342, 2015.
[13] A. A. R. Putra and S. Setumin, “The Performance of Siamese Neural
Network for Face Recognition using Different Activation Functions”,
In Proceedings of the International Conference of Technology, Science
and Administration, pp. 1-5, 2021.
[14] S. C. Hsiao, D. Y. Kao, Z. Y. Liu and R. Tso, “Malware image
classification using one-shot learning with siamese networks”,
Procedia Comput. Sci, vol. 159, pp. 1863-1871, 2019.
[15] C. Luan, R. Cui, L. Sun and Z. Lin, “A Siamese Network Utilizing
Image Structural Differences For Cross-Category Defect Detection”,
In Proceedings of the International Conference on Image Processing,
2020, pp. 778-782, doi: 10.1109/ICIP40778.2020.9191128.
[16] L. Fan, H. Liu and Y. Hou, “An Improved Siamese Network for Face
Sketch Recognition”, In Proceedings of the International Conference
on Machine Learning and Cybernetics, pp. 1-7, 2019. doi:
10.1109/ICMLC48188.2019.8949231.

Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on August 09,2023 at 21:01:18 UTC from IEEE Xplore. Restrictions apply.

You might also like