You are on page 1of 4

Signature Object Detection based on YOLOv3

Nguyen Thi Phuong Thanh Pham Thi Anh Duong


Hanoi University Hanoi University
Hanoi, Vietnam Hanoi, Vietnam
2001140049@s.hanu.edu.vn 2001140018@s.hanu.edu.vn
ABSTRACT Through the training process, CNN layers automatically learn the
Since every individual’s handwritten signature is unique, it has been values represented by the layer filter. The basic structure of CNN
utilized in identification and authentication. A plethora of research includes 3 main parts: Local receptive field, shared weights and bias,
and implementations has been reinforced in the task of offline and pooling [13]. With the Local receptive field, the effect of this
handwritten signature detection in order to replace the manual layer is that it helps us filter data and information of images and
labor of sorting documents by hand. Developing a ready-to-deploy select them. Shared weights and bias Minimizing the number of pa-
system for such a task still remains a challenging task, since in the rameters is the main effect of this factor in today’s CNN network [1].
real world, handwritten signatures are often overlapped by lines, Because in each convolution there are different feature maps, each
approval stamps, texts, etc. In this paper, a computer vision-based feature map helps detect some features in the image. The pooling
approach is proposed for dealing with this task, which uses the layer is the last layer before producing the results. Therefore, to
object detection model YOLOv3 trained on the Tobacco-800 dataset. get the easiest-to-understand and use results, the pooling layer will
After the model was trained, several inference tests were run to have the effect of simplifying the output information. That is, after
experiment with the proposed model. The outcome indicated that completing the calculation process and scanning the layers, it will
there still existed some difficulties in the signature detection task. come to the pooling layer to reduce unnecessary information, and
then produce the desired results.
R-CNN was proposed in 2014 and some improved versions such
1 INTRODUCTION as Fast R-CNN or Faster R-CNN to improve CNN methods in image
Despite the ever-evolving digitalized society, handwritten signa- recognition [2]. R-CNN can select a large number of possible bound-
tures are one of the most widely used traditional biometric authenti- ing boxes as the final goal by searching for the and option then
cation techniques. From petition forms to bank notes, handwritten dividing different independent regions of the image using CNN
signatures are utilized as a way to "seal the deal" due to their being feature extraction [6]. R-CNN’s algorithm model can be trained
simple and unique [10]. Traditionally, the documents are scanned by classifying images and searching for target regions By selec-
are manually detected, and verified by human evaluation. However, tively searching, these target areas require resizing to the default
manually detecting signatures is a labor-intensive task. Therefore, it size. Second, for each image region, a feature vector is generated
is necessary to develop an automated system for signature detection. through CNN traversal, and then the feature vector needs to be fed
Handwritten signature detection is an important task in the field into the binary SVM. Finally, the regression model is used for the
of pattern recognition. Developing a robust model for handwritten reduction of Boundary box positioning and fixing errors. You Only
signature detection opens the doors for another advancement for Look Once (YOLO) proposes to use an end-to-end neural network
other tasks in OCR (Optical character recognition) like signature to make predictions about bounding boxes and class probabilities
forgery detection, document classification, and extraction. In real- at the same time [9]. Instead of considering this area and that area,
world scenarios, an efficient system for signature detection is useful YOLO divides the image into a square grid with a size of say NxN,
for the banking business that relies heavily on user identification for example. It then browses each cell in the image, for each cell
and authentication to handle finances. it tries m anchor boxes. For each anchor box, the class and offset
of the object’s bounding box will be predicted. Finally, the bound-
2 LITERATURE REVIEW AND THEORETICAL ing boxes with the highest probability above the threshold will be
BACKGROUND retained using an algorithm called Non-Max suppression (NMS)
Artificial Neural Network (ANN), is a network system with many to only retain the best bounding boxes and output the results to
neural nodes. Specifically, ANN simplifies and imitates the neural the user. In terms of speed, YOLO is currently quite fast with a
network system of the human brain and has several basic features maximum test speed of 45 FPS [8]. YOLO has developed several
similar to the brain [11]. However, CNN was invented because models from YOLOv1 to YOLOv8, each version is based on and
ANN did not always meet the requirements of the image processing improves on its predecessor. This article focuses on the YOLOv3
area [3]. It is known that CNN is one of the mechanism-based ANN development model
structures organism of natural vision (feature extraction) and CNN
can directly recognize visual rules from the original pixels, little 2.1 YOLOv3
pre-processing is required due to the features extraction function of The YOLOv3 framework was proposed in 2016 by [7] and has
image [4].CNN networks consist of many overlapping Convolution evolved to incorporate several advanced features way to get a more
layers, using functions and tanh to activate numbers. Each layer, accurate and precise Modern object detection algorithm. Since its
once activated, will create object results for the next layers. Each introduction, several versions (v4, v5, v6,v7, and v8) of the YOLO
subsequent main layer can display the results of the previous layer. framework were developed, often with increased depth through
Nguyen Thi Phuong Thanh and Pham Thi Anh Duong

26x26 for medium objects, and 52x52 for small objects are deployed
for effective object localization and detection in YOLO v3 classifier.

2.2 Architecture in YOLOv3


Four major architectural updates in YOLOv3 include:

2.2.1 Residual Box and Skip Connections. To avoid vanishing or


exploding gradients, this study uses residual layers to determine
deviations from identity classes. This is necessary to prevent conver-
gence deterioration. The problems posed by this research method
require output from a layer will be provided as input to the layer
immediately or the next layers in the neural network. Ignore exist-
ing connections used in this model instead of directly mapping to
ensure the direct passing of input from one layer to another.

2.2.2 Residual Box and Skip Connections. The purpose of this type
of grid (maintained in YOLOv3) is to serve as the output of the
architecture, so B-bounding boxes are predicted at each grid cell.
specifically, in YOLOv3, B=3 therefore We predict 3 bounding boxes
in each grid cell. The authors used what you mentioned in the ques-
tion i.e. 3 different prediction layers/grids covering 13×13,26×26
and 52×52 grid cells. Now, we should remember that B=3 bounding
boxes are being predicted at each of these grid cells. These boxes
are encoded as vectors of 85 characteristics/size. The required depth
(number of feature maps) is 85×B=255, as you said in the question.
To visualize this, we can first focus on how a given box of a given
cell is encoded.
Figure 1: Darknet-53
2.2.3 Bounding- box Regression. Bounding boxes provide out-
lines that highlight objects within one Visual photo or video frame.
YOLOv3 boasts more Bounding Box/object ratio and improves ob-
layer addition, with some layers mainly seen to have lightweight ject detection efficiency regarding previous architectural versions
architecture, faster detection speed, and improved mean average given by Eqn. [7]. Attributes of Bounding boxes are described below
precision (mAP).YOLOv3 was invented, which can use logistic re- in Eqn
gression for the prediction objective points of each bounding box x, y, w(width), h(height), P r(confidence)
and also change the way cost functions are calculated. YOLOv3 uses Where: Bounding box center (bx, by)
its own logical classifier to replace the SoftMax function to calculate Bounding box Width (bw)
probabilities that the input belongs to a particular label. In calculat- Bounding box Height (bh)
ing classification loss, YOLOv3 does not use mean squared error but Confidence Probability of class (P r(C)) such as person, car, traffic
uses binary cross-entropy loss for each label. This can reduce calcu- light, etc.
lation complexity by avoiding SoftMax features. In YOLOv3, it can
be three times faster. YOLOv3 also shows significant improvements
in detecting small objects [2] Implement the YOLO v3 algorithm
Darknet-53 as the backbone to extract features from the input im-
age. Darknet-53 is a deep neural network written in C and Compute
Unified Device Architecture (CUDA) included Convolutional layers
are used for feature extraction. It does Feature Pyramid Network
(FPN) [6], which enables feature maps extracted from input images.
It supports CPU and GPU computationally detectable and easily
accessible.
The YOLOv3 architecture uses a detection kernel where B is
the relative limit quantity boxes that a feature cell can predict,
’5’ represents the four-bounds box deviation and confidence of a
subject, and C is the number of possible classes. The superiority of
YOLO v3 has improved its effectiveness in detecting smaller objects.
Because detecting multiple objects from the input image, different Figure 2: Transformations of Network Output Attributes of
classes such as 13x13 are responsible for detecting large objects, Anchors
Signature Object Detection based on YOLOv3

Values of network outputs (tx, ty, tw, andth) are transformed box coordinates, height, width) is converted to YOLO format by
into bounding box values (i.e., bx, by, bw, bh). Cx and Cy denote the normalization in range (0, 1) and then stored in one *.txt file.
top-left coordinates, while P w and P h represent the grids’ anchors.
2.2.4 Intersection over Union (IoU) . In object detection, Intersec-
tion over Union (IoU) is a metric used to evaluate the performance
of an algorithm in detecting objects in an image. It is calculated
as the ratio between the intersection of the predicted bounding
box and the actual bounding box to the union of the two bounding
boxes. Where, value 1 represents perfect overlap, while value 0
represents no overlap. The predicted bounding box is considered
“correctly” detected if its IoU score is higher than the specified
threshold.
If two boxes overlap or intersect we can calculate the IoU by
finding the ratio of the area of overlap between the two boxes to
the total area of the combined boxes.
Area of Overlap Figure 4: Example of the data for validation
𝐼𝑜𝑈 = (1)
Area of Union
Object detection is performed in the YOLOv3 classifier using a
bounding box and the Intersection principle on the Union (IOU). 4 RESULTS AND DISCUSSIONS
During object detection, point 1 indicates that. The predicted bound-
ing box matches the ground truth exactly box. A relative score of 0 4.1 Training and Validation
implies that the prediction and baseline truth boxes do not overlap. The training process was conducted on Google Colaboratory (Google
This mechanism is described in. for example in Figure 2 Colab), a web-based Jupyter Notebook service that offers free and
powerful GPUs backend. The model was trained with a batch size
3 METHODOLOGY of 12 and the number of epochs was 52. The results were displayed
in real-time and can be viewed during the training and validation
3.1 Data Acquisition
process. Several metrics were used to evaluate the proposed model.
The dataset used in this research is an extraction from the publicly Using the standard evaluation of mAP score with an IoU threshold
available Tobacco-800 dataset [5, 12]. Tobacco-800 dataset contains of 50%, mAP peaked at the 22th epoch. The mAP score with the
handwritten signatures from real-world documents and unlike most IoU threshold of 50% to 95% peaked at epoch 29.
of the publicly available signature datasets, it contains noises and
artifacts, such as stamps, handwritten texts, and ruling lines, on the
signatures. The figure below shows example signatures of different
users from the Tobacco-800 dataset.

Figure 5: Plots of box loss, objectness loss, classification loss,


precision, recall and mean average precision (mAP) over the
training epochs for the training and validation set.

Figure 3: Preview of handwritten signatures from four dif- In the figure above, three different loss metrics are shown: box loss,
ferent individuals objectness loss, and classification loss. The box loss indicates how
well the model can locate the center of the object and how well the
predicted bounding box covers the ground true object. Objectness
3.2 Data Preparation loss is a measurement of the probability of an object being in a
Our customized dataset contains a total of 936 images, which are proposed region of interest. As observed in the graph, it can be seen
split into 2 subsets for training and validation with a ratio of 85:15. that our model’s performance peaked during the first 20 epochs
The images are annotated and labeled according to two classes: not according to recall, precision, and mAP score (with a threshold of
signature and signature. Each image annotation metadata (class, 0.50). After the 20th epoch, the model began to perform stably.
Nguyen Thi Phuong Thanh and Pham Thi Anh Duong

4.2 Model Inference architectures, challenges, applications, future directions. Journal of Big Data 8
(2021). https://api.semanticscholar.org/CorpusID:232434552
After training our model, an experimental testing approach was [2] Richeng Cheng. 2020. A survey: Comparison between Convolutional Neural
adapted by using model inference. Several unseen images contain- Network and YOLO in image identification. Journal of Physics: Conference Series
1453 (01 2020), 012139. https://doi.org/10.1088/1742-6596/1453/1/012139
ing handwritten signatures from real Word documents were ran- [3] Krzysztof J. Cios. 2017. Deep Neural Networks - A Brief History. CoRR
domly selected for our model to make predictions about whether abs/1701.05549 (2017). arXiv:1701.05549 http://arxiv.org/abs/1701.05549
there was a signature or not. As shown in Fig. 6 and Fig. 7, the [4] Zongjiang Gao, Yingjun Zhang, and Yuankui Li. 2020. Extracting features from
infrared images using convolutional neural networks and transfer learning.
model correctly predicted the signature class if there was no data Infrared Physics Technology 105 (2020), 103237. https://doi.org/10.1016/j.infrared.
noise or other patterns overwriting the signature. 2020.103237
[5] David D. Lewis, Gady Agam, Shlomo Engelson Argamon, Ophir Frieder, David A.
Grossman, and Jefferson Heard. 2006. Building a test collection for complex
document information processing. Proceedings of the 29th annual international
ACM SIGIR conference on Research and development in information retrieval (2006).
https://api.semanticscholar.org/CorpusID:19516087
[6] Guiying Li, Junlong Liu, Chunhui Jiang, Liangpeng Zhang, Minlong Lin, and
Ke Tang. 2017. Relief R-CNN : Utilizing Convolutional Features for Fast Object
Detection. arXiv:1601.06719 [cs.CV]
[7] Kanyifeechukwu Jane Oguine, Ozioma Collins Oguine, and Hashim Ibrahim
Bisallah. 2022. YOLO v3: Visual and Real-Time Object Detection Model for Smart
Surveillance Systems(3s). arXiv:2209.12447 [cs.CV]
[8] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2015. You Only
Look Once: Unified, Real-Time Object Detection. http://arxiv.org/abs/1506.02640
cite arxiv:1506.02640.
[9] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015.
You Only Look Once: Unified, Real-Time Object Detection. CoRR abs/1506.02640
(2015). arXiv:1506.02640 http://arxiv.org/abs/1506.02640
[10] Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement.
Figure 6: An image from model inference results ArXiv abs/1804.02767 (2018). https://api.semanticscholar.org/CorpusID:4714433
[11] Z.R. Yang and Z. Yang. 2014. 6.01 - Artificial Neural Networks. In Comprehensive
Biomedical Physics, Anders Brahme (Ed.). Elsevier, Oxford, 1–17. https://doi.
org/10.1016/B978-0-444-53632-7.01101-1
However, the model had difficulty in differentiating not signature [12] Guangyu Zhu, Yefeng Zheng, David Doermann, and Stefan Jaeger. 2007. Multi-
and signature class when the signature was stamped over. On the scale Structural Saliency for Signature Detection. In In Proc. IEEE Conf. Computer
other hand, the system performed relatively well with data noises Vision and Pattern Recognition (CVPR 2007). 1–8.
[13] Yingmou Zhu, Hongming chen, Wei Meng, Qing Xiong, and Yongjian Li. 2022.
like overlapping texts, as illustrated by Fig. 7. A wide kernel CNN-LSTM-based transfer learning method with domain adapt-
ability for rolling bearing fault diagnosis with a small dataset. Advances in
Mechanical Engineering 14, 11 (2022), 16878132221135745. https://doi.org/10.
1177/16878132221135745 arXiv:https://doi.org/10.1177/16878132221135745

Figure 7: An image from model inference results

5 CONCLUSION
Our proposed approach for biometric-based authentication is to
use the YOLOv3 model to detect handwritten signatures in real-
world documents. The results drawn from the model show that
the model still has many areas for improvement and is not ready
for practical implementation. Further training and enhancement
techniques can be applied to the model for optimal performance
such as data augmentation, data cleaning, and oversampling.

REFERENCES
[1] Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-dujaili, Ye Duan, Om-
ran Al-Shamma, José I. Santamaría, Mohammed Abdulraheem Fadhel, Muthana
Al-Amidie, and Laith Farhan. 2021. Review of deep learning: concepts, CNN

You might also like