Professional Documents
Culture Documents
26x26 for medium objects, and 52x52 for small objects are deployed
for effective object localization and detection in YOLO v3 classifier.
2.2.2 Residual Box and Skip Connections. The purpose of this type
of grid (maintained in YOLOv3) is to serve as the output of the
architecture, so B-bounding boxes are predicted at each grid cell.
specifically, in YOLOv3, B=3 therefore We predict 3 bounding boxes
in each grid cell. The authors used what you mentioned in the ques-
tion i.e. 3 different prediction layers/grids covering 13×13,26×26
and 52×52 grid cells. Now, we should remember that B=3 bounding
boxes are being predicted at each of these grid cells. These boxes
are encoded as vectors of 85 characteristics/size. The required depth
(number of feature maps) is 85×B=255, as you said in the question.
To visualize this, we can first focus on how a given box of a given
cell is encoded.
Figure 1: Darknet-53
2.2.3 Bounding- box Regression. Bounding boxes provide out-
lines that highlight objects within one Visual photo or video frame.
YOLOv3 boasts more Bounding Box/object ratio and improves ob-
layer addition, with some layers mainly seen to have lightweight ject detection efficiency regarding previous architectural versions
architecture, faster detection speed, and improved mean average given by Eqn. [7]. Attributes of Bounding boxes are described below
precision (mAP).YOLOv3 was invented, which can use logistic re- in Eqn
gression for the prediction objective points of each bounding box x, y, w(width), h(height), P r(confidence)
and also change the way cost functions are calculated. YOLOv3 uses Where: Bounding box center (bx, by)
its own logical classifier to replace the SoftMax function to calculate Bounding box Width (bw)
probabilities that the input belongs to a particular label. In calculat- Bounding box Height (bh)
ing classification loss, YOLOv3 does not use mean squared error but Confidence Probability of class (P r(C)) such as person, car, traffic
uses binary cross-entropy loss for each label. This can reduce calcu- light, etc.
lation complexity by avoiding SoftMax features. In YOLOv3, it can
be three times faster. YOLOv3 also shows significant improvements
in detecting small objects [2] Implement the YOLO v3 algorithm
Darknet-53 as the backbone to extract features from the input im-
age. Darknet-53 is a deep neural network written in C and Compute
Unified Device Architecture (CUDA) included Convolutional layers
are used for feature extraction. It does Feature Pyramid Network
(FPN) [6], which enables feature maps extracted from input images.
It supports CPU and GPU computationally detectable and easily
accessible.
The YOLOv3 architecture uses a detection kernel where B is
the relative limit quantity boxes that a feature cell can predict,
’5’ represents the four-bounds box deviation and confidence of a
subject, and C is the number of possible classes. The superiority of
YOLO v3 has improved its effectiveness in detecting smaller objects.
Because detecting multiple objects from the input image, different Figure 2: Transformations of Network Output Attributes of
classes such as 13x13 are responsible for detecting large objects, Anchors
Signature Object Detection based on YOLOv3
Values of network outputs (tx, ty, tw, andth) are transformed box coordinates, height, width) is converted to YOLO format by
into bounding box values (i.e., bx, by, bw, bh). Cx and Cy denote the normalization in range (0, 1) and then stored in one *.txt file.
top-left coordinates, while P w and P h represent the grids’ anchors.
2.2.4 Intersection over Union (IoU) . In object detection, Intersec-
tion over Union (IoU) is a metric used to evaluate the performance
of an algorithm in detecting objects in an image. It is calculated
as the ratio between the intersection of the predicted bounding
box and the actual bounding box to the union of the two bounding
boxes. Where, value 1 represents perfect overlap, while value 0
represents no overlap. The predicted bounding box is considered
“correctly” detected if its IoU score is higher than the specified
threshold.
If two boxes overlap or intersect we can calculate the IoU by
finding the ratio of the area of overlap between the two boxes to
the total area of the combined boxes.
Area of Overlap Figure 4: Example of the data for validation
𝐼𝑜𝑈 = (1)
Area of Union
Object detection is performed in the YOLOv3 classifier using a
bounding box and the Intersection principle on the Union (IOU). 4 RESULTS AND DISCUSSIONS
During object detection, point 1 indicates that. The predicted bound-
ing box matches the ground truth exactly box. A relative score of 0 4.1 Training and Validation
implies that the prediction and baseline truth boxes do not overlap. The training process was conducted on Google Colaboratory (Google
This mechanism is described in. for example in Figure 2 Colab), a web-based Jupyter Notebook service that offers free and
powerful GPUs backend. The model was trained with a batch size
3 METHODOLOGY of 12 and the number of epochs was 52. The results were displayed
in real-time and can be viewed during the training and validation
3.1 Data Acquisition
process. Several metrics were used to evaluate the proposed model.
The dataset used in this research is an extraction from the publicly Using the standard evaluation of mAP score with an IoU threshold
available Tobacco-800 dataset [5, 12]. Tobacco-800 dataset contains of 50%, mAP peaked at the 22th epoch. The mAP score with the
handwritten signatures from real-world documents and unlike most IoU threshold of 50% to 95% peaked at epoch 29.
of the publicly available signature datasets, it contains noises and
artifacts, such as stamps, handwritten texts, and ruling lines, on the
signatures. The figure below shows example signatures of different
users from the Tobacco-800 dataset.
Figure 3: Preview of handwritten signatures from four dif- In the figure above, three different loss metrics are shown: box loss,
ferent individuals objectness loss, and classification loss. The box loss indicates how
well the model can locate the center of the object and how well the
predicted bounding box covers the ground true object. Objectness
3.2 Data Preparation loss is a measurement of the probability of an object being in a
Our customized dataset contains a total of 936 images, which are proposed region of interest. As observed in the graph, it can be seen
split into 2 subsets for training and validation with a ratio of 85:15. that our model’s performance peaked during the first 20 epochs
The images are annotated and labeled according to two classes: not according to recall, precision, and mAP score (with a threshold of
signature and signature. Each image annotation metadata (class, 0.50). After the 20th epoch, the model began to perform stably.
Nguyen Thi Phuong Thanh and Pham Thi Anh Duong
4.2 Model Inference architectures, challenges, applications, future directions. Journal of Big Data 8
(2021). https://api.semanticscholar.org/CorpusID:232434552
After training our model, an experimental testing approach was [2] Richeng Cheng. 2020. A survey: Comparison between Convolutional Neural
adapted by using model inference. Several unseen images contain- Network and YOLO in image identification. Journal of Physics: Conference Series
1453 (01 2020), 012139. https://doi.org/10.1088/1742-6596/1453/1/012139
ing handwritten signatures from real Word documents were ran- [3] Krzysztof J. Cios. 2017. Deep Neural Networks - A Brief History. CoRR
domly selected for our model to make predictions about whether abs/1701.05549 (2017). arXiv:1701.05549 http://arxiv.org/abs/1701.05549
there was a signature or not. As shown in Fig. 6 and Fig. 7, the [4] Zongjiang Gao, Yingjun Zhang, and Yuankui Li. 2020. Extracting features from
infrared images using convolutional neural networks and transfer learning.
model correctly predicted the signature class if there was no data Infrared Physics Technology 105 (2020), 103237. https://doi.org/10.1016/j.infrared.
noise or other patterns overwriting the signature. 2020.103237
[5] David D. Lewis, Gady Agam, Shlomo Engelson Argamon, Ophir Frieder, David A.
Grossman, and Jefferson Heard. 2006. Building a test collection for complex
document information processing. Proceedings of the 29th annual international
ACM SIGIR conference on Research and development in information retrieval (2006).
https://api.semanticscholar.org/CorpusID:19516087
[6] Guiying Li, Junlong Liu, Chunhui Jiang, Liangpeng Zhang, Minlong Lin, and
Ke Tang. 2017. Relief R-CNN : Utilizing Convolutional Features for Fast Object
Detection. arXiv:1601.06719 [cs.CV]
[7] Kanyifeechukwu Jane Oguine, Ozioma Collins Oguine, and Hashim Ibrahim
Bisallah. 2022. YOLO v3: Visual and Real-Time Object Detection Model for Smart
Surveillance Systems(3s). arXiv:2209.12447 [cs.CV]
[8] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2015. You Only
Look Once: Unified, Real-Time Object Detection. http://arxiv.org/abs/1506.02640
cite arxiv:1506.02640.
[9] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015.
You Only Look Once: Unified, Real-Time Object Detection. CoRR abs/1506.02640
(2015). arXiv:1506.02640 http://arxiv.org/abs/1506.02640
[10] Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement.
Figure 6: An image from model inference results ArXiv abs/1804.02767 (2018). https://api.semanticscholar.org/CorpusID:4714433
[11] Z.R. Yang and Z. Yang. 2014. 6.01 - Artificial Neural Networks. In Comprehensive
Biomedical Physics, Anders Brahme (Ed.). Elsevier, Oxford, 1–17. https://doi.
org/10.1016/B978-0-444-53632-7.01101-1
However, the model had difficulty in differentiating not signature [12] Guangyu Zhu, Yefeng Zheng, David Doermann, and Stefan Jaeger. 2007. Multi-
and signature class when the signature was stamped over. On the scale Structural Saliency for Signature Detection. In In Proc. IEEE Conf. Computer
other hand, the system performed relatively well with data noises Vision and Pattern Recognition (CVPR 2007). 1–8.
[13] Yingmou Zhu, Hongming chen, Wei Meng, Qing Xiong, and Yongjian Li. 2022.
like overlapping texts, as illustrated by Fig. 7. A wide kernel CNN-LSTM-based transfer learning method with domain adapt-
ability for rolling bearing fault diagnosis with a small dataset. Advances in
Mechanical Engineering 14, 11 (2022), 16878132221135745. https://doi.org/10.
1177/16878132221135745 arXiv:https://doi.org/10.1177/16878132221135745
5 CONCLUSION
Our proposed approach for biometric-based authentication is to
use the YOLOv3 model to detect handwritten signatures in real-
world documents. The results drawn from the model show that
the model still has many areas for improvement and is not ready
for practical implementation. Further training and enhancement
techniques can be applied to the model for optimal performance
such as data augmentation, data cleaning, and oversampling.
REFERENCES
[1] Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-dujaili, Ye Duan, Om-
ran Al-Shamma, José I. Santamaría, Mohammed Abdulraheem Fadhel, Muthana
Al-Amidie, and Laith Farhan. 2021. Review of deep learning: concepts, CNN