Professional Documents
Culture Documents
Abstract
In this work, an efficient and accurate face recognition system based on edge processing using GPUs was
completely developed. A complete pipeline that contains a sequence of processing steps, including
pre-processing, face feature extraction, and matching, is proposed. For processing steps, lightweight deep
neural models were developed and optimized so that they could be computationally accelerated on an
embedded hardware of Nvidia’s Jetson Nano. Besides the core processing pipeline, a database, as well as a
user application server were also developed to fully meet the requirements of readily commercialized
applications. The experimental evaluation results show that our system has a very high accuracy based on
the BLUFR benchmark, with a precision of 98.642%. Also, the system is very computationally efficient, as the
computing time to recognize an ID in a dataset of 1171IDs with 10141 images on the Jetson Nano is only
165ms. For the critical case, the system can process 4 camera streams and simultaneously recognize a
maximum of 40 IDs within a computing time of 458ms for each ID. With its high-speed and accuracy
characteristics, the developed system has a high potential for practical applications.
Keywords: Face recognition, deep learning, GPUs, edge processing.
ISSN: 2734-9373
https://doi.org/10.51316/jst.171.ssad.2024.34.1.1
Received: February 15, 2023; accepted: September 27, 2023
1
JST: Smart Systems and Devices
Volume 34, Issue 1, January 2024, 001-008
Fig. 1. General processing steps of the whole face recognition system Jetson Nano
A key goal has been set that the system should be multiple camera streams into a single processing
optimized so that the lightweight embedded hardware pipeline. This new approach is quite impressive and
can process as many camera streams as possible. In thus should be further optimized and evaluated.
conventional methods, a processing pipeline using the
DeepStream SDK from Nvidia is a complete
so-called Gstreamer [10] is often used. Each camera
streaming analytics toolkit based on GStreamer for AI-
stream is attached to a Gstreamer-based pipeline,
based multi-sensor processing, video, audio, and
which contains a set of deep learning models. This
image understanding. DeepStream’s multi-platform
approach has the disadvantage that the utilization of
support gives us a faster, easier way to develop vision
hardware for deep learning models is repeated through
AI applications and services. We can even deploy
all camera streams. This limits the number of camera
them on-premises, on the edge, and in the cloud. Fig. 2
streams that can be processed on lightweight
depicts a description of the entire platform. It is ideal
embedded hardware. Fortunately, Nvidia has released
for vision AI developers, software partners, startups,
the DeepStream SDK [11], which allows you to mux
2
JST: Smart Systems and Devices
Volume 34, Issue 1, January 2024, 001-008
and OEMs who are developing IVA apps and services. 2. System Description
Stream processing pipelines incorporating neural
The overall processing pipeline of the design
networks and other complex processing tasks such as system is shown in Fig. 3. There are four main
tracking, video encoding/decoding, and video modules, each of which is responsible for a specific
rendering can be created by developers. DeepStream task. The first module is the face recognition pipeline,
pipelines enable real-time video, image, and sensor which contains a sequence of several processing steps.
data analytics. This platform provides extremely Video frames captured from cameras are streamed to
powerful tools with numerous advantages, including: the processing unit via the RTSP protocol. The video
a) DeepStream SDK is powerful and flexible, streams are decoded by the NVDEC. The decoded
frames are converted to the RGBA format. These
making it suitable for a wide range of use-cases across
frames are routed to the batching system for further
a wide range of industries.
processing on the Jetson Nano's hardware CPU. All
b) Multiple Programming Options: The decoded frames from different cameras are pushed to
platform uses the simple and intuitive UI of C/C++, a meta data patch and then sent to our self-developed
Python, or Graph Composer, you can create powerful plug-in, which performs the following processes:
vision AI applications. a) Attaching the tracking identification (ID) for
c) Real-Time Insights: The platform can gain an each camera stream to be managed.
understanding of rich, multimodal real-time sensor b) Resizing image frames to 640x360 and
data at the edge. converting them to BGR format in order to satisfy the
d) Managed AI Services: The platform can input requirement of the face detection model.
deploy AI services in Kubernetes-managed cloud c) Performing the face detection inference to get
native containers. the bounding box and landmarks of the detecting face,
which are consequently used for the tracking and
e) Lower total cost of ownership: The platform
alignment process. Each detected face in video frames
increases stream density by training, adapting, and
is assigned a tracking ID. Face tracking enables us to
optimizing models with TAO toolkit and deploying
avoid duplication of a person's faces to be processed
models with DeepStream.
and reduce processing efforts.
In this work, an efficient and accurate face
d) Using face anti-spoofing to eliminate fake
recognition system based on GPU-based embedded
cases in which someone intentionally generates the
hardware is developed. The system has a complete
face image of an interested person using images from
processing pipeline that contains pre- and post-
smart phones, tablets, printed paper, or even wearing a
processing algorithms as well as deep convolutional
mask [2].
neural models. The DeepStream SDK is exploited for
the video processing, which includes the decoder and e) Using the detected landmarks and the affine
the mux. The decoder of DeepStream is already transformation to align the face. In sequence, the face
optimized for hardware by Nvidia, so it is more feature vector extraction model is performed.
efficient than building the Gstreamer pipeline from Extracted vectors are then matched with vectors from
scratch. The mux allows for batching camera frames the database to find a person's ID [3-4].
from several cameras into only one processing pipeline
Deep neural networks used for this processing are
to reduce the utilization of the hardware resource. Our
listed in Table 1. Most of these processes are
contribution is that, instead of using the already-
implemented on the deep learning processing unit
available plugins of deep models for face recognition
(GPU) of the Jetson Nano. To meet the requirements
issues, we develop and optimize models ourselves.
of real-time applications, we can use the GPU to
These models are developed so that they are
accelerate the computation of deep learning models by
lightweight while maintaining high accuracy
using parallel computing.
characteristics. For this reason, pretrained models with
the backbones of VIT-T [12], Retina Mnet [13], and Table 1. Technical detail of the developed models.
MobileNetV2 [14] were fine-tuned using transfer
learning techniques with a public dataset [15]. These Model Back-bone/specification
models are then optimized and quantized so that they
Face anti-spoofing Mobilenet V2 [14]
can be implemented and computationally accelerated
on the Jetson Nano from Nvidia [16]. The accuracy of Face and landmark Retina mnet [13]
the system is evaluated via the benchmark BLUFR detection
[17], and the computational efficiency is verified via
Face feature extraction VIT-T [12]
several critical scenarios. For a complete system, a
database that contains data for recognizing people and Face alignment Affine transformation [18]
a user application server were developed.
3
JST: Smart Systems and Devices
Volume 34, Issue 1, January 2024, 001-008
Fig. 3. Processing pipeline of the whole face recognition system based on Jetson Nano
The second module is the database using Postgre contains the CIVAMS Web Server and CIVAMS
SQL, which contains the face feature vectors of the Application Servers and other display devices and
corresponding images of the person's identification. actuators for user-specific applications.
This database is created by the third module, namely
Fig. 4 describes a block diagram of the
the server updater. This module also has full deep
development and deployment of a deep learning model
learning models for face detection and face feature
on the Jetson Nano platform. First, a deep neural
vector extraction. The extracted face feature vectors
network is trained based on the so-called PyTorch
are then updated in the Feature Vector DB. Inputs to
framework and a given dataset. After the training and
this module are the face images of interested persons
testing process, a weight file “model.weight” is
and the corresponding record of information such as
obtained. This model is then converted to ONNX
their name, age, company, and email. These inputs can
format to get "model.onnx". These files are then
be imported from the so-called CIVAMS Web Server
quantized and combined via the tool Nvidia TensorRT
or from the image database. When a person is required
from the hardware provider to obtain "model.engine",
to be queried, its face feature vectors are matched with
which is stored on the SD card. For the hardware
feature vectors in the database to find the best-matched
platform creation, several software frameworks are
vector, which has a maximum cosine similarity larger
installed on the Jetson Nano, including Nvidia Cuda,
than a threshold. This process is handled by the
Python, TensorRT, and OpenCV.
system's ARM CPU, which is housed on a Jetson Nano
chip. The fourth module is the user application, which
4
JST: Smart Systems and Devices
Volume 34, Issue 1, January 2024, 001-008
Fig. 4. Development and deployment pipeline of deep learning models on Jetson Nano
The detailed technical specifications of the Jetson 0.0001. The obtained accuracy is high enough for
Nano from Nvidia can be found in [16]. This hardware practical applications. The VIT-T is a miniature
is intended for the edge processing of deep neural version of the VIT. This will help to reduce the
networks. The hardware is a completed multi-process computational requirements of the inference process,
system on chip (MP-SOC), which can be used for allowing the system to be deployed on Jetson Nano's
image processing. The hardware includes a 128-core lightweight hardware.
Maxwell processor, a quad-core ARM A57 CPU
Aside from accuracy, the system's computational
running at 1.43 GHz, and 4 GB of 64-bit LPDDR4
efficiency is the most important factor. Evaluation
memory (25.6 GB/s), 4K at 30 fps, 4x 1080p at 30 fps,
results of the computational performance of the Jetson
9x 720p at 30 fps (H.264/H.265), 4K at 60 fps, 2x 4K
Nano are shown in Table 2. The evaluation was
at 30 fps, 8x 1080p at 30 fps, 18x 720p at 30 fps
performed on a test dataset containing 10141 images
(H.264/H.265); The computational benchmark for this
representing 1171 IDs that are collected on our real-
hardware can be found on the homepage of
life deployment systems. In this evaluation, the total
Nvidia [16].
processing time for the overall proposed pipeline is
3. Results and Discussion calculated under different scenarios. The number of
camera streams and the number of IDs are varied to
The face feature vector extraction model was
test the computing time for each processing step,
developed based on the backbone of VIT-T [12] with
including the detection time, the recognition time, the
a pretrained model. This model was fine-tuned based
matching time, and, as a result, the total processing
on a public dataset, namely the MS1M-ArcFace [15],
time for one ID. Critical cases are considered. It is seen
which contains a set of 5.8M images of a total of 85k
that for the fastest case, one ID on only one camera
IDs. This dataset is additionally enriched by our self-
stream, the total processing time is only 165 ms. For
collected dataset, which contains 200k images of 13k
the critical case, where a total of 40 IDs from
IDs. It is important that these 200k images have
10 camera streams can be recognized in a total
challenging characteristics including blur, side-view,
processing time of 458 ms for each ID, it can be
light-mode (IR), and long-hair and glasses, as
concluded that the designed system has very high
illustrated in Fig. 5. The dataset is divided into two
computational efficiency on the Jetson Nano since the
parts: 80% for the training set and 20% for the testing
overall processing pipeline was optimized and only
set. The fine-tuned model is then evaluated using a
lightweight models were developed. This shows a high
standard benchmark, namely the BLUFR [17]. The
potential for practical applications.
evaluation results show that the developed model has
an impressive accuracy of 98.642% and a true positive
rate of 97.9%, with a false acceptance rate of only
5
JST: Smart Systems and Devices
Volume 34, Issue 1, January 2024, 001-008
6
JST: Smart Systems and Devices
Volume 34, Issue 1, January 2024, 001-008
In the future, the system will be extended in some [9] Xie, Y., Ding, L., Zhou, A. and Chen, G., An
directions. The use of the Jetson Nano has optimized face recognition for edge computing,
demonstrated that it has a high potential for use as an presented at the 13th Int. Conf. on ASIC, Oct, 2019,
edge processing device. Thus, a thorough evaluation of pp. 1-4.
https://doi.org/10.1109/ASICON47005.2019.8983596
the device in the concept of hybrid edge-central
computing should be investigated. [10] GStreamer Plugin Overview - DeepStream 6.2 Release
Documentation. [Online] Available:
Acknowledgement docs.nvidia.com/metropolis/deepstream/dev-
This research is funded by the CMC Applied guide/text/DS_plugin_Intro.html.
Technology Institute, CMC Corporation, Hanoi, [11] Nvidia DeepStream SDK. Nvidia Developer, 3 Feb.
Vietnam. 2023, [Online] Available:
References developer.nvidia.com/deepstream-sdk.
[1] Wang, Mei, and Weihong Deng, Deep face [12] Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J.
recognition: A survey. Neurocomputing, 429, 2021, and Yuan, L., Tinyvit: Fast pretraining distillation for
pp. 215-244. small vision transformers, in Proc. ECCV, Tel Aviv,
https://doi.org/10.1016/j.neucom.2020.10.081 Israel, October 23-27, 2022, Part XXI, pp. 68-85.
https://doi.org/10.1007/978-3-031-19803-8_5
7
JST: Smart Systems and Devices
Volume 34, Issue 1, January 2024, 001-008
[13] Deng, J., Guo, J., Ververas, E., Kotsia, I. and [16] Nvidia Jetson Nano Developer Kit. Nvidia Developer,
Zafeiriou, S., Retinaface: Single-shot multi-level face 28 Sept. 2022, [Online] Available:
localisation in the wild, in Proc. CVPR, Seattle, WA, developer.nvidia.com/embedded/jetson-nano-
USA, 2020, pp. 5202-5211. developer-kit.
https://doi.org/10.1109/CVPR42600.2020.00525
[17] Liao, S., Lei, Z., Yi, D. and Li, S. Z., A benchmark
[14] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and study of large-scale unconstrained face recognition,
Chen, L. C., Mobilenetv2: Inverted residuals and linear presented at Int. Conf. on biometrics, September, 2014,
bottlenecks, in Proc. CVPR, Salt Lake City, UT, USA, pp. 1-8.
2018, pp. 4510-4520. https://doi.org/10.1109/BTAS.2014.6996301
https://doi.org/10.1109/CVPR.2018.00474
[18] Weisstein, Eric W. Affine Transformation. From
[15] Guo, Y., Zhang, L., Hu, Y., He, X. and Gao, J., MathWorld - A Wolfram Web Resource. [Online]
Ms-celeb-1m: A dataset and benchmark for large-scale Available:
face recognition, in Proc. ECCV, Amsterdam, The https://mathworld.wolfram.com/AffineTransformatio
Netherlands, October 11-14, 2016, Proceedings, Part n.html
III 14, pp. 87-102.
https://doi.org/10.1007/978-3-319-46487-9_6