An Efficient Deep Learning-Assisted Person Re-Identification Solution For Intelligent Video Surveillance in Smart Cities

Front. Comput. Sci.
, 2023, 17(4): 174329

https://doi.org/10.1007/s11704-022-2050-4
RESEARCH ARTICLE
An efficient deep learning-assisted person re-identification

solution for intelligent video surveillance in smart cities
Muazzam MAQSOOD1, Sadaf YASMIN1, Saira GILLANI2, Maryam BUKHARI1,

Seungmin RHO ( )3, Sang-Soo YEO ( )4✉ ✉
1 Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock 43600, Pakistan
2 Department of Computer Science, Bahria University, Lahore 54600, Pakistan
3 Department of Industrial Security, Chung-Ang University, Seoul 06974, Republic of Korea
4 Department of Computer Engineering, Mokwon University, Daejeon 35349, Republic of Korea
Higher Education Press 2023
Abstract Innovations on the Internet of Everything (IoE) concept has been realized. However, the emergence of the
enabled systems are driving a change in the settings where we Internet of Everything (IoE) has been accomplished with
interact in smart units, recognized globally as smart city individuals in the IoT paradigm, in which an interconnected
environments. However, intelligent video-surveillance systems system is consolidated in the IoE. As a result, the idea of a
are critical to increasing the security of these smart cities. More smart city combined with IoE foundations is invoked to
precisely, in today’s world of smart video surveillance, person support future potential applications [3]. In addition to these,
re-identification (Re-ID) has gained increased consideration by the surveillance of these smart cities to limit different types of
researchers. Various researchers have designed deep learning- crimes is a crucial requirement today. Hence, intelligent video
based algorithms for person Re-ID because they have achieved surveillance is one of the major tools to empower the security
substantial breakthroughs in computer vision problems. In this of these smart cities.
line of research, we designed an adaptive feature refinement- Furthermore, researchers in the computer vision domain
based deep learning architecture to conduct person Re-ID. In have focused on person re-identification (Re-ID) in recent
the proposed architecture, the inter-channel and inter-spatial years, which is an effective tool for surveillance. The rising
relationship of features between the images of the same trend of intelligent video-surveillance systems has increased in
individual taken from nonidentical camera viewpoints are the contemporary era due to various security plans, such as the
focused on learning spatial and channel attention. In addition, prevention of crimes, forensic examinations, and others. The
the spatial pyramid pooling layer is inserted to extract the tracking and interpreting process of recorded footage is one of
multiscale and fixed-dimension feature vectors irrespective of the most critical aspects of vision-based smart surveillance
the size of the feature maps. Furthermore, the model’s systems. However, this manual monitoring of recorded videos
effectiveness is validated on the CUHK01 and CUHK02 by a person takes a great deal of time and energy. Person Re-
datasets. When compared with existing approaches, the ID is an essential activity in smart vision-based surveillance
approach presented in this paper achieves encouraging Rank 1 systems. It is referred to as the procedure of identifying a
and 5 scores of 24.6% and 54.8%, respectively. similar subject over a collection of cameras that do not overlap
and are located at various geographic locations, also known as
Keywords Internet of Everything (IoE), visual surveillance multicamera surveillance systems [4,5].
systems, big data, security systems, person re-identification Person Re-ID is a very complex and demanding problem
(Re-ID), deep learning because videos are captured with cameras that do not overlap
and are in diverse environments. Consequently, the use of
1 Introduction fundamental data, such as a person’s face, is ineffective for
The foundation of smart cities can be traced to the benefits this purpose. Most researchers focus on the apparent features
achieved in urban living, which has increased people’s of the person, but these approaches have a great deal of
standards of living and efficient resource usage [1]. These uncertainty in visual appearance due to intra-class and inter-
benefits are possible due to the developments in the Internet, class challenges. A similar individual might seem different,
communication, and information technology [2]. With the and different individuals might appear the same.
advancement of the Internet of Things (IoT), the smart city There are many reasons for these intra-class and inter-class
variances, which include changes in human body postures,
Received January 25, 2022; accepted July 6, 2022
lightning conditions, scene occlusions, background noise, and
E-mail: smrho@cau.ac.kr; ssyeo@mokwon.ac.kr camera viewpoints over short and prolonged periods [6,7]. As
2 Front. Comput. Sci., 2023, 17(4): 174329
a result, the extraction of robust and discriminating features in belong to a similar individual are correctly matched amid
varied and dynamic environments and feature mapping from intense visual changes, such as illumination, posture, and
similar groups to different groups to perform the person Re-ID viewpoint, that have significant scientific values. Therefore,
across multiple cameras is an indispensable challenge. due to the significance of person Re-ID in research and
Many deep learning-based algorithms have been proposed application areas, the research community has progressively
by researchers to comply with the challenge of person Re-ID adopted and designed the best and most advanced set of
to improve the outcomes [8]. Besides this, various research algorithms for stable and accurate person Re-ID. Some
studies have linked deep learning and handcrafted approaches samples of paired images of similar subjects in distinct camera
for person Re-ID [9,10]. Moreover, the systems of person Re- viewpoints are presented in Fig. 1.
ID are categorized into three major classes [11]. Image-based In this paper, an adaptive feature refinement-based deep
person Re-ID is one of the classes and is based on image pairs learning model is designed to extract the most relevant and
only, each of which has single and multiple shots of an discriminative features via the images to conduct person Re-
individual. ID. Nevertheless, traditional deep learning models frequently
The second approach is person Re-ID using videos of focus on feature extraction with significant discrimination
individuals, which operates by examining various frames that while omitting promising features. Moreover, extracted fine-
depict a similar individual. The last class is an image to video- grained features may include redundancies that affect the
based person Re-ID, in which an algorithm in a video accuracy of person Re-ID systems. When two images of the
sequence searches the image of a person. Person Re-ID is also same person are captured with two different cameras, this
categorized into two types based on the time. Short-term mandates the design of a deep learning model that focuses on
period-based person identification is when individuals appear the most relevant and meaningful information in both images.
on Camera 1 and, after a few seconds or minutes, appear on Thus, more accurate and refined features provide better
Camera 2. Similarly, if a long period is considered, in this performance.
scenario, an individual appears on Camera 1 and, after several Hence, in this research study, an adaptive feature refinement
days, appears on Camera 1 or 2. For a person Re-ID in a short- block, also called a convolutional block attention module [15],
term period, the features based on visual appearance are the is incorporated at every level in feature extraction. We aim to
most widely used to re-identify the individual. Moreover, in identify the most relevant and salient features through this
some approaches, both visual and soft biometric-based refinement by learning inter-spatial and inter-channel
features are used [12,13]. relationships of features and diminishing the features of
Considering the importance of person Re-ID in terms of irrelevant areas of the images. For every extracted feature
applications or research areas, it has been well studied in map, the proposed model makes inferences to become
various fields, such as metaphysics [14], psychology [4], and acquainted with the most discriminative features between two
logic [4]. In a video-surveillance system, a person’s Re-ID images of similar subjects from different viewpoints. Channel
system works by first presenting the image of an individual of and spatial attention maps exploit these inter-spatial and inter-
interest as a query image to the system. At that time, the channel relationships using three aggregation arrangements
system reveals or indicates whether this individual has been (i.e., sequential, parallel, and mixed). In addition, a spatial
recorded at another location or time by another camera. pyramid pooling (SPP) layer is also deployed to optimally use
According to the standpoint of computer vision, the major and the spatial information of activations of convolutional
difficult challenge in person Re-ID is how two images that operations and accumulate feature maps before the last fully
Fig. 1 Samples of pair images of the same person under different camera viewpoints
Muazzam MAQSOOD et al. Efficient deep learning-assisted person re-identification 3
connected (FC) layers. The major contributions of this article layers capable of handling misalignment, occlusions,
are listed below: backgrounds with noise, and geometric and photometric
transformations. They evaluated the proposed model on the
● An end-to-end holistic adaptive feature refinement- CUHK03 dataset, which comprises images of about 13,164 for
based lightweight deep learning model is proposed for 1,360 subjects. Then, a cumulative matching characteristic
effective person Re-ID in smart cities for surveillance (CMC) curve was established to analyze the efficiency,
purposes. reporting a 26.51% recognition rate of Rank 1. Ahmed et al.
● The suggested deep learning model uses feature inter- designed another deep learning-based architecture to improve
spatial and inter-channel relationships to acquire the performance [17]. Their proposed algorithm used a special
attention maps using spatial and channel attention to type of convolution, tied convolution, to extract the local
enhance the representation of relevant regions of relations between two inputted images. Moreover, the patch
interest and lower the extracted feature values of summary layer was also integrated into the network, where
irrelevant locations. neighboring differences were summarized. They evaluated the
● An SPP layer is incorporated before the final FC layers
algorithms on CUHK01, CUHK03, and VIPeR datasets.
to fully employ the spatial information from the
Similarly, Cheng et al. designed another deep learning
adaptive refinement of the features and acquire the
architecture based on improved feature extraction [18]. In their
representation of features with extensive spatial
algorithm, a multichannel part-based framework was added
information.
along with an enhanced triplet loss function that draws both
The rest of the paper is partitioned into several sections. overall and parts of body features of a person in an image.
Section 2 presents existing work, and Section 3 details the Separate convolutional layers were designed to extract
proposed method. Next, Section 4 explains the experimental information from different body parts (e.g., one convolutional
results, and Section 5 concludes the paper and provides the layer for full-body parts). The four commonly used datasets
future research directions. for experimental purposes in their study include CUHK01,
VIPeR, PRID2011, and i-LDS.
2 Related work The body portion that includes the individual’s face and
Currently, person Re-ID using images requires additional shoulders contributes more to the performance, but lower
consideration from researchers. Figure 2 depicts a generic body portions comprising the individual’s legs and feet
overview of the image-based person Re-ID system. At testing performed the worst. Some body parts have essential features
time, a query image from the probe set is selected to be to distinguish people; thus, Huang et al. proposed a part-based
matched against all gallery images. The gallery and probe set method called DeepDiff to attempt to learn distinctive features
images are from two different camera views. The person Re- of individual parts [19]. This method splits the body parts and
ID system aims to rank the top N gallery images against the retrieves their feature representations by employing three NN-
query image, as illustrated in Fig. 2. In addition, these images based architectures called subnets. Every subnet network
can be any image type (i.e., RGB images, depth images, might manage a specific kind of intra-class variation. The
infrared images). CUHK03, CUHK01, and VIPeR datasets were employed to
Li et al. designed a deep learning algorithm to perform validate the results and achieved 62.4%, 47.9%, and 43.2%
subject Re-ID [16]. Their model, DeepReID, introduced a Rank 1 scores. Their designed model helped accumulate better
filter pairing-based neural network (NN) consisting of six knowledge and derived distinct features in the area of local
Fig. 2 Pictorial representation of the internal working of the person re-identification system
4 Front. Comput. Sci., 2023, 17(4): 174329
images. proposed an improved model of subject Re-ID [28]. Imani et

Based on the idea that the complete body and parts of body al. also employed depth images for the Re-ID of subjects and
characteristics complement one another, Zhao et al. designed extracted two types of histogram features [29]. In addition,
the deep learning algorithm SpindleNet, wherein the human Ren et al. used deep learning models and depth images to
Re-ID is performed concerning the human body with directed extract appearance and anthropometric features for person Re-
feature deconstruction and fusion [20]. They evaluated the ID [30]. This model has a 76.7% Rank 1 score on the RGBD-
proposed SpindleNet on the CUHK01 and CUHK03 datasets ID dataset.
and attained encouraging Rank 1 scores. Moreover, some researchers have also employed thermal
Moreover, Hermans et al. considered the influence of the images or infrared images to strengthen the effectiveness of
loss function in the person Re-ID algorithms and trained their the Re-ID of subjects. Wu et al. first designed a deep learning
deep learning model on the triplet loss function [21]. Different algorithm with RGB-infrared cross-modality [31]. In their
variants of the triplet loss function were also employed in their work, a new dataset was also introduced as SYSU-MM01.
work. Besides this, the pre-trained Res-Net50 model was also Furthermore, Mogelmose et al. employed all image
used in this work with the triplet loss function. This research modalities, such as RGB, depth, and thermal, to extract
study concluded that the triplet loss function works better with features and designed a tri-model person Re-ID system [32].
the pre-trained model instead of the model with scratch In addition, the concept of smart cities is an emerging field
training. However, this function also has some limitations in with several open challenges [33]. Security is one of the major
that it cannot use all batch information; hence, more time is challenges addressed through blockchain technology in work
required during the manual selection of hard negative samples. by Majeed et al. [34]. Ullah et al. proposed the green city
Furthermore, this limitation is addressed by He et al. [22]. In architecture to overcome this security challenge [35]. In the
their work, the lifted structured loss was incorporated into the existing literature, most research studies have focused on
learning of deep feature embedding. Two different loss efficient deep learning models that best extract sufficient,
functions, such as identification loss and lifted structured loss, relevant, and informative features to re-identify a person from
were combined and can be used as a network loss function to various camera angles. These deep learning algorithms are
analyze the relative image specifics that are either positive or heavily used not only in person Re-ID problems but also in
negative and determine accurate person information. other problems of computer vision [36]. To accomplish person
Some researchers have also involved the spatial relationship Re-ID and contribute to these existing methods, this research
in feature learning using recurrent NNs in their designed study proposed another variant of the deep learning model
frameworks. Wu et al. presented deep spatially multiplicative based on adaptive feature refinement blocks and SPP layers to
integration networks [23]. In this algorithm, features are conduct person Re-ID.
retrieved by two-stream convolutional neural networks
(CNNs) called M-Net [24] and D-Net [25]. Their designed 3 Methodology
model exhibits a 49.11% Rank 1 score on the VIPeR dataset. A graphical overview of the proposed framework is illustrated
Furthermore, M-Net and D-Net were also used in work by in Fig. 3, which demonstrates that we designed the Re-ID
Wu et al. [26]. They designed the cross-entropy adversarial model after the dataset acquisition. This model compares
view adaptation framework. The loss called cross-entropy images of similar subjects recorded by two separate cameras
adversarial mapping was employed to reduce the distance that do not overlap and specifies whether the two images are
between the gallery and probe samples and make the feature from a similar subject. The proposed algorithm consists of
space view-invariant. This proposed framework achieved a several layers. First, the inputs are passed through convolu-
55.9% Rank 1 score on the VIPeR dataset. Another challenge tional and max-pooling layers, and then the resulting feature
in person Re-ID is pedestrian misalignment. Existing studies maps pass from the adaptive feature refinement block. Finally,
on person Re-ID have employed high-level convolutional SPP is deployed to provide feature encodings of the images.
layers for feature extraction to address affine transformations, Afterward, an absolute difference among the two feature
but this method does not effectively handle misalignment vectors is computed using the Euclidian distance, and two
problems related to posing variations. dense layers are constructed in which the last FC layer
Zhu et al. designed a novel adaptive alignment network to comprises a sigmoid activation. All of these layers are
overcome the misalignment problem to obtain a credible and thoroughly discussed in the following subsections. Moreover,
precise Re-ID of subjects by learning the alignments both the proposed architecture is lightweight and provides an end-
pixel and patch-wise [27]. These network alignment modules to-end automated solution for person Re-ID.
allow them to learn more discriminating and relevant features.
Hence, the efficiency of Re-ID increases if the number of 3.1 Convolutional layers
individuals (training instances) increases. The CNNs are hierarchical models that alternate between two
Instead of using only RGB images, some research studies fundamental processes, convolution, and subsampling,
have focused on images based on RGB depth, and handcrafted resembling various fundamental complicated cells in the
methods are applied to depth images to extract features. visual cortex [37]. While learning a collection of convolu-
Against various lighting conditions, these features are not tional filters from convolutional layers, they employ weight
affected much. Wu et al. extracted the shape features from the sharing to leverage the 2D structure of the images. This
human body and the skeleton data from depth images and remarkable feature allows them to be used in numerous object
Fig. 3 Internal architecture of the proposed deep learning model for re-identification of individuals
recognition [38,39] and segmentation benchmarks [40,41]. A layers with window dimensions of 2 × 2. After every level of
collection of filters {wk }k∈K convolves over the set of images feature representation by the convolutional and max-pooling
of a subject { fi }i∈I taken by two distinct cameras and then layers, the adaptive feature refinement blocks are inserted to
generates another collection of images {H j } j∈J referred to as refine the features extracted from these layers. These
feature maps or activation maps. The correspondence of input refinement blocks are inserted into sequential or parallel
and output is maintained by a connection table CT (input i, arrangements or mixed arrangements.
filter k , and output j ). The responses of the filters among
3.2 Adaptive feature refinement block
inputs are linearly merged. This layer is responsible for the
We must identify correlations between the two viewpoints to
following mapping: assess whether the subject in the two images is similar. Thus,
∑
h j (x) = ( fi ∗wk ) (x) . (1) to establish a correct match, person Re-ID requires the
i,k∈CT i, j,k
extraction of discriminative features, which must be done
independently of difficulties, such as occlusion, viewpoint, or
In Eq. (1), the symbol * denotes the valid convolutional illumination changes. The features extracted by the convolu-
operation on images. The wk of a certain layer has dimensions tional and max-pooling layers at every level are inputted into
similar to the input and determines the dimensions of the the adaptive feature refinement block. In this block, the
feature maps h j and the input size. In addition, I1 and I2 are irrelevant features from two different camera views are
the two input images of dimensions 60 × 160 × 3 given as an suppressed, and more attention is focused on discriminative
input to the first convolutional layer on which the operation features. This adaptive refinement of features can be
given in Eq. (1) is performed. performed by incorporating the spatial and channel attention
In the deep learning paradigm, convolutional features layers. Moreover, it is widely known that the attention
provide effective interpretations for a range of classification mechanism plays an essential part in human perception [42].
problems. As depicted in Fig. 3, both input images are In the human visual system, one of the important
operated with the first convolution layer of kernel size 3 × 3. characteristics is that humans cannot comprehend an entire
The obtained feature maps are processed by max-pooling visual scene at once [43]. Humans employ a chain of partial
6 Front. Comput. Sci., 2023, 17(4): 174329
glimpses and choose and concentrate on essential areas of the to highlight or suppress, and ultimately the information flow
scene to better capture the visual structure. Therefore, various throughout the network is improved. This method also helps
researchers have attempted to involve these attention learn the more meaningful and discriminative features of
mechanisms in their designed deep learning models to images from two non-overlapping camera views for person-
increase their effectiveness. Re-ID. The detailed description of each attention map is
In this research study, we used an attention mechanism in described below.
the proposed architecture that adaptively refines the features
extracted from the convolutional layers. The intermediate 3.2.1 Inter-channel relationships of features
feature F∈RC×H×W maps resulting from the preceding layers A channel attention map is employed to exploit the inter-
are provided as input to the adaptive feature refinement block channel relationships of features between two images of
that infers a 2D channel attention map Mc ∈ RC×1×1 and 2D similar subjects captured from distinct angles. The feature
maps of spatial attention M s ∈ R1×H×W . The mathematical detector represents every feature map channel; thus, the
formulation of the adaptive feature refinement block is attention based on the channel is focused on what is relevant
summarized in Eqs. (2) and (3) [15]: in providing an image as input. The spatial dimensions of the
′ feature map provided at the input are squeezed to determine
F = Mc (F) ⊗ F, (2) the attention effectively based on the channel. Two spatial
′′ ( ′) ′ contexts are extracted, known as average-pooled and max-
F = Ms F ⊗ F . (3) c
pooled operations. These two types of spatial information Favg
In Eqs. (2) and (3), the symbol ⊗ represents element-by- c
and Fmax of feature, maps are aggregated. Later, an attention
element multiplication. With this multiplication, the following map based on the channel Mc ∈ RC×1×1 is obtained by
attention values are propagated (replicated) along the spatial forwarding these two shared model descriptors. This shared
dimensions, channel attention values are broadcasted, and vice model consists of a multilayer perceptron with one hidden
′′
versa. The adaptively refined feature is denoted by F . layer.
For every attention module, Fig. 4 illustrates the Moreover, RC/r×1×1 is fixed as the size of the hidden
computational process. Moreover, channel and spatial activation to lessen the influence of the parameter overhead,
attention are generated by exploiting the inter-channel and where r is the reduction ratio. Following that, the summation
inter-spatial relationships of features. The main difference
computed element-by-element is performed to integrate the
between the inter-channel and inter-spatial relationships of
two outputs based on the feature vectors after using the shared
features is that, in the inter-channel relationship of features, c c . A
model in every spatial context of Favg and Fmax
channel attention maps emphasize “what” is most necessary in
the image. In contrast, inter-spatial relationships of features mathematical description of the channel attention to draw
produce spatial attention emphasizing “where” the informative inter-channel relationships is described below:
regions in the image are most necessary. Both types of Mc (F) = σ(MLP (AvgPool (F)) + MLP (MaxPool (F)))
features assist in exploiting the channel and spatial-wise ( ( )) (4)
= σ(W1 W0 Favg
c
+ W1 (W0 (Fmax
c
))).
attention. These features assist in learning which information
Fig. 4 Determining feature inter-channel and inter-spatial relationships

In Eq. (4), σ represents the sigmoid function, and W0 ∈ RC/r×C 3.3 Spatial pyramid pooling layer
and W1 ∈ RC×C/r . Regarding both inputs, the shared weights The aggregated attention maps resulting from the feature
are denoted by W0 and W1 in the multilayer perceptron, and refinement of the last convolutional layer are passed as input
W0 continues to follow the rectified linear unit activation. to the SPP layers [44]. All features are pooled using this SPP
layer, producing fixed-dimension outcomes, which are then
3.2.2 Inter-spatial relationship of features sent to the FC layers. Generally, the SPP layer enables the
We generated spatial attention maps using the inter-spatial CNN to take inputs of any size, increasing the model’s scale
relationship of features, as depicted in Fig. 4. In contrast to invariance, suppressing overfitting, and permitting the
channel attention, spatial attention mechanisms focus on an extraction of local features from the data at various scales. By
informational component that complements channel attention. employing various distinctly sized pool operations, this layer
The spatial attention maps are generated along the channel ensures the fixed eigenvector output and attainment of the
axis by applying the average and max-pooling operations and input at any scale.
fusing them to create a productive feature descriptor. Applying This study incorporates an SPP layer with the proposed deep
the pooling methods somewhere along the channel axis is learning model. This layer is deployed between the
proven to be efficient to emphasize the informative areas of convolutional and FC layers. In every spatial bin, the response
the images. A convolutional operation is applied on the fused of each is pooled. The SPP produces vectors of kM
feature descriptor vector that encodes where to emphasize and dimensions with M specifying the number of bins (k denotes
inhibit to compute the attention map based on a spatial the number of filters used in the prior convolutional
mechanism M s ∈ R1×H×W . Two 2D maps are generated by operation). The FC layer receives these fixed-dimensional
combining the channel information of feature maps using two vectors as input. One primary insight on employing the SPP
operations of pooling (i.e., Favg c ∈ R1×H×W and F c layer in the proposed architecture is its significant
max
∈R 1×H×W ). Over the channel, each represents the average- characteristic of resilience in multilevel pooling toward object
pooled and maximum-pooled features. deformation [45]. The architecture of the SPP layer is
Later, a 2D attention map is computed by fusing these, presented in Fig. 6.
followed by applying the standard convolutional layer. A
mathematical description of spatial attention is described 4 Experiment and discussion
below: 4.1 Dataset and protocols
([ ]) This section discusses the results of the designed model under
M s (F) = σ f 7×7 AvgPool (F) ; MaxPool (F) ) different experimental settings, followed by a discussion and
( ([ ]))
= σ f 7×7 Favg
s s
; Fmax . (5) comparisons. To assess the efficacy of the proposed model,
we used the CUHK02 dataset, a publicly available benchmark
In Eq. (5), σ represents the sigmoid function and f 7×7
[46]. The CUHK02 dataset comprises 1,816 people. The
symbolizes a convolutional procedure with the filter
images of individuals are captured from five pairs of camera
dimensions of 7 × 7.
viewpoints (10 viewpoints). These pairs are numbered from
3.2.3 Aggregating channel and spatial attention maps P1 to P5. More specifically, P1 contains 971 identities, P2
For every input image of a person, two attention maps, the contains 306, P3 contains 107, P4 contains 193, and P5
channel and spatial attention maps are aggregated to calculate contains 239. In every pair, there are two images per subject.
the complementary attention, concentrating on what and Pair P1 belongs to the CUHK01 dataset, which is primarily
where. These two attention maps are computed sequentially used in the existing research studies [47]. In addition,
and in parallel, where the multiplication is performed between CUHK02 is a larger version of the CUHK01 dataset that
these two attention maps with the feature map based input includes more identities and angle points.
F∈RC×H×W to refine the features adaptively. The aggregation This database assesses the efficiency when using different
of the spatial and channel attention maps in a sequential camera viewpoints for training and testing sets. We used this
arrangement is illustrated in Fig. 4, whereas the parallel dataset in different experimental settings, as listed in Table 1.
arrangement of their aggregation is displayed in Fig. 5. Furthermore, the assessment metric used to analyze the
Moreover, the mixed approach is also employed, where both effectiveness of the proposed model is the CMC curve with a
sequential and parallel configurations are employed. single gallery shot setting. This metric finds the probability of
Fig. 5 Parallel arrangement of attention mechanisms in the adaptive feature refinement block
8 Front. Comput. Sci., 2023, 17(4): 174329
Fig. 6 Spatial pyramid pooling layer
positive pairs are formed by taking the images of a similar

Table 1 Subject-wise division of the dataset per camera pair
person (who are the same) via two distinct camera angles.
No# Camera view Training subjects Testing subjects Similarly, a pair of negative images is created if the subjects
1 P1 775 196
in the images are not the same. We separately used each pair
2 P2 245 61
3 P3 86 26
of viewpoints (P1 to P5) of the CUHK02 dataset. Each pair
4 P4 155 38 was divided into an 80/20 split, as indicated in Table 1. After
5 P5 192 47 the division, the actual number of subjects present in the
training set is used to generate the positive and negative image
a given test image’s correct and perfect match from the probe pairs to train the algorithm. Eventually, the testing set of that
set over the gallery images. More precisely, only one instance particular pair was used to evaluate the model. Ranks 1 to 15
of each gallery subject is ranked against a single query image for the identification rates of each camera view pair (P1−P5)
in a single gallery shot setting. For instance, the rank n score are separately depicted in Fig. 7(a, b, c). Figure 7(a, b, c) has
can be computed if the model ranks the gallery images to rank three graphs, and the x-axis of the graphs indicates the rank
the person in the query image in the top N gallery images. values, while the y-axis shows the identification rates. Each
The hyperparameters of the proposed model include the curve in the graph presents the rank scores for each camera
number of epochs (set to 100), the weight optimizers view (P1, P2, P3, P4, and P5).
(including Adam and stochastic gradient descent) with Moreover, these results are computed using different
learning rates of 0.001 and 0.0001, and a batch size set to 4. structural aggregations of spatial and channel attention maps.
The numbers of neurons in the last FC layers are 512 and 1. Figure 7(a) depicts the scores of pairs P1 to P5 with the
These hyperparameters are tuned using the trial-and-error sequential arrangement of attention maps in the model,
method. whereas Fig. 7(b) depicts the rank scores with the parallel
arrangement of the attention maps in the model. Figures 7(a)
4.2 Subject-independent person re-identification and (b) reveals that, in the sequential arrangement, the top 15
This type of experimental setting is a very realistic Rank 1 to 15 scores are from 10% to 52% for P1, 27% to
experimental setup for subject Re-ID problems. The people 90.16% for P2, 19% to 95% for P3, and 13% to 78% for P4.
included in the training set are different from those in the Last, for camera pair P5, the Rank 1 to 15 scores vary from
testing set; hence, it is person Re-ID with unknown subjects. 20% to 68%.
In this scenario, a model is trained by inputting images in the More specifically, the results in the sequential arrangement
form of pairs as positive and negative. The image pair is of the spatial and channel attention maps are encouraging. In
positive if the same subject is present in both images. These this arrangement, the input feature maps are first passed to the
Fig. 7 (a) Ranks 1 to 15 for the identification rates of the overall camera view pairs in the series setting of the adaptive feature refinement block,
(b) Ranks 1 to 15 for the identification rates of the overall camera view pairs in the parallel setting of the adaptive feature refinement block, and
(c) Ranks 1 to 100 for the identification rates over the P1 camera view pairs in both series and parallel settings of the adaptive feature refinement
channel attention module, and following that, the result of this layouts of these inter-channel and inter-spatial layers. Figures
module is input into the spatial attention map. Furthermore, in 8(a) and (b) provides the accuracy and loss values of all
Fig. 7(b), the results of all camera views (P1−P5) are experiments during training. In Figs. 8(a) and (b), the y-axis
presented. In this setup, the inter-channel and inter-spatial indicates the values of accuracy and loss, and the x-axis
relationships are focused on parallel configurations of the denotes the number of epochs. Each curve in the graph
inputted feature maps, as displayed in Fig. 7(b). The results of provides the values of accuracy and loss for each camera view
this type of arrangement in the model are also encouraging. (i.e., P1, P2, P3, P4, and P5) for both the spatial and channel
The Rank 1 to 15 scores of P1 range from 12% to 57%. attention settings (i.e., parallel and series). All experiments run
Similarly, the Rank 1 to 15 scores for camera pair P2 are for 100 epochs. After 20 epochs, the training accuracy exceeds
24% to 93.4%, and the Rank 1 to 15 scores of camera pair P3 80%, and the loss values approach zero. Furthermore,
are 33% to 100%. Furthermore, the camera pairs P4 and P5 hyperparameters play a critical role in model performance.
also have encouraging Rank 1 to 15 scores of 13% to 78% and All these experiments in Figs. 7 (a), (b), and (c) were run
13% to 73%, respectively. The parallel arrangement for with Adam optimizers using a learning rate of 0.0001, and the
acquiring the inter-channel and inter-spatial relationships same set of experiments was also repeated using stochastic
works better than the series arrangement. gradient descent optimizers using a 0.001 learning rate, as
However, the number of subjects in the camera pair P1 is displayed in Figs. 9(a), (b), and (c). Figure 9 has six graphs,
more than that of the remaining pair’s subjects. Thus, rank and the x-axis of the graphs represents the rank values,
scores from 1 to 100 with gaps of 5 are also plotted for this whereas the y-axis represents the identification rates. Each
camera pair view. Figure 7(c) lists the values of Ranks 1 to curve in the graph represents the rank scores for each camera
100 for both parallel and sequential arrangements of the view, such as P1, P2, P3, P4, and P5. Furthermore, the model
adaptive feature refinement blocks for camera pair P1. More performance is accessed via a mixed method of adaptive
explicitly, Fig. 7(c) provides the rank scores for both feature refinement block configurations. In this approach, the
arrangements, ranging from 10% to 94.9% for the sequential adaptive feature refinement block is organized in parallel form
case and 12% to 95% for the parallel case. Similarly, the in the first two convolutional blocks, and in series form after
parallel configuration performs better than the sequential the last convolutional block. With this model configuration,
configuration in this case. Furthermore, the training and loss the experiments are run with different hyperparameter
values for each camera pair view are also recorded in both settings, such as Adam and stochastic gradient descent, with
Fig. 8 (a, b) Accuracy and loss curves during training for all camera views in both attention mechanism arrangements
10 Front. Comput. Sci., 2023, 17(4): 174329
0.001 and 0.0001 learning rates, respectively. The Rank 1 to setting also presents the cross-dataset scenario.
15 scores in this setup for all camera views are presented in Therefore, the details of the total number of training and
Figs. 9 (d), (e), and (f). testing subjects and the camera views are listed in Table 3.
After the division of data according to this scenario, the
4.3 Person re-identification with unknown subject and positive and negative image pairs were generated using
viewpoint
camera view pair P1, and the results were calculated using the
To assess the precision of an automated real-time surveillance
remaining views separately. The results of Rank 1 to 15 scores
system, person Re-ID should be performed with unknown
in this experimental setup for camera view pair P2 range from
camera viewpoints and people. In the prior arrangement, the
27% to 93%, as displayed in Table 2, row 1, with the
model is trained on known camera viewpoints, with the only
sequential arrangement of inter-channel and inter-spatial-
difference being that the people are different. For example, in
based attention mechanisms. Following that, the results of the
a public area, two different cameras are installed at two
locations (viewpoints), and if the image of a person enrolled in camera pair view P3 to P5 are given in rows 2, 3, and 4. For
the system is captured by these two cameras, then the system camera pair view P3, the range of the rank scores is 19% to
should be able to identify the subject. Similarly, if the 57%. The number of subjects in this pair is considerably less;
unknown person comes under these two camera viewpoints, thus, the values of the highest ranks are not reported.
the system should be able to recognize the unknown person Similarly, for the camera pairs P4 and P5, the rank scores are
(i.e., our previous experimental setup). In this experimental 10% to 56% and 13% to 57%, respectively. In addition, the
setup, instead of only recognizing the unknown person from same set of experiments is repeated for the parallel
different camera viewpoints, the system should also recognize configuration of the adaptive feature refinement blocks. In this
unknown people from two unknown camera viewpoints; case, the scores for Ranks 1, 4, 8, 10, 12, and 15 are listed in
hence, it is called person Re-ID with unknown subject and rows 6, 7, 8, and 9 in Table 2.
viewpoint. With camera view pair P2, the rank scores from 1 to 15 are
This research study also investigated the model performance 22% to 95%. Similarly, for pair P3, the scores are 23.1% to
with this realistic experimental setup. As mentioned, the entire 52%, and for camera pair P3, the scores are 18% to 64%.
dataset of CUHK02 comprises 10 camera views divided into Lastly, for camera pair P5, the Rank 1 to 15 scores is 13% to
pairs as P1 to P5. The P1 pair belongs to the CUHK01 dataset, 62%. It seems that a model based on parallel configurations of
whereas the remaining pair views belong to the CUHK02 inter-channel and inter-spatial relationships works better than
dataset. In this scenario, the model was trained on a camera series configurations. In addition, in this challenging
pair P1 and tested on the remaining pairs P2 to P5. This experimental setup, the proposed model learns to perform
Fig. 9 (a, b, c) Ranks 1 to 15 of the identification rates for the overall camera view pairs in series and parallel settings for the adaptive feature
refinement block with the stochastic gradient descent optimizer (lr = 0.001) . (d, e, f) Ranks 1 to 15 of the identification rates for the overall
camera view pairs in mixed configurations for sequential and parallel arrangements of adaptive feature refinement blocks
Table 2 Results of the model with unknown subject and viewpoint
Camera view Rank 1/% Rank 4/% Rank 8/% Rank 10/% Rank 12/% Rank 15/%
No#
Sequential arrangement of adaptive feature refinement block
1 P2 27 55 80 86 91 93
2 P3 19 57 − − − −
3 P4 10 40 56 − − −
4 P5 13 28 46 57 − −
Parallel arrangement of adaptive feature refinement block
6 P2 22 59 75 85 88 95
7 P3 23.1 52 − − − −
8 P4 18 43 64 − − −
9 P5 13 28 51 62 − −
setting or a multishot gallery setting. This measure calculates

Table 3 Dataset division with an unknown subject and viewpoints
the likelihood of a particular query image from the probe set
No# Camera view Division Testing subjects over the gallery images. For example, as presented in Fig. 10,
1 P1 Training 971
a query image on the right side was matched against the top
2 P2 Testing 306
3 P3 Testing 109
five gallery images, giving the values of Rank 5. These gallery
4 P4 Testing 193 and probe sets were constructed from the testing set. The
5 P5 Testing 239 comparison of these rank scores with the existing methods is
provided in Table 4.
better and can re-identify unknown subjects even when the The proposed method in this research study exhibits
camera viewpoint is not disclosed to the model. encouraging results compared with other methods. More
specifically, Li et al. [16] designed the deep filter-based
4.4 Discussion and comparison with existing approaches person Re-ID model to extract features of subjects from
The results presented above show that person Re-ID is a very distinct camera viewpoints. Their proposed model deals with
challenging task in computer vision. It is very difficult to the misalignment problem and several other challenges related
model the dynamic environment, which includes diverse to geometric transformations. Koestinger et al. [48] and Feng
challenges, such as changes in lighting, viewpoint alterations, et al. [51] employed the distance metric learning-based
occlusions of carrying items, and other problems. Due to these approach to carry out person Re-ID. They evaluated their
dynamic changes, the features are greatly affected. In this proposed method on the VIPeR, ToyCars, and CUHK03
research study, an end-to-end holistic deep learning algorithm datasets and achieved good performance. Zheng et al. [49]
is designed to perform person Re-ID. As mentioned in the suggested the bag of words-based techniques for person Re-
introduction section, we leveraged the inter-spatial and inter- ID. Local features, identified as an effective aspect in the
channel relationships of features to obtain spatial and channel literature, were effectively accommodated by the bag of
attention maps. Through this, the deep learning model learns words. Similarly, Fan et al. [50] suggested a clustering and
by better focusing on information on what and where. fine-tuning-based strategy for the problem of person Re-ID.
Moreover, the SPP layer further assists in using complete Because the clustering results might be noisy, they
spatial information. These modifications in the deep learning incorporated a selection operation among the clustering and
model increase its representability and ultimately contribute to fine-tuning. These proposed methods also achieved
its good performance, as discussed in the result section. encouraging scores, but the proposed method provides some
In addition, we considered two different experimental setups improvement on these existing methods. Hence, the proposed
to evaluate the model. In the first subject-independent setup, a work contributes to the existing literature in terms of
model was trained on a collection of individuals whose images performance. More precisely, many excellent performance
were captured using two different camera views. The model deep learning methods have been proposed in existing studies.
attempted to learn whether the two inputted images belong to However, the proposed deep learning model more perfectly
the same person. Then, the model was tested on different sets extracts discriminative features from the images by
of individuals who were not part of the training set. emphasizing and exploiting both channel and spatial attention
Furthermore, in the second experimental setup, the model maps. The proposed method attempts to mimic the human
was unaware of both camera views and subjects and attempted visual perception and intentionally focuses on highly
to recognize the two unknown individuals from an unknown informative regions of the images by suppressing irrelevant
camera viewpoint, which is the most difficult scenario. areas and adaptively refining features extracted by
However, the proposed model exhibits encouraging results. convolutional layers.
Moreover, we evaluated the model by comparing the Furthermore, it is necessary to consider the limitations of the
outcomes of the suggested method based on deep learning proposed study; therefore, one possible limitation of this study
with some existing efficient models designed for person Re- is the short-term period in which the visual appearance of
ID. Generally, CMC curves were computed in all existing persons in different camera views remains the same. As a
deep learning methods to assess the suggested model result, what if the person captured with one camera view
performance for person Re-ID in either a single-shot gallery reappears in another camera after a lengthy period in which
12 Front. Comput. Sci., 2023, 17(4): 174329
Fig. 10 Top five matches against query image from the gallery set
Table 4 Rank-wise comparison with some existing approaches

Sr. No. Authors Method Rank 1 score Datasets
1 Li et al. [16] Deep learning 24.26 CUHK01 and CUHK03
2 Koestinger et al. [48] Distance metric learning 20.8 VIPeR and ToyCars
3 Zheng et al. [49] Bag of words model 24.33 CUHK03
4 Fan et al. [50] Clustering and fine-tuning 24.8 CUHK03
5 Feng et al. [51] Distance metric learning 21.96 CUHK03
6 Proposed method Deep learning 24.6 CUHK01 and CUHK02
aspects of the individual’s visual appearance, such as clothing encouraging average Rank 1 and Rank 5 scores of 24.6% and
conditions, have significantly changed? In such a scenario, the 54.8%, respectively. According to the findings, when both
performance of the proposed model might not be as good views and subjects are varied, the deep learning model has
because substantial changes occur in the features of the images difficulty in re-identifying a person. This is owing to the fact
due to the changes in the visual appearance of the subjects. that the features of distinct subjects’ images are significantly
changed as a result of a change in the camera view. However,
5 Conclusion when compared to existing methods, the overall findings in
With the innovation of the IoE, the concepts of smart cities, terms of average rank scores are acceptable. We should
smart homes, and smart environments have emerged. approach the challenge of person Re-ID with long-term
However, the surveillance of these smart cities is one of the scenarios in the future.
major challenges. Hence, this paper provides a deep learning-
assisted solution for person Re-ID for surveillance and Acknowledgements This paper was supported by Korea Institute for
Advancement of Technology (KIAT) grant funded by the Korea Government
security purposes in these smart cities. The variability in the
(MOTIE) (P0008703, The Competency Development Program for Industry
appearance of the same person under different camera Specialist) and also the MSIT (Ministry of Science and ICT), Republic
viewpoints makes person Re-ID a very difficult and of Korea, under the ITRC (Information Technology Research Center) support
demanding task. program (IITP-2022-2018-0-01799) supervised by the IITP (Institute for
However, the proposed deep learning model is based on Information & Communications Technology Planning & Evaluation).
adaptive feature refinement with SPP layers. The designed
approach learns the image features by considering the inter- References
channel and inter-spatial associations using attention 1. Neirotti P, De Marco A, Cagliano A C, Mangano G, Scorrano F.
mechanisms and focuses more on the discriminative regions of Current trends in smart city initiatives: some stylised facts. Cities, 2014,
the images. Moreover, spatial learning is further enhanced by 38: 25–36
using SPP layers that employ different scale pooling windows 2. Vlacheas P, Giaffreda R, Stavroulaki V, Kelaidonis D, Foteinos V,
to extract features. Furthermore, we have used several Poulios G, Demestichas P, Somov A, Biswas A R, Moessner K.
experimental situations in which unknown viewpoints and Enabling smart cities through a cognitive management framework for
subjects are considered. The proposed framework achieves the internet of things. IEEE Communications Magazine, 2013, 51(6):
102–111 Tools and Applications, 2019, 78(5): 5863–5880

3. Singh P, Nayyar A, Kaur A, Ghosh U. Blockchain and fog based 23. Wu L, Wang Y, Li X, Gao J. What-and-where to match: deep spatially
architecture for internet of everything in smart cities. Future Internet, multiplicative integration networks for person re-identification. Pattern
2020, 12(4): 61 Recognition, 2018, 76: 727–738
4. Zheng L, Yang Y, Hauptmann A G. Person re-identification: past, 24. Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil
present and future. 2016, arXiv preprint arXiv: 1610.02984 in the details: delving deep into convolutional nets. In: Proceedings of
5. Wu D, Zheng S J, Zhang X P, Yuan C A, Cheng F, Zhao Y, Lin Y J, British Machine Vision Conference. 2014
Zhao Z-Q, Jiang Y L, Huang D S. Deep learning-based methods for 25. Simonyan K, Zisserman A. Very deep convolutional networks for large-
person re-identification: a comprehensive review. Neurocomputing, scale image recognition. In: Proceedings of the 3rd International
2019, 337: 354–371 Conference on Learning Representations. 2015
6. Zahra A, Perwaiz N, Shahzad M, Fraz M M. Person re-identification: a 26. Wu L, Hong R, Wang Y, Wang M. Cross-entropy adversarial view
retrospective on domain specific open challenges and future trends. adaptation for person re-identification. IEEE Transactions on Circuits
2022, arXiv preprint arXiv: 2202.13121 and Systems for Video Technology, 2020, 30(7): 2081–2092
7. Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi S C H. Deep learning for 27. Zhu X, Liu J, Xie H, Zha Z-J. Adaptive alignment network for person
person re-identification: a survey and outlook. IEEE Transactions on re-identification. In: Proceedings of the 25th International Conference
Pattern Analysis and Machine Intelligence, 2022, 44(6): 2872–2893 on Multimedia Modeling. 2019, 16–27
8. Wu W, Tao D, Li H, Yang Z, Cheng J. Deep features for person re- 28. Wu A, Zheng W-S, Lai J-H. Robust depth-based person re-
identification on metric learning. Pattern Recognition, 2021, 110: identification. IEEE Transactions on Image Processing, 2017, 26(6):
107424 2588–2603
9. Chen X, Xu H, Li Y, Bian M. Person re-identification by low- 29. Imani Z, Soltanizadeh H. Histogram of the node strength and histogram
dimensional features and metric learning. Future Internet, 2021, 13(11): of the edge weight: two new features for RGB-D person re-
289 identification. Science China Information Sciences, 2018, 61(9): 092108
10. Li R, Zhang B, Teng Z, Fan J. A divide-and-unite deep network for 30. Ren L, Lu J, Feng J, Zhou J. Multi-modal uniform deep learning for
person re-identification. Applied Intelligence, 2021, 51(3): 1479–1491 RGB-D person re-identification. Pattern Recognition, 2017, 72:
11. Ming Z, Zhu M, Wang X, Zhu J, Cheng J, Gao C, Yang Y, Wei X. 446–457
Deep learning-based person re-identification methods: a survey and 31. Wu A, Zheng W-S, Yu H-X, Gong S, Lai J. RGB-infrared cross-
outlook of recent works. Image and Vision Computing, 2022, 119: modality person re-identification. In: Proceedings of 2017 IEEE
104394 International Conference on Computer Vision. 2017, 5390–5399
12. Lin S, Li C T. Person re-identification with soft biometrics through deep 32. Møgelmose A, Bahnsen C, Moeslund T B, Clapes A, Escalera S. Tri-
learning. In: Jiang R, Li C T, Crookes D, Meng W, Rosenberger C, eds. modal person re-identification with RGB, depth and thermal features.
Deep Biometrics. Cham: Springer, 2020, 21–36 In: Proceedings of 2013 IEEE Conference on Computer Vision and
13. Shoukry N, Abd El Ghany MA, Salem M A M. Multi-modal long-term Pattern Recognition Workshops. 2013, 301–307
person re-identification using physical soft bio-metrics and body figure. 33. Silva B N, Khan M, Han K. Towards sustainable smart cities: a review
Applied Sciences, 2022, 12(6): 2835 of trends, architectures, components, and open challenges in smart
14. Nambiar A, Bernardino A, Nascimento J C. Gait-based person re- cities. Sustainable Cities and Society, 2018, 38: 697–713
identification: a survey. ACM Computing Surveys, 2020, 52(2): 33 34. Majeed U, Khan L U, Yaqoob I, Kazmi S M A, Salah K, Hong C S.
15. Woo S, Park J, Lee J-Y, Kweon I S. CBAM: convolutional block Blockchain for IoT-based smart cities: recent advances, requirements,
attention module. In: Proceedings of the 15th European Conference on and future challenges. Journal of Network and Computer Applications,
Computer Vision. 2018, 3–19 2021, 181: 103007
16. Li W, Zhao R, Xiao T, Wang X DeepReID: deep filter pairing neural 35. Ullah F, Al-Turjman F, Nayyar A. IoT-based green city architecture
network for person re-identification. In: Proceedings of 2014 IEEE using secured and sustainable android services. Environmental
Conference on Computer Vision and Pattern Recognition. 2014, Technology & Innovation, 2020, 20: 101091
152–159 36. Li J, Wang J, Ullah F. An end-to-end task-simplified and anchor-guided
17. Ahmed E, Jones M, Marks T K. An improved deep learning architecture deep learning framework for image-based head pose estimation. IEEE
for person re-identification. In: Proceedings of 2015 IEEE Conference Access, 2020, 8: 42458–42468
on Computer Vision and Pattern Recognition. 2015, 3908–3916 37. Hubel D H, Wiesel T N. Receptive fields and functional architecture of
18. Chen S, Qin J, Ji X, Lei B, Wang T, Ni D, Cheng J-Z. Automatic monkey striate cortex. The Journal of Physiology, 1968, 195(1):
scoring of multiple semantic attributes with multi-task feature leverage: 215–243
a study on pulmonary nodules in CT images. IEEE Transactions on 38. Bukhari M, Bajwa K B, Gillani S, Maqsood M, Durrani M Y,
Medical Imaging, 2017, 36(3): 802–814 Mehmood I, Ugail H, Rho S. An efficient gait recognition method for
19. Huang Y, Sheng H, Zheng Y, Xiong Z. DeepDiff: learning deep known and unknown covariate conditions. IEEE Access, 2021, 9:
difference features on human body parts for person re-identification. 6465–6477
Neurocomputing, 2017, 241: 191–203 39. Ashraf R, Afzal S, Rehman A U, Gul S, Baber J, Bakhtyar M,
20. Zhao H, Tian M, Sun S, Shao J, Yan J, Yi S, Wang X, Tang X. Spindle Mehmood I, Song O Y, Maqsood M. Region-of-interest based transfer
net: person re-identification with human body region guided feature learning assisted framework for skin cancer detection. IEEE Access,
decomposition and fusion. In: Proceedings of 2017 IEEE Conference on 2020, 8: 147858–147871
Computer Vision and Pattern Recognition. 2017, 907–915 40. Maqsood M, Bukhari M, Ali Z, Gillani S, Mehmood I, Rho S, Jung Y-
21. Hermans A, Beyer L, Leibe B. In defense of the triplet loss for person A. A residual-learning-based multi-scale parallel-convolutions- assisted
re-identification. 2017, arXiv preprint arXiv: 1703.07737 efficient CAD system for liver tumor detection. Mathematics, 2021,
22. He Z, Jung C, Fu Q, Zhang Z. Deep feature embedding learning for 9(10): 1133
person re-identification based on lifted structured loss. Multimedia 41. Maqsood M, Yasmin S, Mehmood I, Bukhari M, Kim M. An efficient
14 Front. Comput. Sci., 2023, 17(4): 174329
DA-net architecture for lung nodule segmentation. Mathematics, 2021, design, computer vision, medical imaging, and pattern recognition.
9(13): 1457
42. Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep Saira Gillani received her PhD degree in
learning. Neurocomputing, 2021, 452: 48–62 Information Sciences from Corvinus University of
43. Guo M H, Xu T X, Liu J J, Liu Z N, Jiang P T, Mu T J, Zhang S H, Budapest, Hungary. She joined the COMSATS
Martin R R, Cheng M M, Hu S M. Attention mechanisms in computer
Institute of Information Technology, Pakistan in
vision: a survey. Computational Visual Media, 2022, 8(3): 331–368
2016. She also served as an assistant professor in
44. He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep
convolutional networks for visual recognition. IEEE Transactions on Saudi Electronic University, Saudi Arabia. She is
Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916 currently serving as an associate professor in
45. Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial Bahria University Lahore, Pakistan. Previously, she worked as
pyramid matching for recognizing natural scene categories. In: research scholar in Corvinno, Technology Transfer Center of
Proceedings of 2006 IEEE Computer Society Conference on Computer Information Technology and Services in Budapest, Hungary and also
Vision and Pattern Recognition. 2006, 2169–2178 worked as research associate in CoReNet (Center of Research in
46. Li W, Wang X. Locally aligned feature transforms across views. In: Networks and Telecom), CUST, Pakistan. Her areas of interest
Proceedings of 2013 IEEE Conference on Computer Vision and Pattern
include data sciences, text mining, data mining, machine learning,
Recognition. 2013, 3594–3601
vehicular networks, mobile edge computing and Internet of Things.
47. Li W, Zhao R, Wang X. Human reidentification with transferred metric
learning. In: Proceedings of the 11th Asian Conference on Computer
Vision. 2012, 31–44 Maryam Bukhari is perusing her MS degree at
48. Köstinger M, Hirzer M, Wohlhart P, Roth P M, Bischof H. Large scale COMSATS University Islamabad, Attock
metric learning from equivalence constraints. In: Proceedings of 2012 Campus, Pakistan. Her research areas include
IEEE Conference on Computer Vision and Pattern Recognition. 2012, machine learning and image processing.
2288–2295
49. Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q. Scalable person re-
identification: a benchmark. In: Proceedings of 2015 IEEE International
Conference on Computer Vision. 2015, 1116–1124
50. Fan H, Zheng L, Yan C, Yang Y. Unsupervised person re-identification:
clustering and fine-tuning. ACM Transactions on Multimedia Seungmin Rho is currently an associate professor
Computing, Communications, and Applications, 2018, 14(4): 83 at Department of Industrial Security at Chung-
51. Feng G, Liu W, Tao D, Zhou Y. Hessian regularized distance metric Ang University, Republic of Korea. His current
learning for people re-identification. Neural Processing Letters, 2019, research interests include database, big data
50(3): 2087–2100 analysis, music retrieval, multimedia systems,
machine learning, knowledge management as well
Muazzam Maqsood is serving as an Assistant as computational intelligence. He has published
Professor at the Department of Computer Science, 300 papers in refereed journals and conference proceedings in these
COMSATS University Islamabad, Attock areas. He has been involved in more than 20 conferences and
Campus, Pakistan. He holds a PhD in software workshops as various chairs and more than 30
engineering with a keen interest in artificial conferences/workshops as a program committee member. He has
intelligence and deep learning-based systems. His edited a number of international journal special issues as a guest
main research focus is to use the latest machine editor, such as multimedia systems, information fusion, and
learning and deep learning algorithms to develop automated engineering applications of artificial intelligence.
solutions, especially in the field of pattern recognition and data
analytics. He has published various top-ranked impact factor papers Sang-Soo Yeo received a PhD degree in Computer
in the area of image processing, medical imaging, recommender Science & Engineering from Chung-Ang
systems, stock exchange prediction, and big data analytics. He is also University, Republic of Korea in 2005. He is a
a reviewer of many impact factor journals and a program committee professor at the Department of Computer
member of various international conferences. Engineering, Mokwon University, Republic of
Korea. He worked for MOIS, Ministry of Interior
Sadaf Yasmin is currently working as Assistant and Safety and worked for PIPC, Personal
Professor at the Department of Computer Science, Information Protection Commission, Republic of Korea from Feb.
COMSATS University Islamabad, Attock 2020 to Jul. 2021. He is President of the Institution of Creative
Campus, Pakistan. She has completed her MS and Research Professionals (ICRP), and Vice President of ICT Platform
PhD in Computer Science from Capital University Society (ICTPS). He is serving as Steering Chair of the PlatCon
of Science and Technology, Pakistan, and BS in conference series, a very comprehensive conference series on
Software Engineering from (APCOMS) NUML platform technology and services. Dr. Yeo’s research interests
Islamabad, Pakistan. She has worked on several research projects include security, privacy, personal information Protection, ubiquitous
during and after her PhD She is also serving as a reviewer for various computing, multimedia service, ubiquitous computing, embedded
reputed journals. Her research interests include network protocol system, and bioinformatics.

An Efficient Deep Learning-Assisted Person Re-Identification Solution For Intelligent Video Surveillance in Smart Cities

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Efficient Deep Learning-Assisted Person Re-Identification Solution For Intelligent Video Surveillance in Smart Cities

Uploaded by

Copyright:

Available Formats

Front. Comput. Sci.

, 2023, 17(4): 174329

An efficient deep learning-assisted person re-identification

Muazzam MAQSOOD1, Sadaf YASMIN1, Saira GILLANI2, Maryam BUKHARI1,

Higher Education Press 2023

images. proposed an improved model of subject Re-ID [28]. Imani et

Fig. 4 Determining feature inter-channel and inter-spatial relationships

Fig. 6 Spatial pyramid pooling layer

positive pairs are formed by taking the images of a similar

Table 2 Results of the model with unknown subject and viewpoint

setting or a multishot gallery setting. This measure calculates

Table 4 Rank-wise comparison with some existing approaches

102–111 Tools and Applications, 2019, 78(5): 5863–5880

You might also like