Professional Documents
Culture Documents
RESEARCH ARTICLE
Abstract Innovations on the Internet of Everything (IoE) concept has been realized. However, the emergence of the
enabled systems are driving a change in the settings where we Internet of Everything (IoE) has been accomplished with
interact in smart units, recognized globally as smart city individuals in the IoT paradigm, in which an interconnected
environments. However, intelligent video-surveillance systems system is consolidated in the IoE. As a result, the idea of a
are critical to increasing the security of these smart cities. More smart city combined with IoE foundations is invoked to
precisely, in today’s world of smart video surveillance, person support future potential applications [3]. In addition to these,
re-identification (Re-ID) has gained increased consideration by the surveillance of these smart cities to limit different types of
researchers. Various researchers have designed deep learning- crimes is a crucial requirement today. Hence, intelligent video
based algorithms for person Re-ID because they have achieved surveillance is one of the major tools to empower the security
substantial breakthroughs in computer vision problems. In this of these smart cities.
line of research, we designed an adaptive feature refinement- Furthermore, researchers in the computer vision domain
based deep learning architecture to conduct person Re-ID. In have focused on person re-identification (Re-ID) in recent
the proposed architecture, the inter-channel and inter-spatial years, which is an effective tool for surveillance. The rising
relationship of features between the images of the same trend of intelligent video-surveillance systems has increased in
individual taken from nonidentical camera viewpoints are the contemporary era due to various security plans, such as the
focused on learning spatial and channel attention. In addition, prevention of crimes, forensic examinations, and others. The
the spatial pyramid pooling layer is inserted to extract the tracking and interpreting process of recorded footage is one of
multiscale and fixed-dimension feature vectors irrespective of the most critical aspects of vision-based smart surveillance
the size of the feature maps. Furthermore, the model’s systems. However, this manual monitoring of recorded videos
effectiveness is validated on the CUHK01 and CUHK02 by a person takes a great deal of time and energy. Person Re-
datasets. When compared with existing approaches, the ID is an essential activity in smart vision-based surveillance
approach presented in this paper achieves encouraging Rank 1 systems. It is referred to as the procedure of identifying a
and 5 scores of 24.6% and 54.8%, respectively. similar subject over a collection of cameras that do not overlap
and are located at various geographic locations, also known as
Keywords Internet of Everything (IoE), visual surveillance multicamera surveillance systems [4,5].
systems, big data, security systems, person re-identification Person Re-ID is a very complex and demanding problem
(Re-ID), deep learning because videos are captured with cameras that do not overlap
and are in diverse environments. Consequently, the use of
1 Introduction fundamental data, such as a person’s face, is ineffective for
The foundation of smart cities can be traced to the benefits this purpose. Most researchers focus on the apparent features
achieved in urban living, which has increased people’s of the person, but these approaches have a great deal of
standards of living and efficient resource usage [1]. These uncertainty in visual appearance due to intra-class and inter-
benefits are possible due to the developments in the Internet, class challenges. A similar individual might seem different,
communication, and information technology [2]. With the and different individuals might appear the same.
advancement of the Internet of Things (IoT), the smart city There are many reasons for these intra-class and inter-class
variances, which include changes in human body postures,
Received January 25, 2022; accepted July 6, 2022
lightning conditions, scene occlusions, background noise, and
E-mail: smrho@cau.ac.kr; ssyeo@mokwon.ac.kr camera viewpoints over short and prolonged periods [6,7]. As
2 Front. Comput. Sci., 2023, 17(4): 174329
a result, the extraction of robust and discriminating features in belong to a similar individual are correctly matched amid
varied and dynamic environments and feature mapping from intense visual changes, such as illumination, posture, and
similar groups to different groups to perform the person Re-ID viewpoint, that have significant scientific values. Therefore,
across multiple cameras is an indispensable challenge. due to the significance of person Re-ID in research and
Many deep learning-based algorithms have been proposed application areas, the research community has progressively
by researchers to comply with the challenge of person Re-ID adopted and designed the best and most advanced set of
to improve the outcomes [8]. Besides this, various research algorithms for stable and accurate person Re-ID. Some
studies have linked deep learning and handcrafted approaches samples of paired images of similar subjects in distinct camera
for person Re-ID [9,10]. Moreover, the systems of person Re- viewpoints are presented in Fig. 1.
ID are categorized into three major classes [11]. Image-based In this paper, an adaptive feature refinement-based deep
person Re-ID is one of the classes and is based on image pairs learning model is designed to extract the most relevant and
only, each of which has single and multiple shots of an discriminative features via the images to conduct person Re-
individual. ID. Nevertheless, traditional deep learning models frequently
The second approach is person Re-ID using videos of focus on feature extraction with significant discrimination
individuals, which operates by examining various frames that while omitting promising features. Moreover, extracted fine-
depict a similar individual. The last class is an image to video- grained features may include redundancies that affect the
based person Re-ID, in which an algorithm in a video accuracy of person Re-ID systems. When two images of the
sequence searches the image of a person. Person Re-ID is also same person are captured with two different cameras, this
categorized into two types based on the time. Short-term mandates the design of a deep learning model that focuses on
period-based person identification is when individuals appear the most relevant and meaningful information in both images.
on Camera 1 and, after a few seconds or minutes, appear on Thus, more accurate and refined features provide better
Camera 2. Similarly, if a long period is considered, in this performance.
scenario, an individual appears on Camera 1 and, after several Hence, in this research study, an adaptive feature refinement
days, appears on Camera 1 or 2. For a person Re-ID in a short- block, also called a convolutional block attention module [15],
term period, the features based on visual appearance are the is incorporated at every level in feature extraction. We aim to
most widely used to re-identify the individual. Moreover, in identify the most relevant and salient features through this
some approaches, both visual and soft biometric-based refinement by learning inter-spatial and inter-channel
features are used [12,13]. relationships of features and diminishing the features of
Considering the importance of person Re-ID in terms of irrelevant areas of the images. For every extracted feature
applications or research areas, it has been well studied in map, the proposed model makes inferences to become
various fields, such as metaphysics [14], psychology [4], and acquainted with the most discriminative features between two
logic [4]. In a video-surveillance system, a person’s Re-ID images of similar subjects from different viewpoints. Channel
system works by first presenting the image of an individual of and spatial attention maps exploit these inter-spatial and inter-
interest as a query image to the system. At that time, the channel relationships using three aggregation arrangements
system reveals or indicates whether this individual has been (i.e., sequential, parallel, and mixed). In addition, a spatial
recorded at another location or time by another camera. pyramid pooling (SPP) layer is also deployed to optimally use
According to the standpoint of computer vision, the major and the spatial information of activations of convolutional
difficult challenge in person Re-ID is how two images that operations and accumulate feature maps before the last fully
Fig. 1 Samples of pair images of the same person under different camera viewpoints
Muazzam MAQSOOD et al. Efficient deep learning-assisted person re-identification 3
connected (FC) layers. The major contributions of this article layers capable of handling misalignment, occlusions,
are listed below: backgrounds with noise, and geometric and photometric
transformations. They evaluated the proposed model on the
● An end-to-end holistic adaptive feature refinement- CUHK03 dataset, which comprises images of about 13,164 for
based lightweight deep learning model is proposed for 1,360 subjects. Then, a cumulative matching characteristic
effective person Re-ID in smart cities for surveillance (CMC) curve was established to analyze the efficiency,
purposes. reporting a 26.51% recognition rate of Rank 1. Ahmed et al.
● The suggested deep learning model uses feature inter- designed another deep learning-based architecture to improve
spatial and inter-channel relationships to acquire the performance [17]. Their proposed algorithm used a special
attention maps using spatial and channel attention to type of convolution, tied convolution, to extract the local
enhance the representation of relevant regions of relations between two inputted images. Moreover, the patch
interest and lower the extracted feature values of summary layer was also integrated into the network, where
irrelevant locations. neighboring differences were summarized. They evaluated the
● An SPP layer is incorporated before the final FC layers
algorithms on CUHK01, CUHK03, and VIPeR datasets.
to fully employ the spatial information from the
Similarly, Cheng et al. designed another deep learning
adaptive refinement of the features and acquire the
architecture based on improved feature extraction [18]. In their
representation of features with extensive spatial
algorithm, a multichannel part-based framework was added
information.
along with an enhanced triplet loss function that draws both
The rest of the paper is partitioned into several sections. overall and parts of body features of a person in an image.
Section 2 presents existing work, and Section 3 details the Separate convolutional layers were designed to extract
proposed method. Next, Section 4 explains the experimental information from different body parts (e.g., one convolutional
results, and Section 5 concludes the paper and provides the layer for full-body parts). The four commonly used datasets
future research directions. for experimental purposes in their study include CUHK01,
VIPeR, PRID2011, and i-LDS.
2 Related work The body portion that includes the individual’s face and
Currently, person Re-ID using images requires additional shoulders contributes more to the performance, but lower
consideration from researchers. Figure 2 depicts a generic body portions comprising the individual’s legs and feet
overview of the image-based person Re-ID system. At testing performed the worst. Some body parts have essential features
time, a query image from the probe set is selected to be to distinguish people; thus, Huang et al. proposed a part-based
matched against all gallery images. The gallery and probe set method called DeepDiff to attempt to learn distinctive features
images are from two different camera views. The person Re- of individual parts [19]. This method splits the body parts and
ID system aims to rank the top N gallery images against the retrieves their feature representations by employing three NN-
query image, as illustrated in Fig. 2. In addition, these images based architectures called subnets. Every subnet network
can be any image type (i.e., RGB images, depth images, might manage a specific kind of intra-class variation. The
infrared images). CUHK03, CUHK01, and VIPeR datasets were employed to
Li et al. designed a deep learning algorithm to perform validate the results and achieved 62.4%, 47.9%, and 43.2%
subject Re-ID [16]. Their model, DeepReID, introduced a Rank 1 scores. Their designed model helped accumulate better
filter pairing-based neural network (NN) consisting of six knowledge and derived distinct features in the area of local
Fig. 2 Pictorial representation of the internal working of the person re-identification system
4 Front. Comput. Sci., 2023, 17(4): 174329
Fig. 3 Internal architecture of the proposed deep learning model for re-identification of individuals
recognition [38,39] and segmentation benchmarks [40,41]. A layers with window dimensions of 2 × 2. After every level of
collection of filters {wk }k∈K convolves over the set of images feature representation by the convolutional and max-pooling
of a subject { fi }i∈I taken by two distinct cameras and then layers, the adaptive feature refinement blocks are inserted to
generates another collection of images {H j } j∈J referred to as refine the features extracted from these layers. These
feature maps or activation maps. The correspondence of input refinement blocks are inserted into sequential or parallel
and output is maintained by a connection table CT (input i, arrangements or mixed arrangements.
filter k , and output j ). The responses of the filters among
3.2 Adaptive feature refinement block
inputs are linearly merged. This layer is responsible for the
We must identify correlations between the two viewpoints to
following mapping: assess whether the subject in the two images is similar. Thus,
∑
h j (x) = ( fi ∗wk ) (x) . (1) to establish a correct match, person Re-ID requires the
i,k∈CT i, j,k
extraction of discriminative features, which must be done
independently of difficulties, such as occlusion, viewpoint, or
In Eq. (1), the symbol * denotes the valid convolutional illumination changes. The features extracted by the convolu-
operation on images. The wk of a certain layer has dimensions tional and max-pooling layers at every level are inputted into
similar to the input and determines the dimensions of the the adaptive feature refinement block. In this block, the
feature maps h j and the input size. In addition, I1 and I2 are irrelevant features from two different camera views are
the two input images of dimensions 60 × 160 × 3 given as an suppressed, and more attention is focused on discriminative
input to the first convolutional layer on which the operation features. This adaptive refinement of features can be
given in Eq. (1) is performed. performed by incorporating the spatial and channel attention
In the deep learning paradigm, convolutional features layers. Moreover, it is widely known that the attention
provide effective interpretations for a range of classification mechanism plays an essential part in human perception [42].
problems. As depicted in Fig. 3, both input images are In the human visual system, one of the important
operated with the first convolution layer of kernel size 3 × 3. characteristics is that humans cannot comprehend an entire
The obtained feature maps are processed by max-pooling visual scene at once [43]. Humans employ a chain of partial
6 Front. Comput. Sci., 2023, 17(4): 174329
glimpses and choose and concentrate on essential areas of the to highlight or suppress, and ultimately the information flow
scene to better capture the visual structure. Therefore, various throughout the network is improved. This method also helps
researchers have attempted to involve these attention learn the more meaningful and discriminative features of
mechanisms in their designed deep learning models to images from two non-overlapping camera views for person-
increase their effectiveness. Re-ID. The detailed description of each attention map is
In this research study, we used an attention mechanism in described below.
the proposed architecture that adaptively refines the features
extracted from the convolutional layers. The intermediate 3.2.1 Inter-channel relationships of features
feature F∈RC×H×W maps resulting from the preceding layers A channel attention map is employed to exploit the inter-
are provided as input to the adaptive feature refinement block channel relationships of features between two images of
that infers a 2D channel attention map Mc ∈ RC×1×1 and 2D similar subjects captured from distinct angles. The feature
maps of spatial attention M s ∈ R1×H×W . The mathematical detector represents every feature map channel; thus, the
formulation of the adaptive feature refinement block is attention based on the channel is focused on what is relevant
summarized in Eqs. (2) and (3) [15]: in providing an image as input. The spatial dimensions of the
′ feature map provided at the input are squeezed to determine
F = Mc (F) ⊗ F, (2) the attention effectively based on the channel. Two spatial
′′ ( ′) ′ contexts are extracted, known as average-pooled and max-
F = Ms F ⊗ F . (3) c
pooled operations. These two types of spatial information Favg
In Eqs. (2) and (3), the symbol ⊗ represents element-by- c
and Fmax of feature, maps are aggregated. Later, an attention
element multiplication. With this multiplication, the following map based on the channel Mc ∈ RC×1×1 is obtained by
attention values are propagated (replicated) along the spatial forwarding these two shared model descriptors. This shared
dimensions, channel attention values are broadcasted, and vice model consists of a multilayer perceptron with one hidden
′′
versa. The adaptively refined feature is denoted by F . layer.
For every attention module, Fig. 4 illustrates the Moreover, RC/r×1×1 is fixed as the size of the hidden
computational process. Moreover, channel and spatial activation to lessen the influence of the parameter overhead,
attention are generated by exploiting the inter-channel and where r is the reduction ratio. Following that, the summation
inter-spatial relationships of features. The main difference
computed element-by-element is performed to integrate the
between the inter-channel and inter-spatial relationships of
two outputs based on the feature vectors after using the shared
features is that, in the inter-channel relationship of features, c c . A
model in every spatial context of Favg and Fmax
channel attention maps emphasize “what” is most necessary in
the image. In contrast, inter-spatial relationships of features mathematical description of the channel attention to draw
produce spatial attention emphasizing “where” the informative inter-channel relationships is described below:
regions in the image are most necessary. Both types of Mc (F) = σ(MLP (AvgPool (F)) + MLP (MaxPool (F)))
features assist in exploiting the channel and spatial-wise ( ( )) (4)
= σ(W1 W0 Favg
c
+ W1 (W0 (Fmax
c
))).
attention. These features assist in learning which information
In Eq. (4), σ represents the sigmoid function, and W0 ∈ RC/r×C 3.3 Spatial pyramid pooling layer
and W1 ∈ RC×C/r . Regarding both inputs, the shared weights The aggregated attention maps resulting from the feature
are denoted by W0 and W1 in the multilayer perceptron, and refinement of the last convolutional layer are passed as input
W0 continues to follow the rectified linear unit activation. to the SPP layers [44]. All features are pooled using this SPP
layer, producing fixed-dimension outcomes, which are then
3.2.2 Inter-spatial relationship of features sent to the FC layers. Generally, the SPP layer enables the
We generated spatial attention maps using the inter-spatial CNN to take inputs of any size, increasing the model’s scale
relationship of features, as depicted in Fig. 4. In contrast to invariance, suppressing overfitting, and permitting the
channel attention, spatial attention mechanisms focus on an extraction of local features from the data at various scales. By
informational component that complements channel attention. employing various distinctly sized pool operations, this layer
The spatial attention maps are generated along the channel ensures the fixed eigenvector output and attainment of the
axis by applying the average and max-pooling operations and input at any scale.
fusing them to create a productive feature descriptor. Applying This study incorporates an SPP layer with the proposed deep
the pooling methods somewhere along the channel axis is learning model. This layer is deployed between the
proven to be efficient to emphasize the informative areas of convolutional and FC layers. In every spatial bin, the response
the images. A convolutional operation is applied on the fused of each is pooled. The SPP produces vectors of kM
feature descriptor vector that encodes where to emphasize and dimensions with M specifying the number of bins (k denotes
inhibit to compute the attention map based on a spatial the number of filters used in the prior convolutional
mechanism M s ∈ R1×H×W . Two 2D maps are generated by operation). The FC layer receives these fixed-dimensional
combining the channel information of feature maps using two vectors as input. One primary insight on employing the SPP
operations of pooling (i.e., Favg c ∈ R1×H×W and F c layer in the proposed architecture is its significant
max
∈R 1×H×W ). Over the channel, each represents the average- characteristic of resilience in multilevel pooling toward object
pooled and maximum-pooled features. deformation [45]. The architecture of the SPP layer is
Later, a 2D attention map is computed by fusing these, presented in Fig. 6.
followed by applying the standard convolutional layer. A
mathematical description of spatial attention is described 4 Experiment and discussion
below: 4.1 Dataset and protocols
([ ]) This section discusses the results of the designed model under
M s (F) = σ f 7×7 AvgPool (F) ; MaxPool (F) ) different experimental settings, followed by a discussion and
( ([ ]))
= σ f 7×7 Favg
s s
; Fmax . (5) comparisons. To assess the efficacy of the proposed model,
we used the CUHK02 dataset, a publicly available benchmark
In Eq. (5), σ represents the sigmoid function and f 7×7
[46]. The CUHK02 dataset comprises 1,816 people. The
symbolizes a convolutional procedure with the filter
images of individuals are captured from five pairs of camera
dimensions of 7 × 7.
viewpoints (10 viewpoints). These pairs are numbered from
3.2.3 Aggregating channel and spatial attention maps P1 to P5. More specifically, P1 contains 971 identities, P2
For every input image of a person, two attention maps, the contains 306, P3 contains 107, P4 contains 193, and P5
channel and spatial attention maps are aggregated to calculate contains 239. In every pair, there are two images per subject.
the complementary attention, concentrating on what and Pair P1 belongs to the CUHK01 dataset, which is primarily
where. These two attention maps are computed sequentially used in the existing research studies [47]. In addition,
and in parallel, where the multiplication is performed between CUHK02 is a larger version of the CUHK01 dataset that
these two attention maps with the feature map based input includes more identities and angle points.
F∈RC×H×W to refine the features adaptively. The aggregation This database assesses the efficiency when using different
of the spatial and channel attention maps in a sequential camera viewpoints for training and testing sets. We used this
arrangement is illustrated in Fig. 4, whereas the parallel dataset in different experimental settings, as listed in Table 1.
arrangement of their aggregation is displayed in Fig. 5. Furthermore, the assessment metric used to analyze the
Moreover, the mixed approach is also employed, where both effectiveness of the proposed model is the CMC curve with a
sequential and parallel configurations are employed. single gallery shot setting. This metric finds the probability of
Fig. 5 Parallel arrangement of attention mechanisms in the adaptive feature refinement block
8 Front. Comput. Sci., 2023, 17(4): 174329
Fig. 7 (a) Ranks 1 to 15 for the identification rates of the overall camera view pairs in the series setting of the adaptive feature refinement block,
(b) Ranks 1 to 15 for the identification rates of the overall camera view pairs in the parallel setting of the adaptive feature refinement block, and
(c) Ranks 1 to 100 for the identification rates over the P1 camera view pairs in both series and parallel settings of the adaptive feature refinement
channel attention module, and following that, the result of this layouts of these inter-channel and inter-spatial layers. Figures
module is input into the spatial attention map. Furthermore, in 8(a) and (b) provides the accuracy and loss values of all
Fig. 7(b), the results of all camera views (P1−P5) are experiments during training. In Figs. 8(a) and (b), the y-axis
presented. In this setup, the inter-channel and inter-spatial indicates the values of accuracy and loss, and the x-axis
relationships are focused on parallel configurations of the denotes the number of epochs. Each curve in the graph
inputted feature maps, as displayed in Fig. 7(b). The results of provides the values of accuracy and loss for each camera view
this type of arrangement in the model are also encouraging. (i.e., P1, P2, P3, P4, and P5) for both the spatial and channel
The Rank 1 to 15 scores of P1 range from 12% to 57%. attention settings (i.e., parallel and series). All experiments run
Similarly, the Rank 1 to 15 scores for camera pair P2 are for 100 epochs. After 20 epochs, the training accuracy exceeds
24% to 93.4%, and the Rank 1 to 15 scores of camera pair P3 80%, and the loss values approach zero. Furthermore,
are 33% to 100%. Furthermore, the camera pairs P4 and P5 hyperparameters play a critical role in model performance.
also have encouraging Rank 1 to 15 scores of 13% to 78% and All these experiments in Figs. 7 (a), (b), and (c) were run
13% to 73%, respectively. The parallel arrangement for with Adam optimizers using a learning rate of 0.0001, and the
acquiring the inter-channel and inter-spatial relationships same set of experiments was also repeated using stochastic
works better than the series arrangement. gradient descent optimizers using a 0.001 learning rate, as
However, the number of subjects in the camera pair P1 is displayed in Figs. 9(a), (b), and (c). Figure 9 has six graphs,
more than that of the remaining pair’s subjects. Thus, rank and the x-axis of the graphs represents the rank values,
scores from 1 to 100 with gaps of 5 are also plotted for this whereas the y-axis represents the identification rates. Each
camera pair view. Figure 7(c) lists the values of Ranks 1 to curve in the graph represents the rank scores for each camera
100 for both parallel and sequential arrangements of the view, such as P1, P2, P3, P4, and P5. Furthermore, the model
adaptive feature refinement blocks for camera pair P1. More performance is accessed via a mixed method of adaptive
explicitly, Fig. 7(c) provides the rank scores for both feature refinement block configurations. In this approach, the
arrangements, ranging from 10% to 94.9% for the sequential adaptive feature refinement block is organized in parallel form
case and 12% to 95% for the parallel case. Similarly, the in the first two convolutional blocks, and in series form after
parallel configuration performs better than the sequential the last convolutional block. With this model configuration,
configuration in this case. Furthermore, the training and loss the experiments are run with different hyperparameter
values for each camera pair view are also recorded in both settings, such as Adam and stochastic gradient descent, with
Fig. 8 (a, b) Accuracy and loss curves during training for all camera views in both attention mechanism arrangements
10 Front. Comput. Sci., 2023, 17(4): 174329
0.001 and 0.0001 learning rates, respectively. The Rank 1 to setting also presents the cross-dataset scenario.
15 scores in this setup for all camera views are presented in Therefore, the details of the total number of training and
Figs. 9 (d), (e), and (f). testing subjects and the camera views are listed in Table 3.
After the division of data according to this scenario, the
4.3 Person re-identification with unknown subject and positive and negative image pairs were generated using
viewpoint
camera view pair P1, and the results were calculated using the
To assess the precision of an automated real-time surveillance
remaining views separately. The results of Rank 1 to 15 scores
system, person Re-ID should be performed with unknown
in this experimental setup for camera view pair P2 range from
camera viewpoints and people. In the prior arrangement, the
27% to 93%, as displayed in Table 2, row 1, with the
model is trained on known camera viewpoints, with the only
sequential arrangement of inter-channel and inter-spatial-
difference being that the people are different. For example, in
based attention mechanisms. Following that, the results of the
a public area, two different cameras are installed at two
locations (viewpoints), and if the image of a person enrolled in camera pair view P3 to P5 are given in rows 2, 3, and 4. For
the system is captured by these two cameras, then the system camera pair view P3, the range of the rank scores is 19% to
should be able to identify the subject. Similarly, if the 57%. The number of subjects in this pair is considerably less;
unknown person comes under these two camera viewpoints, thus, the values of the highest ranks are not reported.
the system should be able to recognize the unknown person Similarly, for the camera pairs P4 and P5, the rank scores are
(i.e., our previous experimental setup). In this experimental 10% to 56% and 13% to 57%, respectively. In addition, the
setup, instead of only recognizing the unknown person from same set of experiments is repeated for the parallel
different camera viewpoints, the system should also recognize configuration of the adaptive feature refinement blocks. In this
unknown people from two unknown camera viewpoints; case, the scores for Ranks 1, 4, 8, 10, 12, and 15 are listed in
hence, it is called person Re-ID with unknown subject and rows 6, 7, 8, and 9 in Table 2.
viewpoint. With camera view pair P2, the rank scores from 1 to 15 are
This research study also investigated the model performance 22% to 95%. Similarly, for pair P3, the scores are 23.1% to
with this realistic experimental setup. As mentioned, the entire 52%, and for camera pair P3, the scores are 18% to 64%.
dataset of CUHK02 comprises 10 camera views divided into Lastly, for camera pair P5, the Rank 1 to 15 scores is 13% to
pairs as P1 to P5. The P1 pair belongs to the CUHK01 dataset, 62%. It seems that a model based on parallel configurations of
whereas the remaining pair views belong to the CUHK02 inter-channel and inter-spatial relationships works better than
dataset. In this scenario, the model was trained on a camera series configurations. In addition, in this challenging
pair P1 and tested on the remaining pairs P2 to P5. This experimental setup, the proposed model learns to perform
Fig. 9 (a, b, c) Ranks 1 to 15 of the identification rates for the overall camera view pairs in series and parallel settings for the adaptive feature
refinement block with the stochastic gradient descent optimizer (lr = 0.001) . (d, e, f) Ranks 1 to 15 of the identification rates for the overall
camera view pairs in mixed configurations for sequential and parallel arrangements of adaptive feature refinement blocks
Muazzam MAQSOOD et al. Efficient deep learning-assisted person re-identification 11
Camera view Rank 1/% Rank 4/% Rank 8/% Rank 10/% Rank 12/% Rank 15/%
No#
Sequential arrangement of adaptive feature refinement block
1 P2 27 55 80 86 91 93
2 P3 19 57 − − − −
3 P4 10 40 56 − − −
4 P5 13 28 46 57 − −
Parallel arrangement of adaptive feature refinement block
6 P2 22 59 75 85 88 95
7 P3 23.1 52 − − − −
8 P4 18 43 64 − − −
9 P5 13 28 51 62 − −
Fig. 10 Top five matches against query image from the gallery set
aspects of the individual’s visual appearance, such as clothing encouraging average Rank 1 and Rank 5 scores of 24.6% and
conditions, have significantly changed? In such a scenario, the 54.8%, respectively. According to the findings, when both
performance of the proposed model might not be as good views and subjects are varied, the deep learning model has
because substantial changes occur in the features of the images difficulty in re-identifying a person. This is owing to the fact
due to the changes in the visual appearance of the subjects. that the features of distinct subjects’ images are significantly
changed as a result of a change in the camera view. However,
5 Conclusion when compared to existing methods, the overall findings in
With the innovation of the IoE, the concepts of smart cities, terms of average rank scores are acceptable. We should
smart homes, and smart environments have emerged. approach the challenge of person Re-ID with long-term
However, the surveillance of these smart cities is one of the scenarios in the future.
major challenges. Hence, this paper provides a deep learning-
assisted solution for person Re-ID for surveillance and Acknowledgements This paper was supported by Korea Institute for
Advancement of Technology (KIAT) grant funded by the Korea Government
security purposes in these smart cities. The variability in the
(MOTIE) (P0008703, The Competency Development Program for Industry
appearance of the same person under different camera Specialist) and also the MSIT (Ministry of Science and ICT), Republic
viewpoints makes person Re-ID a very difficult and of Korea, under the ITRC (Information Technology Research Center) support
demanding task. program (IITP-2022-2018-0-01799) supervised by the IITP (Institute for
However, the proposed deep learning model is based on Information & Communications Technology Planning & Evaluation).
adaptive feature refinement with SPP layers. The designed
approach learns the image features by considering the inter- References
channel and inter-spatial associations using attention 1. Neirotti P, De Marco A, Cagliano A C, Mangano G, Scorrano F.
mechanisms and focuses more on the discriminative regions of Current trends in smart city initiatives: some stylised facts. Cities, 2014,
the images. Moreover, spatial learning is further enhanced by 38: 25–36
using SPP layers that employ different scale pooling windows 2. Vlacheas P, Giaffreda R, Stavroulaki V, Kelaidonis D, Foteinos V,
to extract features. Furthermore, we have used several Poulios G, Demestichas P, Somov A, Biswas A R, Moessner K.
experimental situations in which unknown viewpoints and Enabling smart cities through a cognitive management framework for
subjects are considered. The proposed framework achieves the internet of things. IEEE Communications Magazine, 2013, 51(6):
Muazzam MAQSOOD et al. Efficient deep learning-assisted person re-identification 13
DA-net architecture for lung nodule segmentation. Mathematics, 2021, design, computer vision, medical imaging, and pattern recognition.
9(13): 1457
42. Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep Saira Gillani received her PhD degree in
learning. Neurocomputing, 2021, 452: 48–62 Information Sciences from Corvinus University of
43. Guo M H, Xu T X, Liu J J, Liu Z N, Jiang P T, Mu T J, Zhang S H, Budapest, Hungary. She joined the COMSATS
Martin R R, Cheng M M, Hu S M. Attention mechanisms in computer
Institute of Information Technology, Pakistan in
vision: a survey. Computational Visual Media, 2022, 8(3): 331–368
2016. She also served as an assistant professor in
44. He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep
convolutional networks for visual recognition. IEEE Transactions on Saudi Electronic University, Saudi Arabia. She is
Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916 currently serving as an associate professor in
45. Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial Bahria University Lahore, Pakistan. Previously, she worked as
pyramid matching for recognizing natural scene categories. In: research scholar in Corvinno, Technology Transfer Center of
Proceedings of 2006 IEEE Computer Society Conference on Computer Information Technology and Services in Budapest, Hungary and also
Vision and Pattern Recognition. 2006, 2169–2178 worked as research associate in CoReNet (Center of Research in
46. Li W, Wang X. Locally aligned feature transforms across views. In: Networks and Telecom), CUST, Pakistan. Her areas of interest
Proceedings of 2013 IEEE Conference on Computer Vision and Pattern
include data sciences, text mining, data mining, machine learning,
Recognition. 2013, 3594–3601
vehicular networks, mobile edge computing and Internet of Things.
47. Li W, Zhao R, Wang X. Human reidentification with transferred metric
learning. In: Proceedings of the 11th Asian Conference on Computer
Vision. 2012, 31–44 Maryam Bukhari is perusing her MS degree at
48. Köstinger M, Hirzer M, Wohlhart P, Roth P M, Bischof H. Large scale COMSATS University Islamabad, Attock
metric learning from equivalence constraints. In: Proceedings of 2012 Campus, Pakistan. Her research areas include
IEEE Conference on Computer Vision and Pattern Recognition. 2012, machine learning and image processing.
2288–2295
49. Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q. Scalable person re-
identification: a benchmark. In: Proceedings of 2015 IEEE International
Conference on Computer Vision. 2015, 1116–1124
50. Fan H, Zheng L, Yan C, Yang Y. Unsupervised person re-identification:
clustering and fine-tuning. ACM Transactions on Multimedia Seungmin Rho is currently an associate professor
Computing, Communications, and Applications, 2018, 14(4): 83 at Department of Industrial Security at Chung-
51. Feng G, Liu W, Tao D, Zhou Y. Hessian regularized distance metric Ang University, Republic of Korea. His current
learning for people re-identification. Neural Processing Letters, 2019, research interests include database, big data
50(3): 2087–2100 analysis, music retrieval, multimedia systems,
machine learning, knowledge management as well
Muazzam Maqsood is serving as an Assistant as computational intelligence. He has published
Professor at the Department of Computer Science, 300 papers in refereed journals and conference proceedings in these
COMSATS University Islamabad, Attock areas. He has been involved in more than 20 conferences and
Campus, Pakistan. He holds a PhD in software workshops as various chairs and more than 30
engineering with a keen interest in artificial conferences/workshops as a program committee member. He has
intelligence and deep learning-based systems. His edited a number of international journal special issues as a guest
main research focus is to use the latest machine editor, such as multimedia systems, information fusion, and
learning and deep learning algorithms to develop automated engineering applications of artificial intelligence.
solutions, especially in the field of pattern recognition and data
analytics. He has published various top-ranked impact factor papers Sang-Soo Yeo received a PhD degree in Computer
in the area of image processing, medical imaging, recommender Science & Engineering from Chung-Ang
systems, stock exchange prediction, and big data analytics. He is also University, Republic of Korea in 2005. He is a
a reviewer of many impact factor journals and a program committee professor at the Department of Computer
member of various international conferences. Engineering, Mokwon University, Republic of
Korea. He worked for MOIS, Ministry of Interior
Sadaf Yasmin is currently working as Assistant and Safety and worked for PIPC, Personal
Professor at the Department of Computer Science, Information Protection Commission, Republic of Korea from Feb.
COMSATS University Islamabad, Attock 2020 to Jul. 2021. He is President of the Institution of Creative
Campus, Pakistan. She has completed her MS and Research Professionals (ICRP), and Vice President of ICT Platform
PhD in Computer Science from Capital University Society (ICTPS). He is serving as Steering Chair of the PlatCon
of Science and Technology, Pakistan, and BS in conference series, a very comprehensive conference series on
Software Engineering from (APCOMS) NUML platform technology and services. Dr. Yeo’s research interests
Islamabad, Pakistan. She has worked on several research projects include security, privacy, personal information Protection, ubiquitous
during and after her PhD She is also serving as a reviewer for various computing, multimedia service, ubiquitous computing, embedded
reputed journals. Her research interests include network protocol system, and bioinformatics.