You are on page 1of 16

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

Real-time Human Tracking using Multi-features


Visual with CNN-LSTM and Q-Learning
Devira Anggi Maharani1, Carmadi Machbub1,2, Pranoto Hidaya Rusmin1, Lenni Yulianti 1
1
School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Indonesia
2
Institut Teknologi Sains Bandung, Indonesia

Corresponding author: Pranoto Hidaya Rusmin (e-mail: pranotohidayarusmin@gmail.com).


This work was supported by the School of Electrical Engineering and Informatics, Institut Teknologi Bandung (ITB).

ABSTRACT Various methods are employed in computer vision applications to identify individuals,
including using face recognition as a human visual feature helpful in tracking or searching for a person.
However, tracking systems that rely solely on facial information encounter limitations, particularly when
faced with occlusions, blurred images, or faces oriented away from the camera. Under these conditions, the
system struggles to achieve accurate tracking-based face recognition. Therefore, this research addresses this
issue by fusing descriptions of the face visual with body visual features. When the system cannot find the
target face, the CNN+LSTM hybrid method assists in multi-feature body visual recognition, narrowing the
search space and speeding up the search process. The results indicate that the combination of the
CNN+LSTM method yields higher accuracy, recall, precision, and F1 scores (reaching 89.20%, 87.36%,
91.02%, and 88.43%, respectively) compared to the single CNN method (reaching 88.84%, 74.00%,
67.00%, and 69.00% respectively). However, the combination of these two visual features requires high
computation. Thus, it is necessary to add a tracking system to reduce the computational load and predict the
location. Furthermore, this research utilizes the Q-Learning algorithm to make optimal decisions in
automatically tracking objects in dynamic environments. The system considers factors such as face and
body visual features, object location, and environmental conditions to make the best decisions, aiming to
enhance tracking efficiency and accuracy. Based on the conducted experiments, it is concluded that the
system can adjust its actions in response to environmental changes with better outcomes. It achieves an
accuracy rate of 91.5% and an average of 50 fps in five different videos, as well as a video benchmark
dataset with an accuracy of 84% and an average error of 11.15 pixels. Utilizing the proposed method speeds
up the search process and optimizes tracking decisions, saving time and computational resources.

INDEX TERMS Face and body visual features, CNN, LSTM, Q-learning, real-time

I. INTRODUCTION visual features. Nevertheless, somebody's visual features


Detecting and tracking moving objects in video sequences remained unrecognizable. Therefore, this research aims to
find their application in various fields, including security overcome this limitation by incorporating additional visual
surveillance systems, intelligent robotics, autonomous features of the body. Consequently, when the system fails to
vehicles, and human-computer interaction. In security detect the target face, it can utilize body visual features to
surveillance systems, moving object tracking serves multiple narrow down the search space [6], and partial information is
purposes, such as tracking specific individuals based on their known, as shown in Fig. 1.
faces and visual features. Valuable information, such as a In this research, the moving human detection and
person's identity, can be obtained by analyzing these face tracking system can encounter three failure cases. Firstly,
visual features. the system may fail to find the face's visual features.
In studies [1] and [2], face visual features were used as Secondly, it may be unable to detect the visual features of
tracker initialization. However, using face visual features the body. Thirdly, face and body detection failures may
presents several challenges, such as faces being blocked by occur when the target is fully occluded, partially occluded,
other objects or occluded, blurred images [3], and faces not or blurred.
facing the camera [4]. To overcome these problems, previous In specific video scenarios, the detection and tracking of
studies[2] and [5], attempted a combination of face and body targets can encounter various challenges, as illustrated in

VOLUME XX, 2017 1


This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

Fig. 1(a) and Fig. 1(b). In case 1(a), even though part of the fusion of multiple visual features of the body and face
face is visible, facial visual features cannot be found. Thus, ('detect') and the tracking system ('track'). These actions
facial recognition cannot be performed. Conversely, facial compete with each other, and decisions are made based on
visual features are detected in case 1(b), but body visual past experiences, considering factors like the availability of
features cannot be successfully located. A multi-feature visual features, object location, movement speed, and
fusion system that combines facial and body visual features environmental conditions. Using the Q-Learning algorithm,
is required to address these issues and find the desired the system aims to enhance efficiency and accuracy in
target location. tracking the targeted object. The policy referred to is a series
of actions the system decides to achieve the goal of object
tracking efficiently and accurately in specific situations. The
optimal policy for each situation is determined based on the
maximum Q value obtained from the Q-Learning algorithm.
This process is continuously applied in every state to enable
the system to adapt its actions to changes in the environment
or the human being tracked. The proposed human movement
tracking system integrates face and body visual features by
combining CNN+LSTM and using the Q-Learning algorithm
to improve the object tracking accuracy and achieve a real-
time system implementation.
The primary contributions of this paper can be
summarized as follows:
1. The system identifies three failure cases that can occur
in the human movement tracking system, such as the
system's inability to detect face visual features, body
visual features when the target experiences occlusion,
FIGURE 1. (a) Cases of face visual features cannot be found, (b) Cases
of body visual features cannot be found, (c) cases of failure of face and blurry images, and faces facing away from the camera.
body detection when the target is in full occlusion, partial occlusion, A multi-feature visual fusion system, as represented in
and blur.
equation (12), that combines face and body visual
features addresses these issues and enhances tracking
There was a significant computational requirement in accuracy.
combining face and body visual features for human tracking 2. In this research, a tracking system is integrated to
[7], so this research proposed adding a tracking system. The address the high computational requirements of the
proposed tracking system can use the KCF method [8], [9], detection and recognition system. The study employs
or other tracking techniques [10]. Previous research [11] the Q-Learning algorithm for making optimal
suggests that confidence scores can be used to make more decisions in dynamic environments, considering
adaptive decisions in tracking. The addition of a tracking various factors such as face and body visual features,
system reduces the computation load [12] and enables the object location, and environmental conditions to
prediction of the following target location, as illustrated in enhance tracking efficiency and accuracy.
Fig. 1(c). For instance, when a target is successfully detected The remainder of this paper is organized as follows.
through visual multi-feature recognition in frame t, the Section 2 provides materials and methods, a key aspect of
system can predict its location in frame t+1, even if the the proposed method. Section 3 presents a detailed
probability of detecting the target is low or the target cannot description of the proposed method. In Section 4, the
be found due to occlusion and blur. experimental results obtained are analyzed and discussed.
Reinforcement Learning (RL) algorithms have developed Finally, Section 5 summarizes the significant findings of this
rapidly in several fields, such as game theory [13] [14], study.
information theory, simulation-based optimization, control
systems, image processing [15], and statistics [16]. RL II. MATERIALS AND METHOD
algorithm learns optimal policies by interacting with its Face and body detection, the first stage of our process,
environment model-free [17] [18]. Among RL algorithms, involves the computer locating and detecting faces in an
the Q-Learning algorithm stands out with its simple Q image or video. Following face detection, the system moves
function, forming the foundation for many other RL on to face recognition, trying to match the observed faces
algorithms [19]. with known people or previously recorded information. Our
In contrast to previous fusion-based methods, this research next addition is a body detection and recognition system,
introduces a decision framework for automated and which locates and detects more visual characteristics
intelligent tracking. The system employs a Q-Learning associated with the human body. The following is a
algorithm to determine the best course of action between the description of each process.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

A. FACE DETECTION 𝑗 = input notation


Before tracking, the computer processes the face detection 𝑎 = anchor image
system using the Haar Cascade method. This method 𝑝 = positive image
employs machine learning and serves as the basis for object 𝑛 = negative image
detection applications, particularly face detection. Paul Viola 𝛼 = margin between positive and negative anchors
and Michael Jones [20] founded the machine learning 𝛵 = set of possible triplets in the training process
method, which serves as a fundamental technique for object
detection, especially in face detection. The image training The primary aim of (2) as the objective function is to
process in the Haar Cascade method involves several steps: minimize the distance between the anchor and positive
determining Haar features, using integral images for rapid images while maximizing the distance between the anchor
feature detection, conducting Adaboost machine learning and negative images. The parameter 𝛮 is crucial in achieving
training, and employing a cascading classifier that combines this goal, as it determines the cardinality that results in the
numerous features. minimal loss. The specific loss to be minimized is
In this research, the calculations were supported by represented as (3).
𝑁
utilizing the TensorFlow, SciPy, and OpenCV libraries [21] 𝑝 2 2
[22] [23]. The speed of Haarcascade [24] and MobileNet [25] 𝐿 = ∑ [‖𝑓(𝑥𝑗𝑎 ) − (𝑥𝑗 )‖ + 𝛼 − ‖𝑓(𝑥𝑗𝑎 ) − (𝑥𝑗𝑛 )‖ ] (3)
2 2
𝑖
in the face detection process was evaluated, as discussed in
The objective is to find a suitable triplet loss that fulfils the
the study [26]. Face detection results were then used for
constraints outlined in (3). Triplet loss is defined as per the
recognition, and the displacement of horizontal and vertical
equation provided. Here, 𝑁 represents the number of images
positions was determined by comparing the face detection
within a set, which includes all possible triplet pairs within
results between the current frame and the previous frame. To
the test data. The selection of these triplets aids in achieving
obtain the closest and minimum distance between the
faster convergence since it utilizes L2 distance between
midpoints at time 𝑡 and 𝑡 − 1 several types of distances were
image pairs, effectively measuring the similarity of the two
considered, including the cosine distance (1) where 𝑥̅ and 𝑦̅
images.
represent the average values of the 𝑥 and 𝑦 coordinates to
Face recognition begins with capturing facial images using
calculate the object's position displacement as in [22].
𝑥·𝑦 devices such as PTZ cameras. Subsequently, preprocessing
Cosine Distance = 1 − (1)
‖𝑥‖‖𝑦‖ steps, including resizing (96x96) and alignment, are applied
The Cosine Distance (1) calculates the displacement value to ensure consistent input. Features are then extracted,
of the target object's horizontal and vertical positions. Once generating a distinctive embedding that represents facial
the displacement value is obtained, the subsequent step characteristics. The similarity to database entries is quantified
involves conducting the face recognition process to identify using metrics like Euclidean distance. The training process
the individual present within the previously detected target yields a face embedding of 128 vectors using Triplet loss in
object. FaceNet, achieving training and validation accuracy up to
82.20% and 78.08%, respectively. The training accuracy of
B. FACE RECOGNITION 82.20% indicates strong performance in recognizing and
The face detection system has successfully detected faces distinguishing different faces.
even when a person is wearing a mask. For the recognition Upon completion of training, testing the model with 30
process in this study, the method used is based on previous randomly selected data faces across 3 different classes results
research [26], which uses Triplet loss FaceNet. The loss in a precision of approximately 0.8387. Precision represents
function is added to the Triplet loss process in the FaceNet the ratio of true positive predictions to all positive predictions
Triplet loss method developed by Google as face feature made by the model. The recall, approximately 0.8125,
extraction with the Inception Resnet v1 architecture [27]. indicates the model's ability to identify true positive instances
Triplet loss aims to minimize the distance or dissimilarity among all actual positive instances. The overall accuracy of
between similar faces and bring the similarity values closer the Triplet loss FaceNet model is approximately 0.8706,
together. During training, each input consists of three face indicating its ability to classify correctly across the dataset.
images: two of these images are face images with the same
class, serving as an anchor and positive images, while the C. BODY DETECTION SYSTEM
third image is a face image of a person from a different class, MobileNet is a CNN architecture specifically designed to
functioning as a negative image. The output of this address the need for efficient computing resources, making it
classification model consists of 128 face points, which are suitable for deployment on mobile phones and embedded
represented mathematically by the Triplet loss (2). systems. Researchers from Google [25] introduced
𝑝 2 2 (2)
‖𝑓(𝑥 𝑎 ) − (𝑥 )‖ + 𝛼 < ‖𝑓(𝑥 𝑎 ) − (𝑥 𝑛 )‖
𝑗 𝑗 𝑗 𝑗
MobileNet as a solution for optimizing CNNs for these
2 2
𝑎 𝑝 𝑛
resource-constrained devices. The key distinction between
∀ (𝑓(𝑥𝑗 ), 𝑓(𝑥𝑗 ), 𝑓(𝑥𝑗 )) 𝜖 𝛵 MobileNet and traditional CNN architectures lies in
𝑓(𝑥) is a function of 𝑥 as input depthwise and pointwise convolution. In traditional CNNs,

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

convolution layers typically have filters with a fixed size facilitate face verification. Additionally, [31] investigated
across all input channels. However, MobileNet breaks down pose alignment using a neural network to recognize human
the convolution into two separate operations: depthwise features such as age and image expression. However, these
convolution and pointwise convolution. Depthwise three studies focus on good-quality images, which may not
convolution involves applying a single filter to each input accurately represent real-world scenarios in surveillance
channel individually, and this step is followed by pointwise camera applications. In research [32], it was stated that the
convolution, which uses 1x1 filters to combine the output recognition of human visual features is challenging to
channels from depthwise convolution. This division allows achieve accurate results. For example, when an object is
MobileNet to reduce the number of computations while occluded, it is difficult to identify the features of the related
maintaining reasonable accuracy significantly. MobileNet body parts.
also incorporates Batch Normalization (BN) and Rectified- Other research on the recognition of visual features on
Linear units (ReLU) for both depthwise and pointwise surveillance cameras utilizes the SVM algorithm to recognize
convolutions, further enhancing the efficiency of the gender and use bags to help search for pedestrians [33].
network. Overall, MobileNet's deeply separable convolutions Research [34] provides a dataset for identifying pedestrian
contribute to its lightweight Deep Neural Network (DNN) visual features. However, the handcrafted feature was used in
architecture, making it well-suited for resource-efficient studies [35] and [34], which could not represent images
implementations. effectively on surveillance cameras. In research [36], the
This section will discuss the SSD (Single Shot Detector) Depthwise Separable Convolution method achieved a recall
[28] method for detecting objects. SSD employs a single value of 72.07 and an F1 score of 66.60. In the study [37],
layer to detect objects by associating predicted bounding box multi-visual feature recognition with multi-label focal loss
areas with a collection of default bounding boxes, using was carried out, producing 84.83% mA (mean accuracy),
different scales and ratios for each location in the feature 79.37% accuracy, 87.47% precision, 86.09% recall, and
map. During training, SSD compares objects with the default 86.77% F1-score. Research [38] uses deep visual features
bounding boxes at various ratios. Every default bounding box with CNN for food recognition and achieves better
(bb) with IoU > 0.5 is appropriate for the corresponding recognition accuracy than medium-level features and high-
object. The SSD method uses several layers at multiple level semantic features. In [39] research on face visual
scales that can provide the best results for detected objects. feature recognition using the decoupling method and Graph
This study uses the MobileNet architecture as the feature Convolutional Network (GCN), the method's effectiveness
extraction method in the SSD approach. The overall object was shown with the results of qualitative and quantitative
detection system is illustrated in Fig. 2. evaluations. Research [40] has carried out the recognition of
visual features of pedestrians using the CNN algorithm and
achieved an mA value of 80.56%, accuracy of 78.30%,
precision of 89.49%, recall of 84.36%, and F1 score of
86.85%. Then, [41] performed face-visual feature recognition
with a Deep Multi-Task Learning Approach (DMTL).
From the studies above, the CNN algorithm has proven to
have good performance in the image classification process,
FIGURE 2. Detection system block diagram with SSD [28]
so in this study, a CNN + LSTM hybrid method is proposed
to recognize visual features. In this CNN+LSTM method,
The SSD method utilizes six additional convolution layers
each visual feature is designed not to be related to each other
after passing the image through the MobileNet architecture.
and is classified as an independent component.
Three of these six extra layers can generate six predictions
for each cell. The SSD method can generate 8732
E. LONG SHORT-TERM MEMORY (LSTM)
predictions. The extra layer also produces feature maps of Long short-term memory (LSTM) is an evolution of the
various sizes to detect objects to provide better accuracy of Recurrent Neural Network (RNN) method, which can
objects in an image. overcome vanishing and exploding gradient problems [42].
The LSTM architecture [43] was introduced to address this
D. BODY VISUAL MULTI-FEATURE RECOGNITION
limitation. LSTM can remember a collection of information
Introducing visual multi-features is a crucial area of study
stored for a long time by removing irrelevant information.
in computer vision because it connects low-level features and
LSTM is more efficient in processing, predicting, and
high-level semantics, making it easier for humans to
classifying data based on time series. The LSTM method can
recognize objects. This approach has gained significant
add and combine information because it has various gates.
attention in surveillance camera applications. For instance,
There are four types of gates in the LSTM system: forget
research [29] utilizes probability techniques to explore low-
gates, input gates, input modulation gates, and output gates.
level visual features like 'striped' and 'spotted.' Another study
The four gates have functions and tasks in collecting,
[30] models face attributes using binary classification to

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

classifying, and processing data. The LSTM has an internal cell state 𝑆𝑡−1 ∈ ℝ𝑛×ℎ is retained. If the forget gate always
cell state that stores selection information from the previous has a value of 1 and the input gate always has a value of 0,
unit, as shown in Fig. 3. The Forget gate is used to forget the internal memory cell state 𝐶𝑡−1 will remain constant and
some irrelevant information from a system. The input gate unchanged in each subsequent time step. However, the input
can enter helpful information to support data accuracy and and forget gates allow the model to learn when to maintain
add information that has been previously selected through the these values as constant and when to update them in response
forget gate for one data output. The output gate is the last to new input. In practice, this design addresses the vanishing
gate to produce complete and actual data information and is gradient problem, resulting in a model that is easier to train,
the last gate for information. Finally, the information is especially when dealing with long sequences.
processed through the input gate in the next cell.
F. CONVOLUTIONAL NEURAL NETWORK (CNN)
CNN (Convolutional Neural Network) utilizes convolution
operations to automatically and adaptively learn spatial
features from an image, which differs from traditional pattern
recognition techniques that require human intervention to
extract features. In the architecture of a CNN, the most
significant component is the convolutional layer. It consists
of a set of convolutional filters (known as kernels). The input
image, expressed as an N-dimensional matrix, is convolved
with these filters to produce output feature maps [47].
The hierarchical structure of CNN can recognize patterns
FIGURE 3. LSTM algorithm structure [44]
from the simplest to the most complex in images. The layers
Fig. 3 illustrates the basic structure of an LSTM cell,
in CNN consist of convolutional, pooling, and fully
consisting of control gates for input 𝑋𝑡 and the previous
connected layers. Due to their efficiency in recognizing
short-term state ℎ𝑡−1 . There are ℎ hidden units, a batch size
visual features and ability to reduce data dimensions without
of 𝑛, and 𝑑 input features. According to reference [45] [46],
losing crucial information, CNNs have become the primary
the input is 𝑋𝑡 ∈ ℝ𝑛×𝑑 , and the hidden state from the
choice in various image recognition applications, including
previous time step is ℎ𝑡−1 ∈ ℝ𝑛×ℎ . The gates at time step 𝑡
face recognition, object detection, and medical image
are defined as follows: input gate 𝑖𝑡 ∈ ℝn×h , forget gate 𝑓𝑡 ∈
analysis. The CNN algorithm employs filters for feature
ℝn×h , output gate 𝑜𝑡 ∈ ℝn×h are computed as follows:
extraction in images with the formula [48], and the proposed
CNN algorithm is shown in Fig. 4.
Input gate:
𝑖𝑡 = 𝜎 (𝑋𝑡 . 𝑊𝑥𝑖 + ℎ𝑡−1 . 𝑊ℎ𝑖 + 𝑆𝑡−1 ⨀𝑊𝑐𝑖 + 𝑏𝑖 ) (4)
Forget gate:
𝑓𝑡 = 𝜎(𝑋𝑡 . 𝑊𝑥𝑓 + ℎ𝑡−1 . 𝑊ℎ𝑓 + 𝑆𝑡−1 ⨀𝑊𝑐𝑓 + 𝑏𝑓 ) (5)
New Candidate:
𝑆̃𝑡 = tanh (𝑋𝑡 . 𝑊𝑥𝑐 + ℎ𝑡−1 . 𝑊ℎ𝑐 + 𝑏𝑐 ) (6)
Cell state:
𝑆𝑡 = 𝑓𝑡 ⨀𝑆𝑡−1 + 𝑖𝑡 ⨀𝑆̃𝑡 (7) FIGURE 4. Proposed CNN architecture
Output gate:
𝑜𝑡 = 𝜎(𝑋𝑡 . 𝑊𝑥𝑜 + ℎ𝑡−1 . 𝑊ℎ𝑜 + 𝐶𝑡 ⨀𝑊𝑐𝑜 + 𝑏𝑜 ) (8) The CNN algorithm has a filter that is used for feature
Hidden state: extraction in images with a formula:

ℎ𝑡 = 𝑜𝑡 ⨀tanh (𝐶𝑡 ) (9) (10)
(𝑥 ∗ 𝑤)[𝑛] = ∑ 𝑥[𝑎]𝑤[𝑛 − 𝑎]
Where 𝑊𝑥𝑖 , 𝑊𝑥𝑓 , 𝑊𝑥𝑜 ∈ ℝ𝑑×ℎ dan 𝑊ℎ𝑖 , 𝑊ℎ𝑓 , 𝑊ℎ𝑜 ∈ ℝℎ×ℎ 𝑎=−∞
are weight parameters and 𝑏𝑖 , 𝑏𝑓 , 𝑏𝑜 ∈ ℝ1×ℎ are bias The calculation of the formula above involves changing
parameters. Furthermore, for the memory cell, the input node the discrete time index 𝑛 to 𝑎 in the signals 𝑥[𝑛] and 𝑤[𝑛],
𝑆̃𝑡 ∈ ℝ𝑛×ℎ is known. Its computation is similar to the input, resulting in a discrete-time function 𝑎. The CNN works
forget, and output gates, but it uses the tanh function as its similarly to MLP, but each neuron is represented in two
activation function, with a range of values between (-1,1). dimensions in CNN. The MLP, where each neuron is only
Parameter 𝑊𝑥𝑠 ∈ ℝ𝑑×ℎ and 𝑊ℎ𝑠 ∈ ℝℎ×ℎ represent weight one dimension in size. The Convolution Layer performs the
parameters, while 𝑏𝑠 ∈ ℝ1×ℎ represents the bias parameter. convolution operation on the output from the previous layer.
The ⨀ symbol represents the Hadamard product or the This layer is the primary process that underlies CNN.
elementwise product. In LSTM, the input gate 𝑖𝑡 controls Convolution is a mathematical term that means repeatedly
how much new data is considered through 𝑆̃𝑡 , while the applying one function to the output of another function. In
forget gate 𝑓𝑡 determines how much of the previous internal image processing, convolution means using a kernel to the

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

image at all possible offsets. The kernel moves from the top 𝑄(𝑠𝑡 , 𝑎𝑡 ) : value to update
left corner to the bottom right. The purpose of convolution on 𝛼 : learning rate with a range of 0-1
image data is to extract features from the input image. 𝑟𝑡+1 : rewards
Convolution will produce a linear transformation of the input 𝜆 : discount factor
data according to the spatial information in the data. The Max 𝑄(𝑠𝑡+1 , 𝑎) : estimated reward from the next action
weights at that layer specify the convolution kernel used so The use of the Q-Learning algorithm is extensive and
that the convolution kernel can be trained based on input to varied. This algorithm has been used in various fields and
the CNN. To evaluate the performance of the proposed applications. Research [45] used the Deep RL algorithm to
algorithm, we use accuracy [49], Precision, Recall, and F1- detect and predict object movements. Each object is used as
Score obtained from the confusion matrix. an agent to find the target location through a designed
decision network that displays the identity number for each
G. KCF (KERNELIZED CORRELATION FILTER) object. Research [54] used the Deep Recurrent RL
TRACKER [8] algorithm with the LSTM algorithm in the Q-network to
Due to its cyclic shift technique as well as simple analyze the possibility of failure in tracking with 6.3 fps.
concepts, the KCF tracker is nominated as a fast tracker in Research [55] proposes feature selection using the actor-
the performance category [49]. KCF was initially set up by critic RL method to select representative skeleton features
to increase the accuracy of human activity recognition,
[8] as a framework for a correlation filter and a
reaching 85.1%. However, the learning system used still
conventional way of discriminating. This collection of
requires high computational capacity.
techniques gains filtering skills from a set of training
From several studies on the Q-Learning algorithm,
samples. The cyclic shift technique makes high frame rates
agents can choose the action that gives the best results
possible, which is used to construct the KCF sample [51]. based on the processing and analysis of the observed visual
The two fundamental KCF procedures are training and features to improve tracking accuracy. The Q-Learning
detection. Throughout the training phase, the target, in this algorithm uses a table or function of Q values to represent
instance, is a binary classifier. The conventional tracking the estimated value of taking an action in a state. In the case
approach attempts to isolate a collection of objects and of multiple visual features, the Q value can be used to
address linear regression problems. Linear regression seeks estimate the reward value obtained from specific actions
to characterize the relationship to obtain the data. based on the observed visual features. It is intended that
The KCF tracker extracts object image patches using agents can always track targets accurately, and this process
linear or nonlinear (filter) tracking. The KCF method utilizes is applied continuously in each state.
ridge regression to obtain the solution 𝑤 from the function The Q-Learning algorithm comprises several
𝑓(𝑧) = 𝑤 𝑇 𝑧, allowing it to minimize the squared error components: agent, environment, reward, state, and action.
between samples 𝑥𝑖 and their corresponding targets 𝑦𝑖 The implementation of the Q-Learning algorithm utilizes
[8][51]. the Markov Decision Process (MDP) in a combination
The result of the detection process is in the form of target algorithm for tracking human movement with
location coordinates [52] using a training sample set 𝑠 𝜖 𝑆, 𝑎𝑐𝑡𝑖𝑜𝑛 𝜖 𝐴, transition state function 𝑠 ′ = 𝑓(𝑠, 𝑎) and
{(𝑋1 , 𝑦1 ), (𝑋2 , 𝑦2 ) … (𝑋𝑛 , 𝑦𝑛 )}. In the tracking process, reward function 𝑟(𝑠, 𝑎).
especially during the update phase that occurs within the • Agent
detection process, 𝑓(𝑧) represents the score generated for all Agents perform actions on state transitions from one state
cyclic shifts of the test image patch. The detected target to another. In this research, it is state ‘lost’ to state ‘tracked’
location is above the test image patch represented by the to track the target. The agent is responsible for learning the
maximum score 𝑓(𝑧), and then the bb is updated. Next, a optimal action decision based on the updated 𝑄 value.
new model is trained at the new position. To provide memory Agents interact with the environment, receive information
in tracking, alpha (𝛼) and 𝑥 are updated from the current and about the state, and perform actions.
previous states. • State
There are two states, ‘lost’ and ‘tracked’ in frames.𝐹𝑡−1
H. Q-LEARNING and 𝐹𝑡 Based on accuracy values that can influence agent
Q-Learning is a model-free RL algorithm proposed by action decisions.
Watkins in 1989 [53]. Q-Learning is an algorithm commonly • Action
used because it is simple and converges faster. The value of 𝑎𝑡 𝜖 [𝑑𝑒𝑡𝑒𝑐𝑡, 𝑡𝑟𝑎𝑐𝑘]
𝑄 describes the estimated optimal value of taking action in a Action is an action the agent takes in response to the
certain situation with the formula: current state. The ‘track’ action is an action for updating the
model and predicting the target location by the tracker.
𝑄(𝑠𝑡 , 𝑎𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼[𝑟𝑡+1 + 𝜆𝑚𝑎𝑥
𝑎 𝑄(𝑠𝑡+1 , 𝑎) (11) Meanwhile, the action of ‘detect’ occurs when the current
− 𝑄(𝑠𝑡 , 𝑎𝑡 )] observation results are inadequate, or there is a possibility of
the target being lost, so re-detection and recognition are

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

FIGURE 5. Overall system block diagram with Q-Learning algorithm

needed to obtain new visual features of the target. The visual calculated by multiplying the visual feature recognition
features used include face and body features combined to accuracy with the object movement accuracy (assuming a
correctly identify the target using (12). constant object movement speed). Next, the product is
• Environment multiplied by the score of each face and body detection
The environment is the world in which the agent operates. result. The score represents the probability of an image being
The environment provides the agent with information about detected as a face or body and is typically given as a
the current state and receives actions from the agent. The percentage. This score is derived from the mAP value at a
environment provides feedback in the form of rewards to specific IoU limit, which can vary. When the
agents based on the quality of the actions taken by agents. 𝐴𝑐𝑐𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑢𝑠𝑖𝑜𝑛 value exceeds 50% indicates that the
This study's environment includes visual information from detected object is the desired target to be tracked. This
resources such as cameras or videos. In the object tracking 𝐴𝑐𝑐𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑢𝑠𝑖𝑜𝑛 result is then used to determine the reward
task, the agent can perform the fusion analysis of the detected value in the Q-Learning algorithm.
object's face and body visual features to identify and track the
𝐶𝑠𝑓 (𝐴𝑐𝑐𝐹 + 𝐴𝑐𝑐𝐷 ) + 𝐶𝑠𝑏 (𝐴𝑐𝑐𝐴 + 𝐴𝑐𝑐𝐷 ) (12)
target being observed. Based on this information, the agent 𝐴𝑐𝑐𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑢𝑠𝑖𝑜𝑛 =
2(𝐶𝑠𝑓 + 𝐶𝑠𝑏 )
can select the action that is most likely to maintain tracking
of objects with high accuracy.
Where:
In the Q-Learning process, the agent's next state is
𝐶𝑠𝑓 = score of face detection
observed, and it receives an instant reward based on the
𝐶𝑠𝑏 = score of body detection
policy defined. This instant reward is then added to the total
𝐴𝑐𝑐𝐴 = Visual multi-features recognition accuracy
cumulative reward. Using the reward results, the agent
𝐴𝑐𝑐𝐷𝑥 = Target horizontal displacement accuracy 𝑥𝑡 - 𝑥𝑡−1
decides on the next action. The agent's current state is
𝐴𝑐𝑐𝐷𝑦 = target vertical displacement accuracy 𝑥𝑡 - 𝑥𝑡−1
updated to the observed next, and the Q-value is updated
𝐴𝑐𝑐𝐹 = Face recognition accuracy
accordingly. The entire process is repeated iteratively until a
𝐴𝑐𝑐𝑇 = Tracking accuracy
specific iteration limit is reached. In each iteration, the agent
observes the environment, decides on the best action to take, III. OUR APPROACH
updates its state based on the action taken, and updates the Q- Fig. 5 illustrates the research design aimed at
table using the given formula. The Q-table stores the implementing adaptive automated tracking decision-making.
estimated rewards for different state-action pairs, and its This study combines face and body visual features with a
continuous updates enable the agent to learn and improve its tracking system to enhance object tracking accuracy and
decision-making capabilities over time. This iterative achieve real-time performance. The picture shows several
learning process allows the agent to progressively enhance its interacting components. Firstly, in the initial frame (𝑡 frame),
performance and make better decisions in achieving its goals. a visual feature extraction process takes place. It is purpose is
The process begins with extracting face and body visual to retrieve visual features of the face and body from the
features from image or video data. Subsequently, these detected object. These features are then combined using (12).
features are combined through the fusion method, Secondly, the decision module uses Q-Learning. The Q-
represented by (12). The visual feature fusion accuracy is Learning process starts by defining the state based on the

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

accuracy of the results obtained. Then, an action is correct, and the accuracy of face and body recognition will
determined from the decision module to establish the next be higher.
step. When a new face appears in a scene, the first step taken is
When triggered, it activates detection and recognition. The to detect the presence of that face. The newly detected face is
agent will re-detect the object to obtain target information. then compared to an existing facial database. The FaceNet
One of the actions available is the 'detect' action. Meanwhile, model processes the facial image to generate a facial
the ‘track’ action involves updating the tracking model and embedding representation. The new facial embedding is
predicting the target's location using current information. compared to existing embeddings within the database using
After executing the 'detect' action, the decision results are distance metrics. If the face is not recognized, indicating its
utilized to calculate the accuracy of feature fusion or the similarity falls below the predetermined threshold, the system
accuracy of the tracking system. This accuracy assessment is can add the new facial image to the database and assign a
crucial in determining the reward value in the Q-Learning new identity label. However, the system will successfully
process. The following is a detailed explanation of each identify the individual if the face matches one in the
component. database.
Fig. 6 explains an example of a process illustration of the Combining multiple visual features with a person's identity
system built with Q-Learning, which aims to determine is achievable through face and body detection. The results of
‘detect’ or ‘track’ actions. The implementation process of Q- these detection processes offer valuable insights into several
Learning starts by defining the state based on the accuracy of visual characteristics that a person possesses, including:
recognizing multiple visual features, such as faces, bodies, or a) Name
tracking systems. In the initial stages, a reward system and b) 26 visual body features include: Female,
policies for agents were designed. AgeOver60, Age18-60, AgeLess18, Front, Side,
Back, Hat, Glasses, HandBag, ShoulderBag,
A. DESIGN OF FACE AND BODY VISUAL FEATURE Backpack, HoldObjectsInFront, ShortSleeve,
SYSTEM LongSleeve, UpperStride, UpperLogo, UpperPlaid,
The face and body detection process is performed using UpperSplice, LowerStripe, LowerPattern,
the Haar cascade and MobileNet SSD methods to get the LongCoat, Trousers, Shorts, Skirt&Dress, boots
score from face and body detection. Once a face is detected, from the dataset [56].
the identified face area is measured by its height, width, and The outcomes are classified using a multi-label
position relative to the image frame. The score detection is classification approach during the detection process. The
calculated based on the extent to which the results of these computer system could identify various visual features that
measurements match the appearance of the face and body. differentiate the detected objects. Several methods have
The higher the score is, the more likely the detection is been developed for recognizing visual features in humans.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

In this study, the network design to be trained is based on B. VISUAL MULTI-FEATURE FUSION ACTION-DECISION
RGB features with an image size of 100x100 pixels. Fig. 7 DESIGN AND TRACKING SYSTEM WITH Q-LEARNING
illustrates the architecture, which comprises six convolution When a new face appears in a scene compared to an
layers. Each convolution layer is segmented into 3x3 existing facial database, the FaceNet model processes the
convolutions with 32, 64, 256, and 512 filters, respectively. facial image to generate a facial embedding representation.
These are followed by BN (Batch Normalization), ReLU The new facial embedding is compared to existing
non-linearity, and max-pooling layers. embeddings within the database using metrics such as
Fifty epochs were utilized to train the network, and Euclidean distance. If the face is not recognized, indicating
dropout layers were incorporated to prevent overfitting. A its similarity falls below the predetermined threshold, the
batch size of 100 was employed during the training process, system can add the new facial image to the database and
and tests were conducted using standard datasets, including assign a new identity label.
the PA-100K dataset [56]. This dataset contains 100,000 The rewards are contingent on the state and actions
images captured from real-time outdoor surveillance performed by the agent. When the agent is "tracked" and
cameras, with 26 binary visual features. It encompasses chooses the ‘detect’ action, it receives a reward of -1.
various pedestrian views, has many instances, and is the most Conversely, if the agent selects the ‘track’ action in the same
up-to-date dataset. Compared to previous collections, the PA- state, it gets a reward of +1. On the other hand, when the
100K dataset provides more informative data for pedestrian agent is in the "Lost" state and takes the ‘detect’ action, it
analysis. When combining the CNN and LSTM algorithms, receives a reward of +1. However, if the agent chooses the
the network architecture, as depicted in Fig. 7, was employed ‘track’ action in that state, the reward is -1. These reward
to process the dataset and achieve the desired recognition values aim to provide feedback to the agents, guiding them
results. toward making optimal decisions during the tracking process.
Fig. 7 illustrates the integration of the LSTM method into • Policy
the CNN architecture. The architecture consists of six 1. The rewards for the “Track” action:
convolution layers, four pooling layers, one fully connected 𝐶𝑠𝑓 (𝐴𝑐𝑐𝐹 +𝐴𝑐𝑐𝐷 )+𝐶𝑠𝑏 (𝐴𝑐𝑐𝐴 +𝐴𝑐𝑐𝐷 )
𝑟(𝑠, 𝑎) = 1 𝑖𝑓 > 50% or
layer, one LSTM layer, and one output layer with a binary 2(𝐶𝑠𝑓 +𝐶𝑠𝑏 )

cross-entropy function. Each convolution layer uses a 3×3 𝐴𝑐𝑐𝑇 > 50%
𝐶𝑠𝑓 (𝐴𝑐𝑐𝐹 +𝐴𝑐𝑐𝐷 )+𝐶𝑠𝑏 (𝐴𝑐𝑐𝐴 +𝐴𝑐𝑐𝐷 )
kernel to extract features and is activated by the ReLU 𝑟(𝑠, 𝑎) = −1 𝑖𝑓 < 50% or
2(𝐶𝑠𝑓 +𝐶𝑠𝑏 )
function. Subsequently, a 2×2 max pooling layer is applied to
𝐴𝑐𝑐𝑇 < 50% (lost target)
reduce the image dimensions. As the data progresses through
2. The reward for the “Detect” action:
the convolution block, the output size is (none, 4, 4, 512).
𝑟(𝑠, 𝑎) = 1 𝑖𝑓 𝐴𝑐𝑐𝑇 < 50%
𝑟(𝑠, 𝑎) = −1 𝑖𝑓 𝐴𝑐𝑐𝑇 > 50% (lost target)
The score describes the extent to which the algorithm
detects the image correctly and is expressed as a percentage.
The score is taken from the mean average precision at IoU
with different thresholds. The ‘detect’ action involves
reactivating the detection and recognition process to re-detect
the observed object. Meanwhile, the ‘track’ action consists of
updating the tracking model and predicting the target's
location based on current information. This illustration
visually represents how the two actions are performed in the
FIGURE 7. CNN+LSTM Architecture
By utilizing the reshape method, the input size of the dataset. In the ‘track’ action, the tracking system will display
LSTM layer can be changed to (16×512), next, through the the accuracy value taken from the highest response peak
fully connected layer followed by a dropout layer with a value.
dropout rate of 5% (to prevent overfitting) with the sigmoid
C. REAL-TIME CONTROL SYSTEM DESIGN
activation function. The system training is conducted for 50
All the software implementations were conducted on a
epochs with a batch size 100. During this training process,
Windows platform, utilizing Keras with TensorFlow v1.15 as
the visual features extracted from the face and body of the
the backend [21] [56], SciPy, and OpenCV libraries [21],
observed object are used. These visual features capture the
[22] [23] [58]. The hardware configuration consisted of an
unique characteristics and attributes of the object, and they
AMD Ryzen 5 3500X 6-Core Processor running on a 64-bit
play a crucial role in object tracking. Once the visual features
operating system. A single Nvidia GeForce GTX 1650 also
of the face and body are extracted, a feature fusion step is
served as the graphical processing unit (GPU).
performed. This feature fusion combines the extracted visual
The real-time system uses a multi-threading technique
features using (12). Feature fusion aims to produce a more
[59], enabling it to perform multiple tasks simultaneously.
comprehensive representation of the detected object, thus
These tasks include face recognition, body recognition, pan
facilitating the identification and tracking of targets.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

tilt movement control, and object tracking. Each task can be data. Recall with the CNN algorithm was 67.00%, while with
executed independently in separate threads by implementing CNN+LSTM, this value increased to 87.36%. We employ
multi-threading, allowing the system to process these tasks the VGG16 model [62], published by Simonyan and others,
simultaneously. This approach significantly improves data augmented with LSTM algorithms for the classification
processing efficiency, especially in applications involving process. The VGG16+LSTM method performs similarly in
streaming images per second. accuracy (83.35%) but achieves a better balance with a
After the image processing section is tested and performs higher precision of 88.46% and a slightly higher F1-score of
well, the center point of the 𝑥, 𝑦 bounding box will be sent to 80.72%. Furthermore, we replace the CNN architecture with
the PTZ (Pan Tilt and Zoom) camera to ensure that the object InceptionV3 [63], as introduced by Szegedy and others, and
is always in the center of the frame. In this study, the PID include LSTM for the classification process. The results
(proportional integral derivative) [60] algorithm is designed indicate that the InceptionV3+LSTM model achieved an
as a closed-loop system controller. This tracking system is accuracy of 87.12%, with a high precision rate of 91.20%, a
tested with a real-time system on a static face. The object's recall (sensitivity) of 77.51%, and an F1-score of 80.92%.
unique features must be tracked continuously by adjusting The recall, precision, F1-score, and accuracy values achieved
the pan and tilt of the camera to keep an object in the center by CNN+LSTM were 87.36%, 91.02%, 88.43%, and
of the frame. The output data in the midpoint of the 𝑥, 𝑦 89.20%, respectively. The CNN + LSTM method is the most
coordinates are used to control the pan and tilt camera. These suitable choice for this classification task and ensures robust
coordinates will be calculated and given continuous performance in correctly identifying positive instances while
feedback, then compared with the set point, resulting in an minimizing false positives.
error value during iteration. Based on this error value, the
control algorithm will generate control signals for the camera
pan and tilt so that the target remains in the center of the
frame. The desired position must be met accurately and
quickly so that the resulting error value is near zero. PID is
used as a control algorithm on PTZ cameras.
This controller is able to increase the accuracy of the
system switching characteristics and determine the signal
output to be given to the PTZ camera motor, which functions
FIGURE 8. Mean accuracy for each class (Dataset PA100K)
as an actuator. The sensor will detect the motor position on
Fig. 8 shows the mean accuracy of the model with
the PTZ camera, which is then given feedback as a control
different methods. The lowest accuracy is found in female
signal input to reduce motor control errors.
recognition using the CNN+LSTM algorithm with a value of
IV. RESULTS AND DISCUSSION
74.00%, while the highest mean accuracy is located in the
Lower Stripe and Boots classes.
A. RESULTS OF THE MULTI-FEATURE VISUAL Using the CNN+LSTM method for visual feature
RECOGNITION SYSTEM WITH CNN+LSTM
recognition, the value is close to 1, indicating that the
Before performing face and body recognition, the
developed model is getting better at classification. The main
computer processes the detection system. In this study, body
objective of this study is to achieve good results in multi-
detection utilized the MobileNet SSD method, followed by
feature recognition. The results show that the proposed
the human face and body recognition systems. This visual
CNN+LSTM architecture is better than the single CNN
multi-feature recognition sub-chapter evaluates the
architecture.
recognition system's performance using four criteria:
precision, recall, F1 score, and accuracy. These four indices,
namely mA, P, R, and F1, were employed for a
comprehensive evaluation, as presented in Table 1.

TABLE 1 THE EVALUATION OF THE RECOGNITION MODEL


Methods Accuracy Precision Recall F1-
(sensitivity) score
DAFL[61] 83.54% - - 88.09%
VGG16 + LSTM 83.35% 88.46% 75.41% 80.72%
InceptionV3+LSTM 87.12% 91.20% 77.51% 80.92%
CNN 88.84% 74.00% 67.00% 69.00%
CNN+LSTM 89.20% 91.02% 87.36% 88.43%
FIGURE 9. Classification prediction results by CNN+LSTM on the PA-
The recall metric shows how many recognition results are 100K dataset
classified as positive by the model from all positive class

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

TABLE 2 PERFORMANCE METRICS FOR RECOGNITION AND TRACKING SYSTEM IN VARIOUS CONFIGURATIONS
Face and Body Face recognition Body recognition Face, body
recognition and tracker and tracker recognition, and
Face recognition Body recognition
tracker with Q-
Learning
P R F1 P R F1 P R F1 P R F1 P R F1 P R F1
Vid1 0.95 0.24 0.39 0.55 0.4 0.47 0.71 0.36 0.48 0.86 0.92 0.89 0.83 0.39 0.53 1,00 1,00 1,00
Vid2 0.36 0.51 0.42 0.32 0.28 0.3 0.63 0.43 0.51 0.76 0.82 0.79 0.72 0.5 0.59 0.98 0.97 0.97
Vid3 0.69 0.84 0.76 0.21 0.19 0.2 0.84 0.77 0.8 0.95 0.97 0.96 0.4 0.29 0.34 0.96 0.97 0.97
Vid4 0.49 0.67 0.57 0.29 0.19 0.23 0.64 0.54 0.59 0.68 0.68 0.68 0.83 0.69 0.75 0.78 0.8 0.79
Vid5 0.55 0.57 0.56 0.42 0.55 0.47 0.71 0.67 0.69 0.63 0.7 0.66 0.68 0.79 0.73 0.83 0.86 0.85

It is shown in Fig. 9 that the testing image successfully


predicts five classes due to the use of the sigmoid function on
the network so that the model can predict one image that can
be categorized into one or more classes. The correct
prediction is marked with a blue highlight.
B. ACTION DECISION RESULTS IN THE FUSION OF
VISUAL MULTI-FEATURES AND TRACKING SYSTEMS
WITH Q-LEARNING
The decision-making process for detection and tracking
involves randomly selecting an action with the highest Q
value. When the agent moves to the next state, this Q value is
updated according to the chosen policy.

FIGURE 11. The average accuracy of decision tracking and visual


feature fusion with Q-Learning with an adaptive weighting

In this research, a comparison has been conducted on the


usage of Face recognition (Triplet Facenet), Body
recognition (CNN+LSTM), Face and Body recognition, Face
recognition and Tracker, Body recognition and tracker, and
Face, Body recognition, and tracker with Q-learning. Table 2
shows that the face recognition and tracker system performs
better than body recognition and tracker, particularly when
FIGURE 10. 𝑸 Value Result facial features are apparent and the facial position is
relatively stable. Nevertheless, body recognition and tracking
The learning performance using the Q-Learning can help increasing the precision value in videos 4 and 5,
algorithm is shown in Fig. 10, where the system reaches especially in situations where the object's body movements
convergence at the 36th time step, which means that the Q- are consistent, and the face is not oriented towards the
Learning algorithm has effectively learned the optimal camera.
policy for decision-making in the given environment. When The system consistently improves its performance by
the system decides to do a 'detect' or ‘track’ action and is in incorporating face and body recognition and tracking
the lost and tracked state, the agent will receive an functionalities. It demonstrates good precision, recall, and
immediate reward of +999. This reward is given based on F1-score results compared to alternative methods focusing
the value of 𝐴𝑐𝑐𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑢𝑠𝑖𝑜𝑛 dan 𝐴𝑐𝑐𝑇 . By providing an solely on face or body recognition. The enhancement of
immediate reward in such situations, the Q-Learning object identification accuracy and tracking precision is
algorithm encourages the agent to make decisions that lead achieved by applying tracking techniques, with the tracker
to accurate detections and tracking when the system is reducing False Negative (FN) values. This mechanism
uncertain or "lost and tracked." This approach aligns to
ensures that objects are tracked continuously and accurately
achieve high accuracy, which is crucial in many tracking
across various video frames.
and detection applications.
The system’s ability to adapt to changes in object positions
Then, Fig. 11 indicates that the system has achieved an
impressive accuracy rate of up to 91.5% across five videos. and utilize historical information is essential for maintaining
This level of accuracy suggests that the system is highly effective tracking. By integrating Q-Learning into the
effective in tracking specific individuals by combining decision-making process for object tracking and considering
multiple visual features and a tracker employing Q- factors such as object position, confidence scores, and
Learning. accuracy scores, the system can intelligently decide when

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

and how to use face and body recognition with the tracking The real-time system is designed with a multi-threading
system. This integration not only enhances efficiency but technique. This technique allows the system to perform
also elevates real-time object tracking accuracy. multiple tasks simultaneously, including face and body
After conducting tests using personal datasets, the recognition, pan tilt movement control, and tracking system.
performance of the proposed method was evaluated with Each task can be executed independently in a separate thread
benchmark test datasets [63]. by implementing multi-threading, enabling the system to run
these tasks simultaneously and improve overall data
TABLE 3 THE RESULTS OF THE TRACKING EVALUATION ON THE TB processing efficiency. In the application of streaming images
DATASET
Video Recall Precision Accuracy F1- CLE CLE
per second with queues and multi-threading [59], the fps has
Score (pixel) Previous increased to 2-4 fps.
Research
(pixel)
After the image processing section is tested and performs
David 91% 92% 85% 92% 6.35 14.36 [64] well, the center point of the 𝑥, 𝑦 bounding box will be sent to
Girl 93% 85% 83% 89% 8.07 10.82 [65] the PTZ camera to ensure that the object is always in the
12.00 [66]
Blur 98% 85% 84% 91% 19.02 27.73 [64] center of the frame. Data is continuously sent when the
Face camera captures the object you want to track. The PTZ
Avg 94% 87% 84% 91% 11.15
camera movement is controlled using separate systems for
The tracking evaluation results on the TB dataset, as
the horizontal and vertical axes, which a pre-designed PID
presented in Table 3, demonstrate good performance metrics
regulates.
across various videos. The system achieves a notable average
recall of 94%, indicating its effectiveness in correctly
identifying and tracking objects in the videos. The averages
of precision, accuracy, and F1-Score values are 87%, 84%,
and 91%, respectively. The Cumulative Localization Error
(CLE) values, measured in pixels, are relatively low for each
video, with an average CLE of 11.15 pixels. This indicates
that the system maintains a precise localization of tracked
objects. Furthermore, the system performs better in terms of FIGURE 12. Pan and tilt response with PD controller
CLE compared to earlier studies, demonstrating By utilizing the Kp=0.2 and Kd=0.02 values on the pan
improvements in tracking accuracy. and tilt PD controller, an RMSE value of 22 pixels is
Integrating tracking methods and recognition mechanisms achieved, as displayed in Fig. 12. The tracked object is
with the Q-Learning algorithm can optimize the process of approximately one meter away from the camera and
human object tracking within dynamic environments. With typically moves around the room. The pan and tilt system
its proficiency in determining optimal policies through effectively follows the object's movements. The camera
learning from its interactions with the environment, Q- monitoring system was successfully implemented by
Learning can make astute decisions regarding when to integrating a control system into the camera movement.
initiate the recognition process and when to rely on
information from the tracking system. For instance, when the TABLE 4 SYSTEM PERFORMANCE WITH P, I, AND D IN PTZ CAMERA
Controller Camera Settling Rise Overshoot Ess
recognition system detects and identifies an object, this time Time (pixel) (pixel)
information can be a starting point for the tracker to follow (second) (second)
the object through subsequent video frames. Subsequently, P Pan 1.77 1.3 80 28
utilizing Q-Learning, the system can adaptively decide when Tilt 1.78 1 89 21
PI Pan 5 1.65 85 72
re-recognition is necessary based on alterations in the Tilt 5 1.37 260 250
object’s status or condition, such as whether the object PD Pan 1.695 1 23 3
should change orientation or occlusion. With this proposed Tilt 3.88 1.22 22 7
method, further recognition, which involves a significant PID Pan 3.4 2.2 56 3
Tilt 1.41 1.23 203 5.67
computational load, is only performed as required - based on
feedback and rewards from the environment, interpreted by
Table 4 shows the performance results of the pan and tilt
the Q-Learning algorithm. Hence, the fusion of visual
camera system using P, I, and D components. The PD system
recognition, tracking, and decision-making optimized by Q-
on pan and tilt can produce steady-state error values of 3
Learning provides a robust and adaptive framework for
pixels for pan and 7 pixels for tilt. Meanwhile, Table 5 shows
object tracking across various scenarios and environmental
the total RMSE values for objects moving with an average
conditions, mitigating computational burden and enhancing
speed of 96 pixels/second from a distance of 1m, 2m, 3m,
system efficiency and accuracy.
and 4m. These results indicate that the farther the object is
C. RESULTS OF TESTING THE REAL-TIME ASPECT OF from the camera, the smaller the RMSE value will be. In this
THE TRACKING SYSTEM case, the farther the object's distance, the smaller the resulting

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

error, so the pan and tilt camera system can produce good within a range of four meters. These results signify a good
tracking results on objects moving at high speed. level of accuracy in real-time human movement tracking.
However, there is a limitation in that the proposed method
TABLE 5 RMSE RESULTS WITH PD CONTROL can track a specific object at an average speed of 133.38
Distance 1 meter 2 meter 3 meter 4 meter
pixels per second. To enhance facial and body feature
RMSE (pixel) 22 17.63 14.52 11.52
recognition accuracy, expanding the training dataset with a
Moreover, the real-time tracking capability, facilitated by a
more extensive variety, covering different positions, lighting,
PD simple controller based on visual multi-features, adds to
and other conditions, is recommended. Also, it can combine
the system's effectiveness.
data from other sensors, such as depth sensors or infrared
Based on Table 6, it can be concluded that using multi-
sensors, to increase system robustness in poor lighting
threading combined with a visual multi-feature recognition
conditions or when occlusion occurs and conduct further
system and a tracker can reduce the computational load
research on optimizing the Q-Learning algorithm to increase
effectively. The face detection and recognition system
speed and efficiency in making decisions based on dynamic
produced 90 fps, while the body detection and visual multi-
environmental conditions.
feature recognition system produced 27 fps. However, with
the addition of a tracker, time consumption became more
ACKNOWLEDGMENT
optimal at 36 fps. The average frame rate of 50 fps indicates
This work was supported by the School of Electrical
that combining a visual multi-feature recognition system and
Engineering and Informatics, Institut Teknologi Bandung.
a tracker with Q-Learning can reduce the computational load
with improved time performance and accuracy. REFERENCES
[1] Y. Xiang, A. Alahi, and S. Savarese, “Learning to
TABLE 6 TIME REQUIRED FOR EACH SYSTEM
track: Online multi-object tracking by decision
fps ms
Face detection and recognition 90 11.1
making,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, pp. 4705–
Body detection and visual multi- 27 37.0 4713.
feature recognition [2] C. L. Hwang, D. S. Wang, F. C. Weng, and S. L. Lai,
Tracker 36 27.7
“Interactions between specific human and
Pan and tilt control 48 20.8
omnidirectional mobile robot using deep learning
average fps 50 19.9
approach: SSD-FN-KCF,” IEEE Access, vol. 8, pp.
41186–41200, 2020, doi:
V. CONCLUSION
10.1109/ACCESS.2020.2976712.
The system explores visual feature recognition,
[3] A. Sadeghzadeh and H. Ebrahimnezhad, “Pose-
specifically face and body visual multi-feature recognition. invariant face recognition based on matching the
The CNN+LSTM hybrid method was employed for body occlusion free regions aligned by 3D generic model,”
visual multi-feature recognition, achieving recall, precision, IET Computer Vision, vol. 14, no. 5, pp. 268–277,
F1-score, and accuracy values of 87.36%, 91.02%, 88.43%, 2020, doi: 10.1049/iet-cvi.2019.0244.
and 89.20%, respectively. On the other hand, the single CNN [4] B. Jiang, Q. Zhang, Z. Li, Q. Wu, and H. Zhang,
method achieved recall, precision, F1-score, and accuracy “Non-frontal facial expression recognition based on
values of 74.00%, 67.00%, 69.00%, and 88.84%, salient facial patches,” EURASIP Journal on Image
respectively. The system uses multi-threading techniques for and Video Processing, no. 1, 2021, doi:
image flow to achieve real-time processing. Additionally, a 10.1186/s13640-021-00555-5.
combination of face and body visual features with the [5] D. Andriana, A. S. Prihatmanto, E. M. I. Hidayat, and
tracking system in the Q-Learning method is utilized to C. Machbub, “Combination of face and posture
enhance target tracking, even when the target is not directly features for tracking of moving human visual
facing the camera in real-time. The Q-Learning method characteristics,” International Journal on Electrical
improves adaptability by using rewards dependent on the Engineering and Informatics, vol. 9, no. 3, pp. 616–
accuracy of face and body visual features, object locations, 631, 2017, doi: 10.15676/ijeei.2017.9.3.14.
and environmental conditions to make informed decisions. [6] S. Banik, M. Lauri, and S. Frintrop, “Multi-label
With the Q-Learning method, the system demonstrates Object Attribute Classification using a Convolutional
enhanced visual multi-feature fusion. It achieved an Neural Network,” 2018, [Online]. Available:
impressive accuracy of 91.5% in test scenarios across five http://arxiv.org/abs/1811.04309
different videos and on video benchmark datasets with an [7] U. Asif, D. Mehta, S. Von Cavallar, J. Tang, and S.
Harrer, “DeepActsNet: A deep ensemble framework
accuracy of 84% and an average center location error of
combining features from face, hands, and body for
11.15 pixels. Moreover, the successful implementation of the
action recognition,” Pattern Recognition, vol. 139, p.
proposed method in a PTZ camera resulted in a PD system
109484, 2023, doi: 10.1016/j.patcog.2023.109484.
on pan and tilt, which achieved an RMSE of 11.52 pixels

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

[8] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, of the 2001 IEEE Computer Society Conference on
“High-speed tracking with kernelized correlation Computer Vision and Pattern Recognition. CVPR
filters,” IEEE Transactions on Pattern Analysis and 2001, 2001.
Machine Intelligence, vol. 37, no. 3, pp. 583–596, [21] M. Abadi et al., “TensorFlow: A system for large-
2014, doi: 10.1109/TPAMI.2014.2345390. scale machine learning,” Proceedings of the 12th
[9] D. A. Maharani, C. Machbub, L. Yulianti, and P. H. USENIX Symposium on Operating Systems Design
Rusmin, “Deep features fusion for KCF-based and Implementation, OSDI 2016, pp. 265–283, 2016.
moving object tracking,” Journal of Big Data, vol. [22] P. Virtanen et al., “SciPy 1.0: fundamental
10, no. 1, 2023, doi: 10.1186/s40537-023-00813-5. algorithms for scientific computing in Python,”
[10] D. A. Maharani, C. Machbub, L. Yulianti, and P. H. Nature Methods, vol. 17, no. 3, pp. 261–272, 2020,
Rusmin, “Real-time Human Tracking System using doi: 10.1038/s41592-019-0686-2.
Histogram Intersection Distance in Firefly [23] G. Danuser, “Computer vision in cell biology,” Cell,
Optimization Based Particle Filter,” International vol. 147, no. 5, pp. 973–978, 2011, doi:
Journal on Electrical Engineering and Informatics, 10.1016/j.cell.2011.11.001.
vol. 13, no. 4, pp. 853–872, 2021, doi: [24] P. Wilson and J. Fernandez, “Facial feature detection
10.15676/ijeei.2021.13.4.7. using Haar classifiers,” Journal of Computing
[11] H. Liu, M., Jin, C. B., Yang, B., Cui, X., & Kim, Sciences in Colleges, vol. 21, no. 4, pp. 127–133,
“Online multiple object tracking using confidence 2006.
score‐based appearance model learning and [25] A. G. Howard et al., “MobileNets: Efficient
hierarchical data association,” IET Computer Vision, Convolutional Neural Networks for Mobile Vision
vol. 13(3), pp. 312–318, 2019. Applications,” arXiv preprint arXiv:1704.04861,
[12] N. H. Abdulghafoor and H. N. Abdullah, “A novel 2017, [Online]. Available:
real-time multiple objects detection and tracking http://arxiv.org/abs/1704.04861
framework for different challenges,” Alexandria [26] D. A. Maharani, C. Machbub, P. H. Rusmin, and L.
Engineering Journal, vol. 61, no. 12, pp. 9637–9647, Yulianti, “Improving The Capability of Real-time
2022. Face Masked Recognition using Cosine Distance,” in
[13] M. Ye, C. Tianqing, and F. Wenhui, “A single-task 2020 6th International conference on interactive
and multi-decision evolutionary game model based digital media (ICIDM), 2020.
on multi-agent reinforcement learning,” Journal of [27] F. Schroff, D. Kalenichenko, and J. Philbin,
Systems Engineering and Electronics, vol. 32, no. 3, “Facenet: A unified embedding for face recognition
pp. 642–657, 2021, doi: 10.23919/jsee.2021.000055. and clustering,” in Proceedings of the IEEE
[14] B. M. Albaba and Y. Yildiz, “Driver modeling conference on computer vision and pattern
through deep reinforcement learning and behavioral recognition, 2015, pp. 815–823.
game theory,” IEEE Transactions on Control [28] W. Liu et al., “Ssd: Single shot multibox detector,” in
Systems Technology, vol. 30, no. 2, pp. 885–892, Computer Vision–ECCV 2016: 14th European
2021, doi: 10.1109/TCST.2021.3075557. Conference, Amsterdam, The Netherlands, October
[15] K. Song, W. Zhang, R. Song, and Y. Li, “Online 11–14, 2016, Proceedings, 2016, pp. 21–37. doi:
decision based visual tracking via reinforcement 10.1007/978-3-319-46448-0_2.
learning,” Advances in Neural Information [29] V. Ferrari and A. Zisserman, “Learning Visual
Processing Systems, vol. 33, pp. 11778–11788, 2020. Attributes,” in Advances in neural information
[16] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q- processing systems, 2007. Accessed: Jun. 23, 2022.
Learning Algorithms: A Comprehensive [Online]. Available:
Classification and Applications,” IEEE Access, vol. https://proceedings.neurips.cc/paper/2007/hash/ed265
7, pp. 133653–133667, 2019, doi: bc903a5a097f61d3ec064d96d2e-Abstract.html
10.1109/ACCESS.2019.2941229. [30] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar,
[17] J. Kober, J. Peters, and Bagnell, J. Andrew, “Describable visual attributes for face verification
“Reinforcement learning in robotics: A survey,” The and image search,” IEEE Trans. Pattern Anal. Mach.
International Journal of Robotics Research, vol. 32, Intell., vol. 33, no. 10, pp. 1962–1977, 2011, doi:
no. 11, pp. 1238–1274, 2013. 10.1109/TPAMI.2011.48.
[18] A. Kumar Shakya, G. Pillai, and S. Chakrabarty, [31] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L.
“Reinforcement Learning Algorithms: A brief Bourdev, “PANDA: Pose aligned networks for deep
survey,” Expert Systems with Applications, p. attribute modeling,” in Proceedings of the IEEE
120495, 2023. conference on computer vision and pattern
[19] R. Dearden, N. Friedman, and S. Russell, “Bayesian recognition, pp. 1637–1644. [Online]. Available:
Q-learning,” Aaai/iaai, pp. 761–768, 1998. http://arxiv.org/abs/1311.5591
[20] P. Viola and M. Jones, “Rapid object detection using [32] H. Fan, H.-M. Hu, S. Liu, W. Lu, and S. Pu,
a boosted cascade of simple features,” in Proceedings “Correlation graph convolutional network for

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

pedestrian attribute recognition,” IEEE Trans. ray images,” Informatics in Medicine Unlocked, vol.
Multimedia, vol. 24, pp. 49–60, 2020, doi: 20, p. 100412, 2020, doi:
10.1109/TMM.2020.3045286. 10.1016/j.imu.2020.100412.
[33] R. Layne, T. Hospedales, S. Gong, and Q. Mary, [45] A. Zhang, Z. C. Lipton, L. Mu, and J. S. Alexander,
“Person re-identification by attributes,” Bmvc, vol. 2, Dive into Deep Learning. Cambridge University
no. 3, p. 8, 2012, doi: 10.5244/C.26.24. Press, 2021. [Online]. Available: https://d2l.ai
[34] J. Zhu, S. Liao, Z. Lei, D. Yi, and S. Z. Li, [46] G. Van Houdt, C. Mosquera, and G. Nápoles, “A
“Pedestrian attribute classification in surveillance: review on the long short-term memory model,”
Database and evaluation,” in Proceedings of the Artificial Intelligence Review, vol. 53, pp. 5929–
IEEE international conference on computer vision 5955, 2020, doi: 10.1007/s10462-020-09838-1.
workshops, Sydney, Australia: IEEE, 2013, pp. 331– [47] Laith Alzubaidi et al., “Review of deep learning:
338. doi: 10.1109/ICCVW.2013.51. concepts, CNN architectures, challenges,
[35] Y. Deng, P. Luo, C. C. Loy, and X. Tang, applications, future directions,” Journal of Big Data,
“Pedestrian attribute recognition at far distance,” in vol. 8, pp. 1–74, 2021.
Proceedings of the 22nd ACM international [48] S. B. Damelin and W. Miller, The mathematics of
conference on Multimedia, ACM, Nov. 2014, pp. signal processing. Cambridge University Press, 2012.
789–792. doi: 10.1145/2647868.2654966. doi: 10.1017/CBO9781139003896.
[36] I. N. Junejo and N. Ahmed, “Depthwise separable [49] X. Zhou, S. Li, C. Liu, H. Zhu, N. Dong, and T.
convolutional neural networks for pedestrian attribute Xiao, “Non-Intrusive Load Monitoring Using a
recognition,” SN COMPUT. SCI., vol. 2, pp. 1–11, CNN-LSTM-RF Model Considering Label
2021, doi: 10.1007/s42979-021-00493-z. Correlation and Class-Imbalance,” IEEE Access, vol.
[37] Y. Li, F. Shi, S. Hou, J. Li, C. Li, and G. Yin, 9, pp. 84306–84315, 2021, doi:
“Feature Pyramid Attention Model and Multi-Label 10.1109/ACCESS.2021.3087696.
Focal Loss for Pedestrian Attribute Recognition,” [50] T. Zhou, M. Zhu, D. Zeng, and H. Yang, “Scale
IEEE Access, vol. 8, pp. 164570–164579, 2020, doi: Adaptive Kernelized Correlation Filter Tracker with
10.1109/ACCESS.2020.3010435. Feature Fusion,” Mathematical Problems in
[38] S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-Scale Engineering, vol. 2017, 2017, doi:
Multi-View Deep Feature Aggregation for Food 10.1155/2017/1605959.
Recognition,” IEEE Transactions on Image [51] F. Yue and X. Li, “Improved kernelized correlation
Processing, vol. 29, pp. 265–276, 2019, doi: filter algorithm and application in the optoelectronic
10.1109/TIP.2019.2929447. tracking system,” International Journal of Advanced
[39] F. Nian, X. Chen, S. Yang, and G. Lv, “Facial Robotic Systems, vol. 15, no. 3, p.
attribute recognition with feature decoupling and 1729881418776582, 2018, doi:
graph convolutional networks,” IEEE Access, vol. 7, 10.1177/1729881418776582.
pp. 85500–85512, 2019, doi: [52] X. Wang, G. Wang, Z. Zhao, Y. Zhang, and B. Duan,
10.1109/ACCESS.2019.2925503. “An improved kernelized correlation filter algorithm
[40] K. Han, Y. Wang, H. Shu, C. Liu, C. Xu, and C. Xu, for underwater target tracking,” Applied Sciences
“Attribute aware pooling for pedestrian attribute (Switzerland), vol. 8, no. 11, p. 2154, 2018, doi:
recognition.” arXiv, 2019. [Online]. Available: 10.3390/app8112154.
http://arxiv.org/abs/1907.11837 [53] C. J. C. H. Watkins and P. Dayan, “Q-learning,”
[41] H. Han, A. K. Jain, F. Wang, S. Shan, and X. Chen, Machine Learning, vol. 8, pp. 279–292, 1992, doi:
“Heterogeneous face attribute estimation: A deep 10.1023/A:1022676722315.
multi-task learning approach,” IEEE Trans. Pattern [54] B. Zhong, B. Bai, J. Li, Y. Zhang, and Y. Fu,
Anal. Mach. Intell., vol. 40, no. 11, pp. 2597–2609, “Hierarchical tracking by reinforcement learning-
2017, doi: 10.1109/TPAMI.2017.2738004. based searching and coarse-to-fine verifying,” IEEE
[42] S. Hochreiter, “The vanishing gradient problem Trans. on Image Process., vol. 28, no. 5, pp. 2331–
during learning recurrent neural nets and problem 2341, 2018, doi: 10.1109/TIP.2018.2885238.
solutions,” International Journal of Uncertainty, [55] Z. Xu, Y. Wang, J. Jiang, J. Yao, and L. Li,
Fuzziness and Knowldege-Based Systems, vol. 6, no. “Adaptive Feature Selection With Reinforcement
2, pp. 107–116, 1998, doi: Learning for Skeleton-Based Action Recognition,”
10.1142/S0218488598000094. IEEE Access, vol. 8, pp. 213038–213051, 2020, doi:
[43] J. Hochreiter, S., & Schmidhuber, “Long short-term 10.1109/ACCESS.2020.3038235.
memory,” Neural computation, vol. 9, no. 8, pp. [56] X. Liu et al., “Hydraplus-net: Attentive deep features
1735–1780, 1997. for pedestrian analysis,” in In Proceedings of the
[44] Md. Z. Islam, Md. M. Islam, and A. Asraf, “A IEEE international conference on computer vision,
combined deep CNN-LSTM network for the 2017, pp. 350–359. [Online]. Available:
detection of novel coronavirus (COVID-19) using X- http://arxiv.org/abs/1709.09930

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3355785

Author Name: Preparation of Papers for IEEE Access (February 2017)

[57] F. Chollet. and others, “Keras.” [Online]. Available: Computer Vision,” Proceedings of the IEEE
https://github.com/fchollet/keras Computer Society Conference on Computer Vision
[58] Pedregosa Fabian et al., “Scikit-learn: Machine and Pattern Recognition, pp. 2818–2826, 2016, doi:
Learning in Python,” Journal of Machine Learning 10.1109/CVPR.2016.308.
Research, vol. 12, pp. 2825–2830, 2011. [64] Y. Wu, J. Lim, and M. H. Yang, “Online object
[59] G. Van Rossum et al., “Python 3 Reference Manual,” tracking: A benchmark,” Proceedings of the IEEE
Nature, vol. 585, no. 7825, pp. 357–362, 2009. Computer Society Conference on Computer Vision
[60] J. G. Ziegler and N. B. Nichols, “Optimum settings and Pattern Recognition. [Online]. Available:
for automatic controllers,” Transactions of the http://cvlab.hanyang.ac.kr/tracker_benchmark/seq/
American society of mechanical engineers, vol. 64, [65] P. G. Bhat, B. N. Subudhi, T. Veerakumar, V. Laxmi,
no. 8, pp. 759–768, 1942. and M. S. Gaur, “Multi-feature fusion in particle
[61] J. Jia, N. Gao, F. He, X. Chen, and K. Huang, filter framework for visual tracking,” IEEE Sensors
“Learning disentangled attribute representations for Journal, vol. 20, no. 5, pp. 2405–2415, 2019, doi:
robust pedestrian attribute recognition,” Proceedings 10.1109/JSEN.2019.2954331.
of the AAAI Conference on Artificial Intelligence, [66] S. D. Lin, J. J. Lin, and C. Y. Chuang, “Particle filter
vol. 36, no. 1, pp. 1069–1077, 2022, doi: with occlusion handling for visual tracking,” IET
10.1609/aaai.v36i1.19991. Image Processing, vol. 9, no. 11, pp. 959–968, 2015,
[62] K. Simonyan and A. Zisserman, “Very deep doi: 10.1049/iet-ipr.2014.0666.
convolutional networks for large-scale image [67] C. Chen, S. Li, H. Qin, and A. Hao, “Real-time and
recognition.” 2014. [Online]. Available: arXiv robust object tracking in video via low-rank
preprint arXiv:1409.1556 coherency analysis in feature space,” Pattern
[63] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Recognition, vol. 48, no. 9, pp. 2885–2905, 2015,
Wojna, “Rethinking the Inception Architecture for doi: 10.1016/j.patcog.2015.01.025.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like