Base 01
Base 01
Abstract—Human posture recognition is extensively researched photograph poses a greater challenge as compared to a single-
for identifying human activities and various applications. Even person case. The reason is that the number and positions of
though it might be tough because of a number of things, like individuals in the picture are unknown beforehand. This makes
complex backgrounds, moving body positions, occlusions of the
self and objects, poor image resolutions, and varying lighting, it challenging for the computer system to accurately infer.
certain obstacles might occur. To address the restrictions and Both methods of posture estimation typically use a top-down
improve performance, a model for assessing human posture is or bottom-up strategy to address the aforementioned problem
developed that makes use of deep convolutional neural networks. [1].
The proposed method utilizes a bottom-up parsing technique
to identify crucial locations in the human body. Moreover, it The “human posture” issue is referred to as the diffi-
employs a non-parametric method to describe the vector field of culty in localizing human joints. It can be challenging to
key point associations, allowing for the grouping of anatomical detect undetectable connections for masking and understand
key points for each individual. The accuracy of localized key the environment when addressing problems related to solid
points is further improved through multiple stages of enhanced alignment. Estimating the position accurately is a complex
prediction. The proposed method is trained and tested on the
MPII Human Pose dataset, and it demonstrates superior perfor- task that requires multiple adjustments to navigate through a
mance in terms of accuracy compared to the latest state-of-the-art vast database of potential postures, including those that are
technique. Moreover, by including an occlusion network in the partially hidden. This is especially challenging due to the sheer
feature representation process, it efficiently discovers occluded number of possible positions and the need for repeated trial
key points, resulting in a mean average precision of 90.4%. and error. The goal of this work is to put out a model that could
Index Terms—Bottom-up parsing, occlusion, pose estimation,
skeletal key points.
estimate a multi-person position and accurately locate, orient,
and localize body components. Based on this finding, it is
utilized to estimate anatomically important points or individual
I. I NTRODUCTION MPII dataset subjects’ “parts” [2].
The task of human posture estimation in computer vi- However, this work is an improved version of the joint
sion involves recognizing and determining the positions and learning structure [3]. The proposed model improves output,
postures of humans in images and videos. In the field of detects occluded and hidden objects, and is more accurate
computer vision, the objective is to detect the positions and than the existing cutting-edge approach. By incorporating
angles of crucial body joints or distinctive landmarks, ranging feature representation, we can reveal hidden critical points.
from the head, elbows, shoulders, wrists, hips, and knees, The process of detecting, localizing, and estimating human
to the ankles. It has diverse applications across areas such posture through the use of multi-stage characteristic informa-
as robotics, motion capture, human-computer interface, sports tion is not without its challenges. This model requires careful
analysis, healthcare, and augmented reality. New techniques implementation and execution to ensure accurate results. These
for evaluating human postures are introduced each year. The difficulties with correct pose estimation are the focus of this
first method typically involved identifying a single person’s work.
posture in an image that initially had only that one person. • Incorporating traits from multiple stages in a parallel
Typically, a person detector is used, and for each detection, a structure, leads to a high level of accuracy of 90.4% of
single-person pose estimation is performed. These methods the average PCKh on the MPII assessment.
identify each constituent separately before joining them to • Extending branch 1’s fundamental architecture and incor-
form the pose. Estimating the pose of multiple people in a porating it into this branch to detect accurately obscured
important locations.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 19,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
II. L ITERATURE REVIEW are connected or grouped to generate individual instances of
Pose estimation of multi-person has undergone significant human posture.
progress in recent times, owing to the growing adoption of A. Overview of the Proposed Workflow
convolutional neural networks. In [4], the authors introduced
a novel Convolutional Neural Network (CNN) technique for The method for determining the poses of several individuals
posture estimation. The method employed multiple CNN pose is represented visually in a schematic layout. The fundamental
regressors in a cascade structure, enabling a comprehensive steps of the proposed multi-person workflow pose estimation
approach to gather information and analyzing poses. This are shown in Fig. 1.
technique offered an advantage by foreseeing challenges in The initial step in the image processing pipeline involves
localization and pose estimation. preprocessing the input image to make it ready for subsequent
In [5] and in [6], exceptional results were found by using processing tasks. As a result, during the preprocessing stage,
advanced convolutional neural networks to produce the con- the input photos are resized, scaled, and normalized. The pro-
fidence score of crucial points. In [6], authors introduced a cessed image is then sent into a CNN, which simultaneously
network of a U-shape, also referred to as an hourglass module, creates a collection of feature maps. After the features have
where multiple modules of hourglass were stacked to generate been retrieved, they are sent along three different branches.
a prediction. This approach proved to be highly effective and Branch 1 produces a collection of confidence maps for each
led to significant improvements in the accuracy of the results of the predetermined key points, these maps provide an
obtained. estimation of the probability; the branch 2 algorithm creates
Key-point association constraint network (KACNet) is a sets of cluster maps individual to each person featured in the
recently developed network that has been designed to evaluate photos, these cluster maps are also unique to each individual;
and analyze the human posture efficiently, presented in [7]. and for the association, branch 3 generates a collection of
The first channel of this particular network identifies critical affinity fields between body part pairs. The parsing process
locations where the distance loss function serves as a limiting then yields a collection of bipartite matches for associating
factor for the system. Channel-2 is limited in choosing the body part candidates. Finally, all of the people in the image
association loss function that plays a crucial role in deciding are given full-body postures using the matching set of parts.
the relationship between the key points. False associations in B. Detailed Explanation of Our Proposed Methodology
both regular and unusual stances may lead to limitations.
The author offers ThermalPose, a portable neural network The steps of multi-person pose estimation and their imple-
system, in [8]. They have developed a tool for identifying mentation are described in the following subsections.
thermal images and applied already-in-use vision algorithms 1) Data Preprocessing : Images used for deep learning
that have been trained to recognize human poses in RGB training need to be similar in size. The network needs to
images. Compared to the vision-based pose estimator, the learn more than four times as many pixels in a larger image
model is 26 percent smaller. There may be some restrictions, as it did in a smaller image, which can take longer to train
such as the fact that thermal monitoring causes the person’s on. Shrinking an image preserves important details while still
body to lose heat and that thermal heat cannot pass through being safe. The ideal image size is utilized as 256 x 256,
obstacles. depending on network and GPU size.
The human pose estimate problem, when data are mis- Due to gradient propagation concerns, the pixel values are
matched between distinct postures, is addressed in [9]. The resized to fit a predetermined range. Pixels originally range
authors utilize the K-means clustering approach to identify from 0 to 255, which can make computations challenging for
and analyze odd positions without extensive knowledge about neural networks. Normalizing the data to a range of 0 to 1
them. To address limited data availability, they propose three reduces computational complexity.
solutions: using duplicate rare poses, generating synthetic In addition, the normalization approach is used to standard-
data based on rare poses, and adding uncommon poses to ize data, which is crucial for preventing issues during training.
the dataset. These techniques aim to overcome data scarcity. Non-standardized data can make training difficult and slow
However, due to the small number of typical posture data down the learning rate. Uneven numerical data points, with
points, overall improvement in performance is unlikely to be some being high and others being low, can lead to uncertainties
significant. in the proportional importance of each input during training.
In such cases, high values tend to outweigh smaller ones.
III. M ETHODOLOGY 2) Feature Extraction using Convolutional Neural Network
The main objective of the process of human posture es- (CNN): The main feature of our suggested strategy is the
timation is to identify and determine the 2D location of a spatial locations of each image and contain the coordinate
total of 16 key points (referred to as K) in photographs for locations of key points. Different architectures built on con-
each person. This involves accurately detecting and localizing volutional neural networks are evaluated to extract spatial
specific joints and parts of the body. This proposed method, characteristics from each image. Several designs, including
like [10], uses a bottom-up pipeline: first, a set of key points ResNet versions 50, 101, and 152, and HRNet versions w32,
is identified for each person, and then all of those body parts and w48, are used in this work. Both ResNet and HRNet
Authorized licensed use limited to: Zhejiang University. Downloaded on September 19,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Methodology of our proposed workflow.
have unique characteristics that are worth noting. ResNet In order to address the issue of missing information in
simplifies the training of deep neural networks while avoiding critical areas due to occlusion, The branch 1 architecture
the problem of vanishing gradients through identity mapping. has been enhanced, visualized in Fig. 3. This updated design
In contrast, HRNet preserves high-resolution representations includes a specialized occlusion net branch that helps identify
throughout the entire process, keeping the information transfer these hidden areas, allowing for a more complete and precise
periodic and ensuring that the final output is accurate in both representation of the human body. By sharing information
space and semantics. This makes it a valuable architecture for between the two levels and using a transposed convolution
image-processing tasks. to split, the network can effectively manage occlusions and
3) Part Detection, Grouping, and Part Association: Iter- provide further information for future image processing. This
ative prediction architecture is built using three branches to should lead to improved accuracy in identifying exposed and
deliver more accurate predictions. Fig. 2 shows the branch hidden critical points.
structural layout.
a) Part Detection for Confidence Maps: The feature map
is the input for Branch 1. This branch creates the set of heat
maps, which is a pattern of a matrix that stores the key points’
confidence scores in a way that each pixel includes specific
information. The location of the annotated 2D body joints is
′∗
used to create the ground-truth confidence map Sj,k (p)(1).
||p − xj , k||22
′∗
Sj,k (p) = exp − (1)
σ2
When the physical component is observable, there is a peak
that appears for the specific visible part j of each individual k,
and the extent of this peak is determined by σ. The groundtruth
location of the body part for the person in the picture is
denoted as xj,k . The objective is to identify the specific point
on a function that produces a peak value in its vicinity. That is
why the greatest confidence score is required. At test time, the
non-maximum suppression (NMS) method is used to extract
the local maximum. Between the calculated prediction and Fig. 2: The multistage architecture represents the structure of
the ground truth maps, the least square error loss function is the branches. The parameters of the convolutional layer are
utilized. denoted as “ Conv⟨ kernel-size ⟩−⟨ no. of output channels ⟩”.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 19,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
x −x
In the above equation, ||xjj2,k,k−xjj1,k,k||2 is a vector field that
2 1
has been carefully directed to ensure its unit magnitude.
The groundtruth component affinity field is created through
an averaging method that combines the individual affinity
fields of all parts. To determine the confidence of the part
association, the predicted part affinity field is calculated using
the expression L′∗c (p) as defined in equation (4).
1 X ∗
L′∗ (p) = Lc (p) (4)
Fig. 3: An extension for branch detection, otherwise known as n(p)
k
the Occlusion net, has been developed in order to locate key
In this context, n(p) refers to the overall count of vectors
points that are obstructed or hidden from view.
that can be found at a certain key point within the individuals
depicted in the image.
4) Post-processing:
b) Cluster maps of part grouping: The Branch 2 com- a) Greedy Parsing Algorithm: High-quality parsed body
ponent in the system takes in a feature map as its input and poses are generated by a greedy parsing technique that aids in
then generates multiple cluster maps consisting of probability finding the best match between candidates of two sections. A
scores. These scores are instrumental in identifying the body well-known assignment problem is known as bipartite graph
parts belonging to each person in the image. This branch, once matching in this area, and we must locate the connection that
again, extracts the total number of people in the input image. maximizes the total score while also solving the assignment
A ground truth cluster map Q′∗d,j (2) is generated based on the challenge. The most evident and immediate advantage is
location of the annotated 2D body joints. required for posture estimate, therefore important points are
established to create accurate connections. The greedy tech-
Q′∗ ∗ 2
d,j = ||xj,k − xj,k ||2 (2) nique is applied to every limb, which determines its weight for
connecting with other limbs or parts of the body. Furthermore,
Let xj,k represent the ground truth position of the neck because the branch 1 model has the highest probability score
key point for each person in the image. The annotated 2D for each joint, maximum weighted parsing, as well as greedily
coordinates of the body joints, with the exception of the neck parses, are used to attain proper pairwise connections between
point location, are xj,k . To determine the key points associated key joints.
with a particular individual, the minimum Euclidean distance b) Merging: In the process of converting detected con-
which measures the straight line distance between the neck nections into finalized skeletons, merging represents the final
and other key points is calculated. It is assumed that certain stage. This step involves treating each connection as belonging
key points belong to a specific individual, and by determining to a different person and forming sets. If any of these sets
the set of important points associated with an individual, their contain body parts, then they are merged with other connected
identity can be established. This method involves calculating sets. The output image generated shows individual estimated
the distance between the neck and other key points present in skeletons and the specific locations of their major body parts.
the body and grouping them accordingly. This process helps This aids in analyzing and understanding the movement and
to identify an individual. pose of each person in the image.
c) Affinity Fields for Parts Association: The third branch
of the model, known as Branch 3, receives the feature map IV. EXPERIMENTS AND RESULTS
as input and provides a non-parametric representation of the The evaluation of human pose performance in a pose
connection between key joints, encoding both the spatial estimation assignment is difficult because many important
location and orientation of the limbs. On the other hand, factors must be taken into account. The Max-Planck Institute
the Part Affinity Field is a 2D vector field that represents for Informatics (MPII) Human pose dataset is used in a
the connection between each limb. To train the model, the number of well-established and openly accessible experiments.
association loss function is applied to the predicted and ground This collection contains around 25K photos of individuals
truth part affinity fields. The groundtruth portion affinity vector with about 40K annotated body joints. The usefulness of this
field is represented as L∗c (p)(3) at each key point in the image.
Overall, this approach provides a detailed and informative TABLE I: The rate of detection for specific points was
understanding of key point connections and limb positioning observed using various optimizers and learning rates.
while effectively minimizing errors and inaccuracies during
Optimizer Learning Rate Mean
training.
RMSProp 0.0001 31.8
Adam 0.01 79.1
xj2 ,k −xj1 ,k Adam 0.001 89.2
||xj2 ,k −xj1 ,k ||2 if limb exists
L∗c (p) = (3) Adam 0.0001 90.4
SGD 0.0001 15.6
0 otherwise
Authorized licensed use limited to: Zhejiang University. Downloaded on September 19,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE II: Model performance on the MPII dataset using pre-trained models.
Pretrained Model Head Shoul. Elbow Wrist Hip Knee Ankle Parameters GFLOPs Mean
ResNet 50 96.1 94.3 86.9 80.4 86.8 80.6 76.3 5.64M 90.45 86.5
ResNet 101 96.7 95.3 89.4 84.4 88.5 84.3 80.1 5.64M 90.45 88.9
ResNet 152 96.9 95.4 89.4 84.4 88.5 84.3 80.1 6.77M 91.5 88.9
HRNet-w32 96.7 95.9 90.3 84.1 89.1 85.1 80.7 2.93M 7.25 89.2
HRNet-w48 97.9 95.6 90.5 85.8 89.2 85.1 80.8 17.04M 11.34 90.4
proposed model is tested by experiments that compare it to to the first stage. This demonstrates the efficiency of the
other methods. assessment process, resulting in better outcomes.
The Adam technique is used to train all posture estimation In our method for estimating poses, we have improved the
models, with a 1e-4 beginning learning rate. Additionally, 1e- Branch 1 architecture by incorporating an occlusion net. The
5 is the weight decay, and 4 is the training batch size. Usually, purpose of this is to investigate the effects of recognizing
it takes 1.5 days to train 250 epochs of ResNet 50, 101, 152, key points that are intentionally obscured. To detect occluded
and HRNet-w32 and w48-based models, which have been pre- joints, reliable ground truth annotations are crucial for the
trained with the ImageNet dataset. NVIDIA GTX 1080 GPUs functioning of networks. The results of Table V show the
are used to carry out a total of 250 training epochs. All human distinction between occlusion net inclusion and exclusion. Our
key points are evaluated for mAP (mean Average Precision) techniques perform badly when not employing occlusion net,
based on the PCKh cutoff value. On the MPII dataset, the whereas the proposed method achieves exceptional accuracy
metric PCKh (head normalized probability of the correct key when using occlusion net. Although our proposed method
point) is widely utilized for keypoint detection. without occlusion net required fewer parameters and GFLOPs,
In our experiment, a number of learning rates and optimizers this approach has less accuracy and less detection ability with
are also examined. A study of the learning rate and optimizer occluded key points. Fig. 4 visualizes the comparison between
performance on the MPII dataset is shown in Table I. This the proposed method without the use of an occlusion net and
table demonstrates that the optimizer developed by Adam with the use of an occlusion net. Notably, outstanding out-
performs more effectively than the others and has a greater comes evidence that our proposed method effectively detects
detection rate. It has a learning rate of 0.0001. Only the MPII the occluded as well as visible key points, when trained with
dataset with more than 100 epochs is used for performance occlusion nets.
analysis. ResNet50 is used in this experiment due to hardware
limitations. Additionally, ResNet50 is employed because it can Table VI shows how our proposed approach for 2D human
be trained more quickly, fits better on hardware, and is smaller. posture estimation compares to previous 2D methods. The
The model can skip one or more levels thanks to a ResNet data indicate that our study significantly improves upon prior
feature called an “identity shortcut connection,” though. With research. To attain these outcomes, we employed a pre-
this method, training the network on hundreds of levels may trained transfer model to extract crucial features. Additionally,
be done without performance suffering. the parallel branch structure of our architecture increases its
effectiveness. This information is presented in a neutral yet
Since our model has multiple stages, a number of designs
knowledgeable tone, intending to inform the reader about our
have been examined as the basic frameworks for those stages.
research in the public domain of human posture estimation.
For those stages, various backbone models, such as the ResNet
versions 50, 101, and 152, HRNet-w32, and HRNet-w48
models are examined. As can be seen in Table II, ResNet
101 outperforms ResNet 50, ResNet 152 outperforms ResNet TABLE III: Different arrangements of optimized models in
101 very slightly, HRNet-w32 outperforms ResNet 152, and computer networks.
HRNet-w48 outperforms others. As more layers are added
Backbone Network Trainable
to an architecture, its performance steadily increases. Despite Models GFLOPs mAP
& Trainable Layers Parameters
having the highest parameters and the second-best GFLOPs Freeze HRNet-w48
among the alternatives, HRNet-48 achieves the highest accu- Model I layers except 1,536,016 3.19 88.7
the last stage
racy. Therefore, the recommended choice for the backbone Freeze HRNet-w48
architecture of the proposed model is HRNet-w48. This deci- Model II layers except for 17,037,328 11.34 90.4
sion is made due to its exceptional accuracy and its capability the last 2 stages
to identify vital obscured locations.
We looked into the improved HRNet-w48 networks and
tested several network configurations for fine-tuning. Model TABLE IV: A comparison between various multi-stage branch
II achieves the highest mAP of 90.4%, according to Table III. networks for the purpose of HRNet-w48 training using the
Table IV shows that performance increases with the number of MPII dataset.
stages in the multi-stage process for estimating. The fifth stage Total stages 1 2 3 4 5
model improves performance from 88.0% to 90.4% compared mAP 88.0 89.1 89.5 90.0 90.4
Authorized licensed use limited to: Zhejiang University. Downloaded on September 19,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE V: The impact of adding or not adding Occlusion net to branch extension.
Occlusion Net. Head Shoul. Elbow Wrist Hip Knee Ankle Parameters GFLOPs Mean
Omitted 95.2 94.3 86.6 81.5 86.0 81.2 75.3 8.19M 9.07 86.9
Not omitted 97.9 95.6 90.5 85.8 89.2 85.1 80.8 17.04M 11.34 90.4
TABLE VI: Comparing performance with other models on MPII’s test set using the PCKh@0.5.
Methods Head Shoul. Elbow Wrist Hip Knee Ank. Mean
Tompson et al. 95.8 90.3 80.5 74.3 77.6 69.7 62.8 79.6
Carreira et al. 95.7 91.7 81.7 72.4 82.8 73.2 66.4 81.3
Tompson et al. 96.1 91.9 83.9 77.8 80.9 72.3 64.8 82
Pishchulin et al. 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4
Wei et al. 97.8 95 88.7 84.0 88.4 82.8 79.4 88.5
Newell et al. 97.6 95.4 90.0 85.2 88.7 85 80.6 89.4
Chen et al. 68.7 77.6 63.2 90.2 68.2 58.6 83.2 65.0
Our Model 97.9 95.6 90.5 85.8 89.2 85.1 80.8 90.4
Authorized licensed use limited to: Zhejiang University. Downloaded on September 19,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.