You are on page 1of 7

Human Pose Estimation Using Convolutional Neural Networks

Anubhav Singh1, Shruti Agarwal2, Preeti Nagrath3, Anmol Saxena4, Narina Thakur5
1,3,4,5
Computer Science Department, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
2
Electronics and Communication Department, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
1
anubhav.singh2359@gmail.com, 2agarwalshruti2801@gmail.com
3
preeti.nagrath@bharatividyapeeth.edu, 4anmsaxena123@gmail.com, 5narinat@gmail.com

Abstract: Human pose estimation has always been a as firstly there's no compelling reason to design feature
challenging problem that holds great attention, it has the representations and identifiers for parts because the model and
widespread and extensive variety of uses from the its attributes are been demonstrated from the data itself.
classification of images to activity acknowledgment, main Secondly, the model that has been learned is comprehensive,
challenge is the detection and localization of the key points in where the last joint prediction is based on a complex non-
the variation of several body poses. To resolve this issue, linear change of the entire image, as restricted to local
substantial research work have been done in this area. This identifiers whose thinking is a solitary part and can just model
paper discusses the issues in human pose estimation and a little subset of between activities and between body parts.
gives the overview of considerable research work in pose There are some difficulties in the prediction of human pose
estimation, including deep learning approach and customary coordinates: the foreshortening of appendages, impediment of
image-based techniques. After analyzing several results and appendages, turn and introduction of the figure, and cover of
detecting the restrictions, the author has reconstructed a different subjects. For example, there are various cumbersome
simple model using convolutional neural network that poses to annotate: rotation, foreshortening, occlusion, and
estimates the poses and demonstrates the potential of CNN's. various figures this fluctuation in the info frame proposes that
The author concludes with a few promising bearings and the entire thinking given by CNN might be a great system. In
directions that have to be explored for future research. this venture, researchers investigate diverse CNN models for
demonstrating human posture estimation and activity
I. INTRODUCTION classification[2].
Human pose estimation holds extraordinary potential from
Pose estimation has ample uses and has numerous utilizations,
single, 2D pictures to aid an extensive variety of uses from the
response to the body to augmented reality to situation related
classification of images and recordings, activity
support, liveliness, wellness uses, and that's just the beginning.
acknowledgment, active investigation, and grabbed great
The authors in this paper believe the availability of this model
attention in computer vision and human PC interaction.
motivates more designers and producers to try and apply
However, human posture estimation has always been a
present recognition to their very own unique projects. While
challenging problem that acquires great attention. It involves
many substitute posture location frameworks have been
huge difficulty for the identification and localization of key
publicly released, all require specific equipment or potentially
points of the body that mainly includes various joints and body
cameras, and also a lot of framework setup. this human pose
movement forecast and also shares difficulties in detection, for
estimation can be utilized to gauge either a single pose or
example in clustering, lighting, perspective, and scale, are the
multiple poses, which means there is a form of the algorithm
significant troubles interesting to human postures[1].
that can recognize just a single individual in a picture and one
The detection of body key points has been a great problem due form that can identify numerous people in a picture.
to little joints, impediments, and the need to catch content.
Therefore for the advancement and solution for this
Hence convolutional neural networks have a remarkable
challenging problem MPII Human Pose dataset has been used
approach to image classification and object identification
for analyzing human postures. It involves around 25k images
issues. Fundamentally they are similar to conventional neural
of 40k people having various illustrated body joints .Every
systems that are comprised of neurons with predefined weights
picture was collected from YouTube videos and has former
and predispositions. However, neural systems do not scale well
and un-illustrated frames involved 3D torso and head
to larger pictures. The authors rapidly create countlessly and
movements.
wind up overfitting on the preparation set as every neuron in a
layer is completely associated with every one of the neurons in Therefore the proposed benchmark “MPII Human Posture”
the past layer. CNN took advantage of the fact that comprises fundamentally progresses in terms of appearance variability
mainly of pictures, so they constrain the engineering in a more and complexity, and incorporates in excess of 40,000 pictures
sensible manner which incomprehensibly lessens the number of individuals or full-body human posture estimation.
of parameters. CNN is engaging for human posture estimation

978-1-5386-9346-9/19/$31.00 ©2019 IEEE


While the methodology depends on convolutional systems to learn part locations and their relationship through two parts
(convnets), testing dataset, and furthermore an investigation of of the same consecutive forecast process.
what is required to make convnets work in estimation of
human pose. The model inputs a color image having size w*h This PAF approach present us a productive strategy for multi-
and outputs 2D positions of key points for each person present individual posture estimation having a high correct nesses in a
in the picture itself. The model has layers used for creation of few open benchmarks.
input image has two branch multistage CNN where first branch
forecast about the confidence maps of various body parts Convolutional Neural Networks (CNN) methodology is used
localization like elbow,knee etc. and second branch tells about for the estimation of 2D human postures from a single picture.
2D vector fields(L) and tells about the degree of correlation Recently, for the assessment of human postures, many
between parts and therefore produce the 2D key points for all techniques and procedures have been created that utilize
members present in the picture.The Multi-Person postures from physiologically propelled graphical models.
Dataset(MPII) is learned specialized model and has 15 outputs Arrangement of image patches takes as input in CNN and
points. determine the correct location of the joints in the picture
utilizing CNN approach, achieving both joint recognizable
Specifically, presenting a two-arrange sifting method whereby proof which decides whether an image comprises freeze body
the reaction maps of convnet part identifiers are denoised by a joint and joint constraint finds the right region of joint in the
additional procedure educated by the division pecking order image plot. Finally, these joint recognition/limitation results
.The proposed system is tried in the comparing parcels of the are combined for surveying the posture.
professionally presented benchmarks (i.e. in pictures that have
not been utilized for preparing), and the outcomes are Dual-Source Deep Convolutional Neural Network(DS-CNN)
contrasted with the first system, in this manner approving the [5] integrates both local(body) part appearance and holistic
assumption that profundity data helps to assess the human view of every local part to get a better human pose estimation.
body's joints. The exploratory approval demonstrates that it is Basically, in this paper the author is taking set of image
conceivable to exploit the halfway relative-depth patches as the input and by using holistic view learns the
estimation[3]. The recognition of human poses have appearances of each local part. By DS-CNN they are achieving
tremendous applications in various fields and do wonders in both joint detection and joint localization and then estimates
several areas, by recognizing the human activities through the human pose by combining the both.
poses the world would be more safer, healthier and prove as a
boon to industries in robotics by making them learn and The researcher [6] uses Convolution Neural Network for
familiar with human poses, in health and wellness by making human posture estimation where primary responsibility is a
any smart gadget into a scanners, and also recognizing the CNN fell structure especially planned for adjusting part
medical emergencies, helpful in recognizing pedestrian in associations and spatial contexts. The initial segment performs
autonomous vehicles and preventing several serious and death part identification heatmaps and second part performs
prone accidents, helpful in security purposes while tracking regression on these heatmaps managing the framework where
and identifying human poses, gestures while recognizing to focus in the image and enough encodes part requirements
doubtful activity and demanding immediate actions. and context. Likewise, author shows that the planned course is
sufficiently adaptable to expeditiously allow the blend of
II. PROBLEM STATEMENT
various CNN structures for both acknowledgmentand relapse.
The problem consists of human pose estimation. For
estimation of human postures, the network takes a raw image The authors [7,8] in their paper proposed named Location
as input and a vector of coordinates of the body key points as based or relapse based[9,10,11] verbalized human posture
outputs. The purpose is to identify x-y pixel coordinates for 15 estimation. The Location constructed techniques are depending
body joints. By training the regression CNN that compensate with respect to intense CNN-based part identifiers which are
and decrease the loss. then solidified using a graphical model. Relapse-based
strategies endeavor to take in a mapping from the picture and
III. THEORETICAL FRAMEWORK CNN features to part areas.
Part Affinity Fields(PAFs)[4] is a non-parametric portrayal
approach that is used to learn about various body parts of The authors in their paper uses CNN [12] which is trained for
individuals in the images distinguishing the 2D postures and working on scene marking, by characterizing a multi-
more effectively incorporating numerous individuals in an classification characterization for every pixel. Rather, they
image. This global context allows a greedy bottom-up describe location over sliding windows in the image. Since,
methodology keeping up a high accuracy independent of they allow their revelation over each window to contain
multiple individuals in an image. This engineering is intended various body parts, where every identification task is basically
a paired characterization job in a window. Network

947
optimization is been done with L2 loss without seeing the Correspondingly, prepared well trained convolutional systems
effect on outliers on the training process which would be for human posture estimation. As an alternative of growing the
affected by outliers. quantity of stages for enhancement, authors investigate how to
enhance the execution of a monocular regression system
The author has used a regression network with ConvNets that showing add-on task [19].
attains robustness against these outliers by reducing Tukey’s
bi-weight function an m-estimator robust to outliers. The IV. FORMULATED WORK
author illustrates quicker combination with improved
The proposed work objective is posed estimation. For the
speculation of the robust loss function for the estimating the
former task, the authors aim to identify x-y pixel coordinates
posesfrom images of faces.
for 14 body joints depicted in the Figure 1. For the latter, they
A significant Convolution neural system is utilized in aim to label the images based on the activity category and to
heterogeneous performing various tasks learning structure for estimate the poses using Convolutional Neural Network
human posture estimation from single images. This framework (CNN). This paper proposes the regression approach as a
takes in a posture joint regression and a sliding-window body- solution to human posture estimation problem that can be
part indicator in a deep organize design demonstrating that designed as a hereditary convolutional neural network. The
including the body-part location regularizes the system CNN outputs the pixel coordinate of each body key joints by
producing better solutions. The researcher reports a taking as input a full raw image of (96*96pixels).
competitive and state-of-arts results in manydatasets.[13]
Since 1980s this field of research has been considered
The authors in their paper, utilized a heterogeneous multi-task important by computer science communities because of it
regression task method [14] and the arrangement assignment having a strong point in giving modified support to other
that urged to have a similar sparsity design. They discovered applications. It has range of implementations from medicine,
that joint-preparing patterns to locate the most valuable human-computer interaction to sociology.
element in the contribution for the two assignments to have a
V. RESEARCH METHODOLOGY
similar component layer, which brings about learning shared
portrayal that is useful for both thetasks. In pose estimation, the authors identify the position and
orientation of an object mean and detect keypoints locations
The author also illustrated that deep neural system[15] forecast that describe the object. This paper will discuss the human
and identifies the location of the various body parts like head pose estimation where major parts/joints are detected and
and hands just by knowing the close by neighbors with learned localized in the body analyzing the state of the art in
lodge features by taking a pose-sensitive embedding with non- articulated human pose estimation using a new large-scale
linear NCA (neighborhood components analysis) regression. benchmark dataset.
Instead, likewise, from regression network systems the work
can directly yield the joint locations from the accessory The dataset includes 40,000 images collected through an
undertakings learning shared"posehighlights". recognized catalogue of human activities. The collected images
cover 410 human activities in total. Pictures are analyzed on
A multi-stage framework with the deep convolutional the basis of annotations including body pose, body part
network[16] is worked in this paper for foreseeing facial point occlusion, torso, and head viewpoint, and personal activity The
areas. They use an arrangement of neural frameworks that architecture takes an input of w*h size and producing an
emphasis on various areas of the input picture. output which includes of 2D locations of anatomically key
points for every person in the image.
The researchers estimates different CNN structures[17] with
regard to hand shape, joint visibility and articulation The network predicts a set of locations and a set of confidence
distributions which include isolated 3D hand pose estimation maps S which has body part locations and degree association
targeting low mean errors, 3D volumetric representations between parts in the form of 2D vector fields. Where there are
outperform 2D CNN capturing the spatial structure of the J confidence maps in set S and C vector fields in L.
depth data.
The procedure involves feeding of images into feedforward
The authors describes Human Mesh Recovery(HMR)[18] network which predicting a set of locations with a set 2D
which is an end to end framework for rebuilding a full 3D vector fields having degree association between parts. Where S
mesh of human body form an RGB image. This paper consists of J confidence maps and set L has C vector fields.
computes a much richer depiction that is parameterized by Beige color represents the confidence map and lowers branch
shape and 3D joint angles running in a real-time with person in represented as blue affinity fields.
bounding box.

948
MPII dataset outputs 15 points which are :

Fig. 1. MPII Dataset Images[20]

Figure 1 represent various human activities includes in the dataset

TABLE 1: Various joints and their keys

Joint Head Nec Rt. Rt. Rt. Lt. Lt. Lt. Rt. Rt. Rt. Lt. Lt. Lt. Chest Backgroun
k Shoul elbo wrist Shou Elbo Wris Hip Knee Ankl Hip Knee Ank d
der w lder w t e le

Key 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Link to download dataset is in [20].


The model inputs a color image of size w × h outputs the 2D locations of keypoints for every person in the picture. The
recognition procedure is in three stages:

Fig. 2. Architecture of the two-branch multistage CNN. After every stage the
predictions of previous branches and image feature are concatenated .[21]

949
In Stage 0 feature maps for the input image is created using the as the aim is to identify x-y pixel coordinates for 14 body
first 10 layers of the VGGNet as shown in Figure 2. joints depicted in the figures above. For the latter, aim is to
label the images based on the activity category and to estimate
The starting 10 layers of the VGGNet would create a feature the poses using Convolutional Neural Network
map. Then stage 1 consists of dual branches, the first branch (CNN).Formulating the human posture and estimating the
predicts a set of 2D confidence maps(S) of the location of body problem as a regression problem that can be designed as a
parts. Whereas, confidence maps of affinities encode the hereditary convolutional neural network. The CNN outputs the
degree of association between parts. Then stage 2 both the pixel coordinate of each body key joints by taking as input a
maps i.e., greedy inference parse confidence and affinity maps full raw image of (96*96 pixels).
to produce the 2D key points for all people in the image where
each stage where every subsequent stage concatenation of Where number of training examples are N, D is the train-set,k
predictions from both branches in the previous stage and the is the number of body key points, and wi is the weights learned
original image is performed. for the ith body keypoint.

VI. PROPOSED WORK VII. RESULT AND ANALYSIS


Backpropagation is being used optimizing weights w, then Figure 4 represents some results containing viewpoint and
performing mini-batch gradient descent on a batch of 128. appearance artifacts from model described in Figure 2.
Figure 3 shows the regression architecture for the keypoint
location estimation where the final layer outputs a 2D Whereas Figure 5 shows the body points that are been
dimensional vector representing the x and y coordinate of all predicted.
14 points.

Fig. 4. Results containing viewpoint and appearance variation,


occlusion, crowding, contact, and other common imaging
Fig. 3. Regression CNN architecture for keypoint location artifacts.
estimation.

950
Table 2 below shows the earlier state of the art significantly their ease and instinctive nature, and their lessened number of
exceeding previous state results on the MPII multi-person a parameter when diverged from the totally related model.
benchmark. It has some qualitative results.
CNN has the advantage of being a model that accepts the
TABLE 2: (a): Evaluation result on the testing subset. whole picture as an info motion for each body points rather
(b): Evaluation result on the whole testing subset. than nearby locators where they are obliged to a single part.
The subset of 288 images
The authors demonstrated that human posture estimation can
Method Hea Sho Elb Wri Hip Kn Ankl s/im be thrown a regression issue and demonstrated with a non-
e e age exclusive CNN. Their use of CNN to the issue of human
posture estimation accomplishes aggressive outcomes on a
Deepcut[ 73.4 71.8 57.9 39. 56. 44. 32.0 579 testing scholarly dataset with a basic model. They guess that
22] 9 7 0 95 they would have the capacity to accomplish shockingly better
outcomes with more processor power and space (the depth of
Iqbal et 70.0 65.2 56.4 46. 52. 47. 44.5 10
al.[23] 1 7 9 their relapse CNN was limited by the RAM of the GPU it was
prepared on.
Deeper 87.9 84.0 71.9 63. 68. 63. 58.1 230
Cut[24] 9 8 8 To diminish the hole among preparing and approval execution
additionally calibrate the hyperparameters of their
demonstrate. Interesting points incorporate altering the base
Full Testing Set learning rate and learning rate approach, trying distinctive
kinds of momentum updates, and tuning the regularization
Method Hea Sho Elb Wri Hip Kne Ank s/im
quality. Besides, display gatherings can be utilized to
age increment performance. They might likewise want to try
different things with a blend of joint estimation and action
Deepcut 73.4 72.5 60.2 51.0 57. 52.0 45.4 48.5 characterization undertakings to check whether knowing the
[24] 2 areas of joints in a picture enhances the air conditioner activity
arrangement execution of a CNN. They would first run the info
Iqbal et 70.0 53.9 44.5 35.0 42. 36.7 31.1 10 picture through the joint estimation relapse model to get the
al.[23] 2 human posture data and utilize this as an optional contribution
(notwithstanding the first info picture) into the characterization
demonstrate. This extra data may enable their model to decide
the movement is performed in the picture. Nonetheless, it is
conceivable that 2D pixel coordinates won't give adequately
valuable data regarding the genuine 3D posture estimation.

REFERENCES
[1] Ντίνου, Ι. (2018). Human pose estimation using convolutional
neural networks (Doctoral dissertation).
[2] Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2016). Real-time
multi-person 2d pose estimation using part affinity fields. arXiv
preprint arXiv:1611.08050.
[3] Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., &Bregler,
C. (2013). Learning human pose estimation features with
convolutional networks. arXiv preprint arXiv:1312.7302.
[4] Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2016). Real-time
multi-person 2d pose estimation using part affinity fields. arXiv
preprint arXiv:1611.08050.
[5] Fan, X., Zheng, K., Lin, Y., & Wang, S. (2015). Combining
local appearance and holistic view: Dual-source deep neural
networks for human pose estimation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition
Fig. 5. 14 body points that aim to identify (pp. 1347-1355).
[6] Bulat, A., &Tzimiropoulos, G. (2016, October). Human pose
VIII. CONCLUSION AND FUTURE WORK estimation via convolutional part heatmap regression. In
European Conference on Computer Vision (pp. 717-732).
If the authors consider PC vision assignments convolutional Springer, Cham.
neural networks system are most demanding design due to

951
[7] Chen, X., Yuille, A.L.: Articulated pose estimation by a [16] Y. Sun, X. Wang, and X. Tang. Deep convolutional network
graphical model with image dependent pairwise relations. In: cascade for facial point detection. In CVPR, 2013
NIPS. (2014) [17] Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G., Yong
[8] Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of Chang, J., Mu Lee, K., ... & Yuan, J. (2018). Depth-based 3d
a convolutional network and a graphical model for human pose hand pose estimation: From current achievements to future
estimation. In: NIPS. (2014) goals. In The IEEE Conference on Computer Vision and Pattern
[9] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via Recognition (CVPR).
deep neural networks. In: CVPR. (2014) [18] Kanazawa, A., Black, M. J., Jacobs, D. W., & Malik, J. (2018,
[10] Pfister, T., Simonyan, K., Charles, J., & Zisserman, A. (2014, June). End-to-end recovery of human shape and pose. In The
November). Deep convolutional neural networks for efficient IEEE Conference on Computer Vision and Pattern Recognition
pose estimation in gesture videos. In Asian Conference on (CVPR).
Computer Vision (pp. 538-552). Springer, Cham. [19] A. Toshev and C. Szegedy. Deeppose: Human pose
[11] Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for estimation via deep neural networks. In CVPR, 2014.
human pose estimation in videos. In: ICCV. (2015) [20] http://human-pose.mpi-inf.mpg.de/
[12] Belagiannis, V., Rupprecht, C., Carneiro, G., &Navab, N. [21] Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2016).
(2015). Robust optimization for deep regression. In Proceedings Realtime multi-person 2d pose estimation using part affinity
of the IEEE International Conference on Computer Vision (pp. fields. arXiv preprint arXiv:1611.08050.
2830-2838). [22] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M.
[13] Li, S., Liu, Z. Q., & Chan, A. B. (2014). Heterogeneous multi- Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset
task learning for human pose estimation with the deep partition and labeling for multi person pose estimation. In
convolutional neural network. In Proceedings of the IEEE CVPR, 2016.
conference on computer vision and pattern recognition [23] U. Iqbal and J. Gall. Multi-person pose estimation with local
workshops (pp. 482-489). joint-to-person associations. In ECCV Workshops, Crowd
[14] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning Understanding, 2016.
hierarchical features for scene labeling.IEEE TPAMI, 2013 [24] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and
[15] G. W. Taylor, R. Fergus, G. Williams, I. Spiro, and C. Bregler. B. Schiele. Deepercut: A deeper, stronger, and faster
Pose-sensitive embedding by nonlinear nca regression. In NIPS, multiperson pose estimation model. In ECCV, 2016.
pages 2280–2288, 2010

952

You might also like