You are on page 1of 28

Multimed Tools Appl

https://doi.org/10.1007/s11042-018-5839-2

Integral customer pose estimation using body orientation


and visibility mask

Jingwen Liu 1 & Yanlei Gu 2 & Shunsuke Kamijo 2

Received: 6 April 2017 / Revised: 12 January 2018 / Accepted: 26 February 2018


# Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract The analysis of customer pose is one of the most important topics for marketing.
Using the customer pose information, the retailers can evaluate the customer interest level to
the merchandise. However, the pose estimation is not easy because of the problems of
occlusion and left-right similarity. To address these two problems, we propose an Integral
Pose Network (IntePoseNet) which incorporates the body orientation and visibility mask.
Firstly, benefiting from the simple gaits in retail store, the body orientation can give the global
information of pose configuration. For example, if a person is facing to right, the body
orientation indicates the occlusion of his or her left body. Similarly, if a person is facing to
the camera, his or her right shoulder is probably at the left side of image. The body orientation
is fused with local joint connections by a set of novel Orientational Message Passing (OMP)
layers in Deep Neural Network. Secondly, the visibility mask models the occlusion state of
each joint. It is tightly related to the body orientation because the body orientation is the main
reason of self-occlusion. In addition, occluding object (e.g. shopping basket in retail store
environment) detection can also give the clues of visibility mask prediction. Furthermore, the
system models the temporal consistency by introducing the optical flow and Bi-directional
Recurrent Neural Network. Therefore, the global body orientation, local joint connections,
customer motion, occluding objects and temporal consistency are integrally considered in our
system. At last, we conduct a series of comparison experiments to show the effectiveness of
our system.

* Jingwen Liu
ljwqq0@kmj.iis.u-tokyo.ac.jp

Yanlei Gu
guyanlei@kmj.iis.u-tokyo.ac.jp
Shunsuke Kamijo
kamijo@iis.u-tokyo.ac.jp

1
Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
2
Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
Multimed Tools Appl

Keywords Pose estimation . Body orientation . Visibility mask . Deep neural network .
Surveillance camera

1 Introduction

The analysis of customer pose is very commercially important for retail stores. Traditionally,
retailers use the records of cash registers or credit cards to analyze the shopping results of
customers. However, the shopping process of customer in the store is still inside the black box.
In practice, the customer poses of standing, stretching arm or bending over can reveal the
customer habits and the interest level to the merchandise. In order to extract this information,
we need to estimate the customer poses from the surveillance camera. Therefore, the customer
pose estimation from surveillance camera draws more and more researchers’ attention.
However, the 2D human pose estimation is not an easy task in computer vision. One of the
difficulties is the occlusion problem, including both self-occlusion and inter-occlusion by other
objects. The main reason of self-occlusion is the body orientation. For example, if a standing
person is facing the right in the camera, his or her left body is probably occluded. On the other
hand, the inter-occlusions may happen due to arbitrary objects. In retail store environment, the
inter-occlusions are typically caused by the shopping baskets. Figure 1 shows the examples of
the self-occlusion and the inter-occlusion by shopping basket. Another difficulty of pose
estimation is the left-right similarity problem because of the symmetry of human body. For
example, the left shoulder of a person in the back view is very similar to the right shoulder in

Fig. 1 The examples of occlusions in the retail store. a The left body part is self-occluded. b The right wrist, left
hip and right hip are occluded by the shopping basket
Multimed Tools Appl

front view. In order to address these problems, we introduce the body orientation and visibility
mask of each joint.
The body orientation is a kind of global information of pose configuration. The body
orientation is defined in 8 directions as Fig. 2. The body orientation is useful for solving both
self-occlusion problem and left-right similarity problem. For example, if we know a person is
facing to right, the body orientation indicates the occlusion of his or her left body. Similarly, if
we know a person is facing to the camera, his or her right shoulder is probably at the left side
of the image. To this end, we propose novel Orientational Message Passing (OMP) layers on
top of a Fully Convolutional Network (FCN) to fuse the global body orientation and local joint
connection. These layers can be seen as an extension of Deformable Parts Model on FCN
structure [8, 46]. Note that the orientation we defined in Fig. 2 is only applicable for some
simple poses in the store, such as standing, walking, bending over or squatting down. For more
complex poses, it is difficult to give a body orientation in the definition of 8 directions, such as
the dancing, parkover or doing gymnastics.
The visibility mask is a binary vector where its element indicates the visibility of each joint.
The term of visibility mask is introduced by Haque et al. [17]. However, they only applied the
visibility mask in the top view images for self-occlusions. We extend the application of
visibility mask for more flexible view angles and for both self-occlusion and the inter-
occlusion by objects. Other approaches such as [29, 32], also used a binary vector to represent
the occlusion. Although these approaches modeled the occlusions in different ways, they did
not consider the relationship between the joint occlusion and body orientation. We argue that
the body orientation is an indispensable part of the occlusion model. Besides solving left-right

Label 5
Label 6 180e Label 4
225e 135e

Label 7 270e 90eLabel 3

315e 45e
Label 8 Label 2
0e
Label 1
Fig. 2 The body orientation is divided into 8 directions by every 45 degrees
Multimed Tools Appl

similarity problem, our proposed OMP layers can also pass different occlusion information
according to different body orientations. Additionally, we combine the occluding object
detection in our system. It is an apparent clue of inter-occlusion for visibility mask prediction.
From our observation, the occluding objects are mainly the shopping baskets in retail stores.
(For the shopping cart, the mainly occluding part is also the basket on the cart).
Furthermore, we also consider the temporal consistency in customer pose. The system
includes the optical flow information to improve the estimation accuracy of some variable joints
like wrists. We also employ a Bidirectional Recurrent Neural Network (BRNN) [34] on top of
the network to improve the accuracy by utilizing both the forward and backward sequences.
The contributions of our work could be summarized as threefold. First, we introduce the
body orientation in customer pose estimation in order to solve the left-right similarity problem
and indicate the self-occlusion. This is fused with local pair-wise joint connection by novel
Orientational Message Passing (OMP) layers on the FCN. Second, we model both self-
occlusion and inter-occlusion by the visibility mask. The visibility mask is also tightly related
to body orientation. Third, in this system, the global body orientation, local joint connections,
customer motion, occluding objects and temporal consistency are all integrally considered.
Therefore, we name this system Integral Pose Network (IntePoseNet).

2 Related works

To estimate the human pose, researchers proposed many approaches based on different
information. One of these approaches was using global silhouette information to estimate
the pose [35, 38, 42]. Other approaches analyzed the parts of human body to establish the
whole pose. For example, Pictorial Structures [10, 14, 37] detected the positions of body parts.
Deformable Part Models (DPM) [13, 26, 45] modeled the connections between neighboring
joints. In another way, Pishchulin et al. [31] introduced a poselet detector to the Pictorial
Structure to establish a more flexible model. These methods were all based on the handcrafted
features, such as the Histogram of Oriented Gradients (HOG) [9] feature.
In recent years, the Convolutional Neural Networks (CNN) [24, 25] achieved high perfor-
mances in different tasks of computer vision. For example, in application fields, CNNs were
adopted to estimate popularity [48] or venue category [7] based on micro-videos. In pose
estimation task, a pioneering work was submitted by Toshev et al. [41]. They proposed the
DeepPose to estimate human pose by multiple cascaded CNNs. The approaches of pose
estimation using CNN could be roughly divided into two groups. The first group directly
gave output of the joint positions [4, 17, 41]. However, this kind of methods was difficult to
optimize the result because there is only one prediction per image. The second group of
methods produced the probabilistic heatmaps of joints to increase the candidates. In order to
generate the heatmaps, [5, 32] employed the joint classification with a sliding CNN. These
methods were very time-consuming in both training and testing because of the sliding CNN.
To reduce the processing time, another kind of methods fed the whole image to a Fully
Convolutional Networks (FCN) and directly generated the joint heatmaps [39, 40].
Base on the joint heatmaps, more flexible optimizations could be applied, such as Deform-
able Parts Model (DPM) [15]. On the top of an FCN, [8, 46] proposed different
implementations of the two-stream Message Passing (MP) algorithm in DPM. In this paper,
we extend the MP algorithm in [8] to OMP, which combines the body orientation and the pair-
wise joint model.
Multimed Tools Appl

To deal with the occlusion problem, one kind of approaches learned the occlusion pattern
for the occluded joints [6, 11]. Another kind of methods used a binary mask to model the self-
occlusion [17, 29] and inter-occlusion by object [32]. Although [17] introduced the term
‘visibility mask’, the paper only used it to model the occlusion joint from top view image. As
far as we know, there is no paper modeled the relationship between the occlusion and body
orientation. Our paper incorporates not only the body orientation for modeling self-occlusion
but also the occluding object detection for modeling inter-occlusion.
Besides the spatial information above, the temporal information also can be utilized in pose
estimation. Jain et al. [20] introduce the optical flow [43] as an extra input to increase the
estimation accuracy. Inspired by their work, our system also adopts the optical flow. Other
approaches employed the Recurrent Neural Network (RNN) to maintain the temporal consis-
tency of pose [1, 16]. Pfister et al. [30] proposed a temporal pooling method to refine the pose
estimation by combining both past and future heatmaps. In order to use both the forward and
backward sequences, our system applies the BRNN [34] on top of the network.

3 Integral pose network

3.1 Framework

The framework of the IntePoseNet is shown as Fig. 3. In the input end, taking the advantage of
surveillance camera, the background of video can be easily subtracted by using Multi-Layer
Body orientation Refined
Input
video Imaget heatmaps Output
Orientational (Conv +
Background BRNN
FCN Message Passing
subtraction ReLU) h3
layers
Joint positions &
visibility mask

Initial
heatmaps
Basket
heatmap
Occluding
object detection

Imaget-1 Optical Flow


FlowNet
[36]

Fig. 3 The framework of Integral Pose Network (IntePoseNet)


Multimed Tools Appl

Background Subtraction algorithm [47]. After the background subtraction, there are 3 streams.
In the first stream, the image is fed to an FCN to generate the initial joint heatmaps. Based on
these heatmaps, we construct a set of OMP layers to incorporate the body orientation with the
pair-wise joint model. The body orientation is obtained from the Semi-Supervised Learning
algorithm in our previous work [27]. It is a vector including the probabilities of 8 body
orientations. The second stream is the occluding object detection. In retail store, the main
occluding object is the shopping basket. The basket heatmap is generated by the ResNet [18].
Because the ResNet is trained from ImageNet dataset [33], it can classify the ‘shopping basket’
object. Thus we can apply a sliding window method with ResNet to generate the basket
heatmap. In the third stream, the optical flow is calculated by the FlowNet [12] using two
continuous images. Then the optical flow, basket heatmap and the output of OMP layers are
combined with 3 convolution + Rectified Linear Unit (ReLU) layers. On the top of the
network, a BRNN is employed to produce the refined heatmap by optimizing the temporal
consistency in both forward and backward sequences. Based on the output joint heatmaps, the
joint positions and visibility mask could be derived. The joint position is calculated as the
centroid of each heatmap. The visibility mask is a binary vector which indicates the visibility
of each joint. For one joint, its visibility is determined by the maximum value of its heatmap. If
the maximum value is larger than a threshold, the corresponding visibility value is 1.
Otherwise, the visibility value is 0.
In our system, the human pose is represented by a tree structure of joints as Fig. 4. The
structure includes 14 joints: head top, head bottom, left/right shoulders, left/right elbows, left/
right wrists, left/right hips, left/right knees and left/right ankles. Note that there are usually two
types of annotations of pose: Observer- Centric (OC) annotation and Person-Centric (PC)
annotation. In the OC annotation, the limbs in the left/right side of image mean the left/right
limbs, while in the PC annotation, the left/right limbs of person are the left/right limbs. Our
system applies the PC annotation. According to the PC annotation, the skeleton in Fig. 4 is
assumed in back view. Therefore, the pose is represented as a 28 × 1 vector: [x1, y1, …, x14, y14]T,
where x n , y n is the coordinate of the nth joint. The visibility mask is a 14 × 1
vector: [v1, v2, …, v14]T, where vn is the probability of visibility of the nth joint. vn = 1 means

Fig. 4 The representation of pose


in our system. The numbers beside
1
the joints are the indices of joints
2
4 3 9 10

5 11
6 12
Left Right
7 13

8 14
Multimed Tools Appl

the joint is totally visible and vn = 0 means the joint is totally invisible. For the visibility
threshold, it is easy to see that if the threshold is 1.0, the accuracy of visibilities would be the
rate of occluded joints, and if the threshold is 0.0, the he accuracy of visibilities would be the rate
of visible joints. Through our experiments, the accuracy of visibilities reaches the maximum
when the threshold is from 0.5 to 0.55 for different joint. Therefore, we let 0.5 be the visibility
threshold, because it is the natural mid-point between totally visible and totally occluded.
Note that although other approaches such as [17, 29, 32] used a binary mask to model the
occlusion, they did not output the mask but still output the predicted positions of occluded
joints. We argue that the positions of occluded joints should not be output. These positions
have few practical meaning in 2D image, because the areas at these positions are actually the
occluding objects or other body parts in appearance. Even though in some open datasets, such
as Leeds Sports Pose (LSP) dataset [22] and MPII Human Pose Dataset [2], the ground truth of
positions of the occluded joints is given, that is not real truth but the inference by human
observer. Therefore, our system gives the outputs of joint positions and visibility mask
simultaneously. The joint position is meaningless if the corresponding visibility mask is 0.
The output image does not show the occluded joint either as Figs. 1 and 3.

3.2 Joint heatmaps generation by FCN

As illustrated in Fig. 3, the customer image after background subtraction is fed to an FCN to
produce the initial joint heatmaps. In this paper, the FCN is based on the structure FCN-8 s
[28]. The FCN-8 s is originally used to segment 20 classes of objects. Its output is 21 channels
(20 objects +1 background) heatmaps with the size equal to the input image. We modify the
output size and output channel number of FCN-8 for joint heatmaps generation.
In our system, the size of output heatmaps is set as 1/4 of the height and width of input
image to reduce computation. The size of input image of FCN is fixed to the average size
200 × 366. Thus the size of output heatmaps is 50 × 92. For comparison, the size of output
heatmaps is 56 × 56 (1/8 of input size) in [8] and is 8 × 8 (1/16 of input size) in [46]. The
channel number of output heatmaps is N × T + 1, where N is the number of joints, T is the
number of Mixture-of-Parts, plus one means the background. In our system, N = 14, T = 13.
The Mixture-of-Parts will be explained in Section 3.3.

3.3 Mixture-of-parts

In order to model the relationship between neighboring joints, this paper applies the concept of
Mixture-of-Parts which is proposed by [45]. The idea of Mixture-of-Parts is to cluster the
joints into T types according to its relative position to its neighboring joint. For example, the
left shoulders in front view and in back view probably belong to different clusters, because the
left shoulder is at the right bottom of neck in front view and is at the left bottom of neck in back
view. That means the type of Mixture-of-Parts gives an inference of position of neighboring
joint. Therefore, different types of joints can send different messages to its neighboring joint in
the message passing layers. The message passing layers will be discussed in Section 3.5.
In [8, 45, 46], the joints are clustered by using K-means algorithm. However, there are two
drawbacks of K-means algorithm. The first one is that it only gives hard clusters. The second
one is that it tends to generate equal-sized clusters, thus it does not work well in the distribution
with unbalanced density. To solve the first problem, some extensions of K-means could be
used to generate soft clusters, such as Fuzzy C- means algorithm and Expectation
Multimed Tools Appl

Maximization (EM) algorithm. However, the Fuzzy C-means algorithm still cannot solve the
second problem, because the algorithm is also based on the distance like K-means. In contrast,
EM algorithm can solve the second problem by benefiting the Gaussian Mixture Models
(GMM). Therefore, we choose the EM algorithm for the joint clustering.

3.4 Soft label and loss function

Based on the soft clusters, the labels of training data also became soft labels, which can
produce more reasonable estimation. For each joint, the ground truth is not a position, but
heatmaps with T channels. Assume a 1 × 1 × T vector locates at the joint position in the ground
truth heatmaps. The values of its T elements are given by T probabilities of the soft clusters. At
other locations, the vector is an all-zero vector. Therefore, the overall ground truth of a training
image is heatmaps with N × T + 1 channels. The additional one channel is the background
channel. In the non-joint pixels, the label is a one-hot vector, where the value of background
channel is 1. The occluded joint would be treated as background, if it is not overlapped by
another visible joint.
Furthermore, we extend the joint position from one pixel to a circle area, because we
consider the joint is more like an area in the image rather than one single pixel. In the output
scale 50 × 92, we tested the circle radius as 1 pixel (equals to a single pixel), 3 pixels and 5
pixels in experiments. As a result, the Percentage of Correct Keypoint (PCK) of 3 pixels radius
achieved the highest value. The correct keypoint is defined as the predicted joint whose
distance from the true joint is less than a given threshold. When the threshold is the 0.09
times the body height, the PCK of 3 pixels radius was 5% higher than the PCK of 1-pixel
radius and 1% higher than the PCK of 5 pixels radius. Therefore, we choose 3 pixels as the
radius of the joint area. If there are n joints circle areas are overlapped, the value of each
channel should be divided by n, in order to keep the sum of probabilities is 1.
The tensor of ground truth can be seen as an extension of pixel-wise labeling as Fig. 5. In
each pixel of our ground truth, there is an N × T + 1 length vector in the z coordinate. In Fig. 5,
the positive integers n in the small boxes indicate the indices of visible joints. For joint n, the
values of elements from index (n − 1) × T + 1 to index n × T are the probabilities of T clusters.
Besides, the zeroes in Fig. 5 mean the background pixels. For the background pixel, the value
of (N × T + 1)th element is 1 and the values of other elements is 0.
For the probabilistic labels, the cross-entropy is suitable to evaluate the loss. Assume the
output heatmap in mapf. Then the probability map mapp is calculated by a softmax function as
follow, where l is the location in the map.
 
exp map f ðl Þ
mapp ðlÞ ¼   ð1Þ
∑l exp map f ðlÞ

Thus the loss is calculated as the follow, where c = {1, 2, …, N × T + 1} is channel index of
the map.
 
loss ¼ −∑l;c labelðl; cÞlog mapp ðl; cÞ ð2Þ

The gradient of loss is calculated as Eq. (3).

e ¼ mapp −label ð3Þ


Multimed Tools Appl

Head top
0 0 1 1 1 0 0 Manually labeled point
0 1 1 1 1 1 0
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
0 1 1 1 1 1 0
0 0 1 1 1 0 0
Pixel-wise labels
Left elbow (occluded)
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0

Right ankle
0 0 14 1414 0 0
0 1414 1414 14 0
14 1414 1414 1414
14 1414 1414 1414
14141414141414
0 1414141414 0
0 0 141414 0 0 Manually labeled point
Fig. 5 An example of pixel-wise labeling. Our 3D ground truth tensor can be seen as an extension of this 2D
labeling. For each pixel, there is N × T + 1 length vector

3.5 Orientational message passing

The Orientational Message Passing layers in our system are inspired by [8]. In [8], the message
passing layers are implemented by a series of convolutions.
Generally, there are two streams in message passing: upstream and downstream,
0
which are shown as Fig. 6. Assume Ai and Ai are the feature maps before and after
0
upstream pass, Bi and Bi are the feature maps before and after downstream pass, where

A1 B1
A2 B2
A3 A9 B3 B9

A4 A10 B4 B10

A5 A6 A12 A11 B5 B6 B12 B11

A7 A13 B7 B13

A8 A14 B8 B14
Fig. 6 The upstream and downstream of message passing
Multimed Tools Appl

i = {1, 2, …, 14}. Before message passing starts, Ai = Bi. In [8], the message passing is
calculated as follows,
0 0 0 0
A5 ¼ A5 ; A8 ¼ A8 ; A11 ¼ A11 ; A14 ¼ A14 ð4Þ

0
 0

Ai∉f5;8;11;14g ¼ f Ai þ ∑ j A j ⊗wai;aj ð5Þ

0
B1 ¼ B1 ð6Þ

0
 0

Bi∉f1g ¼ f Bi þ B j ⊗wbi;bj ð7Þ

h 0 0i
scorei ¼ Ai ; Bi ⊗ws ð8Þ

where j is the child index of joint i, operator ⊗ means the convolution, wai, aj, wbi, bj and ws are
0 0 0
the filters of convolutions, f() is an activation function, Ai ; Bi is the concatenation of Ai and
0
Bi , scorei is the final result of message passing. Figure 7 illustrated an example process of the
message passing algorithm in [8], where the white points in the heatmaps are the predicted

Before message passing After message passing

Downstream filter Heatmap of right shoulder

+
Heatmap of right hip Heatmap of right hip

Concatenation

+ Heatmap of right knee

Upstream filter

Fig. 7 An example process of message passing algorithm in [8]. The heatmap of right hip receives the messages
from its neighboring joints (right shoulder and right knee). The white points in the heatmaps are the predicted
positions of joints
Multimed Tools Appl

positions of joints. In Fig. 7, the initial heatmap of right hip is diverse and the predicted
position is not accurate. After receiving the messages from its neighboring joints (right
shoulder and right knee), the heatmap of right hip became more focused and the predicted
position is also corrected.
In this paper, we incorporate the body orientation information in message passing layers. As
the discussion in Section 1, the orientational messages are the strong clues to solve the left-
right similarity and self-occlusion problem. In order to construct the orientational message, the
relative position histogram of joints needs to be counted in training data at first.
The relative position histogram hp has S × S × T × 8 bins, where S is the size of the plane
which records the relative position to the parent joint, T is the number of mixture types, the 8
means 8 orientations. The occluded joints are not counted, in order to produce a weak message
in the corresponding orientation. For each orientation, hp is normalized to the joint number and
0
smoothed by a 3D Gaussian filter. The calculation is shown as Eq. (9), where hp is the
statistical relative heatmaps, g() is a 3D Gaussian filter, i = {1, 2, …, 8} is the orientation label,
ni is joint number in orientation i including occluded ones.
 
0 hp ð:; :; :; iÞ
hp ð:; :; :; iÞ ¼ g ð9Þ
ni

Then the orientational downstream filter and upstream filter can be calculated as follows,
where o is the input orientation vector, the operator × is the tensor multiplication. The size of
wou and wod is S × S × T.
0
wou ¼ hp  o ð10Þ
 0 
wod ¼ rotate hp ; 180°  o ð11Þ

Thus the orientational messages are calculated as Eqs. (12) and (13), where t = {1, 2, …, T},
mu and md are the upstream and downstream orientational messages respectively. The sum-
mation on mixture type dimension means that the orientational message is only related with
position of source joint, and does not consider the mixture type of source joint.
 0 
mu ¼ sumt A ð:; :; t Þ ⊗wou ð12Þ
 0 
md ¼ sumt B ð:; :; t Þ ⊗wod ð13Þ

Then the Eqs. (5) and (7) can be rewritten as Eqs. (14) and (15), where α and β are the
relative weight of the local pair-wise message and orientational message. Empirically, when
αa = αb = 1 and βa = βb = 5, the system achieves best performance.
0
  0

Ai∉f5;8;11;14g ¼ f Ai þ ∑ j αa A j ⊗wai;aj þ β a mu ð14Þ

0
 0

Bi∉f1g ¼ f Bi þ αb B j ⊗wbi;bj þ β b md ð15Þ

The right sides of Eqs. (14) and (15) include 3 parts. The first part is the heatmap before
0
message passing (Ai or Bi). The second part is the conventional pair-wise message (A j ⊗wai;aj
0
or B j ⊗wbi;bj ). The third part is the orientational message (mu or md). Figure 8 shows a detailed
Multimed Tools Appl

example of orientational message passing including these 3 parts from neck to right
shoulder. In Fig. 8, the green point is the ground truth joint position. The statistical relative
heatmaps are counted from training data. The probability of each body orientation is
obtained from body orientation detection algorithm in our previous work [27]. Figure 9
illustrates the comparison of our proposed OMP with conventional message passing. In
Fig. 9, the message passing algorithm in [8] cannot correct the left-right similarity problem,
because the ‘left shoulder is at the bottom left of neck’ is a totally reasonable configuration,
as long as the person is in back view. Different from that, the OMP layers incorporate the
‘front view’ information. Therefore, the system can predict the left shoulder should be at the
right bottom of neck.

3.6 Information combination of joint heatmaps, optical flow and basket heatmap

As shown in Fig. 3, the joint heatmaps, optical flow and basket heatmap are combined by 3
cascaded convolutional + ReLU layers. The 3 inputs are stacked together and then are fed to
the first convolutional layer. Firstly, we briefly introduce the generations of optical flow and
basket heatmap.
The optical flow is calculated by the FlowNet [12]. As the flowchart in Fig. 3, the inputs of
FlowNet are the current customer image and the previous customer image (both after
background subtraction). The output of FlowNet is a tensor with 2 channels, which represent
the optical flows in x direction and y direction.

Orientational message part


Statistical heatmaps of left
shoulder w.r.t head bottom
in each body orientation

0.04
0.06 0.02

0.05 0.03

0.05 0.05
0.7

Conventional message part


Probability of each
orientation

Heatmap before
message passing

, , Result of orientational
= + + × = message passing

Fig. 8 A detailed example of orientational message passing from neck to right shoulder. The green point is the
ground truth joint position. The statistical relative heatmaps are counted from training data. The probability of
each body orientation is obtained from body orientation detection algorithm in our previous work [27]
Multimed Tools Appl

Fig. 9 The heatmaps of left shoulder (a) before message passing layers (b) after the MP layers of [8] (c) after the
OMP layers. The green point is the ground truth joint position

The basket heatmap is generated by the ResNet [18] with sliding windows. We apply 16
different window sizes to generate 16 heatmaps. The value at each pixel in the heatmaps is the
score of the ResNet output for object ‘shopping basket’. These heatmaps represent the
probabilities of the ‘center’ of shopping basket. In order to convert the basket ‘center’
heatmaps to basket ‘area’ heatmaps, we apply an average filter in each heatmap, where the
filter size is equal to the corresponding sliding windows size. Then these heatmaps are res
summed up to the final basket heatmap. The process is shown as Fig. 10.
Then 3 cascaded convolutions + ReLU layers are employed to combine these 3 inputs. The
effects of these convolutions are two folds. Firstly, they fuse different information. Secondly,
they adapt the output channel number to be equal to the channel number of input joint
heatmaps. In the 3 inputs, the joint heatmaps have N × T + 1 = 183 channels; the optical flow
has 2 channels; the basket heatmap has 1 channel. Thus the channel number of the stacked
input is 186. The filter sizes of the 3 convolutional layers are listed in Table 1. The output of
the third convolutional layers becomes to 183 channels again. Figure 11 shows the effects of
adding optical flow and basket heatmap. In the first row of Fig. 11, benefiting the information

Sliding
windows

Sliding
windows

Resizing and Filtering and


stacking summing up
Sliding
windows

Basket heatmap
Sliding Feature maps
windows

Image pyramid Feature pyramid


Fig. 10 The process of generating the basket heatmap
Multimed Tools Appl

Table 1 The filter sizes of the 3 convolutional layers for information combination

Conv1 Conv2 Conv3

Filter size 5 × 5 × 185 × 512 3 × 3 × 512 × 256 1 × 1 × 256 × 183

optical flow, the predicted position of left wrist is corrected to the moving part. In the second
row of Fig. 11, the basket heatmap indicates the occlusion of left hip. Thus the result heatmap
of left hip becomes ‘colder’.

3.7 Temporal modeling using BRNN

The output of information combination is given into a BRNN. The BRNN can be seen as two
RNNs stacked on top of each other. The structure of BRNN is shown as Fig. 12. In Fig. 12, the
sf and sb mean the hidden states of forward RNN and backward RNN respectively, scoret is the
output of information combination at frame t. The output yt is calculated as follows, where Uf,
Wf, Ub, Wb and V are the learnable parameters.

sf t ¼ f U f scoret þ W f sf t−1 ð16Þ

sbt ¼ f ðU b scoret þ W b sbtþ1 Þ ð17Þ

yt ¼ V ½sf t ; sbt  ð18Þ

The BRNN can be seen as a temporal message passing. The heatmap at frame t scoret
receives the forward message Wf sft − 1 from frame t – 1 and the backward message Wb sbt + 1
from frame t – 1. Combining the both forward and backward messages, the BRNN produces
the refined heatmap. Figure 13 shows an example process of this temporal message passing.
The white points in the heatmaps are the predicted joint positions.

Adding optical flow


Heatmap of left wrist

Adding basket heatmap


Heatmap of left hip

Fig. 11 The effects of adding optical flow and basket heatmap. The white points in the heatmaps are the
predicted positions of joints
Multimed Tools Appl

s s s

s s

Fig. 12 The structure of BRNN

3.8 Visibility mask

The visibility mask is not explicitly applied in the above modules. However, it implicitly runs
through the whole model. In the ground truth labeling, the occluded joints are labeled as
background. This labeling defines the learning target of visible and occluded joints. In the
OMP layers, the occluded joints are not counted in the relative position histograms. This
means that, for some specific body orientations (e.g. 90° or 270°), the corresponding occluded
joints (e.g. the right body part or the left body part) will receive a weak orientational message
and then produce a relatively weak result heatmap. In the information combination part, the
basket heatmap indicates the probability of occlusion and results in the ‘cooling down’ in the

(Refined heatmap)

Forward sequence

Backward sequence

Fig. 13 An example process of BRNN. It can be seen as a temporal message passing. The white points in the
heatmaps are the predicted joint positions
Multimed Tools Appl

basket area. In the BRNN, the temporal changing between visibility and occlusion is also
learned in the network. Therefore, the visibility mask is a latent thread from the input to the
output. Note that these optimizations are established on the joint heatmaps. These methods are
difficult applied in the structures which directly generate the joint positions [4, 17, 41].

4 Experiment

In the experiments, we use two group of datasets. One is our customer pose dataset augmented
by Leeds Sports Pose (LSP) dataset [22] and Leeds Sports Pose Extended Training (LSPET)
[23] dataset. The LSP and LSPET datasets are used to extend the human variability of
customer pose dataset. Besides, in order to compare with other existing methods, we also
conduct experiments on Joint-annotated Human Motion Data Base (JHMDB) [21]. This
dataset includes image sequences which are suitable for the algorithm of optical flow and
BRNN.

4.1 Customer pose dataset augmented by LSP and LSPET

The customer pose dataset is shot from the 4 surveillance cameras which are installed in 4
different retail stores as Fig. 14. The surveillance videos are selected from the surveillance
camera near the merchandise shelf. The occlusion between customers is not included in the
dataset. Following the annotations of LSP and LSPET, each customer image is labeled the
positions and visibility masks of 14 joints: head top, head bottom, left/right shoulders, left/right
elbows, left/right wrists, left/right hips, left/right knees and left/right ankles. In addition, we
add a body orientation label for each image. As shown in Table 2, in our customer pose
dataset, there are 58 sequences/9703 images for training, and 21 sequences/3250 images for

Fig. 14 The example images from 4 different stores


Multimed Tools Appl

Table 2 The details of customer dataset

Training data Testing data Total

Sequence number 58 21 79
Frame number 9703 3250 12,953

testing. The sequence length is from 60 to 360 frames. The body height of customer is around
300-400 pixels.
The LSP datasets and LSPET dataset are both widely used benchmarks for human pose
estimation. We use them to extend variability of our customer dataset. LSP dataset contains
1 K persons for training and 1 K persons for testing. LSPET dataset contains 10 K persons for
training. Each image in LSP and LSPET is labeled 14 joints positions and visibility mask as
the description above.
However, there is no body orientation label in LSP and LSPET dataset. Actually, these two
datasets include many complex sports images as Fig. 15, which cannot be simply given an
orientation. Therefore, we manually pick up a subset from LSP and LSPET respectively. In
these subsets, the body orientations can be categorized into one of the 8 directions. It means in
the selected subsets the heads are generally at the top and feet are generally at the bottom of the
images, which are similar to the customer poses in retail store. The poses with these conditions
are excluded: (I) the feet are higher than waists, like Fig. 15a, b); (II) the heads are lower than
waists, like Fig. 15d; (III) the whole body is horizontal, like Fig. 15b. These two subsets are
called Orientational LSP (OLSP) and Orientational LSPET (OLSPET) respectively. The
details of these two datasets are listed in Table 3. The backgrounds of OLSP and OLSPET
are subtracted by using the person segmentation of FCN-8 s [28]. Then the person area is
dilated by a circular structure with 25 pixels radius. The examples of background subtraction
of OLSP and OLSPET are shown as Fig. 16.
In the experiments, the models are trained on customer training data + OLSP training data +
OLSPET. All of the training data are augmented by left-right flipping. The testing is conducted
on two groups of data respectively: the first one is the customer testing data; the second one is
the OLSP testing data.
Based on the datasets above, a series of comparison experiments are conducted as below.

(a) Heatmap by HOG + MP (Conventional DPM [45]).


(b) Heatmap by HOG + MP + Orientation. We simply trained the model of (a) for each
orientation. In the testing, the testing data are firstly detected the body orientation, and
then choose one corresponding model to estimate the pose.

Fig. 15 Examples in LSP and LSPET which are difficult to label body orientation. a and (b) are the images from
LSP, c and (d) are the images from LSPET
Multimed Tools Appl

Table 3 The details of OLSP and OLSPET dataset

Training data Testing data Total

OLSP 889 873 1762


OLSPET 4552 – 4552

(c) ResNet. We trained the ResNet-50 model in [18] to directly output the joint positions and
visibility mask.
(d) ResNet + Orientation. Based on the structure of (c), at the last FC layer of ResNet-50, we
stacked the body orientation vector as an additional input.
(e) Heatmap by FCN + MP [8].
(f) Heatmap by FCN + OMP.
(g) Heatmap by FCN + OMP + Optical Flow + Basket heatmap.
(h) Heatmap by FCN + OMP + Optical Flow + Basket heatmap + BRNN. This is our
proposed IntePoseNet.

For experiments (a) and (b), the joint heatmaps are generated by the latent Support Vector
Machine (latent SVM) based on the handcrafted feature (HOG). For experiments (c) ~ (h), the
network is trained for 100 epochs. The learning rate starts from 10−6 and decreases by half
every 10 epochs. Note that the experiment (g) and (h) cannot be trained on OLSP and
OSLPET, because these two datasets do not include image sequence. Therefore, the networks
in experiment (g) and (h) are fine-tuned only on the customer dataset based on the result
network of (f). The networks are trained for 50 epochs with learning rate 10−8. The depth of
BRNN is 6 in both directions.
In order to evaluate the pose estimation, we apply the Percentage of Correct Keypoints
(PCK), which is a widely used metric. The correct keypoint is defined as the predicted joint
whose distance from true joint is less than a given threshold. However, our system does not
only output the joint positions but also the visibility masks. In order to evaluate the results of
both joint positions and visibility mask, we propose the Visibility sensitive Percentage of
Correct Keypoints (VPCK), which are inspired by the Occlusion sensitive Percentage of
Correctly detected Parts (OPCP) in [3]. The VPCK is defined as below, where Nv and Ninv
are the number of visible and invisible joints respectively, PCKv is the PCK of visible joints,

Fig. 16 Examples of background subtraction in OLSP and OLSPET dataset. a and (b) are the images from
OLSP dataset, c and (d) are the images from OLSPET dataset
Multimed Tools Appl

Accuracyinv is the detection rate of invisible joint. Because we let the 0 to be totally occluded
and 1 to be totally visible, we set the visibility threshold as 0.5 for all of the joints.
PCK v  N v þ Accuracyinv  N inv
VPCK ¼ ð19Þ
N v þ N inv
The testing results on the customer dataset are listed in Table 4. The threshold of VPCK (or
PCK) is 0.09 of body height in Table 4. The curves of results in different thresholds are
illustrated in Fig. 17. Note that in the experiment (a) and (b), the DPM does not generate the
visibility mask. Thus for the experiment (a) and (b), the values in Table 4 and Fig. 17 are the
normal PCK values. The comparison examples are shown in Fig. 18.
These results show several interesting facts. (1) The IntePoseNet which combines the
most information achieves the highest performance. (2) The methods using handcrafted
features much underperform the deep learning methods. (3) The orientation information
generally can improve the VPCK (or PCK). However, for ResNet, the improvement of
orientation is relatively small. The reason is that the ResNet only produce one prediction
per image. The orientation information can hardly correct the prediction without any other
candidates. Note that the result of ResNet outperforms the result of heatmap + message
passing. However, after the optimization by body orientation, the heatmap based methods
achieve better results. This means the heatmap based method has greater potentialities of
optimization. (4) The orientational message passing layers can solve the problem of left-
right similarity, such as the row 1 and row 4 in Fig. 18. The left (red/purple) and right
(blue/cyan) limbs are corrected from the reverse side. (5) The basket heatmap can help for
the detection of occluded joints, such as results at row 4, column 7 and 8 in Fig. 18. (6) The
optical flow can improve the estimation of moving joint, such as the wrists at row 1 and 5,
column 7 and 8 in Fig. 18. In contrast, for the methods using ResNet, it is difficult to
incorporate the optical flow or basket heatmap.
In summary, the IntePoseNet performs better because 3 reasons as follows.

1. The network introduces the body orientation information to solve the left-right similarity
problems.
2. The network introduces the basket heatmap to improve the accuracy of occluded joints.
3. The network combines multiple good practices in related works, such as optical flow, pose
heatmap and RNN.

Table 4 The VPCK (%) of testing results on customer dataset

Overall Head Shoulder Elbow Wrist Hip Knee Ankle

Heatmap by HOG + MP [45]* 34.2 45.2 39.9 33.4 28.0 37.9 33.5 30.9
Heatmap by HOG + MP + Orientation* 48.1 57.8 52.7 44.8 37.3 49.4 53.2 52.3
ResNet [18] 79.3 96.7 82.2 72.4 45.1 83.2 88.7 78.5
ResNet + Orientation 83.7 97.7 85.3 78.6 43.3 85.4 94.0 84.5
Heatmap by FCN + MP [8] 78.1 98.3 80.5 67.1 59.9 81.2 84.9 73.7
Heatmap by FCN + OMP 86.8 98.7 94.3 84.1 73.0 86.6 88.9 81.8
Heatmap by FCN + OMP + Optical flow + 89.8 98.7 94.6 86.2 81.3 87.8 92.7 88.6
Basket heatmap
IntePoseNet 92.0 98.8 95.5 87.4 82.9 89.0 92.8 89.7

*The values in this row are the normal PCKs


Multimed Tools Appl

Fig. 17 The VPCK curves in different threshold on customer dataset


Multimed Tools Appl

Fig. 18 Examples results on customer dataset. a Heatmap by HOG + MP [45]; b Heatmap by HOG + MP +
Orientation; c ResNet [18]; d ResNet + Orientation; e Heatmap by FCN + MP [8]; f Heatmap by FCN + OMP; g
Heatmap by FCN + OMP + Optical flow + Basket heatmap; h IntePoseNet

From the Table 4, the VPCK of wrist is far less than the VPCKs of other joints. The
difficulty would be the area of wrist is too small. Given a small local image, even from
the human point of view, the wrist is difficult to recognize. Thus human must under-
stand the wrist according to the connection between the wrist and elbow. In this paper,
although the OMP models this connection to some degree, it only gives the probable
wrist positions based on the elbow position, but not based on the forearm model.
Therefore, as the future work, the connections between joints needs to be learned by
a better model.
The testing results on the OLSP dataset are listed in Table 5. Because OLSP dataset does
not include image sequences, the experiments (g) and (h) are not conducted on OLSP
dataset. The threshold of VPCK (or PCK) is 0.2 of body height in Table 5. Because the
models are trained with same data as that in Table 5, the trends of performances are similar
to that in Table 4. The example results of the ‘Heatmap by FCN + OMP’ structure are shown
in Fig. 19.
Multimed Tools Appl

Table 5 The VPCK (%) of testing results on OLSP dataset

Overall Head Shoulder Elbow Wrist Hip Knee Ankle

Heatmap by HOG + MP [45]* 36.3 47.0 40.1 35.8 30.3 36.9 34.5 33.1
Heatmap by HOG + MP + Orientation* 49.8 58.8 52.3 46.8 34.6 51.0 54.8 53.6
ResNet [18] 81.7 98.6 89.5 74.7 46.7 84.3 89.4 79.4
ResNet + Orientation 86.1 99.1 91.7 79.4 46.9 89.3 96.1 86.8
Heatmap by FCN + MP [8] 79.4 99.2 88.4 69.2 60.7 82.4 85.9 74.9
Heatmap by FCN + OMP 88.6 99.9 94.3 85.0 73.8 90.8 89.8 83.9

*The values in this row are the normal PCKs

4.2 JHMDB dataset

The JHMDB dataset is proposed by Jhuang et al. [21]. It provides 928 image sequences in 21
action classes. Each image includes a single people. Note that not all of the images show the
whole body of people in JHMDB. Fortunately, the authors also provide a subset named sub-
JHMDB, which contains full human body in every image. The sub-JHMDB includes 316
image sequences in 12 action categories. In the sub-JHMDB, every image is annotated the

Fig. 19 The example results on OLSP dataset with ‘Heatmap by FCN + OMP’ structure
Multimed Tools Appl

Fig. 20 The example annotations of JHMDB. The picture is captured from the video in JHMDB official website

positions of following 15 joints: head, neck, left/right shoulder, left/right elbow, left/right wrist,
belly, left/right hip, left/right knee and left/right ankle. Moreover, the sub-JHMDB also
provides the people body orientation in 16 directions, which could be used in our orientational
message passing algorithm. The example of the annotations of JHMDB is shown as Fig. 20.
In order to compare with other existing methods, the IntePoseNet also needs to be modified.
Firstly, we remove the basket heatmap part, because there is no basket in the sub-JHMDB.
Secondly, the occluded joints also generate the position ground truth in the training. Thirdly,
the performance is measured by normal PCK rather than VPCK. We named this version of
IntePoseNet as IntePoseNetV2.
We conduct the experiments on sub-JHMDB as below. The data of sub-JHMDB are
augmented by left-right flipping.

(a) Heatmap by FCN + MP [8]


(b) Heatmap by FCN + OMP
(c) IntePoseNetV2

The experimental results are shown in Table 6 comparing with other state-of-art methods.
The error threshold of the PCK is 0.2 of body height for all of our experiments and referred
methods. The example results are shown in Fig. 21.
From the Table 6, our method slightly outperforms the state-of-art. Similar to the results on
customer pose dataset, the orientation information also significant improve the performance on
sub-JHMDB dataset. Besides, IntePoseNetV2 improve the results of elbow and wrist

Table 6 The PCK (%) of testing results on sub-JHMDB dataset

Overall Head Shoulder Elbow Wrist Hip Knee Ankle

Nie et al. [44] 55.7 80.3 63.5 32.5 21.6 76.3 62.7 53.1
Iqbal et al. [19] 73.8 90.3 76.9 59.3 55.0 85.9 76.4 73.0
Song et al. [36] 92.1 97.1 95.7 87.5 81.6 98.0 92.7 89.8
Heatmap by FCN + MP [8] 84.1 93.9 86.5 77.3 74.0 93.7 80.6 83.1
Heatmap by FCN + OMP 90.4 99.0 89.9 83.9 79.7 98.3 93.2 88.5
IntePoseNetV2 92.3 99.1 93.5 88.9 82.6 98.4 94.1 89.6
Multimed Tools Appl

Fig. 21 The example results on sub-JHMDB dataset with IntePoseNetV2

especially, because the temporal information could help to estimate the joints which move fast
and variably.

5 Conclusion and future works

We propose an Integral Pose Network (IntePoseNet) for customer pose estimation. This
network incorporates the body orientation and visibility mask to solve the left-right similarity
problem and occlusion problem. In addition, the occluding object detection explicitly gives the
clues of inter-occlusion by object. In temporal domain, this network includes the optical flow
information and the BRNN structure. Therefore, the global body orientation, local joint
connections, human motion information, occluding object and temporal consistency are
integrally considered in one network structure. As the future work, we plan to deal with the
customer pose estimation in the inter-occlusion between multiple customers. We also try to
reduce the processing time of our system in order to implement it in real retail stores.
Multimed Tools Appl

Acknowledgments The authors thank Haitao Wang, Yongjie Liu, Qianlong Wang for their help for labeling
data. The faces of customers are blurred for the purpose of privacy in this paper. This research is permitted by the
Compliance Committee of The University of Tokyo.

References

1. Achilles F, Ichim A-E, Coskun H, Tombari F, Noachtar S, Navab N (2016) Patient MoCap: human pose
estimation under blanket occlusion for hospital monitoring applications. In: Proceedings of the international
conference on medical image computing and computer-assisted intervention, pp 491–499
2. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and
state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp 3686–3693
3. Azizpour H, Laptev I (2012) Object detection using strongly-supervised deformable part models. In:
European Conference on Computer Vision (ECCV), pp 836–849
4. Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4733–
4742
5. Chen X, Yuille AL (2014) Articulated pose estimation by a graphical model with image dependent pairwise
relations. In: Advances in Neural Information Processing Systems, pp 1736–1744
6. Chen X, Yuille AL (2015) Parsing occluded people by flexible compositions. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp 3945–3954
7. Chen J, Song X, Nie L, Wang X, Zhang H, Chua T-S (2016) Micro tells macro: predicting the popularity of
micro-videos via a transductive model. In: Proceedings of the 2016 ACM on multimedia conference, New
York, pp 898–907
8. Chu X, Ouyang W, Li H, Wang X (2016) Structured feature learning for pose estimation. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4715–4723
9. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886–893
10. Dantone M, Gall J, Leistner C, Van Gool L (2013) Human pose estimation using body parts dependent joint
regressors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp 3041–3048
11. Desai C, Ramanan D (2012) Detecting actions, poses, and objects with relational phraselets. In: Proceedings
of the European Conference on Computer Vision (ECCV), pp 158–172
12. Dosovitskiy A, et al. (2015) Flownet: learning optical flow with convolutional networks. In: Proceedings of
the IEEE International Conference on Computer Vision (ICCV), pp 2758–2766
13. Eichner M, Ferrari V (2012) Appearance sharing for collective human pose estimation. In: Proceedings of
the Asian Conference on Computer Vision (ACCV), pp 138–151
14. Felzenszwalb PF, Huttenlocher DP (2005) Pictorial structures for object recognition. Int J Comput Vis
(IJCV) 61(1):55–79
15. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively
trained part-based models. IEEE Trans Pattern Anal Mach Intell (PAMI) 32(9):1627–1645
16. Fragkiadaki K, Levine S, Felsen P, Malik J (2015) Recurrent network models for human dynamics. In:
Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4346–4354
17. Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L (2016) Towards viewpoint invariant 3D human pose
estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 160–177
18. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778
19. Iqbal U, Garbade M, Gall J (2017) Pose for action-action for pose. In: Proceedings of the IEEE International
Conference on Automatic Face & Gesture Recognition, pp 438–445
20. Jain A, Tompson J, LeCun Y, Bregler C (2014) MoDeep: a deep learning framework using motion features for
human pose estimation. In: Proceedings of the Asian Conference on Computer Vision (ACCV), pp 302–315
21. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In:
Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 3192–3199
22. Johnson S, Everingham M (2010) Clustered pose and nonlinear appearance models for human pose
estimation. In: Proceedings of the British Machine Vision Conference (BMVC)
23. Johnson S, Everingham M (2011) Learning effective human pose estimation from inaccurate annotation. In:
2011 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp 1465–1472
Multimed Tools Appl

24. Le Cun BB, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Handwritten digit
recognition with a back-propagation network. In: Neural Information Processing Systems (NIPS)
25. LeCun Y, Bengio Y (1995) Convolutional networks for images, speech, and time-series. In: The handbook
of brain theory and neural networks, vol. 3361, no. 10
26. Liu Z, Wang Z (2016) Action recognition with low observational latency via part movement model.
Multimed Tools Appl (MTAP) 76:26675–26693
27. Liu J, Gu Y, Kamijo S (2016) Customer behavior classification using surveillance camera for marketing.
Multimed Tools Appl (MTAP) 76:6595–6622
28. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3431–3440
29. Park D, Ramanan D (2015) Articulated pose estimation with tiny synthetic videos. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 58–66
30. Pfister T, Charles J, Zisserman A (2015) Flowing convnets for human pose estimation in videos. In:
Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp 1913–1921
31. Pishchulin L, Andriluka M, Gehler P, Schiele B (2013) Poselet conditioned pictorial structures. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 588–
595
32. Rafi U, Gall J, Leibe B (2015) A semantic occlusion model for human pose estimation from a single depth
image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Workshops, pp 67–74
33. Russakovsky O et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis (IJCV)
115(3):211–252
34. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):
2673–2681
35. Sminchisescu C, Telea A (2002) Human pose estimation from silhouettes. A consistent approach using
distance level sets. In: Proceedings of the International Conference on Computer Graphics, Visualization
and Computer Vision (WSCG)
36. Song J, Wang L, Van Gool L, Hilliges O (2017) Thin-slicing network: a deep structured model for pose
estimation in videos. arXiv preprint arXiv:1703.10898
37. Sun M, Savarese S (2011) Articulated part-based model for joint object detection and pose estimation. In:
Proceedings of the International Conference on Computer Vision (ICCV), pp 723–730
38. Tafazzoli F, Safabakhsh R (2010) Model-based human gait recognition using leg and arm movements. Eng
Appl Artif Intell 23(8):1237–1246
39. Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a
graphical model for human pose estimation. In: Advances in Neural Information Processing Systems,
pp 1799–1807
40. Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using
convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp 648–656
41. Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1653–1660
42. Wagg DK, Nixon MS (2003) Model-based gait enrolment in real-world imagery. In: Proceedings of the
workshop on multimodal user authentication, pp 189–195
43. Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) DeepFlow: large displacement optical flow with
deep matching. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp
1385–1392
44. Xiaohan Nie B, Xiong C, Zhu S-C (2015) Joint action recognition and pose estimation from video. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1293–
1301
45. Yang Y, Ramanan D (2013) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern
Anal Mach Intell (PAMI) 35(12):2878–2890
46. Yang W, Ouyang W, Li H, Wang X (2016) End-to-end learning of deformable mixture of parts and deep
convolutional neural networks for human pose estimation. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp 4715–4723
47. Yao J, Odobez J-M (2007) Multi-layer background subtraction based on color and texture. In: Proceedings
of the IEEE International Conference on Computer Vision (CVPR), pp 1–8
48. Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: Venue category estimation
from micro-video. In: Proceedings of the 2016 ACM on multimedia conference, pp 1415–1424
Multimed Tools Appl

Jingwen Liu received his B.Eng. degree from South China University of Technology, China, in 2008. He
received his M.S. degree from University of Bristol, United Kingdom, in 2009. He worked for ZTE Corporation
in China from 2010 to 2013 as an information security engineer. He is currently pursuing Ph.D degree in The
University of Tokyo, Japan.

Yanlei Gu received his M.E. degree from Harbin University of Science and Technology, China, in 2008, and
Ph.D degree on Computer Science from Nagoya University, Japan, in 2012. He is a post-doctoral researcher in
the Institute of Industrial Science at the University of Tokyo since 2013. His research interests include computer
vision, machine learning and sensor integration for ITS.
Multimed Tools Appl

Shunsuke Kamijo received his B.S. and M.S. degree on Physics from The University of Tokyo in 1990 and
1992 respectively, and Ph.D degree on Information Engineering in 2001 from The University of Tokyo. He
worked for Fujitsu Ltd. from 1992 as a processor design engineer. He was Assistant Professor from 2001 to 2002,
and he is Associate Professor since 2002 at Institute of Industrial Science, The University of Tokyo. His research
interests are computer vision for traffic surveillance by infrastructure cameras and onboard cameras, cooperative
systems using V2X communication, GNSS and smart phone applications for ITS and marketing.

You might also like