You are on page 1of 10

Body Joint Tracking in Low Resolution Video

using Region-based Filtering

Binu M Nair1 , Kimberly D Kendricks2 , Vijayan K Asari1 , and
Ronald F Tuttle3

Department of ECE, University of Dayton, Dayton, OH, USA

CHPSA Lab, Central State University, Wilberforce, OH, USA
Air Force Institute of Technology, 2950 Hobson Wa, OH, USA

Abstract. We propose a region-based body joint tracking scheme to

track and estimate continuous joint locations in low resolution imagery
where the estimated trajectories can be analyzed for specific gait signatures. The true transition between the joint states are of a continuous
nature and specifically follows a sinusoidal trajectory. Recent state of art
techniques enables us to estimate pose at each frame from which joint
locations can be deduced. But these pose estimates at low resolution are
often noisy and discrete and hence not suitable for further gait analysis.
Our proposed 2-level region-based tracking scheme gets a good approximation to the true trajectory and obtains finer estimates. Initial joint
locations are deduced from a human pose estimation algorithm and subsequent finer locations are estimated and tracked by a Kalman filter. We
test the algorithm on sequences containing individuals walking outdoors
and evaluate their gait using the estimated joint trajectories.

Keywords: kalman filter, region-based tracking, local binary patterns, histogram

of oriented gradients, low resolution


Most of the research work done in the field of tracking from surveillance videos
has been restricted to detecting and tracking large objects in the scene such as
people in shopping malls, players on a soccer/basketball court, detection and
tracking of cars etc. But, when it comes to tracking body joints in a scene,
it borders on the line of human pose estimation in images and videos. In the
present research community, the human body pose estimation problem is being
tackled in two different scenarios; one which uses the depth information and
the other which uses only the images. The former uses the depth information
from the Kinect (Shotten et al [14]) and mainly suited for indoor applications
such as gaming consoles, human interactive systems etc.. The latter is used in
surveillance applications which uses video feed from multiple CCTV cameras

Nair et al.

(a) Manual annotation provided by point lights (b) Human pose estimation using
software [10]
Articulated Models[15]

Fig. 1: Illustration of specific joints/body parts on the human body to be tracked.

monitoring a parking lot or a shopping mall. Some early research for tracking
motion and pose in surveillance videos has been developed where interest points
detected on the human body can be tracked. The trajectories are then modeled
to differentiate between human actions [8]. Recently in an approach proposed by
Huang et al. [7], human body pose is estimated and tracked across the scene using information acquired by a multi-camera system. The human pose estimates
obtained from such algorithms give continuous smooth sinusoidal like trajectories and therefore are deemed useful for gait analysis. However, one limitation is
the requirement of high resolution imagery for accurate estimation of joint trajectories. Therefore, the use of such algorithms on low-resolution videos does not
guarantee joint location estimate suitable for gait analysis and a pre-processing
mechanism should be applied on these noisy discrete estimates. An illustration of
the pose estimates obtained by a proprietary point light software and articulated
part-based models are shown in Figures 1a and 1b.

Related Work

One of the earlier and popular works which does not use the depth information
and uses only a single video camera to track human motion is done by Markus
Kohler [9]. Here, a Kalman Filter is designed to track non-linear human motion
in such a way that non-linearity in motion is considered as motion with constant
velocity and changing acceleration modeled as white noise. In our proposed algorithm, we use a modification of this Kalman filter and the design of the process
noise covariance to track the body joints across the video sequence. Kaniche et
al [8] used the extended Kalman filter to track specific points or corners detected
at every frame of the video sequence for the purpose of gesture recognition.
In recent years, the problem of human body pose estimation has not just
being limited to tracking points or corners or using depth information. One of
the state of art methods for human pose estimation on static images is the
flexible mixture of parts model, proposed by Yang and Ramanan [15]. Instead of
explicitly using variety of oriented body part templates(parameterized by pixel
location and orientation) in a search-based template matching scheme, a family
of affine-warped templates is modeled, each template containing a mixture of

Body Joint Tracking using Region-based Filtering

non-oriented pictorial structures. Ramakrishna et al [12] proposed an occlusion

aware algorithm which tracks human body pose in a sequence where the human
body is modeled as a combination of single parts such as the head and neck and
symmetric part pairs such as the shoulders, knees and feet. Here, the important
aspect in this algorithm is that it can differentiate between similar looking parts
such as the left or right leg/arm, thereby giving a suitable estimate of the human
pose. Although these methods show an increased accuracy on datasets such as
the Buffy Dataset [5] and the Image Parse dataset [13], the performance on very
low-resolution imagery is not yet evaluated. Further processing of the human
pose estimates can provide coarse locations of a joint which can form the basis
of many tracking schemes. One such work was done by Xavier et al [3] where
they propose a generalization of the non-maximum suppression post processing
schemes to merge multiple post estimates either in a single frame or in multiple
consecutive frames of a video sequence. We focus on an alternative problem
where we require smooth trajectories of individual joints in low resolution video
scene for realistic and online analysis for gait signatures. The work proposed in
this paper is a alternative and more accurate method to our preliminary model
[10] in body joint tracking where a combination of optical flow and LBP-HOG
descriptors with Kalman filter had been evaluated.


In this section, we explain the various modules such as the region-based feature
matching and the tracking scheme using the Kalman filter used in the proposed

Region Descriptor Matching

The region descriptors such as the Histogram of Oriented Gradients (HOG)

[4] and the Local Binary Patterns (LBP) [11] are used to describe the edge
information and the textural content in a local region respectively. Both can be
very effective descriptors for region-based image matching in low resolution. The
histogram of oriented gradients (HOG) [4] descriptor is a weighted histogram of
the pixels over the edge orientation where the weights are the corresponding
edge magnitude. The gradient magnitude and direction are given by G2x + G2y

and tan1 ( Gxy ) where Gx , Gy are gradients in the x, y directions.

The local binary pattern is an image coding scheme which brings out the
textural features in a region. For representing a joint region and to associate a
joint in successive frames, the texture of the region plays a vital part in addition
to the edge information. The LBP considers a local neighborhood of 88 or 16
16 in a joint region and generates a coded value which represents the underlying
texture in its local region. The LBP operator is defined as
1 z0
s(gp gc )2
s(z) =
0 z<0

Nair et al.

where (P, R) is the number of points around the local neighborhood and its
radius. The textural representation of the joint region will then be the histogram
of these LBP-coded values. For our purpose, we use P = 8 with R = 1 which
reduces to a local region of size 8 8. The matching between two joint regions
represented either by HOG or LBP is done using the Chi-squared metric [11]
in Equation 2 where f1 ,f2 are feature vectors corresponding to a certain joint in
successive frames.
X (f1 (b) f2 (b))2
2 (f1 , f2 ) =
f1 (b) + f2 (b)


Kalman Filter

The recursive version of the Kalman filter can also be used for tracking purposes and in literature, it has been widely applied for tracking points in video
sequences. In this proposed algorithm, we use the Kalman filter to track a specific body joint across the scene. This is done by setting the state of the process
(which in this case is the human body movement) as the (x, y) coordinates of
the joint along with its velocity (vx , vy ) to get a state vector xk R4 . The
measurement vector zk = [xo , yo ] R2 will be provided either by the coarse
joint location estimates or by the region-based estimate. By approximating the
motion of a joint in a small time interval by a linear function, we can design the
transition matrix A so that the next state is a linear function of the previous
states. As done by Kohler[9], to account for non-constant velocity often associated with accelerating image structures, we use the process noise covariance
matrix Q defined in Equation 3 where a is the acceleration and 4t is the time
step determined by the frame rate of the camera.

34t 0
a2 4t


6 34t

Proposed Framework

A block schematic of the proposed tracking scheme is shown in Figure 2. It

consists of two main stages: a) 2-level region based matching using LBP/HOG
b) tracking of region-based estimates using Kalman filter. Following are the steps
in the proposed algorithm flow :
1: Extract the first frame(time instant t = 1) of the sub-trajectory. Compute
dense optical flow within the foreground region to get the global velocity
estimate (median flow).
2: Initialize the Kalman filter with the coarse joint location and the global
velocity. The state of the tracker for each body joint is then xt = [x, y, vx , vy ]
where (vx , vy ) is the joint velocity which is set to the global flow velocity
estimate. This is considered as the corrected state x
t1 at time t = 1.

Body Joint Tracking using Region-based Filtering

Fig. 2: Block schematic of the proposed tracking scheme.

3: Update t t + 1 and predict the state (get prior state) x

t of the Kalman







t1 and the a-priori

filter. Using the predicted state x
t , posterior state x

error co-variance Pt , estimate the elliptical region Sreg1 (t) where the joint
location is likely to fall on.
Extract the next frame. Find the region based matching estimate of each
joint between instances t and t 1 formulated as argminpSop (t) 2 (fj , fp )
where fj is the joint descriptor updated in the previous time instant, fp is
the region descriptor computed at the pixel p within the elliptical search
region Sreg1 (t). Also compute the dense optical flow and the global velocity
of the foreground region.
Using this estimate and the coarse joint location estimate, predict the new
elliptical search region Sreg2 (t). A constraint Sreg12 (t) Sreg1 (t) is enforced to prevent the growth of Sreg2 (t). If constraint is satisfied, go to Step
6. Else goto Step 8.
Compute region-based estimate given by argminpSreg2 (t) 2 (fj , fp ). Use this
finer estimate of the joint location as the measurement vector z = [zx , zy ] to
correct the Kalman tracker associated with that particular joint.
Update t t + 1. Set the joint velocity as the global velocity and predict

the state (get prior state) x

t and the elliptical search region Sreg1 (t). Go to
Step 4.
Using the coarse joint location estimates as the measurement vector, perform
the correction phase of the filter.
Update t t + 1. Set the joint velocity as the global velocity and predict

the state (get prior state) x

t and the elliptical search region Sreg1 (t). Go to
Step 4.
Continue till all the frames of the sequence has been processed.

Results and Experiments

The proposed tracking scheme has been tested on a private dataset provided
by the Air Force Institute of Technology, Dayton OH. It consists of 12 subjects
walking along a outdoor track across the face of a building is performed twice,

Nair et al.

(a) Covariance based trajectory measures.

(b) MOTA/MOTP scores

Fig. 3: Experimental results obtained with the proposed region-based Kalman

tracking scheme using LBP descriptors. The numbering of points in
MOTP/MOTA scores refer to the name of the sequences mentioned in the left

one wearing a loaded vest and other no vest by each subject to get a total
of 24 video sequences. The area of focus is when the subject walks clockwise
around the track and climbs a ramp. We set equal neighborhood sizes of 17 17
for each joint region and set a constant acceleration a = 0.1 pixels/f rame2
for the corresponding Kalman filter. Figure 4 shows sample illustration of the
proposed scheme in certain frames of the sequence. Sample illustrations of the
joint trajectories are also shown in Figure 5 where a comparison is made with four
different schemes. All of the joint trajectories estimated by different schemes for
each joint is smoothened by using a regression based neural network. We see that
the smooth trajectories obtained by the proposed scheme using LBP or HOG
has the closest approximation to the sinusoidal trajectory with subtle variations.

Co-variance-Based Trajectory Measure

Its a statistical measure which gives how close the tracked joint locations are
to the coarse estimates of the joint location for each sequence associated with a
particular subject. This metric [6] is given by d2 (K, Km ) =
(log(i (K, Km ))2

where K R3 is the co-variance of the tracked points, Km R3 is the covariance matrix of the coarse joint locations, i is the ith Eigen value associated
with |K Km | = 0 and n being the number of Eigen values. The lower the
value, the closer are the tracked points to the coarse joint locations. This measure
does not provide us with the precision of the tracking scheme but it gives an
indication whether the tracked joint trajectory are located within the spatialtemporal neighborhood of the coarse joint trajectory. We see that most of the

Body Joint Tracking using Region-based Filtering

joint trajectories obtained from the proposed scheme have very low values. This
shows that the proposed scheme obtains tracked estimates which are close to the
pose estimates obtained from a pose detector.


Multiple Object Tracking Precision/Accuracy (MOTP/MOTA)

The MOTP/MOTA [2] metric is a widely used efficiency measure for multipleobject tracking mechanisms where the MOTP/MOTA gives the precision and
accuracy of the tracker by considering all the detected and tracked objects.
We use an implementation of the CLEAR-MOT [1] to give us the statistical
data such as false positive rate, false negative rate, MOTA and MOTP scores.
Multiple Object Tracking Precision (MOTP) refers to the closeness of a tracked
point location to its true location (given as ground truth). Here, we measure the
closeness by measuring the overlap between the neighborhood region occupied by
the tracked point location and the ground truth. Higher the value of this overlap,
more precise is the estimated location of the point. Multiple Object Tracking
Accuracy (MOTA) gives the accumulated accuracy in terms of the fraction of
the tracked joints matched correctly without any misses or mismatches. We
computed the MOTP, MOTA, false positive rate and false negative rate for each
sequence by setting the threshold T = 0.5 with same acceleration parameter
a = 0.1 and a neighborhood size of 17 17 for each body joint. We also use the
coarse joint location estimates as the ground truth data since no appropriate
ground truth has been provided with this dataset. In Figure 3b, we see that
all of the sequences have moderately high precision of around 75% and a high
accuracy of around 90%. This shows that the proposed tracking scheme is less
noise free and the reduction in precision is due to the slight variation of the
estimated joint locations with respect to the coarse joint location estimates.


We have proposed a body joint tracking algorithm using a region-based matching

scheme incorporated along with a Kalman filter for use in conjunction with the
state of the art human pose estimation algorithms under low-resolution scenarios
for outdoor sequences. The algorithm is a combination of effective region-based
point tracking techniques using HOG or LBP coupled with the predictive capability of the Kalman filter. After applying a post-processing GRNN-based
smoothening scheme, we see that the proposed scheme provides a better approximation of the true sinusoidal trajectory than the schemes using only the pose
estimates through qualitative evaluation. In terms of quantitative analysis, precision and accuracy of the joint tracks obtained from proposed scheme is higher.
Future work will involve analyzing the trajectories obtained with the joints to
determine any characteristics embedded in it for suitable gait signature analysis
for people re-identification or for human action and activity analysis.

Nair et al.

(a) Elliptical search region in frame 1 for (b) Fine estimates of joint location in
frame 2.
frame 2 obtained from tracking scheme

(c) Elliptical search region computed in (d) Finer estimates of joint locations in
frame 3 for frame 4. Here, the shoulder and frame 4 obtained from tracking scheme
the ankle joint trackers are corrected with (LBP).
the coarse location while the other joint
trackers are corrected with the regionbased estimate.

(e) Elliptical search region computed at (f) Tracked joint locations at frame 9
frame 7 for frame 8.
based on the elliptical search regions.

Fig. 4: Illustration of elliptical search regions before tracking and joint location
estimates after tracking. The coarse pose estimates are represented by purple
color in each frame. The search regions and the finer joint estimates are given
as shoulder (blue), elbow (green), wrist (red), waist (cyan), knee (yellow) and
ankle (pink).

Body Joint Tracking using Region-based Filtering

(a) Shoulder Joint

(b) Elbow Joint

(c) Wrist Joint

(d) Hip Joint

(e) Knee Joint

(f) Ankle Joint

Fig. 5: Estimated fine joint trajectories by different schemes for subject 11 wearing a coat in phase A. Color Key : Blue - Coarse joint locations from human
pose estimation, Purple - Coarse joint locations filtered by Kalman filter, Green
- HOG region based tracking, Red - LBP region based tracking


Nair et al.

Acknowledgments. This work was done in collaboration with Central State

University and is supported by the National Science Foundation grant No:1240734.
We would like to thank the National Signature Program and the Air Force Institute of Technology for the dataset used in this research.

1. Bagdanov, A., Del Bimbo, A., Dini, F., Lisanti, G., Masi, I.: Posterity logging of
face imagery for video surveillance. MultiMedia, IEEE 19(4), 4859 (Oct 2012)
2. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance:
The clear mot metrics. J. Image Video Process. 2008, 1:11:10 (Jan 2008)
3. Burgos-Artizzu, X., Hall, D., Perona, P., Dollar, P.: Merging pose estimates across
space and time. In: Proceedings of the British Machine Vision Conference. BMVA
Press (2013)
4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on. vol. 1, pp. 886893 vol. 1 (2005)
5. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction
for human pose estimation. In: Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on. pp. 18 (June 2008)
6. Forstner, W., Moonen, B.: A metric for covariance matrices (1999)
7. Huang, C.H., Boyer, E., Ilic, S.: Robust human body shape and pose tracking. In:
3DV-Conference, 2013 International Conference on. pp. 287294 (2013)
8. Kaaniche, M., Bremond, F.: Tracking hog descriptors for gesture recognition. In:
Advanced Video and Signal Based Surveillance, 2009. AVSS 09. Sixth IEEE International Conference on. pp. 140145 (2009)
9. Kohler, M.: Using the Kalman Filter to Track Human Interactive Motion: Modelling and Initialization of the Kalman Filter for Translational Motion. Forschungsberichte des Fachbereichs Informatik der Universit
at Dortmund, Dekanat Informatik, Univ. (1997)
10. Nair, B.M., Kendricks, K.D., Asari, V.K., Tuttle, R.F.: Optical flow based kalman
filter for body joint prediction and tracking using hog-lbp matching. vol. 9026, pp.
90260H90260H14 (2014)
11. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 24(7), 971987 (2002)
12. Ramakrishna, V., Kanade, T., Sheikh, Y.: Tracking human pose by tracking symmetric parts. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE
Conference on. pp. 37283735 (2013)
13. Ramanan, D.: Learning to parse images of articulated bodies. In: Sch
olkopf, B.,
Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems
19, pp. 11291136. MIT Press (2007)
14. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth
images. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 12971304 (2011)
15. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-ofparts. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 13851392 (June 2011)