You are on page 1of 43

A piece-wise Kalman-filter based body joint tracking scheme for low

resolution interlaced imagery


Binu M Naira , Kimberly D Kendricksb , Vijayan K Asaria , Ronald F Tuttlec
a University of Dayton Vision Lab, 300 College Park , Dayton, OH-45469
b Central State University, 1400 Brush Row Road, Wilberforce, OH-45384
cAir Force Institute of Technology, 2950 Hobson Wa, OH-45433

Abstract. We propose an efficient scheme to track joint locations in low resolution interlaced imagery based on
primitive low-level image features. The proposed scheme consists of a piece-wise linear trackers where we model
the sinusoidal joint trajectory as a series of linear sub-trajectories with an lbp-based region matching implemented at
the boundaries. Each sub-trajectory is modeled by a Kalman filter which tracks the optical flow at the joint region in
successive frames. When an optical flow mismatch occurs signaling the end of a sub-trajectory, the tracking scheme
switches to an lbp-based region matching. This region-based match along with coarse joint estimated location from
either a human pose estimation or point light algorithm provides us with a search region on which a second pass of
region matching is performed. The corresponding Kalman filter is reinitialized with this finer joint location estimate
and switches to track the optical flow along the next sub-trajectory. This tracking scheme is tested on a dataset
containing 24 sequences of individuals walking across the face of the building. Three statistical measures are computed
to describe its efficiency in terms of spatio-temporal closeness and multiple object tracking accuracy: Trajectory Covariance, MOTA/MOTP and precision/recall scores.
Keywords: Joint Tracking, Local Binary Patterns, Kalman Filter, Lucas Kanade Optical Flow, Region-based matching, Multiple Object Tracking, Sinusoidal trajectory modeling.
Address all correspondence to: Binu M Nair, University of Dayton Vision Lab, E-mail: nairb1@udayton.edu

1 Introduction
Detection, tracking and locating of objects of interest in a scene is an important aspect of computer
vision where this tuple of research areas hold many applications in surveillance based scenarios
ranging from target identification and tracking in aerial imagery to object detection and location
estimation in videos from CCTV cameras. In recent years, research in object tracking from surveillance videos has been restricted to either detecting and tracking large objects in the scene such as
people in shopping malls, players on a soccer/basketball court,1, 2 detection and tracking of cars
etc. Some research on tracking points in a scene has also been developed where interest points
detected on the human body can be tracked to obtain trajectories to differentiate between different actions.35 Research on joint tracking which requires consistent estimation of a certain body

part has been tackled in scenarios such as indoors for gaming applications where depth sensors
give accurate depth information and higher resolution video data captured at higher frame rates are
available. But, the problem of body joint detection and tracking from surveillance videos captured
by low resolution CCTV cameras without any depth information and at low frame rates has not
given much emphasis in todays research.
In this manuscript, we propose a novel tracking scheme which tracks some of the relevant
body joints such as the shoulder, elbow,wrist, waist, knee and ankle across the scene given a
coarse estimation of the joint location obtained from a human pose estimation algorithm or from
a point light software. An illustration of the specific body joints used for tracking is shown in
Figure 1a. The proposed scheme approximates the non-linear trajectory of a joint as a combination
of successive linear sub-trajectories where each one is modeled by a Kalman filter designed to
track small non-linear variations between frames. The novel aspect of this scheme is that the
boundaries of these sub-trajectories are not pre-defined and can in fact be determined on-line by
detecting apparent mismatches of joint regions between frames along the linear sub-trajectory. In
low resolution imagery, optical flow provides us with the best coarse of measure to track regions or
blobs moving on a linear(or slighty non-linear) fashion. By designing a Kalman filter to model the
state transitions of the optical flow matches, we can detect mismatches if the optical flow match in
the current frame does not fall into the Kalman filter predicted search region. By using this scheme
of the Kalman filter to track optical flow matches on a linear sub-trajectory, its boundaries can be
determined. This brings us to the next issue : Finding a suitable joint location to reinitialize the
Kalman tracker. This finer estimate of the joint location can be determined by using a 2-level region
matching scheme, where at each level, a coarse measure of the joint location is used to define a
search region for the next level and at the finer level, a minimum distance measure between LBP6
2

(a) Illustration of specific joints on the human body to be tracked.

(b) Illustration of a piecewise Kalman Filter


concept for joint tracking.

Fig 1: Joint illustration and its theoretical trajectory.


joint region descriptors is used. Figure 1b illustrates the conceptual modeling of the non-linear
trajectory by piece-wise Kalman filters.
This paper is organized as follows : Section 2 gives the work related to various kinds of research
done in tracking and a brief detail of the problems and issues being tackled. Section 3 explains the
required theoretical background which formed the foundations of the proposed tracking framework
described in Section 4. Finally, we provide results of tracking the specific body joints in Section 5
and conclude this paper in Section 6 with some future proposals and ideas for further improvement
of the algorithm.

2 Related Work
Joint tracking research or in other words, the human body pose estimation problem has been tackled by the research community in two different scenarios; one which uses the depth information
and the other which uses only the images. The former uses the depth information generated either
by a depth sensor such as the Kinect or by a 3D reconstruction algorithm from multiple video
sources. This is suitable for applications in indoor scenarios such as in gaming consoles or for
3

human interactive systems where high resolution is available. The latter is used in surveillance
scenarios which does not have any source of depth information such as the video feed from CCTV
cameras monitoring a parking lot or a shopping mall. One of the most recent and popular work
was done by Shotten et al7 for locating 3D joint position in the human body from a depth image
acquired by a Kinect sensor. They used a part-based recognition paradigm where they converted
a difficult pose estimation problem to an easier per-pixel classification problem and subsequently
estimate the 3D joint locations irrespective of pose, body shape or clothing. In a more recent
approach by Huang et al,8 human body pose is estimated and tracked across the scene using information acquired by a multi-camera system. Here, both the skeletal joint positions as well as the
surface deformations(body shape changes) are estimated by fitting a reference surface model to the
3D point reconstructions from the multi-camera system. This also makes use of a learning scheme
which divides the point reconstructions into rigid body parts for accurate estimation. But the above
cases requires the use of high resolution imagery under controlled lighting or environmental conditions to work with high accuracy. Another restriction of these methods is the dependence on depth
information for either direct usage or for point cloud reconstruction.
One of the earlier and popular works which does not use the depth information and which
uses only a single video camera to track human motion is done by Markus Kohler.9 Here, he
designs a Kalman Filter to track human motion in such a way that non-linearity in motion can be
considered as a motion with constant velocity and changing acceleration which can be modeled as
white noise. The process noise co-variance of the Kalman filter is designed in such a manner so as
to incorporate this changing acceleration. In our proposed algorithm, we use a modification of this
Kalman filter and the design of the process noise co-variance to track the body joints along a subtrajectory across the video sequence. Kaniche et al4 used the extended version of the Kalman filter
4

to track specific points or corners detected at every frame of the video sequence for the purpose
of gesture recognition. Each point is described by a region descriptor such as the Histogram of
Oriented Gradients(HOG) and the Kalman filter tracks the position of the corner by using a HOGbased region matching. For tracking specific joints however, this methodology does not suffice
as any corner point which does not get matched with previous frame gets discarded. Bilimski et
al5 extended this methodology by incorporating the object speed and orientation to track multiple
objects under occlusion.
In recent years, the problem of human body pose estimation has not just being limited to tracking points or corners or using depth information. One of the state of art methods for human pose estimation on static images is the Flexible mixture of parts model, proposed by Yang and Ramanan.10
Instead of explicitly using a variety of oriented body part model templates(parameterized by pixel
location and orientation) in a search-based template matching scheme, a family of affinely-warped
templates is modelled, each template containing a mixture of non-oriented pictorial structures.
This eliminates the need to estimate multiple degrees of freedom of a limb due to these approximations. Ramakrishna et al11 proposed an occlusion aware algorithm which tracks human body
pose in a sequence where the human body is modeled as a combination of single parts such as the
head and neck and symmetric part pairs such as the shoulders, knees and feet. Here, an important aspect of this algorithm is it can differentiate between similar looking parts such as the left
or right leg/arm, thereby giving a suitable estimate of the human pose. Although these methods
show an increased accuracy on datasets such as the Buffy Dataset12 and the Image Parse dataset,13
the performance on very low-resolution imagery with interlacing is not yet evaluated. But one of
main advantages of these kinds of human pose estimation algorithms is that in the post-processing
stage, the various body part detections can provide coarse estimates of a joint location which can
5

then be used to initialize tracking schemes and track joint locations in subsequent frames in a video
sequence. One such work was done by Xavier et al14 where they propose a generalization of the
non-maximum suppression post processing schemes to merge multiple post estimates either in a
single frame or in multiple consecutive frames of a video sequence. This merging of estimates
is done by a robust and constrained K-means clustering15 along the spatial domain for a single
frame and along spatio-temporal domain for a video sequence. Again, one of the main concerns
is its dependence on multiple pose estimates which relies on the ability of the state of the art pose
estimation algorithms on low-resolution interlaced imagery.
In our proposed joint tracking framework, we follow the track by detect scheme where we use
optical flow matches and a Kalman filter to track joint locations lying on an approximately linear
sub-trajectory with suitable re-initialization of the joint tracker using LBP-based region matching
criteria. The coarse joint location estimates are used to re-initialize the tracker and this happens
only in certain frames or intervals which indicate the beginning/end of a sub-trajectory of a joint.
In the next section, we will describe the theory involved in the various modules of the tracking
framework.

3 Theory
This section describes the necessary theoretical background required for a deeper understanding
of the proposed model for joint estimation and tracking. The main sections which will be covered
are : a) Lucas Kanade Optical Flow estimation, b) LBP-based Region Matching, c) Kalman Filter.
Our proposed methodology is a combination of these techniques designed to estimate and track
joints in a low-resolution video, given the coarse estimate of the joint locations.

(a) Block schematic of optical flow computation to compute global velocity.

(b) Optical flow illustration.

Fig 2: Framework for computing optical flow and illustration.


3.1 Lucas Kanade Optical Flow
Optical flow between two frames of a video sequence estimates the velocity of a point in the real
world scene by finding a relationship between the projections of that point in the corresponding
frames. In other words, optical flow measures the velocity or movement of a pixel or region
between two time instances. In our case, the point of interest is the corresponding joint of a human
body and we need to estimate the velocity of that joint in the current frame given its location in
the previous frame. There exists two main methods for computing this velocity : one is the Horn
Shunck method which takes into consideration a global constraint (i.e the entire image pixels are
used in the determination of the velocity of a single pixel) while the other is the Lucas Kanade16
method which is more localized (i.e. it considers only a neighborhood region around the point of
interest, thereby setting a local constraint). The optical flow of both the methods are based on a
single equation given by I(x, y, t) = I(x + x, y + y, t + t). Here, lets consider that a pixel
p = (x, y) at time t has moved to a position p0 = (x + x, y + y) at time t + t. The equation then
assumes that the brightness of the pixel remains constant through its movement. This is one of
major assumptions of the optical flow. Other assumptions such as the spatial coherence where the
point describing an object region does not change shape with time and temporal persistance where
the motion happening in a pixel or region is purely based on the motion of the object and not

due to the camera movement. For tracking joint regions of the human body, the localized regions
remain rigid or do not change shape and thus does not violate the spatial coherence assumption.
Since in our testing scenarios, we use video sequences captured from a stationary camera with a
constant background, the temporal persistance assumption is not violated. So, for our purposes, we
employ the Lucas Kanade(L-K) Optical Flow estimation technique which uses a local constraint.
The optical flow equation can be derived by using a Taylor series expansion of the basic equation
and is given by
I
I
I
vx +
vy +
= 0 or I.v + It
x
y
t

(1)

where (vx , vy ) are the optical flow velocity of a pixel p = (x, y). As mentioned earlier, L-K
method uses a local constraint. A small window region (local neighborhood) around the point
p = (x, y) is considered and within this neigbhorhood, a weighed least squares estimate equation
is minimized. This equation is given by

W 2 (x, y)[I(x, y, t) + It (x, y, t)]2

(2)

x,y

Using the above equation and the optical flow constraint equation (Equation 1), we can uniquely
compute the solution v. The assumption here is that the optical flow within that local region is constant. But there are some issues when computing the Lucas Kanade Optical flow. One issue is that
the motion in the scene is not small enough and we will need the higher order terms in the optical
flow constraint equation. The alternative approach is to use a pyramidal iterative Lucas Kanade
approach where the image scene at a particular instant is down sampled to form a Gaussian pyramid and at each level, optical flow is computed. The other issue is that if the point in a local region

does not move like its neighbors. This brings back to our earlier assumption of spatial coherence
where the objects or points to be tracked should be rigid. So, one of the important design criteria
is to determine what would be the ideal window size (local region size) for computing the optical
flow at a certain point. For the joint tracking problem, this window size depends on the resolution
of the video and thus for poor resolutions, we use a window size of 7 7. An illustration of optical
flow estimation on the points of human body silhouette is shown in Figure 2. For the proposed
algorithm, we use the optical flow estimation in two scenarios: one to compute the global velocity
of the motion of the human body; and the other to find an estimate of the location of a particular
body joint in the next instant. For the latter purpose, we compute the optical flow for every point
surrounding the joint region using L-K method and then compute the median flow.

3.2 Region Descriptors


The region descriptors such as the Local Binary Patterns(LBP)6 are used to describe the edge information and the textural content in a local region and can be a very effective descriptor for regionbased image matching. Many efficient image descriptors are out there such as the SIFT,ORB,HOG
etc.. but one of the assumptions is that the images should be of high resolution. The LBP is very
effective in describing an image region in spite of low resolution and interlacing. The local binary pattern is an image coding scheme which brings out the textural features in a region. For
representing a joint region and to associate a joint in successive frames, the texture of the region
plays a vital part in addition to the edge information. The LBP considers an local neighborhood of
8 8 in a joint region, and labels the neighborhood pixels by either a 1 or 0 based on the center
pixel value. The coded value representing this local region is then the decimal representation of
the neighborhood labels taken in clockwise manner. Thus, for every pixel within the joint region,
9

a coded value is generated which represents the underlying texture. The LBP operator is defined
as

LBPP,R =

P
X

s(gp gc )2p

s(z) =

1 z 0

(3)

0 z < 0

p=0

where (P, R) is the number of points around the local neighborhood and its radius. The textural
representation of the joint region will then be the histogram of these LBP-coded values. For our
purposes, we use P = 8 with R = 1 which reduces to a local region of size 8 8. The matching
between two joint regions is then given by the Chi-squared metric6

2 (f1 , f2 ) =

X (f1 (b) f2 (b))2


f1 (b) + f2 (b)

(4)

where f1 ,f2 are the LBP descriptors corresponding to a certain joint in successive frames.

3.3 Kalman Filter


The Kalman filter17, 18 is a mean squared estimator which estimates the true value of a measurement
in an iterative procedure where each iteration will be a certain noisy measurement at a time instant.
The underlying model of the filter is a set of equations having a state-space representation and is
given by
xk+1 = Axk + qk ; zk = Hxk + rk

(5)

where xk is the state vector at instant k, A is the transition matrix, H is the measurement matrix and
zk is the measurement vector. q and r are random variables generated from a white noise process

10

with co-variances given by Q = E[qk qTk ] and R = E[rk rTk ]. Here, we can define Pk = E[ek eTk ]
as the error co-variance matrix at time instant k where we can consider a prior estimate of the
k after knowing
state x

k from the knowledge of the system and posterior estimate of the state x
the current measurement zk . The error ek is then defined as the difference between the true state
and the posterior state (xk x
k ). For obtaining a true value of a response(or state) generated by
a process or system, an iterative procedure will be to get a prior estimate x

k at instant k which is
obtained from the posterior estimate (
xk1 ) at instant k 1. Then, using the measured value of
the response (zk ), we compute the innovation or measurement residual zk H
x
k and use this to
obtain a posterior estimate x
k = x

x
k + Kk (zk H
k ). The kalman gain Kk at instant k is given
by
Kk = Pk H T (HPk H T + R)1

(6)

where Pk = E[e

k ek ] , ek = (xk x
k ) and Pk = (I Kk H)Pk . This iterative procedure can

be divided into two stages; Time update (prediction stage) and Measurement Update(correction
stage). Thus, the Kalman filter can be thought of as a process which estimates the state at one
instant and then obtains the feedback in the form of noisy measurements of the response of the
system.
The recursive version of the Kalman filter can also be used for tracking purposes and in literature, it has been widely applied for tracking points in video sequences. In the proposed algorithm,
we use the Kalman filter to track a specific body joint across the scene. This is done by setting the
state of the process (which in this case is the human body movement) as the (x, y) coordinates of
the joint along with its velocity (vx , vy ) to get a state vector xk R4 . The measurement vector
zk = [xo , yo ] R2 will be provided by the coarse estimates obtained by using a human body pose

11

Fig 3: Joint tracking algorithm using Kalman filter.


estimation algorithm or point light software. By approximating the motion of a joint in a small
time interval by a linear function, we can design the transition matrix A so that the next state is
a linear function of the previous states. As done by Kohler,9 to account for non-constant velocity
often associated with accelerating image structures, we use the process noise co-variance matrix
Q defined as

2
0
34t 0
2(4t)

2
2
0
2(4t)
34t

a 4t

Q=

6
34t

0
6

0
34t
6

(7)

where a is the acceleration and 4t is the time step determined by the frame rate of the camera.
This design of the Kalman filter suits our scheme well as any small non-linearity in the subtrajectory can be account for a non-constant velocity of the joint region. The modification of the
Kalman Filter recursive algorithm used for the joint tracking is shown in Figure 3. It is shown from
the figure that the measurement is obtained from the optical flow estimate. There are a couple of

12

Fig 4: Block schematic of tracking.


scenarios which needs to be tackled in order to use the optical flow as a reliable measurement
vector. The first one is that the optical flow estimate falls in the elliptical search region computed
during the prediction phase and this confirms the correctness of the optical flow thereby making the
optical flow estimate as a suitable measurement vector. The elliptical search region is computed by
using the posterior state and the predicted state as two foci of an ellipse and computing the major
and minor axis using the possible error values from the prior error co-variance matrix.4 The second
scenario is when the optical flow estimate does not fall in the search region, thereby confirming that
the optical flow estimate is noisy and is not suitable for measurement. This signals the end of the
current linear sub-trajectory and beginning of the next linear sub-trajectory where the associated
Kalman filter must be re-initialized to track the optical flow matches in the next sub-trajectory.

4 Proposed Framework
The proposed tracking scheme consists of two main stages: a) Kalman tracking of the optical flow
matches on a sub-trajectory b) Reinitialization of the Kalman tracker using a region based match.
In the overall schematic shown in Figure 4, the first step is to compute the foreground/background
model. The foreground mask can provide us with an estimate of the global velocity to initialize/re13

(a) Coarse joint location estimated in frame 1.

(b) Elliptical search region in frame 2.

(c) Fine estimates of joint location in frame 2 after tracking. (d) Elliptical search region in frame 4 (Wrist joint tracker is
reinitialized).

(e) Finer estimates of joint locations in frame 4 after tracking.

Fig 5: Illustration of elliptical search regions before tracking and joint location estimates after tracking. The coarse pose estimates are represented by purple color in each frame. The
search regions and the finer joint estimates are given as shoulder(blue), elbow(green), wrist (red),
waist(cyan), knee (yellow) and ankle(pink). In frame 4, region-based matching is initiated and the
corresponding tracker is re-initialized.
14

(a) Elliptical search region for frame 9. Here, the ankle joint (b) Finer joint location estimations after the tracking scheme.
undergoes region matching and since the constraint is satisfied, the corresponding tracker is re-initialized.

(c) Elliptical search regions Sop (t) and Sreg (t) in frame 11. (d) Finer joint location estimates after the tracking scheme.
Here, for the knee joint, the constraint is not satisfied and the
tracker is only corrected by the coarse joint location estimate
given by the purple point.

(e) Elliptical search regions Sop (t) and Sreg (t) for both the(f) Finer joint location estimate where the knee and ankle
knee and the ankle joints in frame 13.
trackers are corrected by coarse joint location estimates.

Fig 6: Illustration of elliptical search regions and fine joint estimates in certain frames when tracker
is only corrected.
15

initialize the Kalman tracker associated with a joint. As we traverse across each time step along
the sub-trajectory, each joint region will be described by a uniform LBP histogram. The coarse
estimation of the joint location is provided by the estimates given by Point Light Software. To
demonstrate the tracking ability of the framework, we use the coarse estimated points at subtrajectory boundaries to get a finer region-based estimate of the joint location. The algorithm is
given below :
1. Extract the first frame(time instant t = 1) of the sub-trajectory. Compute dense optical flow
within the foreground region to get the global velocity estimate(median flow). Initialize/Reinitialize the Kalman filter with the coarse joint location(xcos , ycos )/finer region-based estimate (xreg2 , yreg2 ) and the global velocity. The state of the tracker for each body joint is then
xt = [x, y, vx , vy ] where (vx , vy ) is the joint velocity which is set to the global flow velocity
estimate. This will considered as the corrected state x
t1 at time t = 1. Update t t + 1

and predict the state(get prior state) x


t of the Kalman filter. Using the predicted state x
t ,
t1 and the apriori error co-variance Pt , estimate the elliptical region Sop (t)
posterior state x
where the joint location is likely to fall on.
2. Extract the next frame of the sub-trajectory. Find the optical flow match (xop , yop ) of each
joint between instances t and t 1. Also compute the dense optical flow and the global
velocity of the foreground region. Check if optical flow joint location estimate falls on the
predicted elliptical search region. If yes, go to Step 3. Else go to step 4.
3. Using the joint location estimates provided by the optical flow as the measurement vector
z = [zx , zy ], perform the correction phase of the filter to get the posterior state x
t . Update

t t+1. Set the joint velocity as the global velocity and predict the state(get prior state) x
t
16

and the elliptical search region. Repeat steps 2 and 3 until the end boundary of sub-trajectory
denoted by optical flow mismatch.
4. Compute the joint location estimate (xr eg, yr eg) within the Kalman filter predicted search
region using LBP-based region matching. This estimate is given by

argminpSop (t) 2 (fj , fp )

where fj is the joint descriptor updated in the previous time instant, fp is the region descriptor
computed at the pixel p = (xreg , yreg ) within the elliptical search region Sop (t). Using
this estimate and the coarse joint location estimate, predict the new elliptical search region
Sreg (t). If the new elliptical search region is very large, a constraint Sreg (t) Sop (t) is
used. Re-initialization occurs only if this constraint is satisfied. If it is satisfied, go to step 5,
else goto step 6.
5. Compute the LBP-based region matching estimate given by argminpSreg (t) 2 (fj , fp ) where
p = (xreg2 , yreg2 ). Use this finer estimate of the joint location to re-initialize the Kalman
tracker associated with that particular joint. Update joint velocity as the global velocity and

predict the state(get prior state) x


t and the elliptical search region Sop (t). Go to Step 2.
6. Use the coarse joint location estimates (xcos , ycos ) as the measurement vector z = [zx , zy ] to
correct the corresponding tracker.
7. Continue till all the frames of the sequence has been processed.
We provide sample illustrations of the tracking scheme in Figures 5 and 6. In Figure 5, for frame
2, the optical flow matches of all the joints fall in their respective predicted elliptical search region
17

Sop (t) and these matches correct the corresponding joint tracker. In frame 4, all of the joints except
the wrist joint are still in the sub-trajectory as there are no optical flow mismatches. For the wrist
however, the optical flow match does not fall into its respective predicted elliptical search region
Sop (t). Thus, within the Sop (t), a LBP-based region match is found. Using this match and the
coarse estimated joint location, another elliptical region Sreg (t) is obtained. Again, on the Sreg (t)
, the region based match is obtained. This match re-initializes the Kalman tracker as the constraint
is satisfied and signals the beginning of another sub-trajectory. This is not the case with the knee
and ankle joints in frame 11 and 13 where in fact, the elliptical region Sreg (t) is much much larger
than Sop (t). This constraint is violated either when the coarse joint location estimates are noisy(
sometimes not on the body but on the background) or when the region-based LBP match fails and
catches onto the an edge on the background. In the proposed technique, we tackle this issue by
using the coarse joint location estimate to correct the existing tracker and not re-initialize it. This
is to make sure that the tracker does not get caught on to the background edges and only keeps
tracks of the corresponding body joint.

5 Results and Experiments


The proposed tracking scheme has been tested on a private dataset provided by the Air Force
Institute of Technology, Dayton OH. It consists of 12 subjects walking along a outdoor track across
the face of a building and with a staircase in the front, and is performed twice by each subject to
get a total of 24 video sequences. Each subject not only wears different colored clothing but also
wears a coat vest on their second try during data capture. These video sequences are captured
simultaneously using two cameras focused on the same area. So, when each sequence is divided
into 5 phases A - E, a sequence of each phase is selected from either the left camera or right
18

(a) Background image

(b) Phase A

(c) Phase B

(d) Phase C

(e) Phase D

(f) Phase E

Fig 7: Illustration of the scene and division of video sequence into five phases.
camera depending on what area is being focused on for analysis. Thus, we dont consider from
which camera the sequence has been shot from. The description of each phase along with the
illustration in Figure 7 is described as follows.
Phase A : Subject is walking clockwise around the track. The frames of interest are of the
subject walking on the cross over the platform.
Phase B: Subject is walking clockwise around the track. The frames of interest are of the
subject walking on the grass after the ramp.
Phase C: Subject is walking clockwise around the track. The frames of interest are of the
subject walking on the grass after the ramp on the side of the track away from the building.
Phase D: Subject is walking counter-clockwise around the track. The frames of interest are
of the subject walking on the grass before the ramp.
19

Phase E: Subject is walking counter-clockwise around the track. The frames of interest are
of the subject walking on the grass along the ramp.

5.1 Challenges of the dataset and evaluation strategies


In this manuscript, we provide test results obtained by testing the proposed tracking scheme on
all sequences in all the phases. Although the dataset was captured to analyze the difference in the
gait of the individual in the case of wearing/not-wearing a coat vest, this dataset provides a good
number of challenges to test the precision of the proposed tracking scheme. The dataset comes
with the human pose estimates at every frame and is obtained by the Point Light Software. These
pose estimates give us a coarse joint location estimates which are noisy and accurate with regards
to the application of gait tracking at hand. The proposed tracking scheme makes use of the coarse
joint location estimates to give finer estimates of the joint location. The effect of the tracking
scheme on the gait analysis algorithms is beyond the scope of this paper and we focus mainly on
the joint tracking aspect and the smoothness of the trajectory. One main challenge in this dataset
is the very low resolution imagery where a 17 17 neighborhood around a single joint, say a
shoulder joint will capture the entire upper body of the individual. This is illustrated in Figure
1a. The other challenge is the interlacing effects present in the video which can render edge based
region descriptors ineffective and affect the matching process. Apart from these global challenges,
there are certain characteristics associated each phase which sometimes introduces a challenging
scenario for tracking. Some of these characteristics are
Phase A : There can be partial/complete occlusion of the lower-body joints such as knee
and ankle due to the platform railings and staircase. The lowest resolution of the person is
captured in this phase as the person is at the farthest distance from the camera.
20

Phase B : There is a complete occlusion of the ankle by the tall grass and joint region descriptions cannot be computed. Moreover, the coarse estimates provided by the Point Light
software are also very noisy and do not give robust estimates of the joint.
Phase C : The image region containing the person is of slightly higher resolution as he is
closer to the camera. No occlusions of the ankle joint by the grass was noticed and it gives a
much cleaner data for the tracking scheme.
Phase D : There is a complete occlusion of the ankle due to the tall grass and same problems
from phase B exists as well. The only difference being the person is walking on the opposite
direction.
Phase E : Same challenges as that of phase A with the difference being the person walking
in the opposite direction.
We set equal neighborhood sizes of 17 17 for each joint region and set a constant acceleration
a = 0.1 pixels/f rame2 in the process noise co-variance design of the corresponding Kalman
filter. To illustrate the effectiveness of the tracking scheme, we provide three different types of
measures and graphs which explains the different aspects of tracking efficiency.
1. Co-variance-Based Trajectory Measure: A statistical measure which gives how close the
tracked joint locations are to the coarse estimates of the joint location for each sequence
associated with a particular subject. This statistical metric19 is given by
v
u n
uX
d(K, Km ) = t (log(i (K, Km ))2
i=1

21

(8)

where K R3 is the co-variance of the tracked points, Km R3 is the co-variance matrix


of the coarse joint locations, i is the ith eigen value associated with |K Km | = 0 and
n being the number of eigen values. The lower the value, the closer are the tracked points
to the coarse joint location estimates. This measure although does not provide us with the
precision of the tracking scheme, gives an indication whether the tracked joint trajectory are
located within the spatio-temporal neighborhood of the coarse joint trajectory.
2. Multiple Object Tracking Precision/Accuracy (MOTP/MOTA): The MOTP/MOTA20 metric is a widely used efficiency measure for multiple-object tracking mechanisms where the
MOTP/MOTA gives the precision and accuracy of the tracker by considering all the detected
and tracked objects. We use a implementation of the CLEAR-MOT provided by the authors21 to provide us the statistical data such as false positive rate, false negative rate, MOTA
and MOTP scores. These statistics are computed as follows

(a) Multiple Object Tracking Precision (MOTP) : It refers to the closeness of a tracked
point location to its true location(given as ground truth). Here, we measure the closeness by measuring the overlap between the neighborhood region occupied by the tracked
point location and the ground truth. Higher the value of this overlap, more precise is
the estimated location of the point. This is given by
X

oit

i,t

M OT P = X

(9)
ct

where oit is the amount of overlap for the joint i at frame t of a sequence and ct is
the number of correct correspondences at frame t. Only those joints which satisfy the
22

criteria oit > T are included in the above equation.


(b) Multiple Object Tracking Accuracy (MOTA) : It gives the accumulated accuracy in
terms of the fraction of the tracked joints matched correctly without any misses or
mismatches. It is given by
X
(mt + f pt + mmet )
M OT A = 1

(10)
gt

where mt , f pt and mmet are the number of misses, false positives and mismatches
respectively and gt are the number of points present at frame t. Thus the false negative
rate, false positive rate and rate of mismatches can be computed as

P
P
m
fp
Pt t , Pt t
t gt
t gt

and

P
t mmet
P
.
t gt

These statistics evaluates the tracking algorithm in terms of overall accuracy and precision
achieved by accumulating the measures of all the joints of interest per video sequence.
3. Precision/Recall: The precision and recall for the multiple object tracking is computed as the
overall MOTP and MOTA scores. The precision, in contrast to the theoretical definition, is
computed by accumulating the overlaps oit and correct correspondences ct at all frames for all
the sequences and taking the ratio between them. The recall is computed by accumulating the
total number of misses, false positives and mismatches from all frames of all the sequences
and using the formula for MOTA. The precision and recall is computed from every possible
parameter set of the tracking scheme so that the best combination can be found for each
phase.

23

(a) Kalman filtered coarse joint location estimates.

(b) Corresponding range percentage for Kalman tracking


scheme.

(c) Proposed tracking scheme.

(d) Corresponding range percentage for the proposed tracking


scheme.

Fig 8: Statistical measures (left) and percentage of sequences into different ranges (right) obtained
for phase A.
5.2 Covariance-Based Trajectory Analysis
The co-variance based trajectory measure is computed between the tracked points and the coarse
estimated points for each phase. We provide two variations of the tracking scheme ; a) One which
simply uses the Kalman filter on the coarse estimates directly ; b) the other is the proposed tracking

24

(a) Kalman filtered manual point annotation.

(b) Corresponding range percentage for Kalman tracking


scheme.

(c) Proposed tracking scheme.

(d) Corresponding range percentage for the proposed tracking


scheme.

Fig 9: Statistical measures (left) and percentage of sequences into different ranges (right) obtained
for phase B.

25

(a) Kalman filtered coarse joint location estimates.

(b) Corresponding range percentage for Kalman tracking


scheme.

(c) Proposed tracking scheme.

(d) Corresponding range percentage for the proposed tracking


scheme.

Fig 10: Statistical measures (left) and percentage of sequences into different ranges (right) obtained
for phase C.

26

(a) Kalman filtered coarse joint location estimates.

(b) Corresponding range percentage for Kalman tracking


scheme.

(c) Proposed tracking scheme.

(d) Corresponding range percentage for the proposed tracking


scheme.

Fig 11: Statistical measures (left) and percentage of sequences into different ranges (right) obtained
for phase D.

27

(a) Kalman filtered coarse joint location estimates.

(b) Corresponding range percentage for Kalman tracking


scheme.

(c) Proposed tracking scheme.

(d) Corresponding range percentage for the proposed tracking


scheme.

Fig 12: Statistical measures (left) and percentage of sequences into different ranges (right) obtained
for phase E.

28

scheme which uses the image information in determining the fine joint location estimates. Thus,
we compute the trajectory measure of the two tracking schemes for each video sequence and for
each joint. Figures 8, 9, 10, 11 and12 provide the tables containing the trajectory measures for
each phase and for each tracking scheme.
We can empirically determine different ranges of the trajectory discrepancy measures over
which we can say the finer estimates obtained by the tracking scheme is acceptable or not and this
is possible by a visual inspection of the trajectory plots for each joint for each sequence. A sample
of the trajectory plots computed for subject 11 in phase A is shown in Figure. The ranges and the
possible acceptance level with explanations are given below
Trajectory measure ; d [0, 1) : This denotes that the finer estimates obtained by a tracking
scheme are much closer to the coarse joint location estimates than required. In this scenario,
the finer estimates leans more towards the noisy, discrete coarse estimates. Although the
tracking scheme gives better estimates than the coarse pose estimates, the finer estimates are
slightly noisy in nature and are not very smooth.
Trajectory measure; d [1, 3) : This range of values are considered as highly acceptable
levels even though they seem farther from the noisy coarse pose estimates. By observation,
we see that the finer estimates of the joint trajectory are more smoother than the coarse joint
estimates and in fact resembles the actual sinusoidal trajectory of the joint.
Trajectory measure; d [3, 5) : This range of values can be considered as semi-acceptable
where the finer joint trajectory estimates obtained from the tracking are smooth but they are
slightly far apart from the noisy coarse estimates. This is because either the coarse pose
estimates are noisy or that it tracks a different point on the same body joint region and
29

maintains the wrong track. Sometimes the estimated fine trajectory might miss/track some
other point in a sub-trajectory and the corresponding trajectory measure falls in this range as
well.
Trajectory measure; d [5, ] : This corresponds to some wayward tracking by the tracking
scheme. This happens mainly because the coarse joint location estimates contain a large
error due to the failing of the human pose estimation algorithm. In this case, the finer joint
location estimates and the coarse estimates are drastically different.
Using these pre-defined ranges, we compute the percentage of sequences whose trajectory discrepancy measure falls within the specified ranges for the two schemes as mentioned earlier. For phase
A, we see a large percentage of sequences of around 6575% falling within the first measure range
d [0, 1] for the Kalman filtered tracking scheme. As mentioned before, although this measure is
small, this tracking scheme gives more precedence to the coarse points and is under the assumption that these coarse points are noise free. Thus, it is an acceptable estimated trajectory but not
a smooth one as required for gait analysis. However, for the proposed scheme, around 65 85%
of the sequences lie on the most acceptable region d [1, 3) with the exception of the ankle joint
where only 20% falls on it while the rest falls on the region d [0, 1) . This latter region is still
acceptable as far as tracking is concerned. 5 10% of sequences for shoulder, elbow, wrist and hip
joint falls in the semi-acceptable region d [3, 5).
For phase B, the Kalman filtering scheme performs better where most of the sequences falls
in the acceptable region with equal divisions between ranges d [0, 1) and d [1, 3). Using
the proposed scheme, we get improved performance for the shoulder, knee and ankle joint and
comparable performances for the hip and knee joint. The elbow however has a lot of sequences
30

in the semi-acceptable range d [3, 5) with a couple of sequences falling in the bad range d
[5, ) for the elbow, wrist and hip joints. This is mainly because the LBP descriptor of wrist joint
was unable to capture the information as there was not enough pixels for representation in this
low-resolution imagery. The bad region matching is also due to similar appearances between the
clothing and the background in this phase. Interestingly enough, although the ankle was occluded
due to the grass for some sequences, the tracking scheme was able to pick up the ankle joint
from one of the coarse pose estimates and was able to track it to a certain degree and thus, the
corresponding sequences fall in the acceptable regions.
All of the sequences in Phase C falls in the acceptable region where around 65 100% falls in
the region d [0, 1) for the Kalman filtered scheme. For the proposed scheme, these sequences are
distributed between the two acceptable regions with the majority falling in the highly acceptable
region d [1, 3). Similar distribution of the sequences is seen for phase D and E with the proposed
scheme showing a larger number of sequences falling in the highly acceptable region for all the
joints.
Thus, we see that for all the phases, most of the sequences are distributed in the highly acceptable regions d [1, 3) where gait analysis can be useful. This is also the region where the
estimated trajectories follow a smooth sinusoidal path. Some sequences however are distributed in
the region d [0, 1) even with the proposed scheme which will require post-processing of the fine
joint location estimates for gait analysis. This is because of the constraint of having very low resolution with interlacing effects which makes region-based descriptor matching ineffective. When
the region matching fails, the proposed scheme becomes equivalent to the Kalman filtered tracking
scheme, thereby atleast maintaining the joint track. This is useful when in a certain region where
the region matching does get effective, a portion of the track can be used to analyze the gait of an
31

individual.

5.3 MOTP/MOTA Analysis


We computed the MOTP,MOTA, false positive rate and false negative rate for each sequence individual for each phase by setting the threshold T = 0.5 with same acceleration parameter a = 0.1
and a neighborhood size of 17 17 for each body joint. The corresponding distributions in the
MOTA-MOTP space are shown in Figure 14 where the red stars are the sequences, labeled appropriately. The gaussian contours approximates the distribution of the sequences in the MOTPMOTA space. The more concentrated the distribution is towards the upper right corner, the more
better the precision and accuracy of the tracking scheme. In Figure 14a, we see that all of the
sequences in phase A have moderately high precision and accuracy with some achieving high
accuracy of 90% with the corresponding precision being above 80%. However, two sequences belonging to Subject 26 show a low accuracy of 60% or less with a precision of around 75%. This is
mainly because the hip and the ankle joint tracks follow a different path as compared to the ground
truth data. Another important factor contributing to the drop in accuracy for some sequences is
also the noise in the ground truth data annotation provided by the Point Light Software.
For phase B, as shown in Figure 14b, most of the sequences have only a moderate precision of
around 70 75% and moderate accuracy ranging from 50% 75%. Some sequences belonging
to Subject 3, 22 and 26 exhibits low accuracy of 50% or less. However, there are some sequences
belonging to Subject 18 and 27 which exhibits very high accuracy of 90% or more with good
precision of 8085%. The wide distribution of the sequences in the MOTP-MOTA space is mainly
due to a lot of noisy ground truth annotations by the Point Light Software and a lot of occlusionbased challenges present in this phase. Even in such a challenging scenario( with occlusions and
32

(a) Shoulder joint

(b) Elbow joint

Fig 13: Estimated fine joint trajectories by different schemes for subject 11 wearing a coat in phase
A.
33

(c) Wrist joint

(d) Hip joint

Fig 13: Estimated fine joint trajectories by different schemes for subject 11 wearing a coat in phase
A.
34

(e) Knee joint

(f) Ankle joint

Fig 13: Estimated fine joint trajectories by different schemes for subject 11 wearing a coat in phase
A.
35

(a) Phase A

(b) Phase B

Fig 14: Scatter plot showing where the sequences of each phase are distributed in the
MOTP/MOTA space.
36

(c) Phase C

(d) Phase D

Fig 14: Scatter plot showing where the sequences of each phase are distributed in the
MOTP/MOTA space.
37

(e) Phase E

Fig 14: Scatter plot showing where the sequences of each phase are distributed in the
MOTP/MOTA space.
background similarities with interlacing ), the tracking scheme performs moderately well.
For phase C, although there are a few sequences which shows low accuracy, majority of the
sequences have a moderate of accuracy of around 60% more. Although this phase has a slightly
better resolution of the person, some of the challenging scenarios similar to phase B exists in this
phase as well where due to the better resolution, the tracking scheme performs much better in
phase C than in phase B.
Phase D and E however, show a lot of sequences having good accuracies of 75% or more. Similar challenging scenarios exist with the difference being the person moving in an anti-clockwise
manner around the track. Overall for each phase, we notice that there is a considerable amount
of sequences showing good accuracies of 75% or more with a minor portion exhibiting low accu-

38

racies of 50% or more. Again, this is due to a lot of noise in the coarse joint location estimates
provided by the Point Light Software which drops the accuracy for some sequences. This noise
in fact contributes to the number of false positives which maybe incorrectly interpreted, thereby
reducing some portion of the accuracy during evaluation. However, for all the phases, a good precision of 75% or more is achieved and the tracking scheme is precise in locating or providing us
with finer estimates of the joint location.

5.4 Precision/Recall for each body joint


The precision and recall is computed for each phase for a particular value of the acceleration
parameter a in the Kalman filter and is illustrated in Figure 15. For phases A,C and D, we see
that the precision and recall achieves the highest value of around 80% and 85% for an acceleration
value a = 0.1. However, for phases B and E, we see that an acceleration of a = 0.2 gives higher
values of precision and recall. This is due to the difference in speed of the joints with respect to
each individual and an optimal value of the acceleration for each person is required.

6 Conclusions and Future Work


We have proposed a body joint tracking algorithm for use in low-resolution imagery for outdoor
sequences. The algorithm is a combination of primitive but effective point tracking techniques
using the optical flow and region based matching using LBP coupled with the learning ability
of the Kalman filter. Some joints such as shoulder, elbow and hip are successfully tracked in
most of the sequences along with the wrist joint. However, the knee and ankle joints have multiple
occurrences of re-initialization due to the mismatching of the optical flow caused by low-resolution
artifacts and interlacing effects. An important addition which we plan to add in the future work is

39

(a) Phase A

(b) Phase B

(c) Phase C

(d) Phase D

(e) Phase E

Fig 15: Variation of precision and recall of tracking scheme with change in acceleration parameter.

40

to use the contextual relationship between the body joints. This crucial aspect is missing in this
proposed algorithm as it assumes that joint movement is independent of the other joints which
pertains to the use of individual piece-wise tracking schemes for each joint.

References
1 H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, Tracking multiple people under global
appearance constraints, in Computer Vision (ICCV), 2011 IEEE International Conference
on, pp. 137144, 2011.
2 J. Shao, S. Zhou, and R. Chellappa, Tracking algorithm using background-foreground motion models and multiple cues, in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP 05). IEEE International Conference on, 2, pp. 233236, 2005.
3 W.-L. Lu and J. Little, Simultaneous tracking and action recognition using the pca-hog descriptor, in Computer and Robot Vision, 2006. The 3rd Canadian Conference on, pp. 66,
2006.
4 M. Kaaniche and F. Bremond, Tracking hog descriptors for gesture recognition, in Advanced Video and Signal Based Surveillance, 2009. AVSS 09. Sixth IEEE International Conference on, pp. 140145, 2009.
5 P. Bilinski, F. Bremond, and M. B. Kaaniche, Multiple object tracking with occlusions using
hog descriptors and multi resolution images, in Crime Detection and Prevention (ICDP
2009), 3rd International Conference on, pp. 16, 2009.
6 T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scale and rotation invariant
texture classification with local binary patterns, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 24(7), pp. 971987, 2002.
41

7 J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and


A. Blake, Real-time human pose recognition in parts from single depth images, in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 12971304,
2011.
8 C.-H. Huang, E. Boyer, and S. Ilic, Robust human body shape and pose tracking, in 3DVConference, 2013 International Conference on, pp. 287294, 2013.
9 M. Kohler, Using the Kalman Filter to Track Human Interactive Motion: Modelling and
Initialization of the Kalman Filter for Translational Motion, Forschungsberichte des Fachbereichs Informatik der Universitat Dortmund, Dekanat Informatik, Univ., 1997.
10 Y. Yang and D. Ramanan, Articulated pose estimation with flexible mixtures-of-parts, in
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1385
1392, June 2011.
11 V. Ramakrishna, T. Kanade, and Y. Sheikh, Tracking human pose by tracking symmetric parts, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on,
pp. 37283735, 2013.
12 V. Ferrari, M. Marin-Jimenez, and A. Zisserman, Progressive search space reduction for
human pose estimation, in Computer Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on, pp. 18, June 2008.
13 D. Ramanan, Learning to parse images of articulated bodies, in Advances in Neural Information Processing Systems 19, B. Scholkopf, J. Platt, and T. Hoffman, eds., pp. 11291136,
MIT Press, 2007.

42

14 X. Burgos-Artizzu, D. Hall, P. Perona, and P. Dollar, Merging pose estimates across space
and time, in Proceedings of the British Machine Vision Conference, BMVA Press, 2013.
15 C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
16 B. D. Lucas and T. Kanade, An iterative image registration technique with an application to
stereo vision, in Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI81, pp. 674679, 1981.
17 T. Lacey, Tutorial: The kalman filter, Georgia Institute of Technology .
18 G. Welch and G. Bishop, An introduction to the kalman filter, 1995.
19 W. Forstner and B. Moonen, A metric for covariance matrices, 1999.
20 K. Bernardin and R. Stiefelhagen, Evaluating multiple object tracking performance: The
clear mot metrics, J. Image Video Process. 2008, pp. 1:11:10, Jan. 2008.
21 A. D. Bagdanov, A. Del Bimbo, F. Dini, G. Lisanti, and I. Masi, Compact and efficient
posterity logging of face imagery for video surveillance, 2012.

43