You are on page 1of 24

Noname manuscript No.

(will be inserted by the editor)

3D tracking and classification system using a


monocular camera

Stefanos Astaras · George Bardas ·


Sotirios Diamantas · Aristodemos
Pnevmatikakis

Received: date / Accepted: date

Abstract This paper details a 3D tracking and recognition system using a


single camera. The system is able to track and classify targets in outdoors and
indoors scenarios, as long as they move (at least approximately) on a plane.
The system first detects and validates targets and then tracks them in a state-
space employing cylindrical models (horizontal and vertical position on the
ground, their radius and height) utilising Particle Filters. The tracker fuses
visual measurements that utilise the targets’ foreground and colour models.
Finally the system classifies the tracked objects based on the visual metrics
extracted by our algorithm. We have tested our model in an outdoor setting
using humans and automobiles passing through the field of view of the camera
at various speeds and distances. The results presented in this paper show the
validity our approach.

Keywords Visual tracking · 3D tracking · Visual measurements · Particle


filters · Likelihood function · Fusion

Part of this work has been carried out in the scope of the EC co-funded projects ARGOS
(FP7-SEC-2012-1) and eWALL (FP7-610658).

Stefanos Astaras
Aalborg University, Center for TeleInFrastruktur
Fredrik Bajers Vej 7A,
9220 Aalborg, Denmark
E-mail: sast@ait.gr
George Bardas, Sotirios Diamantas, Aristodemos Pnevmatikakis
Athens Information Technology, Multimodal Signal Analytics
44, Kifisias Ave.,
15125 Marousi, Athens, Greece
E-mail: {sast,gbar,sodi,apne}@ait.gr
2 S. Astaras et al.

1 Introduction

Video 3D tracking and classification of targets is very important in security


applications, like critical infrastructure protection. There we need to know the
location of the target in the world coordinate system, in order to position
it in the monitored space and to match it with targets reported by different
sensors. 3D tracking has traditionally been carried out using multiple cameras
[1] or time-of-flight cameras. The former is expensive for setups that need
multiple observation areas, while the latter can only work indoors due to the
sun interfering with the infrared illuminator. Our approach is based on a single
optical camera, and the assumption that the ground around the infrastructure
is moderately flat.
Since our system is meant for operation in environments far from controlled
ones, it needs to cope with environmental changes that are not due to the tar-
gets. These are illumination variations (e.g. due to clouds or moving shadows),
background motion (e.g. minor camera motion and moderate foliage motion
due to wind) and rain. In these challenging conditions, our system utilises two
visual measurements that complement each other, resulting to target models
that are robust enough to discriminate the targets from resulting clutter. These
models have the form of a measurement likelihood given the target state, are
trained at target initialisation (in some cases aided from assumptions based on
prior knowledge) and are updated to retain their discriminating ability over
time.
A tracker comprises (i) the target management for (re)initialization and
termination, (ii) the object model propagating the prior state to the current
time instance, (iii) the measurement likelihood for an observation given the
state and (iv) the tracking algorithm searching the state-space efficiently. The
innovations of this paper are:

– The 3D tracking algorithm yielding a 3D state from only one camera by


mapping the 3D state onto the single camera image,
– The target initialisation that allows extracting the 3D state from the blob
on the image comprising the target, and
– The is target modeling by two measurement likelihood functions and the
combination of the resulting likelihoods using an early fusion scheme into a
particle filter tracker, suitable for our non-linear measurement models [2].

The remainder of this paper is organised as follows: In section 2 we dis-


cuss the background on depth estimation, background subtraction and target
modeling for visual trackers. In section 3 we detail the relation between the
3D world and the 2D image sequence from our single camera. In section 4, the
two likelihood functions are discussed under a unified framework, following
our earlier approach in [21]. The tracking and classification system is outlined
in section 5, followed by the results in Section 6. Finally, the conclusions are
drawn in section 7.
3D tracking and classification 3

2 Background Work

2.1 Depth estimation

The majority of works that tackle the problem of depth estimation deal with
stereo (binocular) vision. A stereo vision algorithm with applications in obsta-
cle avoidance is presented in [24]. In [36] an overview is given of the taxonomy
of dense two-frame correspondence algorithms based on their performances
which are categorized onto local or global methods with the former referring
to window-based methods favoring performance over accuracy while the latter
are favoring accuracy over performance. In [35] a depth estimation method is
presented using a monocular camera along with a supervised learning method
by taking into account the global structure of the scene. In this paper the
authors have collected a set of outdoor images with ground-truth depth maps
used to train their model and then apply supervised learning to predict the
depth map as a function of the image. Learning depth from monocular im-
ages using deep convolutional neural fields appears in [26]. In that research
the authors propose a deep structured learning scheme which learns the unary
and pairwise potentials of the continuous conditional random field in a unified
deep neural network framework. In [8] the authors present various methods for
estimating heights of objects using vanishing lines from the ground as well as
vertical vanishing point using an uncalibrated camera. In their method a refer-
ence object with known height is used in the image plane. In [7] a method for
measuring the heights of any feature based on an uncalibrated camera is pre-
sented. Their research is based on the Focus of Expansion (FOE) under pure
translation. In [30] the authors present a method for height estimation from
vanishing lines and points. In [10] a method for height estimation is presented
using optical flow methods from active vision.

Depth estimation with a monocular camera and by means of optical flow


and least squares is presented in [12, 11,9]. In this work the authors have
trained the system with a large number of varying optical flow vector magni-
tudes with respect to a range of speeds. Upon training, a regression analysis
has been utilized for the purpose of estimating depth based on a formula de-
rived from regression analysis. The model shows that the R2 increases when a
smaller range of speed is utilized during training. A least squares strategy has
also been employed by taking snapshots of a landmark at various positions
and is compared against optical flow [12]. The results show that R2 increases
along with the number of snapshots taken. A method that performs fusion of
the two strategies, namely optical flow and least squares appears in [11].

Our method is based on a calibrated camera and the assumption that the
target is on a planar ground and fully visible. It is detailed in Section 3.
4 S. Astaras et al.

2.2 Background subtraction

In order to separate the important objects in each frame from their surround-
ings, we employ a background subtraction algorithm. In the simplest case, the
background is constant, so it is memorized, and pixels that are different must
belong to the foreground objects. In a real-world application, the background
pixels sustain slow or repetitive variations, due to environmental factors (sun-
light variations, wind, shadows). A good background subtraction algorithm
should be able to discriminate between these expected variations and abrupt
changes that would denote a foreground object.
To this end, we use the Mixture of Gaussians (MOG) algorithm [37], with
modifications from [45] as implemented in OpenCV [32]. For each pixel, the
probability density function is estimated as a sum of Gaussian functions, which
are constantly updated. Foreground objects consist of pixels that have uncom-
mon colors at their position, so we are able to flag them accordingly. As for
background pixels, slow variations are mitigated because the Gaussian func-
tions of the mixture model shift and adapt to the new values as measure-
ments accumulate (without false positives - because they correspond to the
same Gaussian function as the previous background color), and oscillations
are mitigated because the multiple background colors eventually correspond
to multiple background Gaussian functions.
The concept of using adaptive mixture models continued to provide so-
lutions for various applications. [19] and [45] are popular implementations of
the mixture-of-Gaussians algorithm, tuned to real-world applications. [14] uses
nonparametric kernels instead of Gaussian functions, and [16] uses the local
binary pattern operator, to create texture models instead of color models. [4]
and [15] use pixel color sample models, instead of estimating a probability
density function. [42] and [31] fuse the color model with the texture model
and the contour model (shape), respectively. There are also solutions outside
the concept of adaptive pixel models; [27] uses neural networks, while [28] uses
a fuzzy system.
Our background subtraction method is a variant of the MOG algorithm,
with spatio-temporal adaptation of the learning rate, as discussed in Section
4.1.

2.3 Target modelling and tracking

Tracking involves identifying the location of one or more objects of interest


in various successive video frames, and their respective trajectories. For real-
world applications, a good tracking system is able to adapt to various envi-
ronmental difficulties, like a noisy background, varying light conditions, and
occlusion that may happen between the tracked objects and the background,
or between the objects themselves.
A popular, well performing solution is the Kalman filter [40]. The Kalman
filter employs two steps, an estimation step where the object’s current state is
3D tracking and classification 5

estimated using past measurements, and a correction step, where the object
model adapts to new measurements. This solution is very efficient, and per-
forms well under noise. However, it can be applied to mostly linear dynamic
models with Gaussian noise, and can only model a single hypothesis. [33]
uses an adaptive Kalman filter that predicts the searching point for following
frames.

Particle filtering [29] is a probabilistic (Monte Carlo) technique that uses


random weighted samples. The motivation behind particle filtering is to over-
come the limitations of the Kalman filter, namely to model nonlinear systems
with multiple hypotheses. The drawback to this method is a more demanding
performance footprint. The (re)sampling scheme is non-trivial, as the algo-
rithm tends to accumulate the weight to just a few samples over time (sample
impoverishment). Particle filtering has been successfully used in complex sce-
narios; [22] uses particle filtering to track multiple interacting targets, and [20]
describes tracking by fusing various visual descriptors. [25] uses a cascade par-
ticle filter for low frame rate videos. [43] uses a swarm intelligence sampling
method to mitigate sample impoverishment.

Target modeling approaches based on tracking by detection employ a dis-


criminative classifier. Such a classifier can be trained in a batch mode prior
to application in a tracking system. Typical examples include the Boosted
Cascades of Simple Features [39] and the Histogram of Oriented Gradients
[13] classifier. Adverse conditions result in bad framing of the target, which,
when accumulated, lead to tracker drift and finally target loss. On the other
hand the classifier can be initialised and subsequently updated. In this case
the appearance model of the target is adaptive, with typical example being
colour modeling [17,18]. Traditionally, adaptive modeling involves the use
of heuristically-derived forgetting factors. Recently online multiple instance
learning [3] overcame this problem by retraining a discriminative classifier.

Our approach to overcome the changes of the target and updating appear-
ance models is to use multiple models, as discussed in Section 4. We employ (i)
a colour model that is trained on the target and requires infrequent retraining,
and (ii) a foreground model that is not trained at all as it only looks for fore-
ground blobs that have background around them. These models are different
in terms of persistence and discriminating ability. The discriminative power
of colour models depends on the colours of the background and suffers from
illumination changes and target pose changes. Foreground segmentation on
the other hand distinguishes a moving target from the immobile background
well, but is sensitive to camera motion, background changes and prolonged
lack of motion. Although our stand-alone models suffer in persistence or dis-
criminating ability, their combination with our proposed tracker yields robust
performance.
6 S. Astaras et al.

3 The 3D model of the monitored space

The single camera 3D tracking system makes some assumptions of the targets
it monitors:
– The world is planar. There exists a ground that is planar, without any
curves. The x and y axes of the world coordinate system define this plane
(see Fig. 1). This assumption is easily met for indoors scenario (where the
ground plane is the floor of the building. It is violated by stairs indoors
and almost always outdoors, even on seemingly flat terrain.
– The targets are seen touching the ground. While touching the ground is
almost always true for humans (unless they jump) and always true for
vehicles (on the flat terrain), being seen by the camera doing so requires
that no background or foreground objects obstruct the camera view of the
bottom part of the target. Missing pixels of the lower part of the targets
can result to the tracker believing that the targets are further away from
the camera than they actually are.
– The targets are cylindrical. Hence their state x comprises of the x and y
coordinates on the ground plane, the height h (which is measured along
the z axis of the world coordinates) and the radius r. The cylindrical ap-
proximation is a moderate one for humans, but a poor one for vehicles.
While the visible width of a human target might not change a lot based
on the angle, that of a vehicle does. Hence while tracking the width of the
targets, the system should allow for noisy measurements.
Each target state needs to be mapped onto a patch of the image plane in
order to measure evidence of its existence. Also, each image patch detected
needs to be mapped onto a state vector in order to initialise a new target.
These two mappings are considered in the following subsections.

3.1 State to image patch

A camera views the surrounding 3D space and projects it onto the image plane.
The geometry of this projection depends on two factors, the orientation of the
camera and the lens.
The orientation of the camera relative to the world can be described using
two 3D Cartesian coordinate systems: that of the world and that of the camera
(see Fig. 1). The camera coordinate system has its origin at the centre of
projection of the camera, and its zc axis along the principal ray of the camera
[38]. The orientation of the other two axes approximately coincides with that
of the two image plane axes, apart from a possible skew considered later in
this section. Its offset from the world coordinate system is represented by
the vector Tc , termed translation vector. Its axes are also oriented differently
than those of the world coordinate system. They have to be rotated to match.
This is achieved by the rotation matrix Rc . The coordinates of a point in the
world coordinate system P are hence related to those in the camera coordinate
3D tracking and classification 7

Camera
coordinate
system (3D)

Centre of
projection
Image
plane
Focal
length

Camera to world
translation
Principal vector
point & World
offset coordinate
Principal system (3D)
Image ray
coordinate
system (2D)

Fig. 1: Viewing a scene with a camera to project it onto an image plane. Three
coordinate systems are involved: The 3D world and camera, and the 2D image
coordinate systems

system Pc by:

P = Rc Pc + Tc (1)

The rotation matrix and translation vectors are the extrinsic parameters of
the camera.
The camera coordinates of a point are not the coordinates on the image
plane. It is the lens type and the way it is mounted on the camera that govern
the way a point in the camera coordinate system is projected onto the image
plane. The first part of this projection involves the non-linear distortion of the
camera coordinates. This non-linear distortion has two components [44]. The
first component is radial, i.e. a function of (even) powers of the distance from
(2) (4)
the principal point. A linear combination of the second (kr ), fourth (kr ) and
(6)
sixth (kr ) powers of the distance are used. The second distortion component
(1) (2)
is tangential and is determined by the two coefficients kt and kt as shown
below.
Each line connecting the origin of the camera coordinate system (centre of
projection) and a point on that system is projected onto the same point in the
image plane, giving rise to the depth uncertainty of single camera imaging.
For this reason the correspondence between the depth-normalised coordinates
of the camera coordinate system and the image plane coordinates is sought.
T
Starting from the camera coordinates [xc , yc , zc ] , the depth-normalised cam-
8 S. Astaras et al.

era coordinates are:    


xn xc /zc
xn ≡ = (2)
yn yc /zc
Setting r2 ≡ x2n + yn2 , the (radial and tangential) distorted coordinates xd are
[44, 5]:  
xd
xd ≡ = xr + xt (3)
yd
(i)
where the radial distortion is modelled using three coefficients kr of order i
equal to two, four and six:
 
xr = 1 + kr(2) r2 + kr(4) r4 + kr(6) r6 xn (4)

(i)
and the tangential distortion is modelled using two coefficients kt , i = 1, 2:
" #
(1) (2)
2kt xn yn + kt r2 + 2x2n
xt = (1)  (2) (5)
kt r2 + 2yn2 + 2kt xn yn

The second part of the camera coordinate system projection onto the image
plane involves the linear projection of xd . It is determined by the following
parameters, taken from the pinhole camera model:
– Focal length: This is the distance of the principal point from the centre of
projection.
– Non-square pixels: The aspect ratio of the pixels together with the focal
length result in the scaling (different along each axis) of the viewed objects.
T
Together they are denoted by the focal length 2 × 1 vector fc = [fx , fy ] .
T
– Principal point offset: This is the 2 × 1 translation vector cc = [cx , cy ]
between the principal point on the image plane and the origin of the pixel
coordinates.
– Skew: The skew coefficient αc accounts for the fact that the camera coor-
dinate system axes xc and yc are only approximately parallel to the axes
of the image coordinate system xc and yc respectively. In most cases skew
is approximated with zero.
Thus the pixel coordinates of the world point are a function of the distorted
coordinates xd , given by:
 
fx · (xd + αc yd )
xp = + cc (6)
fy · yd

The distortion coefficients and the pinhole model parameters form the in-
trinsic camera parameters. Given these intrinsic camera parameters, eqs. (1)
to (6) map any point in the 3D world to a pixel in the image plane of the
camera.
The state x is mapped into a rectangle in the image plane that bounds
the target being tracked. The mapping is done as follows: First the centre
3D tracking and classification 9

T
coordinates at the target height P = [x, y, h] are projected onto the image
h iT
(i) (i)
to xp = xt , yt .
T
Then K points on a circle centred at the ground at [x, y, 0] with radius r
are selected for projection using:
     T
2π 2π
x + rcos k , y + rsin k , 0 , where k = 0, ..., K − 1 (7)
K K
 
(i) (i)
and are projected onto the image to xc (k), yc (k) .
The bounding rectangle is then approximated using the following coordi-
nates:
(i)
– Top: yt .
K−1
P (i)
1
– Bottom: K yc (k).
 k=0 
(i)
– Left: min xc (k)
 
(i)
– Right: max xc (k)
The minimum value of K is 2. In this work we use K = 4.

3.2 Image patch to state

The previous section begun considering the forward problem of mapping a


real-world point into a pixel on the camera plane. The inverse problem is
also interesting: map an image pixel into a real-world point. There are two
difficulties in attempting to do so:
– The forward problem is non-linear, hence its analytic inversion is not pos-
sible. So the inverse problem is approximated numerically, resulting to the
T
estimation of the normalised camera coordinates xn = [xn , yn ] .
– Even the numerical approximation does not lead to a single point on the
real-world coordinate system due to depth uncertainty. Recall the depth
normalisation of (2), which cannot be reversed, since depth information
does not exist after it. Attempting to do so results to the depth uncertainty
for the camera coordinate system, expressed as the line of points Pc =
T
[xn zc , yn zc , zc ] , where zc is not specified.
Using (1), the world coordinates are then given by:
 
xn zc
P = Rc  yn zc  + Tc (8)
zc
To find zc we assume that the height of the object H at that pixel is known.
Then the third row of (8) yields:
 
xn
H = zc R(3)
c
 yn  + Tc(3) (9)
1
10 S. Astaras et al.

(3) (3)
where Rc is the third row vector of the rotation matrix Rc and Tc is the
third element of the translation vector Tc . Then the distance from the camera
is found as:   −1
  xn
zc = H − Tc(3) R(3) c
 yn  (10)
1
Thus any point on the image being at some known height H in the world
coordinate system can be mapped onto the other two world coordinates (the
ground ones) by substituting (10) into (8).
Tracker initialisation from an image patch with left, top, right and bottom
h iT
(i) (i) (i) (i)
coordinates xl , yt , xr , yb is performed as follows: First the two pixels
at the bottom edge corners of the patch are assumed to be the bottom of the
h iT
(i) (i)
target touching the ground. The two image coordinates xl = xl , yb and
h iT
(i) (i)
xr = xr , yb are transformed into the normalised camera coordinates xn,l
and xn,r respectively. Since these points are touching the ground, two sets
T T
of ground coordinates [xl , yl ] and [xr , yr ] are obtained by substituting (10)
into (8) while setting H = 0.
The ground coordinates of the new target are then initialised at:
xl + xr yl + yr
x= and y = (11)
2 2
while the radius of the cylindrical approximation is:
q
2 2
r = (xr − xl ) + (yr − yl ) (12)
The height of the target is found by assuming that the top-centre pixel xt =
h  iT
(i) (i) (i)
xl + xl /2, yt , corresponding to the normalised camera coordinates
T
xn,t , is at the wold coordinates [x, y, h] . Then (8) yields:
   
x xn,t zc
 y  = Rc  yn,t zc  + Tc (13)
h zc
Any of the first two rows of (13) yields the distance from camera zc , which is
the substituted back to the third row of (13) to yield the new target’s height.
E.g. from the first line of (13):
  −1
  xn,t
zc = x − Tc(1) R(1)c
 yn,t  (14)
1
and substituting zc back to the third row of (13) yields the new target’s height:
  −1  
  xn,t xn
h = x − Tc(1) R(1)
c
 yn,t  R(3)
c
 yn  + Tc(3) (15)
1 1
3D tracking and classification 11

(i) (i)
where Rc is the i-th row vector of the rotation matrix Rc and Tc is the i-th
element of the translation vector Tc .
Equations (11), (12) and (15) yield the new target’s state at initialisation.

4 Likelihood Functions

In this section we derive the two measurement likelihoods to be combined in


our 3D tracker. Given the state x (and dropping the time dependence for the
sake of notation compactness) the likelihood functions are p (y |x ), where the
measurement y can be either of:
– yf rg for foreground measurements, and
– ycm for colour matching measurements.
The mismatch factor M (y |x ) of the measurement y given the state x is
defined as the enumeration of how good the match is between what we measure
at the state and the model of our target. It is M (y |x ) ∈ [0, 1], where 0 denotes
a perfect match and 1 indicates a complete mismatch.
In the two likelihood functions we model the mismatch factor by the ex-
ponential distribution:
 
M (y |x )
p (y |x ) ∝ exp − (16)
2σ 2
where σ 2 is the variance of the distribution.
The choice of the exponential distribution for the mismatch factor is jus-
tified as we want to penalise imperfect matching of our visual evidence with
our model of the target. How severe this penalty is depends on the variance.
Small values of σ 2 render the likelihood very selective, since a moderate mis-
match factor can result in a very large negative exponent and thus a negligible
likelihood p (y |x ).
The form of the mismatch factors M (y |x ) for the foreground and colour
matching measurements are discussed in the following two subsections.

4.1 Foreground matching

The foreground likelihood represents the matching evidence that is inferred


from the quality of a certain image patch being flagged as foreground. We use
the Mixture-of-Gaussians algorithm. As a result, the Pixel Persistence Map
IP P M can be built, in which every pixel is represented by the weight of the
Gaussian from its Gaussian Mixture Model (GMM) that best describes its
current colour. The weights are modified using a learning rate, adapted per
pixel and time step. While in [34] it is only decreased to protect slowly moving
foreground and in [6] it is increased to cope with sudden illumination changes,
here we adapt the rate as follows:
– Increase rate to learn background flicker.
12 S. Astaras et al.

– Increase rate globally for global illumination changes.


– Increase rate locally for local illumination changes, like moving clouds.
– Decrease rate locally for slowly moving foreground.

Regions of the resulting Pixel Persistence Map with large values corre-
spond to pixels that have colours that appear there for a long time, hence are
background. On the contrary, regions with small values correspond to pixels
that have colours that appear there for a short time, hence are foreground.
The unfiltered foreground pixels are those with weights below a threshold.
Foreground pixels are subjected to shadow removal [41] and morphological
clean-up to obtain the binary foreground mask If rg . An erosion filter reduces
foreground noise, followed by a closing filter that connects gaps and blobs that
possibly belong to the same object. Foreground contours are then detected
and filled, eliminating any object holes. The morphological filter kernels scale
with the position in the frame; objects closer to the camera are more crudely
filtered than those far away. If rg and the distinct blobs detected drive a per
pixel adaptation of the learning rate [34]: It is increased to learn flicker faster
in the background, while it is decreased to protect small immobile foreground
patches from being learnt too fast.
The foreground evidence image is then given by:

If ev = αP P M (1 − IP P M ) + αf rg If rg (17)

where αP P M and αf rg are scaling constants. The PPM is the more robust
term in the sum, while the addition of the binary mask amplifies the effect of
motion of significant objects in the scene.
The state x of each target corresponds to an image patch Ix . To force the
patch to be large enough to include all the moving object, negative foreground
evidence is also collected in a region surrounding Ix , designated I 0 x . The back-
ground expansion is 20% of the target size along each direction. Asking for
no foreground evidence in the expansion region ensures that the tracked tar-
get does not shrink to the denser parts of the actual target. The foreground
evidence is then defined as:

1 X 1 X
Lf rg (x) = If ev (i) − 0
If ev (i) (18)
A (Ix ) A (I x ) 0
i∈Ix i∈Ix

where A(I) is the area in pixels of the image patch I, and i is a pixel index for
an image patch, denoted one-dimensional for notation simplicity. Note that if
the regions Ix and I 0 x are non-rotated rectangles, then the foreground evidence
is efficiently calculated using the integral image [39] of If ev .
(max)
To define the foreground mismatch factor, we empirically define Lf rg ,
the maximum foreground certainty. The theoretical maximum value of Lf rg is
obtained when all three evidence images Idf , IP P M and If rg are unity in Ix
(max)
and zero in I 0 x . Then Lf rg = αP P M + αf rg , but in practice it is lower. We
3D tracking and classification 13

also define the saturated version of the foreground certainty:




 0, if Lf rg (x) ≤ 0
(max)
L0 f rg (x) = Lf rg (x) , if Lf rg (x) ≤ Lf rg (19)
 L(max) , if L (x) > L(max)

f rg f rg f rg

Using the saturated version of the foreground certainty, the foreground mis-
match factor is given by:
(max)
Lf rg − L0 f rg (y |x )
M (yf rg |x ) = (max)
(20)
Lf rg
Substituting (20) into (16), the foreground likelihood is obtained:
!
Mf rg (yobj |x )
p (yf rg |x ) ∝ exp − (21)
2σf2 rg

where σf2 rg is the variance of the distribution. Again, small values of the vari-
ance increase the selectivity of the foreground likelihood.
The foreground likelihood does not depend on any model that is specific
to the foreground object being tracked. It only employs a general background
model that depends on the variations of the pixels in the video frames. On
the other hand, the background modeling process is not entirely agnostic to
objects, since the learning rate of the pixels corresponding to the tracked
objects is lowered to decelerate fading of foreground objects to the background.

4.2 Colour matching

Colour likelihood represents the degree of similarity between a colour model of


the image patch Ix corresponding to the state x and the colour model of the
target. Colour models are calculated in two regions: the target’s bounding box
and an expanded background area around that. They are also calculated once
for the reference image patch (at target’s initialisation) and at throughout the
target’s lifespan, at its current state. All models are represented as colour his-
tograms. Model similarities are enumerated by the Bhattacharyya coefficient
between any of the histograms.
To calculate the histograms, the red (R), green (G) and blue (B) colour
components are first quantised to Nh levels and then are combined into a
one-dimensional colour quantity c(R, G, B), defined as:
     
Nh Nh Nh
c(R, G, B) ≡ R + Nh G + Nh2 B (22)
256 256 256
where bac denotes the largest integer smaller than or equal to a. The histogram
of c(R, G, B) is then calculated. Since {R, G, B} ∈ [0, 255],
 the one-dimensional

colour quantity c(R, G, B) is an integer in the range 0, . . . , Nh3 − 1 , yielding
Nh3 histogram bins.
As already stated, the following histograms are defined:
14 S. Astaras et al.

– hx : colour histogram of the image patch denoted by the current state of


the target x
– h0x : colour histogram of the surrounding pixels (background) around the
current state
– href : reference foreground colour histogram
– h0 ref : colour histograms of the reference target’s surrounding pixels
Since at initialisation the foreground colour model utilises a rectangle
bounding the entire target, it is expected that lots of background pixels are
included. Also, there are foreground pixels that have similar colours to the
background, adding nothing to the discriminating ability of the target. In or-
der to increase the discriminating ability of the target model, the reference
histogram is trained by suppressing the effect of the colours in the immediate
background of the target.
After calculating the reference foreground href and background h0 ref his-
(sup)
tograms, the new, background suppressed reference histogram href is con-
structed as follows. Define:
(min)
h0 ref
, ∀i ∈ 0, ..., Nh3 − 1
 
a(i) ≡ (23)
h0 ref (i)

(min)
where h0 ref is the minimum non-zero value of the histogram h0 ref . Then the
defined bins are:
a(i)href (i), if h0 ref (i) > Khref (i)

(sup)
href = (24)
href (i), if h0 ref (i) ≤ Khref (i)
where K is governs the strength of the suppression as discussed in section 6.
(sup)
The bins of the new histogram href are normalised to sum to unity.
Utilising these histograms, the colour similarity metric is defined by com-
bining the following Bhattacharyya coefficients:
Nh3 −1 q
(sup)
X
Lcm (x|href ) = hx (i)href (i)
i=0
Nh3 −1 q
X
+ h0x (i)h0ref (i) (25)
i=0
Nh3 −1 q
(sup)
X
− h0x (i)href (i)
i=0

(sup)
where hx (i), h0 x(i), href (i), h0 ref (i) are the i-th bins of the respective
histograms.
The first term is zero when no match is found between the colours of
the target and the image patch, and one when the histograms are identical.
The second term is zero when no match is found between the colours of the
backgrounds of the reference target and the image patch and one when the
3D tracking and classification 15

histograms are identical. Finally, the third term is zero when the background
of the image patch and the foreground of the reference target have no match,
and one, when they are all identical. So the image patch background histogram
matches the reference target foreground histogram when the rectangle is in the
surrounding pixels of the image patch, so this is subtracted from the colour
similarity metric.
(max)
To define the colour mismatch factor, we empirically define Lcm , the
maximum colour matching certainty. The theoretical maximum value of Lcm
is obtained, as seen from above, when the first and second terms in (25) are
maximised to unity and the third is minimised to zero. In practice, maximum
Lcm is smaller than the theoretical value of two. We also define the saturated
version of the colour matching certainty, to obtain a value in [0, 1]:


 0, if Lcm (x) ≤ 0
(max)
L0cm (x) = Lcm (x), if 0 < Lcm (x) < Lcm (x) (26)
L(max) (x), if L (x) ≥ L(max) (x)

cm cm cm

Using the saturated version of the colour matching certainty, the colour mis-
match factor is given by:
(max)
Lcm (x) − L0cm (y|x)
M (ycm |x) = (max)
(27)
Lcm (x)
By substituting (27) in the exponential distribution (16), the colour matching
likelihood is obtained:
 
Mcm (ycm |x)
p(ycm |x) ∝ exp − 2
(28)
2σcm
2
where σcm is the variance of the distribution. Small values of the variance
increase the selectivity of the colour matching likelihood, as discussed in section
6.

5 Tracking and Classification

The 3D tracker is initialised using the processed foreground blobs: Blobs


smaller than a certain size are removed, as they are considered to be noise
(too small to be an intruder). Blobs that are very close to each other are
combined, as they probably belong to the same object.
The 3D tracker is built utilising particle filters. The weight for particle
x is updated based on the measured likelihood p (y |x ), where y is any of
the object detection, foreground and colour measurements introduced in the
previous section. The filter is a Sequential Importance Resampling (SIR) one
[2], where systematic resampling [23] is employed. As an SIR particle filter,
the selected proposal distribution is the same as the object model. The object
model is the measurement-assisted one described in [21]. The same hold for
the target management system.
16 S. Astaras et al.

The tracked state of each target contains the height and the radius of
the cylinder modelling it. Since the modelling of the radius depends a lot
on the viewpoint, only the height of the target is used in the classification.
The classifier tries to recognise cars and humans. The height of each class is
modelled as a random variable H, whose distribution is sought. Using datasets
of humans and cars, the height is fit utilising different distributions. The log-
normal yields the minimum fitting error, and hence is chosen. The cumulative
probability function is:
 
1 ln h − µ
FH (h; µ, σ) = erfc − √ (29)
2 σ 2

where erfc() is the complementary error function. The model fitting approach
yields the (µ, σ) pairs for the heights of cars and humans.
Since the estimation of the height is prone to errors, the probability of the
a range of heights around the nominal value given the class is sought:
(c) (c)
P (c) (h) = FH (h + δ) − FH (h − δ) (30)

where c is the class (car or human) and δ enumerates the height uncertainty
and is selected to be 2cm.
For the classification (30) is evaluated for the car and human classes and
the decision is:

 car, if P (car) (h) > 0.08 and P (car) (h) > 1.2P (human) (h)
c= human, if P (human) (h) > 0.08 and P (human) (h) > 1.2P (car) (h) (31)
unknown, elsewhere

6 Results

The goal of this section is to evaluate the performance of the measurement


likelihoods introduced in section 4. For this reason we use a sequence of a hu-
man in an outdoors scenario. Figure 2 displays three frames from the sequence,
one at the beginning of the lifespan of the target, where the colour model is
initialised, another shortly after with the target in a similar position and body
posture and a third at a much larger distance from the camera and different
body posture.
We evaluate the proposed colour measurement likelihood by considering
how sensitive the proposed likelihood functions are to offsets around the ideal
state. We consider the ’similar’ and ’different’ situations depicted in Figure
2. For this evaluation an offset is introduced in the horizontal direction from
the actual target position. The likelihoods should drop away from the actual
position (offset 0), but should do so rather progressively, to be able to attract
the particles when they are a bit close.
The first goal is to evaluate the effect of the three terms in the colour
similarity metric of (25), namely:
3D tracking and classification 17

(a) The reference image

(b) The similar testing image

(c) The different testing image

Fig. 2: Three images to be used for testing the likelihood functions. The
two bounding boxes marked are target (green) and background around target
(red).
18 S. Astaras et al.

– T(rf ,cf ) : Bhattacharyya coefficient term between reference and current fore-
ground
– T(rb,cf ) : Bhattacharyya coefficient term between reference background and
current foreground
– T(rb,cb) : Bhattacharyya coefficient term between reference background and
current background

From Figure 3 it is obvious that the first (blue solid line) and third (red
dashed and dotted line) combination of Bhattacharyya coefficient terms are
excluded because of their low selectivity. Furthermore, the fourth combination
of Bhattacharyya coefficient term (black dotted line) is ruled out too because
it is not a monotonic function, and has local maximums left and right of the
global maximum. This is problematic for the particle filter tracker. So second
combination of Bhattacharyya coefficient terms (green dashed line) is selected,
because it has great selectivity and is monotonic around zero offset.
We next examine the effect of the strength of background suppression in
the reference histogram. The results of varying K are shown in Figure 4. Peak
position, monotonicity and selectivity improve at K = 0.125.
Finally, we examine the effect of the exponential distribution variance on
the selectivity of the colour matching likelihood. The results are shown in
Figure 5. We select σcm = 14 to have selectivity but also avoid a very abrupt
function that will be unable to attract moderately offset particles to the correct
location.
An example of the target classification follows in Figure 6, where three
different outdoor scenarios are depicted.

7 Conclusion

A 3D tracking and classification system based on a single calibrated camera is


presented, along with two visual measurement cues analysed under a common
framework to alleviate the problem of background clutter and lighting vari-
ations in outdoors tracking for surveillance applications. The measurements
differ in discriminating ability and persistence, hence combining them leads to
a robust tracking system. The targets are also classified into cars and humans
utilising a probabilistic model of their heights.

References

1. Andersen, M., Andersen, R., Katsarakis, N., Pnevmatikakis, A., Tan, Z.H.: Three-
dimensional adaptive sensing of people in a multi-camera setup. In: Person tracking
for assistive working and living environments, EUSIPCO 2010, pp. 964–968. Aalborg,
Denmark (2010)
2. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for on-
line non-linear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing
50(2), 174–188 (2002)
3D tracking and classification 19

Reference−Similar Images
2
Trf,cf
1.8 Trf,cf −Trb,cf
Trf,cf +Trb,cb
1.6
Colour Matching Likelihood Trf,cf −Trb,cf +Trb,cb
1.4

1.2

0.8

0.6

0.4

0.2

0
−100 −50 0 50 100
Horizontal offset (% target width)

Reference−Different Images
2
Trf,cf
1.8 Trf,cf −Trb,cf
Trf,cf +Trb,cb
1.6
Trf,cf −Trb,cf +Trb,cb
Colour Matching Likelihood

1.4

1.2

0.8

0.6

0.4

0.2

0
−100 −50 0 50 100
Horizontal offset (% target width)

Fig. 3: Colour matching likelihoods of the four different combination of Bhat-


tacharyya coefficient terms as a function of the horizontal offset, for the ’sim-
ilar’ (top) and the ’different’ (bottom) cases.

3. Babenko, B., Yang, M.H., Belongie, S.: Visual Tracking with Online Multiple Instance
Learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2009). Miami Beach, FL, USA (2009)
4. Barnich, O., Droogenbroeck, M.V.: Vibe: A universal background subtraction algorithm
for video sequences. IEEE Transactions on Image Processing 20(6), 1709–1724 (2011).
DOI 10.1109/TIP.2010.2101613. URL http://dx.doi.org/10.1109/TIP.2010.2101613
5. Bouguet, J.Y.: Camera calibration toolbox for matlab.
www.vision.caltech.edu/bouguetj/calib doc/htmls/parameters.html (2008)
6. Chen, Z., Ellis, T.: A self-adaptive Gaussian mixture model. Computer Vision and Image
Understanding 122(0), 35 – 46 (2014). DOI http://dx.doi.org/10.1016/j.cviu.2014.01.
004. URL http://www.sciencedirect.com/science/article/pii/S1077314214000113
20 S. Astaras et al.

Reference−Similar Images

1 no suppresion
k=2
0.9 k=1
k=0.5
Colour Matching Likelihood 0.8 k=0.25
k=0.125
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)

Reference−Different Images

1 no suppresion
k=2
0.9 k=1
k=0.5
0.8 k=0.25
Colour Matching Likelihood

k=0.125
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)

Fig. 4: Effect of the strength of background suppression in the reference his-


togram. Colour matching likelihoods for different K as a function of the hori-
zontal offset, for the ’similar’ (top) and the ’different’ (bottom) cases.

7. Chen, Z., Pears, N., Liang, B.: A method of visual metrology from uncalibrated images.
Pattern Recognition Letters 27(13), 1447–1456 (2006)
8. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. International Journal of
Computer Vision 40(2), 123–148 (2000)
9. Diamantas, S.C.: Biological and metric maps applied to robot homing. Ph.D. thesis,
School of Electronics and Computer Science, University of Southampton (2010)
10. Diamantas, S.C., Dasgupta, P.: An active vision approach to height estimation with
optical flow. In: International Symposium on Visual Computing, pp. 160–170. Springer
(2013)
11. Diamantas, S.C., Oikonomidis, A., Crowder, R.M.: Depth computation using optical
flow and least squares. In: IEEE/SICE International Symposium on System Integration,
pp. 7–12. Sendai, Japan (2010)
3D tracking and classification 21

Reference−Similar Images

1 σ=1
σ=0.5
0.9 σ=0.25
σ=0.125
Colour Matching Likelihood 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)

Reference−Different Images

1 σ=1
σ=0.5
0.9 σ=0.25
σ=0.125
0.8
Colour Matching Likelihood

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)

Fig. 5: Effect of the exponential distribution variance. Colour matching likeli-


hoods for different σcm as a function of the horizontal offset, for the ’similar’
(top) and the ’different’ (bottom) cases.

12. Diamantas, S.C., Oikonomidis, A., Crowder, R.M.: Depth estimation for autonomous
robot navigation: A comparative approach. In: International Conference on Imaging
Systems and Techniques, pp. 426–430. Thessaloniki, Greece (2010)
13. Ding, X., Xu, H., Cui, P., Sun, L., Yang, S.: A cascade svm approach for head-shoulder
detection using histograms of oriented gradients. In: IEEE International Symposium on
Circuits and Systems (ISCAS 2009), pp. 1791–1794. Taipei, Taiwan (2009)
14. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and foreground
modeling using nonparametric kernel density estimation for visual surveillance. In:
Proceeding of the IEEE, vol. 90, pp. 1151–1163 (2002)
15. Godbehere, A.B., Matsukawa, A., Goldberg, K.Y.: Visual tracking of human visitors
under variable-lighting conditions for a responsive audio art installation. In: American
Control Conference, ACC 2012, Montreal, QC, Canada, June 27-29, 2012, pp. 4305–4312
22 S. Astaras et al.

(a) Object classified as human

(b) Object classified as car

(c) Two objects classified as human and car

Fig. 6: Classification examples.


3D tracking and classification 23

(2012)
16. Heikkilä, M., Pietikäinen, M.: A texture-based method for modeling the background
and detecting moving objects. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 657–662
(2006). DOI 10.1109/TPAMI.2006.68. URL http://doi.ieeecomputersociety.org/
10.1109/TPAMI.2006.68
17. Jaffré, G., Crouzil, A.: Non-rigid object localization from color model using mean shift.
In: IEEE International Conference on Image Processing (ICIP 2003), pp. 317–320.
Barcelona, Spain (2003)
18. Jones, M.J., Rehg, J.M.: Statistical color models with application to skin detection.
International Journal of Computer Vision 46(1), 81–96 (2002)
19. KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model
for real-time tracking with shadow detection. In: Video-Based Surveillance Systems,
chap. 11, pp. pp 135–144. Springer US (2002)
20. Katsarakis, N., Pnevmatikakis, A., Tan, Z., Prasad, R.: Combination of multiple mea-
surement cues for visual face tracking. Wireless Personal Communications 78(3), 1789–
1810 (2014). DOI 10.1007/s11277-014-1900-2. URL http://dx.doi.org/10.1007/
s11277-014-1900-2
21. Katsarakis, N., Pnevmatikakis, A., Tan, Z.H., Prasad, R.: Combination of multiple
measurement cues for visual face tracking. Wireless Personal Communications 78(3),
1789–1810 (2014). DOI 10.1007/s11277-014-1900-2
22. Khan, Z., Balch, T.R., Dellaert, F.: Mcmc-based particle filtering for tracking a variable
number of interacting targets. IEEE Trans. Pattern Anal. Mach. Intell. 27(11), 1805–
1918 (2005). DOI 10.1109/TPAMI.2005.223. URL http://doi.ieeecomputersociety.
org/10.1109/TPAMI.2005.223
23. Kitagawa, G.: Monte carlo filter and smoother for non-gaussian nonlinear state space
models. Journal of Computational and Graphical Statistics 5(1), 1–25 (1996)
24. Lazaros Nalpantidis Ioannis Kostavelis, A.G.: Stereovision-based algorithm for obstacle
avoidance. In: Intelligent Robotics and Applications, vol. 5928, pp. 195–204 (2009)
25. Li, Y., Ai, H., Yamashita, T., Lao, S., Kawade, M.: Tracking in low frame rate video: A
cascade particle filter with discriminative observers of different life spans. IEEE Trans.
Pattern Anal. Mach. Intell. 30(10), 1728–1740 (2008). DOI 10.1109/TPAMI.2008.73.
URL http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.73
26. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using
deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine
Intelligence PP(99) (2015)
27. Maddalena, L., Petrosino, A.: A self-organizing approach to background subtraction for
visual surveillance applications. IEEE Transactions on Image Processing 17(7), 1168–
1177 (2008). DOI 10.1109/TIP.2008.924285. URL http://dx.doi.org/10.1109/TIP.
2008.924285
28. Maddalena, L., Petrosino, A.: A fuzzy spatial coherence-based approach to back-
ground/foreground separation for moving object detection. Neural Computing and
Applications 19(2), 179–186 (2010). DOI 10.1007/s00521-009-0285-8. URL http:
//dx.doi.org/10.1007/s00521-009-0285-8
29. Mihaylova, L., Brasnett, P., Canagarajah, N., Bull, D.: Object tracking by particle
filtering techniques in video sequences
30. Momeni-K., M., Diamantas, S.C., Ruggiero, F., Siciliano, B.: Height estimation from
a single camera view. In: Proceedings of the International Conference on Computer
Vision Theory and Applications, pp. 358–364. SciTePress (2012)
31. Noh, S., Jeon, M.: A new framework for background subtraction using multiple cues. In:
Computer Vision - ACCV 2012 - 11th Asian Conference on Computer Vision, Daejeon,
Korea, November 5-9, 2012, Revised Selected Papers, Part III, pp. 493–506 (2012).
DOI 10.1007/978-3-642-37431-9 38
32. OpenCV: http://opencv.org (2016)
33. Pan, J., Hu, B., Zhang, J.Q.: An efficient object tracking algorithm with adaptive
prediction of initial searching point. In: Advances in Image and Video Technol-
ogy, First Pacific Rim Symposium, PSIVT 2006, Hsinchu, Taiwan, December 10-13,
2006, Proceedings, pp. 1113–1122 (2006). DOI 10.1007/11949534 112. URL http:
//dx.doi.org/10.1007/11949534_112
24 S. Astaras et al.

34. Pnevmatikakis, A., Polymenakos, L.: Robust estimation of background for fixed cam-
eras. In: 15th International Conference on Computing (CIC ’06), pp. 37–42. Mexico
City, Mexico (2006)
35. Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In:
Y. Weiss, B. Schölkopf, J.C. Platt (eds.) Advances in Neural Information Processing
Systems 18, pp. 1161–1168. MIT Press (2006). URL http://papers.nips.cc/paper/
2921-learning-depth-from-single-monocular-images.pdf
36. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo corre-
spondence algorithms. International Journal of Computer Vision 47(1-3), 7–42 (2002)
37. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time track-
ing. In: 1999 Conference on Computer Vision and Pattern Recognition (CVPR ’99),
23-25 June 1999, Ft. Collins, CO, USA, pp. 2246–2252 (1999). DOI 10.1109/CVPR.
1999.784637. URL http://dx.doi.org/10.1109/CVPR.1999.784637
38. Talantzis, F., Pnevmatikakis, A., Constantinides, A.G.: Audio-Visual Person Tracking:
A Practical Approach. Imperial College Press (2012)
39. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple
features. In: IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR 2001), pp. 511–518. Kauai, HI, USA (2001)
40. Welch, G., Bishop, G.: An introduction to the kalman filter. Tech. rep., University of
North Carolina at Chapel Hill (2006)
41. Xu, L., Landabaso, J., Pardas, M.: Shadow removal with blob-based morphological
reconstruction for error correction. In: IEEE Int. Conf. on Acoustics, Speech, and
Signal Processing (ICASSP 2005). Philadelphia, PA, USA (2005)
42. Yao, J., Odobez, J.M.: Multi-layer background subtraction based on color and tex-
ture. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA (2007). DOI
10.1109/CVPR.2007.383497. URL http://dx.doi.org/10.1109/CVPR.2007.383497
43. Zhang, X., Hu, W., Maybank, S.J.: A smarter particle filter. In: Computer Vision
- ACCV 2009, 9th Asian Conference on Computer Vision, Xi’an, China, September
23-27, 2009, Revised Selected Papers, Part II, pp. 236–246 (2009). DOI 10.1007/
978-3-642-12304-7 23. URL http://dx.doi.org/10.1007/978-3-642-12304-7_23
44. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000)
45. Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction.
In: 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK,
August 23-26, 2004., pp. 28–31 (2004). DOI 10.1109/ICPR.2004.1333992. URL http:
//dx.doi.org/10.1109/ICPR.2004.1333992

View publication stats

You might also like