Professional Documents
Culture Documents
Part of this work has been carried out in the scope of the EC co-funded projects ARGOS
(FP7-SEC-2012-1) and eWALL (FP7-610658).
Stefanos Astaras
Aalborg University, Center for TeleInFrastruktur
Fredrik Bajers Vej 7A,
9220 Aalborg, Denmark
E-mail: sast@ait.gr
George Bardas, Sotirios Diamantas, Aristodemos Pnevmatikakis
Athens Information Technology, Multimodal Signal Analytics
44, Kifisias Ave.,
15125 Marousi, Athens, Greece
E-mail: {sast,gbar,sodi,apne}@ait.gr
2 S. Astaras et al.
1 Introduction
2 Background Work
The majority of works that tackle the problem of depth estimation deal with
stereo (binocular) vision. A stereo vision algorithm with applications in obsta-
cle avoidance is presented in [24]. In [36] an overview is given of the taxonomy
of dense two-frame correspondence algorithms based on their performances
which are categorized onto local or global methods with the former referring
to window-based methods favoring performance over accuracy while the latter
are favoring accuracy over performance. In [35] a depth estimation method is
presented using a monocular camera along with a supervised learning method
by taking into account the global structure of the scene. In this paper the
authors have collected a set of outdoor images with ground-truth depth maps
used to train their model and then apply supervised learning to predict the
depth map as a function of the image. Learning depth from monocular im-
ages using deep convolutional neural fields appears in [26]. In that research
the authors propose a deep structured learning scheme which learns the unary
and pairwise potentials of the continuous conditional random field in a unified
deep neural network framework. In [8] the authors present various methods for
estimating heights of objects using vanishing lines from the ground as well as
vertical vanishing point using an uncalibrated camera. In their method a refer-
ence object with known height is used in the image plane. In [7] a method for
measuring the heights of any feature based on an uncalibrated camera is pre-
sented. Their research is based on the Focus of Expansion (FOE) under pure
translation. In [30] the authors present a method for height estimation from
vanishing lines and points. In [10] a method for height estimation is presented
using optical flow methods from active vision.
Our method is based on a calibrated camera and the assumption that the
target is on a planar ground and fully visible. It is detailed in Section 3.
4 S. Astaras et al.
In order to separate the important objects in each frame from their surround-
ings, we employ a background subtraction algorithm. In the simplest case, the
background is constant, so it is memorized, and pixels that are different must
belong to the foreground objects. In a real-world application, the background
pixels sustain slow or repetitive variations, due to environmental factors (sun-
light variations, wind, shadows). A good background subtraction algorithm
should be able to discriminate between these expected variations and abrupt
changes that would denote a foreground object.
To this end, we use the Mixture of Gaussians (MOG) algorithm [37], with
modifications from [45] as implemented in OpenCV [32]. For each pixel, the
probability density function is estimated as a sum of Gaussian functions, which
are constantly updated. Foreground objects consist of pixels that have uncom-
mon colors at their position, so we are able to flag them accordingly. As for
background pixels, slow variations are mitigated because the Gaussian func-
tions of the mixture model shift and adapt to the new values as measure-
ments accumulate (without false positives - because they correspond to the
same Gaussian function as the previous background color), and oscillations
are mitigated because the multiple background colors eventually correspond
to multiple background Gaussian functions.
The concept of using adaptive mixture models continued to provide so-
lutions for various applications. [19] and [45] are popular implementations of
the mixture-of-Gaussians algorithm, tuned to real-world applications. [14] uses
nonparametric kernels instead of Gaussian functions, and [16] uses the local
binary pattern operator, to create texture models instead of color models. [4]
and [15] use pixel color sample models, instead of estimating a probability
density function. [42] and [31] fuse the color model with the texture model
and the contour model (shape), respectively. There are also solutions outside
the concept of adaptive pixel models; [27] uses neural networks, while [28] uses
a fuzzy system.
Our background subtraction method is a variant of the MOG algorithm,
with spatio-temporal adaptation of the learning rate, as discussed in Section
4.1.
estimated using past measurements, and a correction step, where the object
model adapts to new measurements. This solution is very efficient, and per-
forms well under noise. However, it can be applied to mostly linear dynamic
models with Gaussian noise, and can only model a single hypothesis. [33]
uses an adaptive Kalman filter that predicts the searching point for following
frames.
Our approach to overcome the changes of the target and updating appear-
ance models is to use multiple models, as discussed in Section 4. We employ (i)
a colour model that is trained on the target and requires infrequent retraining,
and (ii) a foreground model that is not trained at all as it only looks for fore-
ground blobs that have background around them. These models are different
in terms of persistence and discriminating ability. The discriminative power
of colour models depends on the colours of the background and suffers from
illumination changes and target pose changes. Foreground segmentation on
the other hand distinguishes a moving target from the immobile background
well, but is sensitive to camera motion, background changes and prolonged
lack of motion. Although our stand-alone models suffer in persistence or dis-
criminating ability, their combination with our proposed tracker yields robust
performance.
6 S. Astaras et al.
The single camera 3D tracking system makes some assumptions of the targets
it monitors:
– The world is planar. There exists a ground that is planar, without any
curves. The x and y axes of the world coordinate system define this plane
(see Fig. 1). This assumption is easily met for indoors scenario (where the
ground plane is the floor of the building. It is violated by stairs indoors
and almost always outdoors, even on seemingly flat terrain.
– The targets are seen touching the ground. While touching the ground is
almost always true for humans (unless they jump) and always true for
vehicles (on the flat terrain), being seen by the camera doing so requires
that no background or foreground objects obstruct the camera view of the
bottom part of the target. Missing pixels of the lower part of the targets
can result to the tracker believing that the targets are further away from
the camera than they actually are.
– The targets are cylindrical. Hence their state x comprises of the x and y
coordinates on the ground plane, the height h (which is measured along
the z axis of the world coordinates) and the radius r. The cylindrical ap-
proximation is a moderate one for humans, but a poor one for vehicles.
While the visible width of a human target might not change a lot based
on the angle, that of a vehicle does. Hence while tracking the width of the
targets, the system should allow for noisy measurements.
Each target state needs to be mapped onto a patch of the image plane in
order to measure evidence of its existence. Also, each image patch detected
needs to be mapped onto a state vector in order to initialise a new target.
These two mappings are considered in the following subsections.
A camera views the surrounding 3D space and projects it onto the image plane.
The geometry of this projection depends on two factors, the orientation of the
camera and the lens.
The orientation of the camera relative to the world can be described using
two 3D Cartesian coordinate systems: that of the world and that of the camera
(see Fig. 1). The camera coordinate system has its origin at the centre of
projection of the camera, and its zc axis along the principal ray of the camera
[38]. The orientation of the other two axes approximately coincides with that
of the two image plane axes, apart from a possible skew considered later in
this section. Its offset from the world coordinate system is represented by
the vector Tc , termed translation vector. Its axes are also oriented differently
than those of the world coordinate system. They have to be rotated to match.
This is achieved by the rotation matrix Rc . The coordinates of a point in the
world coordinate system P are hence related to those in the camera coordinate
3D tracking and classification 7
Camera
coordinate
system (3D)
Centre of
projection
Image
plane
Focal
length
Camera to world
translation
Principal vector
point & World
offset coordinate
Principal system (3D)
Image ray
coordinate
system (2D)
Fig. 1: Viewing a scene with a camera to project it onto an image plane. Three
coordinate systems are involved: The 3D world and camera, and the 2D image
coordinate systems
system Pc by:
P = Rc Pc + Tc (1)
The rotation matrix and translation vectors are the extrinsic parameters of
the camera.
The camera coordinates of a point are not the coordinates on the image
plane. It is the lens type and the way it is mounted on the camera that govern
the way a point in the camera coordinate system is projected onto the image
plane. The first part of this projection involves the non-linear distortion of the
camera coordinates. This non-linear distortion has two components [44]. The
first component is radial, i.e. a function of (even) powers of the distance from
(2) (4)
the principal point. A linear combination of the second (kr ), fourth (kr ) and
(6)
sixth (kr ) powers of the distance are used. The second distortion component
(1) (2)
is tangential and is determined by the two coefficients kt and kt as shown
below.
Each line connecting the origin of the camera coordinate system (centre of
projection) and a point on that system is projected onto the same point in the
image plane, giving rise to the depth uncertainty of single camera imaging.
For this reason the correspondence between the depth-normalised coordinates
of the camera coordinate system and the image plane coordinates is sought.
T
Starting from the camera coordinates [xc , yc , zc ] , the depth-normalised cam-
8 S. Astaras et al.
(i)
and the tangential distortion is modelled using two coefficients kt , i = 1, 2:
" #
(1) (2)
2kt xn yn + kt r2 + 2x2n
xt = (1) (2) (5)
kt r2 + 2yn2 + 2kt xn yn
The second part of the camera coordinate system projection onto the image
plane involves the linear projection of xd . It is determined by the following
parameters, taken from the pinhole camera model:
– Focal length: This is the distance of the principal point from the centre of
projection.
– Non-square pixels: The aspect ratio of the pixels together with the focal
length result in the scaling (different along each axis) of the viewed objects.
T
Together they are denoted by the focal length 2 × 1 vector fc = [fx , fy ] .
T
– Principal point offset: This is the 2 × 1 translation vector cc = [cx , cy ]
between the principal point on the image plane and the origin of the pixel
coordinates.
– Skew: The skew coefficient αc accounts for the fact that the camera coor-
dinate system axes xc and yc are only approximately parallel to the axes
of the image coordinate system xc and yc respectively. In most cases skew
is approximated with zero.
Thus the pixel coordinates of the world point are a function of the distorted
coordinates xd , given by:
fx · (xd + αc yd )
xp = + cc (6)
fy · yd
The distortion coefficients and the pinhole model parameters form the in-
trinsic camera parameters. Given these intrinsic camera parameters, eqs. (1)
to (6) map any point in the 3D world to a pixel in the image plane of the
camera.
The state x is mapped into a rectangle in the image plane that bounds
the target being tracked. The mapping is done as follows: First the centre
3D tracking and classification 9
T
coordinates at the target height P = [x, y, h] are projected onto the image
h iT
(i) (i)
to xp = xt , yt .
T
Then K points on a circle centred at the ground at [x, y, 0] with radius r
are selected for projection using:
T
2π 2π
x + rcos k , y + rsin k , 0 , where k = 0, ..., K − 1 (7)
K K
(i) (i)
and are projected onto the image to xc (k), yc (k) .
The bounding rectangle is then approximated using the following coordi-
nates:
(i)
– Top: yt .
K−1
P (i)
1
– Bottom: K yc (k).
k=0
(i)
– Left: min xc (k)
(i)
– Right: max xc (k)
The minimum value of K is 2. In this work we use K = 4.
(3) (3)
where Rc is the third row vector of the rotation matrix Rc and Tc is the
third element of the translation vector Tc . Then the distance from the camera
is found as: −1
xn
zc = H − Tc(3) R(3) c
yn (10)
1
Thus any point on the image being at some known height H in the world
coordinate system can be mapped onto the other two world coordinates (the
ground ones) by substituting (10) into (8).
Tracker initialisation from an image patch with left, top, right and bottom
h iT
(i) (i) (i) (i)
coordinates xl , yt , xr , yb is performed as follows: First the two pixels
at the bottom edge corners of the patch are assumed to be the bottom of the
h iT
(i) (i)
target touching the ground. The two image coordinates xl = xl , yb and
h iT
(i) (i)
xr = xr , yb are transformed into the normalised camera coordinates xn,l
and xn,r respectively. Since these points are touching the ground, two sets
T T
of ground coordinates [xl , yl ] and [xr , yr ] are obtained by substituting (10)
into (8) while setting H = 0.
The ground coordinates of the new target are then initialised at:
xl + xr yl + yr
x= and y = (11)
2 2
while the radius of the cylindrical approximation is:
q
2 2
r = (xr − xl ) + (yr − yl ) (12)
The height of the target is found by assuming that the top-centre pixel xt =
h iT
(i) (i) (i)
xl + xl /2, yt , corresponding to the normalised camera coordinates
T
xn,t , is at the wold coordinates [x, y, h] . Then (8) yields:
x xn,t zc
y = Rc yn,t zc + Tc (13)
h zc
Any of the first two rows of (13) yields the distance from camera zc , which is
the substituted back to the third row of (13) to yield the new target’s height.
E.g. from the first line of (13):
−1
xn,t
zc = x − Tc(1) R(1)c
yn,t (14)
1
and substituting zc back to the third row of (13) yields the new target’s height:
−1
xn,t xn
h = x − Tc(1) R(1)
c
yn,t R(3)
c
yn + Tc(3) (15)
1 1
3D tracking and classification 11
(i) (i)
where Rc is the i-th row vector of the rotation matrix Rc and Tc is the i-th
element of the translation vector Tc .
Equations (11), (12) and (15) yield the new target’s state at initialisation.
4 Likelihood Functions
Regions of the resulting Pixel Persistence Map with large values corre-
spond to pixels that have colours that appear there for a long time, hence are
background. On the contrary, regions with small values correspond to pixels
that have colours that appear there for a short time, hence are foreground.
The unfiltered foreground pixels are those with weights below a threshold.
Foreground pixels are subjected to shadow removal [41] and morphological
clean-up to obtain the binary foreground mask If rg . An erosion filter reduces
foreground noise, followed by a closing filter that connects gaps and blobs that
possibly belong to the same object. Foreground contours are then detected
and filled, eliminating any object holes. The morphological filter kernels scale
with the position in the frame; objects closer to the camera are more crudely
filtered than those far away. If rg and the distinct blobs detected drive a per
pixel adaptation of the learning rate [34]: It is increased to learn flicker faster
in the background, while it is decreased to protect small immobile foreground
patches from being learnt too fast.
The foreground evidence image is then given by:
If ev = αP P M (1 − IP P M ) + αf rg If rg (17)
where αP P M and αf rg are scaling constants. The PPM is the more robust
term in the sum, while the addition of the binary mask amplifies the effect of
motion of significant objects in the scene.
The state x of each target corresponds to an image patch Ix . To force the
patch to be large enough to include all the moving object, negative foreground
evidence is also collected in a region surrounding Ix , designated I 0 x . The back-
ground expansion is 20% of the target size along each direction. Asking for
no foreground evidence in the expansion region ensures that the tracked tar-
get does not shrink to the denser parts of the actual target. The foreground
evidence is then defined as:
1 X 1 X
Lf rg (x) = If ev (i) − 0
If ev (i) (18)
A (Ix ) A (I x ) 0
i∈Ix i∈Ix
where A(I) is the area in pixels of the image patch I, and i is a pixel index for
an image patch, denoted one-dimensional for notation simplicity. Note that if
the regions Ix and I 0 x are non-rotated rectangles, then the foreground evidence
is efficiently calculated using the integral image [39] of If ev .
(max)
To define the foreground mismatch factor, we empirically define Lf rg ,
the maximum foreground certainty. The theoretical maximum value of Lf rg is
obtained when all three evidence images Idf , IP P M and If rg are unity in Ix
(max)
and zero in I 0 x . Then Lf rg = αP P M + αf rg , but in practice it is lower. We
3D tracking and classification 13
Using the saturated version of the foreground certainty, the foreground mis-
match factor is given by:
(max)
Lf rg − L0 f rg (y |x )
M (yf rg |x ) = (max)
(20)
Lf rg
Substituting (20) into (16), the foreground likelihood is obtained:
!
Mf rg (yobj |x )
p (yf rg |x ) ∝ exp − (21)
2σf2 rg
where σf2 rg is the variance of the distribution. Again, small values of the vari-
ance increase the selectivity of the foreground likelihood.
The foreground likelihood does not depend on any model that is specific
to the foreground object being tracked. It only employs a general background
model that depends on the variations of the pixels in the video frames. On
the other hand, the background modeling process is not entirely agnostic to
objects, since the learning rate of the pixels corresponding to the tracked
objects is lowered to decelerate fading of foreground objects to the background.
(min)
where h0 ref is the minimum non-zero value of the histogram h0 ref . Then the
defined bins are:
a(i)href (i), if h0 ref (i) > Khref (i)
(sup)
href = (24)
href (i), if h0 ref (i) ≤ Khref (i)
where K is governs the strength of the suppression as discussed in section 6.
(sup)
The bins of the new histogram href are normalised to sum to unity.
Utilising these histograms, the colour similarity metric is defined by com-
bining the following Bhattacharyya coefficients:
Nh3 −1 q
(sup)
X
Lcm (x|href ) = hx (i)href (i)
i=0
Nh3 −1 q
X
+ h0x (i)h0ref (i) (25)
i=0
Nh3 −1 q
(sup)
X
− h0x (i)href (i)
i=0
(sup)
where hx (i), h0 x(i), href (i), h0 ref (i) are the i-th bins of the respective
histograms.
The first term is zero when no match is found between the colours of
the target and the image patch, and one when the histograms are identical.
The second term is zero when no match is found between the colours of the
backgrounds of the reference target and the image patch and one when the
3D tracking and classification 15
histograms are identical. Finally, the third term is zero when the background
of the image patch and the foreground of the reference target have no match,
and one, when they are all identical. So the image patch background histogram
matches the reference target foreground histogram when the rectangle is in the
surrounding pixels of the image patch, so this is subtracted from the colour
similarity metric.
(max)
To define the colour mismatch factor, we empirically define Lcm , the
maximum colour matching certainty. The theoretical maximum value of Lcm
is obtained, as seen from above, when the first and second terms in (25) are
maximised to unity and the third is minimised to zero. In practice, maximum
Lcm is smaller than the theoretical value of two. We also define the saturated
version of the colour matching certainty, to obtain a value in [0, 1]:
0, if Lcm (x) ≤ 0
(max)
L0cm (x) = Lcm (x), if 0 < Lcm (x) < Lcm (x) (26)
L(max) (x), if L (x) ≥ L(max) (x)
cm cm cm
Using the saturated version of the colour matching certainty, the colour mis-
match factor is given by:
(max)
Lcm (x) − L0cm (y|x)
M (ycm |x) = (max)
(27)
Lcm (x)
By substituting (27) in the exponential distribution (16), the colour matching
likelihood is obtained:
Mcm (ycm |x)
p(ycm |x) ∝ exp − 2
(28)
2σcm
2
where σcm is the variance of the distribution. Small values of the variance
increase the selectivity of the colour matching likelihood, as discussed in section
6.
The tracked state of each target contains the height and the radius of
the cylinder modelling it. Since the modelling of the radius depends a lot
on the viewpoint, only the height of the target is used in the classification.
The classifier tries to recognise cars and humans. The height of each class is
modelled as a random variable H, whose distribution is sought. Using datasets
of humans and cars, the height is fit utilising different distributions. The log-
normal yields the minimum fitting error, and hence is chosen. The cumulative
probability function is:
1 ln h − µ
FH (h; µ, σ) = erfc − √ (29)
2 σ 2
where erfc() is the complementary error function. The model fitting approach
yields the (µ, σ) pairs for the heights of cars and humans.
Since the estimation of the height is prone to errors, the probability of the
a range of heights around the nominal value given the class is sought:
(c) (c)
P (c) (h) = FH (h + δ) − FH (h − δ) (30)
where c is the class (car or human) and δ enumerates the height uncertainty
and is selected to be 2cm.
For the classification (30) is evaluated for the car and human classes and
the decision is:
car, if P (car) (h) > 0.08 and P (car) (h) > 1.2P (human) (h)
c= human, if P (human) (h) > 0.08 and P (human) (h) > 1.2P (car) (h) (31)
unknown, elsewhere
6 Results
Fig. 2: Three images to be used for testing the likelihood functions. The
two bounding boxes marked are target (green) and background around target
(red).
18 S. Astaras et al.
– T(rf ,cf ) : Bhattacharyya coefficient term between reference and current fore-
ground
– T(rb,cf ) : Bhattacharyya coefficient term between reference background and
current foreground
– T(rb,cb) : Bhattacharyya coefficient term between reference background and
current background
From Figure 3 it is obvious that the first (blue solid line) and third (red
dashed and dotted line) combination of Bhattacharyya coefficient terms are
excluded because of their low selectivity. Furthermore, the fourth combination
of Bhattacharyya coefficient term (black dotted line) is ruled out too because
it is not a monotonic function, and has local maximums left and right of the
global maximum. This is problematic for the particle filter tracker. So second
combination of Bhattacharyya coefficient terms (green dashed line) is selected,
because it has great selectivity and is monotonic around zero offset.
We next examine the effect of the strength of background suppression in
the reference histogram. The results of varying K are shown in Figure 4. Peak
position, monotonicity and selectivity improve at K = 0.125.
Finally, we examine the effect of the exponential distribution variance on
the selectivity of the colour matching likelihood. The results are shown in
Figure 5. We select σcm = 14 to have selectivity but also avoid a very abrupt
function that will be unable to attract moderately offset particles to the correct
location.
An example of the target classification follows in Figure 6, where three
different outdoor scenarios are depicted.
7 Conclusion
References
1. Andersen, M., Andersen, R., Katsarakis, N., Pnevmatikakis, A., Tan, Z.H.: Three-
dimensional adaptive sensing of people in a multi-camera setup. In: Person tracking
for assistive working and living environments, EUSIPCO 2010, pp. 964–968. Aalborg,
Denmark (2010)
2. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for on-
line non-linear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing
50(2), 174–188 (2002)
3D tracking and classification 19
Reference−Similar Images
2
Trf,cf
1.8 Trf,cf −Trb,cf
Trf,cf +Trb,cb
1.6
Colour Matching Likelihood Trf,cf −Trb,cf +Trb,cb
1.4
1.2
0.8
0.6
0.4
0.2
0
−100 −50 0 50 100
Horizontal offset (% target width)
Reference−Different Images
2
Trf,cf
1.8 Trf,cf −Trb,cf
Trf,cf +Trb,cb
1.6
Trf,cf −Trb,cf +Trb,cb
Colour Matching Likelihood
1.4
1.2
0.8
0.6
0.4
0.2
0
−100 −50 0 50 100
Horizontal offset (% target width)
3. Babenko, B., Yang, M.H., Belongie, S.: Visual Tracking with Online Multiple Instance
Learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2009). Miami Beach, FL, USA (2009)
4. Barnich, O., Droogenbroeck, M.V.: Vibe: A universal background subtraction algorithm
for video sequences. IEEE Transactions on Image Processing 20(6), 1709–1724 (2011).
DOI 10.1109/TIP.2010.2101613. URL http://dx.doi.org/10.1109/TIP.2010.2101613
5. Bouguet, J.Y.: Camera calibration toolbox for matlab.
www.vision.caltech.edu/bouguetj/calib doc/htmls/parameters.html (2008)
6. Chen, Z., Ellis, T.: A self-adaptive Gaussian mixture model. Computer Vision and Image
Understanding 122(0), 35 – 46 (2014). DOI http://dx.doi.org/10.1016/j.cviu.2014.01.
004. URL http://www.sciencedirect.com/science/article/pii/S1077314214000113
20 S. Astaras et al.
Reference−Similar Images
1 no suppresion
k=2
0.9 k=1
k=0.5
Colour Matching Likelihood 0.8 k=0.25
k=0.125
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)
Reference−Different Images
1 no suppresion
k=2
0.9 k=1
k=0.5
0.8 k=0.25
Colour Matching Likelihood
k=0.125
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)
7. Chen, Z., Pears, N., Liang, B.: A method of visual metrology from uncalibrated images.
Pattern Recognition Letters 27(13), 1447–1456 (2006)
8. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. International Journal of
Computer Vision 40(2), 123–148 (2000)
9. Diamantas, S.C.: Biological and metric maps applied to robot homing. Ph.D. thesis,
School of Electronics and Computer Science, University of Southampton (2010)
10. Diamantas, S.C., Dasgupta, P.: An active vision approach to height estimation with
optical flow. In: International Symposium on Visual Computing, pp. 160–170. Springer
(2013)
11. Diamantas, S.C., Oikonomidis, A., Crowder, R.M.: Depth computation using optical
flow and least squares. In: IEEE/SICE International Symposium on System Integration,
pp. 7–12. Sendai, Japan (2010)
3D tracking and classification 21
Reference−Similar Images
1 σ=1
σ=0.5
0.9 σ=0.25
σ=0.125
Colour Matching Likelihood 0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)
Reference−Different Images
1 σ=1
σ=0.5
0.9 σ=0.25
σ=0.125
0.8
Colour Matching Likelihood
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−30 −20 −10 0 10 20 30
Horizontal offset (% target width)
12. Diamantas, S.C., Oikonomidis, A., Crowder, R.M.: Depth estimation for autonomous
robot navigation: A comparative approach. In: International Conference on Imaging
Systems and Techniques, pp. 426–430. Thessaloniki, Greece (2010)
13. Ding, X., Xu, H., Cui, P., Sun, L., Yang, S.: A cascade svm approach for head-shoulder
detection using histograms of oriented gradients. In: IEEE International Symposium on
Circuits and Systems (ISCAS 2009), pp. 1791–1794. Taipei, Taiwan (2009)
14. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and foreground
modeling using nonparametric kernel density estimation for visual surveillance. In:
Proceeding of the IEEE, vol. 90, pp. 1151–1163 (2002)
15. Godbehere, A.B., Matsukawa, A., Goldberg, K.Y.: Visual tracking of human visitors
under variable-lighting conditions for a responsive audio art installation. In: American
Control Conference, ACC 2012, Montreal, QC, Canada, June 27-29, 2012, pp. 4305–4312
22 S. Astaras et al.
(2012)
16. Heikkilä, M., Pietikäinen, M.: A texture-based method for modeling the background
and detecting moving objects. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 657–662
(2006). DOI 10.1109/TPAMI.2006.68. URL http://doi.ieeecomputersociety.org/
10.1109/TPAMI.2006.68
17. Jaffré, G., Crouzil, A.: Non-rigid object localization from color model using mean shift.
In: IEEE International Conference on Image Processing (ICIP 2003), pp. 317–320.
Barcelona, Spain (2003)
18. Jones, M.J., Rehg, J.M.: Statistical color models with application to skin detection.
International Journal of Computer Vision 46(1), 81–96 (2002)
19. KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model
for real-time tracking with shadow detection. In: Video-Based Surveillance Systems,
chap. 11, pp. pp 135–144. Springer US (2002)
20. Katsarakis, N., Pnevmatikakis, A., Tan, Z., Prasad, R.: Combination of multiple mea-
surement cues for visual face tracking. Wireless Personal Communications 78(3), 1789–
1810 (2014). DOI 10.1007/s11277-014-1900-2. URL http://dx.doi.org/10.1007/
s11277-014-1900-2
21. Katsarakis, N., Pnevmatikakis, A., Tan, Z.H., Prasad, R.: Combination of multiple
measurement cues for visual face tracking. Wireless Personal Communications 78(3),
1789–1810 (2014). DOI 10.1007/s11277-014-1900-2
22. Khan, Z., Balch, T.R., Dellaert, F.: Mcmc-based particle filtering for tracking a variable
number of interacting targets. IEEE Trans. Pattern Anal. Mach. Intell. 27(11), 1805–
1918 (2005). DOI 10.1109/TPAMI.2005.223. URL http://doi.ieeecomputersociety.
org/10.1109/TPAMI.2005.223
23. Kitagawa, G.: Monte carlo filter and smoother for non-gaussian nonlinear state space
models. Journal of Computational and Graphical Statistics 5(1), 1–25 (1996)
24. Lazaros Nalpantidis Ioannis Kostavelis, A.G.: Stereovision-based algorithm for obstacle
avoidance. In: Intelligent Robotics and Applications, vol. 5928, pp. 195–204 (2009)
25. Li, Y., Ai, H., Yamashita, T., Lao, S., Kawade, M.: Tracking in low frame rate video: A
cascade particle filter with discriminative observers of different life spans. IEEE Trans.
Pattern Anal. Mach. Intell. 30(10), 1728–1740 (2008). DOI 10.1109/TPAMI.2008.73.
URL http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.73
26. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using
deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine
Intelligence PP(99) (2015)
27. Maddalena, L., Petrosino, A.: A self-organizing approach to background subtraction for
visual surveillance applications. IEEE Transactions on Image Processing 17(7), 1168–
1177 (2008). DOI 10.1109/TIP.2008.924285. URL http://dx.doi.org/10.1109/TIP.
2008.924285
28. Maddalena, L., Petrosino, A.: A fuzzy spatial coherence-based approach to back-
ground/foreground separation for moving object detection. Neural Computing and
Applications 19(2), 179–186 (2010). DOI 10.1007/s00521-009-0285-8. URL http:
//dx.doi.org/10.1007/s00521-009-0285-8
29. Mihaylova, L., Brasnett, P., Canagarajah, N., Bull, D.: Object tracking by particle
filtering techniques in video sequences
30. Momeni-K., M., Diamantas, S.C., Ruggiero, F., Siciliano, B.: Height estimation from
a single camera view. In: Proceedings of the International Conference on Computer
Vision Theory and Applications, pp. 358–364. SciTePress (2012)
31. Noh, S., Jeon, M.: A new framework for background subtraction using multiple cues. In:
Computer Vision - ACCV 2012 - 11th Asian Conference on Computer Vision, Daejeon,
Korea, November 5-9, 2012, Revised Selected Papers, Part III, pp. 493–506 (2012).
DOI 10.1007/978-3-642-37431-9 38
32. OpenCV: http://opencv.org (2016)
33. Pan, J., Hu, B., Zhang, J.Q.: An efficient object tracking algorithm with adaptive
prediction of initial searching point. In: Advances in Image and Video Technol-
ogy, First Pacific Rim Symposium, PSIVT 2006, Hsinchu, Taiwan, December 10-13,
2006, Proceedings, pp. 1113–1122 (2006). DOI 10.1007/11949534 112. URL http:
//dx.doi.org/10.1007/11949534_112
24 S. Astaras et al.
34. Pnevmatikakis, A., Polymenakos, L.: Robust estimation of background for fixed cam-
eras. In: 15th International Conference on Computing (CIC ’06), pp. 37–42. Mexico
City, Mexico (2006)
35. Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In:
Y. Weiss, B. Schölkopf, J.C. Platt (eds.) Advances in Neural Information Processing
Systems 18, pp. 1161–1168. MIT Press (2006). URL http://papers.nips.cc/paper/
2921-learning-depth-from-single-monocular-images.pdf
36. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo corre-
spondence algorithms. International Journal of Computer Vision 47(1-3), 7–42 (2002)
37. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time track-
ing. In: 1999 Conference on Computer Vision and Pattern Recognition (CVPR ’99),
23-25 June 1999, Ft. Collins, CO, USA, pp. 2246–2252 (1999). DOI 10.1109/CVPR.
1999.784637. URL http://dx.doi.org/10.1109/CVPR.1999.784637
38. Talantzis, F., Pnevmatikakis, A., Constantinides, A.G.: Audio-Visual Person Tracking:
A Practical Approach. Imperial College Press (2012)
39. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple
features. In: IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR 2001), pp. 511–518. Kauai, HI, USA (2001)
40. Welch, G., Bishop, G.: An introduction to the kalman filter. Tech. rep., University of
North Carolina at Chapel Hill (2006)
41. Xu, L., Landabaso, J., Pardas, M.: Shadow removal with blob-based morphological
reconstruction for error correction. In: IEEE Int. Conf. on Acoustics, Speech, and
Signal Processing (ICASSP 2005). Philadelphia, PA, USA (2005)
42. Yao, J., Odobez, J.M.: Multi-layer background subtraction based on color and tex-
ture. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA (2007). DOI
10.1109/CVPR.2007.383497. URL http://dx.doi.org/10.1109/CVPR.2007.383497
43. Zhang, X., Hu, W., Maybank, S.J.: A smarter particle filter. In: Computer Vision
- ACCV 2009, 9th Asian Conference on Computer Vision, Xi’an, China, September
23-27, 2009, Revised Selected Papers, Part II, pp. 236–246 (2009). DOI 10.1007/
978-3-642-12304-7 23. URL http://dx.doi.org/10.1007/978-3-642-12304-7_23
44. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000)
45. Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction.
In: 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK,
August 23-26, 2004., pp. 28–31 (2004). DOI 10.1109/ICPR.2004.1333992. URL http:
//dx.doi.org/10.1109/ICPR.2004.1333992