Professional Documents
Culture Documents
Robust Background Subtraction Under Rapidly Changing Illumination Conditions
Robust Background Subtraction Under Rapidly Changing Illumination Conditions
\
|
+ =
+ +
+ +
, |
256
1
3
1
1
. (2)
The statistical illumination model can be expressed in pixel
color instead of illumination ratio by dividing Eq. 1 by |J
i
|, the
determinant of the Jacobian of function l
i
to obtain
( ) ( )
i i i i i
J z l p z u p / , , | , , | = , (3)
where, |J
i
| = m
i
r
m
i
g
m
i
b
. By multiplying Eq.s 1, divided by |J
i
|,
and 2 we obtain the joint log likelihood function p(u, z | ),
where = {, , } are the GMM parameters.
A. Spatial Likelihood Model
Instead of modeling spatial dependencies in the joint log
likelihood with a prior model on z, the spatial likelihood
model tries to capture spatial dependencies among pixels by
using two features that can be computed very fast from the
luminance channel. The first feature f
i
1
is the normalized cross
correlation (NCC) for each window around pixel i between
Fig. 1. From left to right from top to bottom: Input frame, ratio between input
and background model frame, fi
1
, fi
2
, background histogram probability and
foreground histogram probability for each pixel.
background model frame and input frame. It is given by
( ) ( )
=
i i i i
i i i
w j w j
j j
w j w j
j j
w j w j
j j
w j
j j
i
m m n u u n
m u m u n
f
2
2
2
2
1
, (4)
where, n is the number of pixels in the window centered
around pixel i. However, NCC is meaningless in untextured
areas, due to the normalization in the denominator of Eq. 4,
therefore, it cannot make the difference between a uniform
background and a uniform foreground region. Therefore, the
second feature f
i
2
is a measure of the amount of texture. It is
given by
( ) ( )
+ =
i i i i
w j w j
j j
w j w j
j j i
m m u u f
2
2
2
2 2
. (5)
Fig. 1 shows f
i
1
and f
i
2
for each pixel in the input image.
In [5] the background and foreground distributions
p
BG
(f
i
1
, f
i
2
| u
i
) and p
FG
(f
i
1
, f
i
2
| u
i
) are independently modeled
by two dimensional histograms. The histograms are learned
from a set of manually segmented images and are denoted by
h
BG
and h
FG
. The background histogram corresponds to
background mixture components k=1K, the foreground
histogram corresponds to foreground mixture components
k=K+1K
FG
+1. To achieve real time performance integral
imaging techniques [19] can be applied. With integral images
the features in Eq. 4 and 5 can be computed with a complexity
which is linear w.r.t. the number of pixels in the image and
constant w.r.t. the window size. Furthermore the histogram
features of the background frame can be pre computed. The
background and foreground probabilities for each pixel in the
input frame according to the background and foreground
histograms are shown in Fig. 1.
B. Maximum Likelihood Parameter Estimation
By combining the statistical illumination model, the
foreground model and the spatial likelihood model and
marginalizing over z, we obtain the combined p(u, f | ). The
parameters = {, , } are estimated by maximum likelihood
(ML) from Eq. 6. In order to reduce the computational burden
of the model, it is assumed that all pixels are independent
identically distributed (i.i.d). The ML estimate of Eq. 6 is then
simplified to Eq. 7 which can be maximized by the EM
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
5
algorithm.
( )
)
`
z
z f u p
\
|
+ =
+ +
+ +
, |
256
1
3
1
1
.
The EM algorithm alternates between the E-and M-step. In the
E-step the posterior distribution p(z
i
k
| u
i
,
t
) =
t
i,k
for each
pixel i and for each mixture component k are calculated. In the
M-step the parameters are updated to maximize the log
likelihood. Depending on the initialization the EM-algorithm
will converge to a local or global maximum of the joint log
likelihood. The resulting update equations for the pdf in Eq. 7
are given in appendix Fig. A1. The probability that pixel i
belongs to the foreground is given by P
i
BG
. The segmentation
is performed by thresholding P
i
BG
for all pixels in the image.
Instead of the i.i.d. assumption of Eq. 7, the spatial pixel
dependencies can be modeled explicitly by spatial priors. In
[12], pixel dependencies in a 4-connected neighborhood are
modeled by Ising model to obtain a MAP-MRF framework. To
maximize the MAP-MRF framework, graph cut optimization
or other iterative approximations like iterated common modes
simulated annealing or mean field are required. This will result
in a higher computational complexity which will not allow real
time operation. The spatial likelihood model of SI, however,
models pixel dependencies in textured areas of the image
while keeping the computational complexity low.
IV. PROPOSED APPROACH
SI has three undesired properties: 1) the spatial likelihood
model requires training on manually segmented ground truth
data, 2) it does not operate real time 3) it cannot cope with
local illumination effects. In this section:
A. We modify the spatial likelihood model to mitigate the
need for manually segmenting video data.
B. We reduce the complexity of SI to allow for real time
operation.
C. We propose a post processing filter which improves the
spatio-temporal consistency in the foreground mask.
D. We combine EB with the modified SI to obtain the
proposed algorithm. This algorithm is more accurate and
robust to rapid local illumination changes than EB and SI.
A. Modification Spatial Likelihood Model
The spatial likelihood model significantly improves the
segmentation results because it captures interpixel
Fig. 2. Improvement of segmentation with spatial likelihood model. From left
to right: segmentation with spatial likelihood model, segmentation without
spatial likelihood model and input frame.
dependencies and removes false positives. This is illustrated in
Fig. 2. Manually segmenting video data in background and
foreground in order to train the spatial likelihood model, is a
tedious and time consuming task, especially because a large
training set is needed to include many possible illumination
effects. Furthermore the histograms wont remain valid if
drastic changes in lighting conditions occur. Here we propose
to calculate the spatial likelihood model directly from the input
frames by reliably detecting background and foreground
pixels. For each input frame the background and foreground
histograms are then updated with the histogram features of the
reliable background and foreground pixels respectively.
In [17] a method for template matching is proposed which is
robust to sudden illumination changes. It uses a measure to
determine the amount of order preservation between an image
template and a window surrounding a pixel in the input image.
Since the ordering of pixels in textured image areas will be
preserved even in the presence of heavy photometric
distortions [10], this measure is extremely robust to detect
reliable background pixels for our application.
The order preservation can be determined by looking at the
sign of the pixel differences in a neighborhood. If the sign of
the difference between two pixels does not change, even in the
presence of lighting changes, the order is preserved. To
evaluate the degree of order preservation for pixel i, the NCC
of the image differences between pixels in a window around i
in the input frame and background model frame is calculated.
The value returned by the NCC is in the range [0, 1] and it is
independent of the amplitude of the differences between the
input frame and template due to normalization in the
denominator of Eq. 9.
We modify the method in [17] to compare the order
preservation of pixels in the input frame with the
corresponding pixels in the background frame instead of a
template. We represent the intensity of a pixel by F(x
i
, y
i
),
where x
i
, y
i
are the horizontal and vertical image position of
pixel i, respectively. For each pixel i in the input and
background model frame a vector of image differences
IN
(x
i
,y
i
) and
BG
(x
i
,y
i
) is calculated by
( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
(
(
(
(
+
+
=
1 , 1 ,
, 1 , 1
, 1 ,
, , 1
,
i i i i
i i i i
i i i i
i i i i
i i IN
y x F y x F
y x F y x F
y x F y x F
y x F y x F
y x
,
( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
(
(
(
(
+
+
=
1 , 1 ,
, 1 , 1
, 1 ,
, , 1
,
i i i i
i i i i
i i i i
i i i i
i i BG
y x F y x F
y x F y x F
y x F y x F
y x F y x F
y x
.(8)
The matching function which returns the measure for the
amount of order preservation of pixel i is then given by Eq. 9,
where w
i
is a rectangular window around pixel i. Eq. 8 shows
that each pixel has unique intensity differences, to prevent the
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
6
( )
i i
y x MF ,
=
( ) ( )
( ) ( ) ( ) ( )
i i
i
w i
i i FG
T
i i FG
w i
i i BG
T
i i BG
w i
i i FG
T
i i BG
y x y x y x y x
y x y x
, , , ,
, ,
(9)
intensity difference between the same two pixels from
occurring multiple times in the calculation of MF(x
i
, y
i
). The
matching function of Eq. 9 tries to match high contrast regions
of the two blocks under comparison, since only high intensity
differences can provide a contribution to the matching score.
High contrast regions appear near edges and in textured image
regions. Uniform or clipped regions in the input frame have no
contrast and are therefore unreliable. They will not be detected
as background pixels. Pixels are considered reliable
background if their matching score is larger than a threshold.
Fig. 3 shows reliable background pixels detected after a
sudden illumination change. Generally during illumination
changes without clipping, all background pixels in textured
regions and near edges are detected as reliable background,
while the number of false positives is very low.
Since Eq. 9 like Eq. 4 and 5 represents a NCC, we
implemented it very efficiently with integral image tables. The
advantage of integral image tables is that the sum of pixels in
window can be calculated with only 3 addition operations and
4 table look up operations. The numerator in Eq. 9 for
example can then be calculated with only 4 multiplications, 20
addition and 16 table lookup operations. If we assume a
window size M, then without integral images 4M
2
multiplications, 20M
2
additions and 8M
2
table lookups are
required. The complexity with integral images thus is
independent of the window size M. The same holds for the
products in the denominator.
Foreground pixels of moving objects are detected by a 3-
DRS motion estimator [18]. 3-DRS is a fast an efficient
motion estimation algorithm which applies recursive search
block matching. Here we apply it to generate motion vectors
for each pixel in the input image. The sum of absolute
difference (SAD) is used as the criterion to select the most
likely candidate vector for each block of pixels in the current
frame. All blocks for which the absolute value of the motion
vector is larger than a threshold will be considered foreground.
To remove noisy isolated motion vectors and motion vectors
which exceed object boundaries, a block filter is repeatedly
applied. This filter removes all motion vectors which have
more than 2 neighbors with zero motion vectors. Experiments
indicate that most false positive motion vectors are removed
by repeating the block filter 3 times. This procedure will
remove some non noisy motion vectors as well. This is no
problem because the spatial likelihood model can still be
updated with the remaining motion vectors to obtain an
accurate foreground histogram. To eliminate the block
structures and to obtain a smoother motion vector field we
apply block erosion [18].
3-DRS cannot cope with rapid illumination changes because
they are treated as motion. Undesirably motion vectors will be
assigned to the pixels which undergo an illumination change.
Furthermore, 3-DRS suffers from ghosting effects, which are
Fig. 3. Reliable background pixel detection under rapidly changing
illumination conditions. The green pixels indicate the reliably detected
background.
caused by motion vectors of moving objects which extend into
uniform background areas. Therefore we dont use the results
of the motion estimator to update the foreground histogram if
the percentage of reliable foreground pixels which overlap the
reliable background pixels is larger than 5%. If the number of
detected foreground pixels is smaller than 0.5% of the total
number pixels in the input frame, we wont update the
foreground histogram either, since then only a small fraction of
the foreground object is detected. This prevents updating the
foreground histogram with unreliable foreground pixels
detected by 3-DRS. The thresholds 5% and 0.5% are
determined empirically.
After calculating reliable background and foreground pixels
the background and foreground histograms of the current
frame, h
*
BG
and h
*
FG
, are calculated. The bins b
l
of the spatial
likelihood histograms h
BG
and h
FG
are then update according to
( ) ( ) ( ) ( )
l BG BG l BG BG l BG
b h b h b h
+ = 1
BG l
h b ,
, and
( ) ( ) ( ) ( )
l FG FG l FG FG l FG
b h b h b h
+ = 1
FG l
h b ,
, (10)
where,
BG
and
FG
are the learning rate parameters of the
background and foreground respectively. They indicate the
update speed of the histograms. Initially h
BG
and h
FG
are
uniform. If a foreground object is motionless h
FG
remains
uniform until the object starts moving.
B. Complexity Reduction
The algorithm in [5] does not run real time for a standard
resolution of 640x480 pixels. Therefore we reduce its
complexity by first downscaling the frame, then applying SI on
the downscaled frame and finally upscaling the segmentation
result. Theres a tradeoff between speed vs. segmentation
accuracy. Downscaling implies more loss in segmentation
accuracy because edges and small details will be lost,
however, it will speed up the segmentation. Since the
foreground mask will be further processed by other algorithms
for information extraction, it is sufficient for our application to
indicate the location and size of foreground objects. In our
implementation we therefore downscale with a factor 5 to
allow real time operation. Furthermore downscaling improves
the signal to noise ratio (SNR) of the image. This makes SI
more robust to noise.
C. Post processing
To improve the performance of SI a median filter on the
segmented frame can be applied before upscaling. Fig. 4
shows this filter improves the segmentation results by
removing false positive detections in the background and false
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
7
Fig. 4. Left: segmentation without post processing. Middle: segmentation
with post processing. Right: input frame.
Fig. 5. Left: segmentation with SI. Middle: Input frame. Right: background
model frame.
negatives detections in foreground objects, which appear as
random noise in the segmented image. However, this filter
does not remove temporal inconsistencies from the foreground
mask. Temporal inconsistencies manifest themselves as
flickering or the sudden emergence or disappearance of a
group of foreground pixel in the foreground mask. To combat
temporal inconsistencies and obtain a smooth foreground mask
we propose the recursive spatio-temporal filter described by
( )
( )
>
=
otherwise
thd n r F
n r F
o
o
, 0
, , 1
,
*
where, r=[x, y]
T
and
n
I r
( )
= =
=
|
|
\
|
+
+
|
|
\
|
+
(
(
(
|
|
\
|
(
+
+
|
|
\
|
+
|
|
\
|
(
|
|
\
|
+
|
|
\
|
(
|
|
\
|
|
|
\
|
(
+
+
|
|
\
|
+
|
|
\
|
(
+
|
|
\
|
=
|
|
\
|
(
1
1
1
1
*
1
1
* * *
1 , ,
1 1
,
1
0
1
,
1
0
1
,
1 1
,
1 1
1 ,
i j
o
i
o o o
n
j y
i x
F
j
i
h n
y
i x
F
i
h
n
y
x
F h n
y
x
F h
n
y
i x
F
i
h n
y
i x
F
i
h n
y
x
F
(11)
where thd is threshold value in the range [0,1], F
o
(r,n){0,1}
is the output foreground mask at location r for frame n, I
n
is
the set of all pixel locations in frame n, F[0,1] denotes the
input foreground mask, h is a 3 by 3 filter kernel with
normalized weights and [0,1] is a constant which weighs
the contribution of the pixels in the foreground mask of the
previous frame n-1. First the filter is applied from top left to
bottom right. To remove temporal inconsistencies in the
opposite direction the filter is applied again from bottom right
to top left. From Eq. 11 it can be noted that both spatial and
temporal recursiveness is embedded in our filter because the
filter uses all available output pixels in its aperture. In this way
flickering and holes in foreground objects will be removed
from the foreground mask. This is demonstrated in section V,
where the proposed filter is analyzed.
D. Proposed Method
SI can cope well with sudden global uniform illumination
changes. However, it completely fails when the appearance of
the background scene significantly changes due to cast
shadows, highlights and clipping. This is illustrated in Fig. 5,
where an input image in which the background is affected by
local illumination effects is shown. Ideally the ratio between
the input and background image pixels belongs to one of the
constant illumination ratios in SI. Fig. 5 shows that in the
presence of different local lighting conditions this does not
hold and that the pixels in the background have many different
illumination ratios. SI cannot model this.
As mentioned in section II, the background reconstruction
method of EB can reconstruct all possible local and global
illumination effects which are present in the training set. It
exploits multiple background frames in which valuable
information about lighting conditions is stored. By combining
the modified SI, which only uses one background frame, with
EB we obtain a segmentation method which is robust to heavy
photometric changes under different lighting conditions. For
each input frame we let EB reconstruct the background model
image separately in the three RGB channels. The reconstructed
background image is then used in SI. This will make the ratios
between the background model frame and the background
areas of the input image more constant. This will improve the
performance of SI, because the background pixel ratios will fit
the background model better. A block scheme of the resulting
algorithm is shown in Fig. 6.
The advantage of combining EB with SI is that it allows for
more flexibility in classifying a pixel than the pixel wise
differencing and thresholding of EB. Compared to the pixel
wise differencing of EB, the proposed method models the
background pixel ratios with Gaussians distributions of which
its parameters are automatically adjusted by the EM algorithm
to satisfy the maximum likelihood criterion. Furthermore, SI
models the spatial relations among pixels with the spatial
likelihood model. The EM algorithm will automatically update
the GMM parameters to achieve convergence to an optimum
of the likelihood function in Eq. 7. This allows for better
adaptation to the frame content compared to the thresholding
in [4]. By using EB the reliable background pixel detection
will be improved too. The proposed method thus effectively
combines the flexibility and EM parameter update of SI and
the background reconstruction capabilities of EB. The
experiments of section VI indicate that the proposed method
gives better segmentation results than EB or SI individually.
V. EXPERIMENTAL RESULTS
In this section the performance of TA, SI, EB with fixed and
dynamic threshold and the proposed method are compared
qualitatively and quantitatively in terms of precision and
recall. The ABMM of [20] is used as reference. This method is
an improvement of the pixel based method described in [7]
and [8], because it learns faster and more accurately adapts to
changing environments. Furthermore the effectiveness of the
modified spatial likelihood model and the proposed recursive
spatio-temporal post processing filter are demonstrated.
No benchmark sequences are available to measure the
performance of background subtraction algorithms under
rapidly changing illumination conditions. Therefore, we
recorded a very challenging test and training sequence with a
Logitech webcam. The sequence shows a room with a person
in which rapidly varying illumination conditions occur by
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
8
randomly turning on different combinations of light sources
and opening and closing shades, which block daylight from
entering the room.
A. Parameter Settings
We empirically choose the image differencing threshold in
TA method to give the best quantitative segmentation results
for the processed sequence. The fixed image differencing
threshold for EB is 20% of the largest difference between
input frame and reconstructed background model frame. The
parameters for the dynamic threshold are optimized
empirically to obtain the best segmentation results. For EB, SI
and the proposed method we use a downscale factor of 5. We
construct an Eigen space model separately for each RGB
channel, from a training set of 500 frames in which a diversity
of possible lighting conditions is captured. For each input
frame 10 Eigen vectors are used to reconstruct the background
image. Following [5], we use 2 background and 2 foreground
components in SI. Experiments with the number of
background and foreground components in the proposed
method point out that only 1 background and 2 foreground
components in SI suffice. The number of EM iterations per
frame is determined empirically. We dont observe any more
noticeable improvement in segmentation after more than two
EM iterations per frame. Therefore, we use 2 EM iterations
per frame for SI and the proposed method. We empirically
choose
BG
and
FG
as 0.2 and 0.4 respectively.
A median filter is applied as a post processing step to each
foreground mask obtained by the five algorithms. The window
size of the median filter is adapted for each method to obtain
the best segmentation results in terms of precision and recall.
This leads to a windows size of 5 by 5, for the proposed
method, EB, SI and the ABMM. A window size of 9 by 9 is
selected for TA.
B. Analysis
We performed a quantitative analysis at pixel level, on 750
frames of an indoor sequence in which illumination conditions
change rapidly. The ground truth foreground mask for the
quantitative comparison is obtained by manual segmentation of
that sequence. The detection rates for the five algorithms are
plotted in Fig. A2 in terms of precision and recall per frame,
defined by
% 100
#
#
=
Detected Positives of Total
Positives True of
precision
, and
% 100
#
#
=
Pixels Foreground Total
Positives True of
recall
.
Here precision is the percentage of detected true positive
foreground pixels from the total number of detected
foreground pixels, while recall is the percentage of all
foreground pixels detected. A recall of 100% can easily be
achieved by detecting the whole frame as foreground.
However, the precision will then be very low due to the large
number of false positives. On the other hand if only 1 pixel is
detected and that pixel correctly belongs to the foreground a
Fig. 6. Block diagram of the proposed method.
precision of 100% is achieved while the recall will be very
low. Therefore both precision and recall are needed for the
evaluation.
From the plots of the precision and recall in Fig. A2 and
from Table II it can be seen that the overall detection accuracy
for the proposed method is significantly higher. This method
achieves the highest average precision and recall. Having a
standard deviation of only 5.8% the precision of the proposed
method stays relatively constant compared to the other
methods. This means that in the foreground mask of our
method significantly less false positives occur.
For some intervals in Fig. A2 the recall of TA is higher by a
small amount. This indicates that TA detects more foreground
pixels in those intervals. However, the corresponding precision
values of our method are still significantly higher. The
implication of this can be seen in Fig. A3, where snapshots of
the foreground mask of every method are compared. In the
foreground mask of frame 437 and 747, processed by TA, the
recall is higher since more foreground pixel are detected,
however, the precision is clearly lower because many of the
positives detected belong to the background. For frame 600 to
640 the ABMM achieves a higher precision than our method,
however, its recall is significantly worse. Frame 620 of Fig.
A3 illustrates this well. With ABMM, 98% of the positives in
the foreground mask are foreground pixels, versus 85% in our
method. However our method detects 70% of all foreground
pixels while ABMM only detects 30%.
From Fig. A2 it can be clearly seen that SI and TA do not
work well on our sequence. This is because both methods only
use a single background model image to model the
background. Therefore, local illumination effects cannot be
modeled and they result as misdetections in the foreground
mask. Frames 202, 253, 592 and 747 in Fig. A3 illustrate this.
In frames where the background model image corresponds
well to the background scene of the input frame and where the
illumination change only affects the scene globally, TA and SI
both perform comparable to the proposed method.
Similar to the proposed method EB can reconstruct local
illumination effects, therefore local illumination changes will
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
9
0 100 200 300 400 500 600 700 800
50
60
70
80
90
100
Frame Number
(a) Precision
0 100 200 300 400 500 600 700 800
0
20
40
60
80
100
Frame Number
(b) Recall
Fig. 7. Quantitative comparison of applying the recursive spatio-temporal
filter (blue line) versus a median filter (red line) on the foreground mask
yielded by our proposed method. The optimal window size of the median
filter was found to be 5 by 5.
not be detected as foreground. The ability to reconstruct local
illumination effects is reflected in the precision. EB with
dynamic threshold detects equally much foreground pixels as
TA, however, its precision is 12.6% higher.
ABMM completely fails when sudden illumination changes
occur. Due to the pixel based nature of ABMM, pixels which
undergo a significant change in intensity will be considered
foreground, since they do not match any background
distributions in their GMM. Frame 14, 25 and 437 illustrate
this. Furthermore in ABMM the GMM of each pixel is
updated progressively over time. Therefore, pixels of objects,
which do not move for a number of consecutive frames, will
become part of a background distribution in their GMM,
causing them to disappear into the background in the
foreground mask.
In Fig. A2(c)-(d) and Table II the proposed method and EB
with fixed and dynamic threshold are compared. A 15%
increase in recall is obtained by dynamic instead of fixed
thresholding, however, the precision for dynamic thresholding
drops with 6%. Even if dynamic thresholding is applied, the
proposed method will still outperform EB.
Table II shows the contribution of the spatial likelihood
model in the proposed method. By applying the spatial
likelihood model an increase in precision of 6% is achieved
while the recall is hardly affected.
For the tested sequence, the proposed method on average
detects 57.3% of the foreground pixels with a precision higher
than 90%. The quantitative (Fig. A2) and qualitative (Fig. A3)
comparisons in this section confirm that the proposed method
outperforms the other methods on the tested sequence, by
having a higher recall and a significantly higher precision. The
proposed method thus detects more foreground pixels with a
higher accuracy and less misdetections for the tested sequence.
C. Analysis Recursive Spatio-Temporal filter
Here the recursive spatio-temporal filter of Eq. 11 is
analyzed and compared with a median filter, both
quantitatively and qualitatively. We use =0.4 and thd=0.42.
In Fig. 7 the precision and recall of the foreground mask of the
method processed with a median filter and the proposed filter
are plotted. The average precision and recall are reported in
Table II. The window size of the median filter was 5 by 5.
From Table II it can be observed that on average compared
to median filtering the recall is increased by 5% while the
precision stays approximately the same. Fig. 7 clearly shows
that the recall is higher by applying our filter. For example,
from frame 100 to frame 230 the recall of the median filtered
sequence drops while the recall of the sequence filtered with
the proposed filter even slightly increases. The precision is not
affected at all. The spatio-temporal recursiveness of our filter
will oppose sudden transitions and therefore achieve a
spatially and temporally smooth foreground mask. This is
shown in Fig. A4. If suddenly a new foreground object
emerges the filter needs about 3 frames before convergence is
reached and the object is fully detected.
VI. CONCLUSIONS AND FUTURE WORK
This paper proposed a fast method for the task of
background subtraction under rapidly changing illumination
conditions. This method is a combination of Eigenbackground
(EB) [4] and the modified statistical illumination model (SI).
SI models the background and foreground distribution by
Gaussian mixture model components. Maximum likelihood
parameter estimates are obtained by the expectation
maximization (EM) algorithm. However, SI cannot deal with
local illumination effects like clipping, highlights and
shadows. EB can reconstruct these effects very well but its
fixed or dynamic pixel based thresholding do not make any
assumptions about the foreground distribution in obtaining the
foreground mask, resulting in degraded performance. By
combining both methods we benefit from the modeling
capabilities and EM parameter updates of SI and the
background reconstruction capabilities of EB.
The spatial likelihood model is modified to mitigate the
need for training its spatial likelihood model on ground truth
data by allowing an online spatial likelihood model update per
frame. The modified spatial likelihood model will improve the
precision of the proposed method by 6% on the tested
TABLE II
COMPARISON OF THE AVERAGE AND STANDARD DEVIATION OF THE PRECISION
AND RECALL OF THE SEQUENCE USED IN FIG. A2.
Method
Precision
(%)
Recall
(%)
Proposed (recursive filter) 91.6 4.7 57.3 21.0
Proposed (median filter) 90.5 5.8 52.3 19.4
Proposed(recursive filter no histograms) 85.8 8.7 59.7 24.5
Proposed(median filter no histograms) 85.8 9.9 55.4 23.1
Eigenbackground (fixed threshold) [4] 75.8 16.6 29.2 21.5
Eigenbackground (dynamic threshold) [4] 69.6 21.1 45.4 24.9
Statistical Illumination [5] 61.9 26.0 20.6 16.9
Tonal Alignment [3] 57.0 17.0 44.0 21.7
ABMM [23] 38.1 31.5 31.8 23.2
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
10
sequence. Furthermore we proposed a temporal recursive filter
to obtain a spatially and temporally smooth foreground mask.
This post processing filter improves the recall of our method
by 5%.
We compared the proposed method with a reference method
and 3 state of the art background subtraction methods, which
claim to be robust to sudden illumination changes. Because no
benchmark sequences were available we conducted the
experiments on a challenging test sequence which contains
sudden illumination changes over a broad range of lighting
conditions. The results indicate that our method significantly
outperforms these methods on the tested sequence. With the
proposed post processing filter an average precision and recall
of 91.6% and 57.3% are obtained respectively. Furthermore
the precision of the proposed method is more constant with a
standard deviation of 5%. With the complexity reduction we
achieved an average frame rate of 15 fps which easily meets
the real time requirements for the application in ambient
intelligence systems.
Although we made an honest attempt to test the algorithms
on challenging and relevant video material, final conclusions,
regarding the superiority of the proposed method, require
generally accepted, and perhaps more diverse, benchmarks not
currently available.
A. Future Work
The proposed method is very robust under rapidly changing
illumination conditions provided the appearance of the
background in the input frame can be reconstructed by a linear
combination of the Eigen vectors obtained from the training
set. However, like EB and SI the proposed method cannot
cope with moved background objects. The object will be
detected forever after the movement since its new location
does not correspond with its location in the background model
image. Another problem is that for some applications it is
complicated to obtain background model images under
different illumination conditions. These problems can be
tackled by extending the proposed method with an incremental
Eigen space update in which each pixel is weighted before
updating the Eigen space model [4]. In this way new
appearances of the background are integrated into the Eigen
space model during runtime.
Input frames with large foreground objects, cause
degradation in the Eigenbackground reconstruction. These
degradations occur as errors which are spread over the entire
reconstructed background image resulting in inaccurate
segmentation [21]. This problem can be circumvented by
extending the proposed method with recursive error
compensation as described in [21].
The reliable foreground pixel detection in the spatial
likelihood model can be improved by using normalized cross
correlation (NCC) instead of sum of absolute difference (SAD)
as a block match criterion in the 3-DRS motion estimator.
NCC is less vulnerable to the amplitude differences between
the previous and current frame, which are caused by
illumination changes, due to normalization in its denominator.
ACKNOWLEDGEMENTS
The author would like to acknowledge his gratitude for the
support provided by C. Shan and T. Gritti. Special thanks go
out to G. de Haan for his advice and enabling me to complete
my graduation project at Philips Research laboratories.
REFERENCES
[1] E. Aarts, R. Harwig, M. Schuurmans, Ambient Intelligence, in The
Invisible Future: The Seamless Integration Of Technology Into
Everyday Life, P. J. Denning, Ed. New York: McGraw-Hill, 2002, pp.
235-250.
[2] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, Wallflower:
Principles and practice of background maintenance," Proc. ICCV, vol.
1, pp. 255261, Kerkyra, Greece, September, 1999.
[3] L. Di Stefano, F. Tombari, S. Mattoccia Robust and Accurate Change
Detection under Sudden Illumination Variations, Proc. ACCV, vol. 1,
pp. 103-109, Tokyo, November, 2007.
[4] B. Han, R. Jain Real-Time Subspace-Based Background Modeling
Using Multi-channel Data, ISVC, Part II, LNCS 4842, pp. 162-172,
Lake Tahoe, USA, November, 2007.
[5] J.Pilet, C. Stretcha, P. Fua, Making Background Subtraction Robust to
Sudden Illumination Changes, ECCV, part IV, LNCS 5305, pp. 567-
580, Marseille, France, October, 2008.
[6] R.J. Radke, S. Andra, O. Al-Kofahi, B. Roysam, Image Change
Detection Algorithms A systematic Survey, IEEE Transactions on
Image Processing, vol. 14, pp. 294-307, 2005.
[7] N. Friedman, S. Russell, Image Segmentation in Video Sequences: A
Probabilistic Approach, Proc.UAI, pp. 175-181, Rhode Island, USA,
August, 1997.
[8] C. Stauffer, W. Grimson, Learning Patterns of Activity Using Real-
Time Tracking, IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 22, pp. 747-757, August, 2000.
[9] L. Li, M. K. H. Leung, Integrating Intensity and Texture Differences
for Robust Change Detection, IEEE Transactions on Image
Processing, vol. 11, pp. 105-112, February, 2002.
[10] B. Xie, V. Ramesh, T. Boult, Sudden Illumination Change Detection
Using Order Consistency, Image and Vision Computing, vol. 22, pp.
117-125, 2004.
[11] X. Zhao, W. He, S. Luo, L. Zhang, MRF-Based Adaptive Approach for
Foreground Segmentation under Sudden Illumination Change, Proc.
ICICS, pp. 1-4, Singapore, December, 2007.
[12] Y. Sheikh, M. Shah, Bayesian Modeling of Dynamic Scenes for Object
Detection, IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 27, pp. 1778-1792, 2005.
[13] P. Noriega, B. Bascle, O. Bernier, Local kernel color histograms for
background subtraction, In INSTICC Press, editor, VISAPP, vol. 1, pp.
213219, 2006.
[14] P. Noriega, O. Bernier Real Time Illumination Invariant Background
Subtraction Using Local Kernel Histograms, Proc. BMVC, vol. 3, pp.
979-988, Edinburgh, September, 2006.
[15] N.M. Oliver, B. Rosario, A.P. Pentland, A Bayesian Computer Vision
System for Modeling Human Interactions, IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 22, pp. 831-843, 2000.
[16] R. Gonzalez, R.Woods. Digital Image Processing, 2nd ed. Upper
Saddle River, New Jersey: Prentice Hall, 2002, pp. 88-103.
[17] F. Tombari, L. Di Stefano, and S. Mattoccia, A robust measure for
visual correspondence, Proc. ICIAP, pp. 376-381, Modena, Italy,
September, 2007.
[18] G. de Haan, P.W.A.C. Biezen, H. Huijgen, O. Ojo. True Motion
Estimation with 3-D Recursive Search Block Matching, IEEE
Transactions on Circuits and Systems for Video Technology, vol. 3, pp.
368-388, 1993.
[19] D.-M. Tsai C.-T. Lin, Fast Normalized Cross Correlation for Defect
Detection, Pattern Recognition Letters, vol. 24, pp. 2625-2631, 2003.
[20] P. Kaewtrakulpong, R. Bowden An Improved Adaptive Background
Mixture Model for Real-time Tracking with Shadow Detection,
AVBSS, Kingston upon Thames, UK, September, 2001.
[21] Z. Xu, P. Shi, I. Gu, An Eigenbackground Subtraction Method Using
Recursive Error Compensation, PCM, pp. 779-787, Hangzhou, China,
September, 2006.
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
11
APPENDIX
Fig. A1. Update equations for maximizing the log likelihood of Eq. 7 by means of the EM algorithm. P is the number of pixels in a frame
and
T
denotes the transposed of a vector.
0 100 200 300 400 500 600 700 800
0
20
40
60
80
100
Frame Number
(a) Precision
0 100 200 300 400 500 600 700 800
0
20
40
60
80
100
Frame Number
Proposed
Eigenbackground [4]
Statistical Illumination [5]
Tonal Allignment [3]
ABMM [20]
(b) Recall
0 100 200 300 400 500 600 700 800
0
20
40
60
80
100
Frame Number
Proposed
Eigen Background Dynamic Threshold
Eigen Background Fixed Threshold
(c) Precision
0 100 200 300 400 500 600 700 800
0
20
40
60
80
100
Frame Number
(d) Recall
Fig. A2. (a)-(b) Quantitative comparison of the proposed method, EB [4], SI [5], TA [3] and ABMM [20] in terms of precision and recall per frame. (c)-(d)
Quantitative comparison proposed method, EB with dynamic threshold and EB with fixed threshold [4].
E-step M-step
( ) ( )
i
BG
t
k
t
k
i
i
t
k
t
K k i
f h l N
J N
=
+
=
, |
1 1
1
... 1 ,
( ) ( )
i
FG
t
k
t
k
i
t
k
t
K K k i
f h u N
N
FG
=
+
+ =
, |
1
1
... 1 ,
( )
i
FG
t
t
K K i
f h
N
FG K K
FG
|
|
\
|
=
+ + +
+ + 3
1
1 ,
256
1 1
+ +
=
+
=
= =
1
1
1
,
1
,
,
FG
K K
k
t
k i
K
k
k i
i
BG
N where P
( )
( )
=
+
=
+
+
P
i
i
t
k i k
P
i
i
t
k i k t
k
u N
l N
1
1
,
1
1
1
,
1
1
K k
K k
>
,
,
( ) ( )( )
( ) ( )( )
=
+ + +
=
+ + +
+
P
i
T
t
k
i
t
k
i
t
k i k
P
i
T
t
k
i
t
k
i
t
k i k t
k
u u N
l l N
1
1 1 1
,
1
1
1 1 1
,
1
1
K k
K k
>
,
,
=
+
+ +
=
+
= =
P
i
t
k i k K K
k
k
k t
k
N where
N
N
FG 1
1
, 1
1
1
,
GRADUATION SYMPOSIUM, 27 OCTOBER 2009
12
Frame Number Input Ground-truth Proposed EB SI TA ABMM
frame 14
frame 25
frame 202
frame 253
frame 380
frame 437
frame 592
frame 620
frame 644
frame 747
Fig. A3. Qualitative comparison of the foreground masks yielded by the five methods. From left to right: Input frame, ground truth, our proposed method,
Eigenbackground [4], the Statistical Illumination Model [5], Tonal Alignment [3] and the Adaptive Background Mixture Model [20]. Snapshots of the
foreground masks of all five methods are depicted for some frames of the sequence used in the quantitative comparison of Fig. A2.
Fig. A4. Qualitative comparison of applying a median filter (top) versus the proposed recursive spatio-temporal filter (bottom) on
frame 635-641.