You are on page 1of 10

Bayesian Loss for Crowd Count Estimation with Point Supervision

Zhiheng Ma1∗ Xing Wei1∗ Xiaopeng Hong1,2† Yihong Gong1


1
Faculty of Electronic and Information Engineering, Xi’an Jiaotong University
2
Research Center for Artificial Intelligence, Peng Cheng Laborotory
mazhiheng@stu.xjtu.edu.cn xingxjtu@gmail.com {hongxiaopeng,ygong}@mail.xjtu.edu.cn

Abstract ing estimating the number of vehicles in traffic congestion


[29, 30, 56, 14, 28], counting the cells and bacteria from
In crowd counting datasets, each person is annotated by microscopic images [20, 42, 44, 45, 8], and animal crowd
a point, which is usually the center of the head. And the estimations for ecological survey [27, 1, 18], to name a few.
task is to estimate the total count in a crowd scene. Most of Crowd counting is a very challenging task because: 1)
the state-of-the-art methods are based on density map es- dense crowds often have heavy overlaps and occlusions be-
timation, which convert the sparse point annotations into tween each other; 2) perspective effects may cause large
a “ground truth” density map through a Gaussian kernel, variations in human size, shape, and appearance in the im-
and then use it as the learning target to train a density map age. In the past decade, a number of crowd counting algo-
estimator. However, such a “ground-truth” density map rithms [22, 58, 21, 12, 5, 35, 20, 11, 7, 31] have been pro-
is imperfect due to occlusions, perspective effects, varia- posed in the literature. Recently, crowd counting methods
tions in object shapes, etc. On the contrary, we propose using Convolutional Neural Networks (CNNs) have made
Bayesian loss, a novel loss function which constructs a den- remarkable progresses [53, 46, 36, 6, 57, 59, 38, 9, 4, 32,
sity contribution probability model from the point annota- 16]. The best performing methods are mostly based on the
tions. Instead of constraining the value at every pixel in density map estimation, which typically obtain the crowd
the density map, the proposed training loss adopts a more count by predicting a density map for the input image and
reliable supervision on the count expectation at each an- then summing over the estimated density map. Nowadays,
notated point. Without bells and whistles, the loss function publicly available datasets [15, 57, 16] for training crowd
makes substantial improvements over the baseline loss on count estimators only provide point annotations for each
all tested datasets. Moreover, our proposed loss function training image, i.e., only one pixel of each person is labeled
equipped with a standard backbone network, without using (typically the center of the head). Currently, the most com-
any external detectors or multi-scale architectures, plays mon approach for using these annotations is to first convert
favourably against the state of the arts. Our method out- the point annotations for each training image to a “ground-
performs previous best approaches by a large margin on truth” density map using the Gaussian kernel, and then train
the latest and largest UCF-QNRF dataset. The source code a CNN model by regressing the value at each pixel in this
is available at https://github.com/ZhihengCV/ density map. With such pixel-level strict supervisions, the
Baysian-Crowd-Counting. accuracy of a CNN model is highly dependent on the quality
of the obtained “ground-truth” density maps.
Obviously, “ground-truth” density maps obtained by ap-
1. Introduction plying a hypothetical Gaussian kernel to the point annota-
tions can hardly be of top quality, due to the occlusions,
Counting dense crowds using computer vision tech-
irregular crowd distributions, large variations in object size,
niques has attracted remarkable attentions in recent years.
shape, density, etc. On the contrary, we propose Bayesian
It has a wide range of applications such as estimating the
loss, which constructs a density contribution probability
scale of, and counting the number of participants in po-
model from the point annotations. Then the expected count
litical rallies, civil unrest, social and sport events, etc. In
at each annotated point is calculated by summing the prod-
addition, methods for crowd counting also have great po-
uct of the contribution probability and estimated density at
tentials to handle similar tasks in other domains, includ-
each pixel, which can be reliably supervised by the ground-
∗ Equal contribution. truth count value (apparently, one). Compared with previ-
† Corresponding author. ous loss functions that constrain the density value at every

6142
pixel, our proposed training loss supervises on the count ex- regression methods are more efficient than detection based
pectation at each annotated point, instead. methods, however, they do not fully utilized available point
Extensive experimental evaluations show that the pro- supervisions.
posed loss function substantially outperforms the baseline
training loss on UCF-QNRF [16], ShanghaiTech [57], and Density map estimation. This kind of methods [20, 11, 31]
UCF CC 50 [15] benchmark datasets. Moreover, our pro- take advantage of the location information to learn a map of
posed loss function equipped with the standard VGG-19 density values for each training sample and the final count
network [39] as backbone, without using any external de- estimation can be obtained by summing over the predicted
tectors or multi-scale architectures, achieves the state-of- density map. Lempitsky and Zisserman [20] proposed to
the-art performances on all the benchmark datasets, espe- transform the point annotations into a density map by the
cially with a magnificent improvement on the UCF-QNRF Gaussian kernel as “ground-truth”. Then they train their
dataset compared to other methods. models using a least-square objective. This kind of training
framework has been widely used in recent methods [11, 31].
2. Related Work Furthermore, thanks to the excellent feature learning ability
of deep CNNs, CNN based density map estimation meth-
We review related works in the literature on crowd count ods [53, 57, 51, 38, 26, 25, 4, 32, 16] have achieved the
estimation from the following respects.
state-of-the-art performance for crowd counting. One major
problem of this framework is how to determine the optimal
Detection-then-counting. Most of early works [22, 58, 21,
size of the Gaussian kernel which is influenced by many
12] estimate crowd count by detecting or segmenting indi-
factors. To make matters worse, the models are trained
vidual objects in the scene. This kind of methods has to
by a loss function which applies supervision in a pixel-to-
tackle great challenges from two respects. Firstly, they pro-
pixel manner. Obviously, the performance of such meth-
duce more accurate results (e.g. bounding-boxes or masks
ods highly depend on the quality of the generated “ground-
of instances) than the overall count which is computational
truth” density maps.
expensive and mostly suitable in lower density crowds. In
overcrowded scenes, clutters and severe occlusions make
it unfeasible to detect every single person, despite the pro- Hybrid training. Several works observed that crowd
counting benefits from mixture training strategies, e.g.,
gresses in related fields [19, 10, 47, 17, 33, 43, 61, 50, 48,
multi-task, multi-loss, etc. Liu et al. [24] proposed Deci-
34, 49, 55, 60, 52]. Secondly, training object detectors re-
quire bounding-box or instance mask annotations, which is deNet to adaptively decide whether to use a detection model
much more labor-intensive in dense crowds. Thus most of or a density map estimation model. This approach takes the
current counting datasets only provide a one-point label per advantage of mixture-of-experts where a detection based
object. model can estimate crowds accurately in low density scenes
while the density map estimation model is good at handling
Direct count regression. To avoid the more complex detec- crowded scenes. However, this method requires external
tion problem, some researchers proposed to directly learn pre-trained human detection models and is less efficient.
a mapping from image features to their counts [5, 35, 7, Some researchers proposed to combine multiple losses to
23, 46, 36, 6]. Former methods [5, 35, 7] in this category assist each other. Zhang et al. [53] proposed to train a deep
rely on hand-crafted features, such as SIFT, LBP etc., and CNN by alternatively optimizing a pixel-wise loss function
then learn a regression model. Chan et al. [5] proposed and a global count regression loss. A similar training ap-
to extract edge, texture and other low-level features of the proach was adopted by Zhang et al. [54], in which they first
crowds, and lean a Gaussian Process regression model for train their model via the density map loss and then add a
crowd counting. Chen et al. [7] proposed to transform low- relative count loss in the last few epochs. Idrees et al. [16]
level image features into a cumulative attribute space where proposed a composition loss, which consists of ℓ1 , ℓ2 , and
each dimension has clearly defined semantic interpretation ℓ∞ norm losses for the density map and a count regression
that captures how the crowd count value changes continu- loss. Compared to these hybrid losses, our proposed single
ously and cumulatively. Recent methods [46, 36, 6] resort loss function is simpler and more effective.
to deep CNNs for end-to-end learning. Wang et al. [46]
adopted an AlexNet architecture where the final fully con- 3. The Proposed Method
nected layer of 4096 neurons is replaced by a single neu-
3.1. Background and Motivation
ron for predicting the scalar count value. Shang et al. [36]
proposed to extract a set of high level image features via a Let {D(xm ) >= 0 : m = 1, 2, . . . , M } be a density
CNN firstly, and then map the features to local counts us- map, where xm denotes a 2D pixel location, and M is the
ing a Long Short-Term Memory (LSTM) unit. These direct number of pixels in the density map. Let {(zn , yn ) : n =

6143
1, 2, . . . , N } denote the point annotation map for a sam- converting point annotations into the “ground-truth” density
ple image, where N is the total crowd count, zn is a head maps generated by Eq. (1) as the learning targets, we pro-
point position and yn = n is the corresponding label. The pose to construct likelihood functions of xm given label yn
point annotation map contains only one pixel for each per- from them,
son (typically the center of the head), which is sparse, and
contains no information about the object size and shape. It p(x = xm |y = yn ) = N (xm ; zn , σ 2 12×2 ). (3)
is difficult to directly use such point annotation maps to train
To simplify the notations, we omit the random variable x
a density map estimator. A common remedy to this diffi-
and y in the following formulations, e.g., Eq. (3) becomes
culty is to convert it to a “ground-truth” density map using
p(xm |yn ) = N (xm ; zn , σ 2 12×2 ). According to Bayes’
the Gaussian kernel.
theorem, given a pixel location xm in the density map, the
N
X posterior probability of xm having the label yn can be com-
gt def
D (xm ) = N (xm ; zn , σ 2 12×2 ) puted as:
n=1
(1) p(xm |yn )p(yn ) p(xm |yn )p(yn )
N 2
X 1 kxm − zn k2 p(yn |xm ) = = N
= √ exp(− ), p(xm )
2σ 2
P
n=1
2πσ p(xm |yn )p(yn )
n=1
where N (xm ; zn , σ 2 12×2 ) denotes a 2D Gaussian distribu- p(xm |yn ) N (xm ; zn , σ 2 12×2 )
tion evaluated at xm , with the mean at the annotated point = N
= N
.
P P
zn , and an isotropic covariance matrix σ 2 12×2 . p(xm |yn ) N (xm ; zn , σ2 1 2×2 )
n=1 n=1
Many recent works use the above “ground-truth” density (4)
map as the learning target, and train a density map estimator In the above derivation, the third equality holds as we as-
using the following loss function: sume the equal prior probability p(yn ) for each class label
M
X yn , i.e. p(yn ) = N1 , without loss of generality. In prac-
baseline
F Dgt (xm ) − Dest (xm ) ,

L = (2) tice, if we know the prior that crowds are more or less tend
m=1 to appear at certain places, a tailored p(yn ) can be applied
here.
where F(·) is a distance function and Dest is the esti-
With the posterior label probability p(yn |xm ) and esti-
mated density map. If a fix-sized Gaussian kernel is adopted
def
mated density map Dest , we derive the Bayesian loss as
σ = const, it is assumed that all the people in a dataset have follows. let cmn denotes the count that xm contributes to yn ,
the same head size and shape, which is obviously not true and cn be the total count associated with yn , we have the
due to the occlusions, irregular crowd distributions, per- expectation of cn as:
spective effects, etc. An alternative solution is to use an
adaptive Gaussian kernel [57, 16] for each n: σn ∝ dn , M
X M
X
where dn is a distance that depends on its nearest neigh- E[cn ] = E[ cm
n] = E[cm
n]
bors in the spatial domain, which assumes that the crowd m=1 m=1
(5)
M
is evenly distributed. Some other methods utilize specific X
est
information such as camera parameters to get a more accu- = p(yn |xm )D (xm ).
m=1
rate perspective map, but in general such information is not
available. Obviously, the ground-truth count cn at each annotation
We argue that the point annotations in the available point is one, therefore we have the following loss function:
crowd counting datasets can rather be considered as weak
N
labels for density map estimation. It is more reasonable X
to take such annotations as priors or likelihoods instead of LBayes = F(1 − E[cn ]), (6)
n=1
the learning targets. Loss functions that impose such strict,
pixel-to-pixel supervisions as Eq. (2) on density map are where F(·) is a distance function and we adopt ℓ1 distance
not always beneficial to enhance the count estimation accu- in our experiments. A special case should be handled when
racy when used to train a CNN model, because it forces the there is no object in a training image. In such scenario we
model to learn inaccurate, or even erroneous information. directly force the sum of the density map to zero. Our pro-
posed loss function is differentiable and can be readily ap-
3.2. Bayesian Loss
plied to a given CNN using the standard back propagation
Let x be a random variable that denotes the spatial lo- training algorithm.
cation and y be a random variable that represents the anno- At the inference stage, we do not have to know the
tated head point. Based on the above discussions, instead of posterior label probability p(yn |xm ) in advance, because

6144
when we sum over the estimated density map, we eliminate 𝑑
p(yn |xm ) as follows:
N M
N X 𝒛𝑚
𝑛 𝒛𝑚
0
𝒙m
X X
est
C= E[cn ] = p(yn |xm )D (xm )
n=1 n=1 m=1 Figure 1: Geometrical illustration of the dummy back-
M
X N
X ground point, where xm denotes a pixel in the density map,
= p(yn |xm )Dest (xm ) (7) zm m
n is its nearest head point and z0 is the defined dummy
m=1 n=1 background point.
M
X
= Dest (xm ). To define the background likelihood, we construct a dummy
m=1 background point for each pixel,
3.3. Background Pixel Modelling
x m − zm
n
zm m
0 = zn + d , (13)
For background pixels that are far away from any of the k x m − zm
n k2
annotation points, it makes no sense to assign them to any
head label yn . To better model the background pixels, we where zm n denotes the nearest head point of xm , and d is
introduce an additional background label y0 = 0, in addi- a parameter that controls the margin between the head and
tion to the head labels {yn = n : n = 1, 2, . . . , N }. Then, the dummy background points. As illustrated in Fig. 1, with
the posterior label probability can be rewritten as: the defined dummy background point zm 0 , for a pixel xm
that is far away from head points, it can be assigned to the
p(xm |yn )p(yn ) background label instead. Here we also use the Gaussian
p(yn |xm ) = N
P
p(xm |yn )p(yn ) + p(xm |y0 )p(y0 ) kernel to define the background likelihood,
n=1
(8) def
p(xm |yn ) p(xm |y0 ) = N (xm ; zm 2
0 , σ 12×2 )
= .
N
(d − kxm − zm 2 (14)
1 n k2 )
P
p(xm |yn ) + p(xm |y0 ) = √ exp(− ).
2σ 2
n=1 2πσ
The last equation is simplified with the assumption p(yn ) =
p(y0 ) = N1+1 , without loss of generality. Similarly, we 3.4. Visualization and Analysis
have:
We build the entropy map Ent of label assignment for
p(xm |y0 ) visualization and analysis, which is calculated for each pixel
p(y0 |xm ) = N
. (9)
P xm as follows,
p(xm |yn ) + p(xm |y0 )
n=1
N
X
And the expected counts for each person and for the entire Ent(xm ) = − p(yn |xm ) ln p(yn |xm ). (15)
background are defined as: n=0

M
X The entropy measures the uncertainty on the label a pixel
E[cn ] = p(yn |xm )Dest (xm ), (10) xm in the density map belongs to. We display entropy maps
m=1 with different settings in Fig. 2 and have the following sum-
M marizes:
X
E[c0 ] = p(y0 |xm )Dest (xm ). (11)
• The posterior could find the boundary between persons
m=1
roughly.
In this
PMcase, the summation over the whole density
map m=1 Dest (xm ) consists of the foreground counts • Dense crowd areas have higher entropy values than
PN sparse areas.
n=1 E[cn ] and the background count E[c0 ]. Obviously,
we would like the background count to be zero and the fore- • The parameter σ controls the softness of the posterior
ground count at each annotation point equals to one, thus we label probability, comparing (b) and (c).
have the following enhanced loss function,
• Pixels far from crowds are handled better via back-
ground pixel modelling, comparing (b) and (e).
N
• The parameter d controls the margin between fore-
X
LBayes+ = F(1 − E[cn ]) + F(0 − E[c0 ]). (12)
n=1
ground and background, comparing (e) and (f).

6145
(a) Input image (b) Without y0 , σ = 16 (c) Without y0 , σ = 32

(d) Blend of (a,e) (e) With y0 , σ = 16, d = 100 (f) With y0 , σ = 16, d = 200
Figure 2: Visualization of the posterior label probability. We construct an entropy map using Eq. (15), which measures the
uncertainty on the label a pixel in the density map belongs to. The color is warmer, the value is larger. (a) Input image.
(b)-(c): Entropy maps with different σ, without background pixel modelling. (e)-(f): Entropy maps with different d, with
background pixel modelling. (d): Blend of the input image and the entropy map in (e).

4. Experiments ShanghaiTech [57] consists of part A and part B. In part A,


there are 300 images for training and 182 images for testing.
4.1. Evaluation Metrics All the images are crawled from the Internet, and most of
Crowd count estimation methods are evaluated by two them are images of very crowded scenes such as rallies and
widely used metrics: Mean Absolute Error (MAE) and large sport events. Part B has 400 training images and 316
Mean Squared Error (MSE), which are defined as follows: testing images captured from busy streets in Shanghai. Part
K
A has a significantly higher density than part B.
1 X
M AE = |Nk − Ck |, (16) UCF CC 50 [15] contains 50 gray images with different
K
k=1 resolutions. The average count for each image is 1,280, and
v
u K
the minimum and maximum counts are 94 and 4,532, re-
u1 X
M SE = t
2
|Nk − Ck | , (17) spectively. Since this is a small-scale dataset and no data
K split is defined for training and testing, we perform five-fold
k=1
cross validations to get the average test result.
where K is the number of test images, Nk and Ck are the
ground-truth count and the estimated count for the k-th im- 4.3. Implementation Details
age, respectively.
Network structure. We use a standard image classifica-
4.2. Datasets tion network as our backbone, with the last pooling and the
subsequent fully connected layers removed. In our exper-
Experimental evaluations are conducted using four
iments, we test two networks which are VGG-19 [39] and
widely used crowd counting benchmark datasets: UCF-
AlexNet [17]. We upsample the output of the backbone to
QNRF [16], UCF CC 50 [15], ShanghaiTech [57] part A
1/8 of the input image size by bilinear interpolation, and
and part B. These datasets are described as follows.
then feed it to a regression header, which consists of two
UCF-QNRF [16] is the latest and largest crowd counting 3 × 3 convolutional layers with 256 and 128 channels re-
dataset including 1535 images crawled from Flickr with spectively, and a 1×1 convolutional layer, to get the density
1.25 million point annotations. It is a challenging dataset map. The regression header is initialized by the MSRA ini-
because it has a wide range of counts, image resolutions, tializer [13] and the backbone is pre-trained on ImageNet.
light conditions and viewpoints. The training set has 1,201 The Adam optimizer with an initial learning rate 10−5 is
images and the remaining 334 images are used for testing. used to update the parameters.

6146
Datasets UCF-QNRF ShanghaiTechA ShanghaiTechB UCF CC 50
Methods MAE MSE MAE MSE MAE MSE MAE MSE
C ROWD -CNN [53] - - 181.8 277.7 32.0 49.8 467.0 498.5
MCNN [57] 277 426 110.2 173.2 26.4 41.3 377.6 509.1
CMTL [40] 252 514 101.3 152.4 20.0 31.1 322.8 341.4
S WITCH -CNN [3] 228 445 90.4 135.0 21.6 33.4 318.1 439.2
CP-CNN [41] - - 73.6 106.4 20.1 30.1 295.8 320.9
ACSCP [37] - - 75.7 102.7 17.2 27.4 291.0 404.6
D-C ONV N ET [38] - - 73.5 112.3 18.7 26.0 288.4 404.7
IG-CNN [2] - - 72.5 118.2 13.6 21.1 291.4 349.4
IC -CNN [32] - - 68.5 116.2 10.7 16.0 260.9 365.5
SAN ET [4] - - 67.0 104.5 8.4 13.6 258.4 334.9
CL-CNN [16] 132 191 - - - - - -
BASELINE 106.8 183.7 68.6 110.1 8.5 13.9 251.6 331.3
Our BAYESIAN 92.9 163.0 64.5 104.0 7.9 13.3 237.7 320.8
Our BAYESIAN + 88.7 154.8 62.8 101.8 7.7 12.7 229.3 308.2

Table 1: Benchmark evaluations on four benchmark crowd counting datasets using the MAE and MSE metrics. The baseline
and our methods are trained using VGG-19.

GT Count: 909 Estimate: 1020.1 Estimate: 982.1 Estimate: 978.5

GT Count: 2745 Estimate: 2452.3 Estimate: 2455.5 Estimate: 2470.3

GT Count: 1616 Estimate: 1946.7 Estimate: 1686.7 Estimate: 1602.3

(a) Input image (b) BASELINE (c) Our BAYESIAN (d) Our BAYESIAN +
Figure 3: Density maps generated by (b) BASELINE, our (c) BAYESIAN, and (d) BAYESIAN +. The color is warmer, the density is higher.
Note that in dense crowds, BASELINE often produces abnormal values, while in sparse areas, it can not localize each person well. In
contrast, our methods give more accurate count estimate and localization, respectively.

6147
Training details. We augment the training data using ran- BASELINE BAYESIAN
dom crop and horizontal flipping. We note that image 140

MAE
resolutions in UCF-QNRF vary widely from 0.08 to 66 120
megapixels. However, a regular CNN can not deal with 100
80
images with all kinds of scales due to its limited receptive 0 4 8 12 16 20 24 28 32
field. Therefore, we limit the shorter side of each image 250

MSE
within 2048 pixels in UCF-QNRF. Images are then ran- 200
domly cropped for training, the crop size is 256 × 256 for
150
ShanghaiTechA and UCF CC 50 where image resolutions 0 4 8 12 16 20 24 28 32
are smaller, and 512 × 512 for ShanghaiTechB and UCF- σ
QNRF. We set the Gaussian parameter σ in Eqs. (3) and Figure 4: The curves of testing results for different losses w.r.t.
(14) to 8 and the distance parameter d in Eq. (13) to 15% the Gaussian parameter σ on UCF-QNRF.
of the shorter side of image. The parameters are selected
accurate estimations. Our methods benefit from the pro-
on a validation set (120 images randomly sampled from the
posed probability model which constructs soft posterior
training set) of UCF-QNRF.
probabilities if the pixel is close to several head points. In
4.4. Experimental Evaluations sparse areas, on the other hand, BASELINE can not recog-
nize each person well, while our methods predict more ac-
Quantitative results. We compare our proposed method
curate results both on count estimation and localization.
with the baseline and the state-of-the-art methods on the
benchmark datasets described in Sec. 4.2. To make a 4.5. Ablation Studies
fair comparison, the baseline method (BASELINE) shares
Effect of σ. Both the proposed BAYESIAN and the BASE -
the same network structure (VGG-19) and training pro-
LINE methods use the parameter σ for the Gaussian kernel.
cess as ours. We use Eq. (1) to generate the “ground-
In our loss function, σ controls the softness of the posterior
truth” density maps for the baseline method and follow
label probability as shown in Fig. 2, while it determines the
previous works [57, 16] to select parameters for the Gaus-
crowd density distribution directly in BASELINE. In this
sian kernel. Specifically, the geometry-adaptive kernels are
subsection, we study the effect of σ on the two methods
adopted for UCF-QNRF, ShanghaiTechA and UCF CC 50,
by computing their MAE and MSE values w.r.t. different σ
while a fixed Gaussian kernel with σ = 15 is used for
values on UCF-QNRF. As can be seen from the curves in
ShanghaiTechB. We study both our basic Bayesian loss
Fig. 4:
(BAYESIAN) and the enhanced Bayesian loss with the back-
ground pixel modelling (BAYESIAN +). We show the exper- • Our BAYESIAN performs well in a wide range of val-
imental results in Table 1 and the highlights can be summa- ues of σ. Our MAE and MSE is less than 98.0 and
rized as follows: 180.0 when σ changes from 0.1 to 32.0.
• BAYESIAN + achieves the state-of-the-art accuracy on • BASELINE is more sensitive to this parameter and its
all the four benchmark datasets. On the latest and the MAE and MSE vary from 118.4 to 136.2 and from
toughest UCF-QNRF dataset, it reduces the MAE and 192.3 to 250.6, respectively.
MSE values of the best method (CL-CNN) by 43.3
and 36.2, respectively. It is worth mentioning that our Effect of d. The proposed BAYESIAN + method introduces
method does not use any external detection models or an additational parameter d to control the margin between
multi-scale structures. foreground and background. Fig. 5 shows the performance
• BAYESIAN + consistently improves the performance of of BAYESIAN + w.r.t. d where we can conclude that:
BAYESIAN by around 3% on all the four datasets. • Our BAYESIAN + method performs well in a wide
• Both BAYESIAN and BAYESIAN + outperform range of of values of d. BAYESIAN + consistently out-
BASELINE significantly on all the four datasets. performs BAYESIAN when d is from 3% to 100% of
BAYESIAN + makes 15% improvements on UCF- the shorter side of the image.
QNRF, 9% on ShanghaiTechA, 8% on Shang- • The parameter d has the meaning that the had size
haiTechB, and 8% on UCF CC 50, respectively. would not exceed d so that d should not be too small.
Visualization of the estimated density maps. We visu-
alize the estimated density maps using different training Robustness to annotation error. In this subsection, we
losses in Fig. 3. From the close-ups we can see that BASE - discuss the robustness of different loss functions w.r.t. an-
LINE often predicts abnormally values in the congested ar- notation error. Labeling a person by a single point is am-
eas, in contrast, our BAYESIAN and BAYESIAN + give more biguous, because the person occupies an area in the image.

6148
BASELINE BAYESIAN BAYESIAN + UCF-QNRF → ShanghaiTechA ShanghaiTechB UCF CC 50
110 Methods MAE MSE MAE MSE MAE MSE
MAE

100 BASELINE 73.4 136.3 18.5 30.9 323.9 558.0


90 Our BAYESIAN 71.7 124.3 16.3 27.8 312.6 540.3
80 Our BAYESIAN + 69.8 123.8 15.3 26.5 309.6 537.1
2% 3% 10% 20% 100%
180 Table 2: Quantitative results for cross-dataset evaluation. The
MSE

160 models are trained on UCF-QNRF while tested on other datasets.


140
2% 3% 10% 20% 100%
Percentage of Image Size (x axis is in log mode) from Table. 3, all methods benefit from resizing and our
Figure 5: The performance of our BAYESIAN + method w.r.t. d. methods outperform the baseline method in both settings.

Although most of the datasets place the annotation point at With Resize Without Resize
the center of each head, small errors from human labeling is Methods MAE MSE MAE MSE
inevitable. In this experiment, we simulate human labeling BASELINE 106.8 183.7 128.7 193.8
Our BAYESIAN 92.9 163.0 115.1 187.0
errors by adding uniform random noises to the original head Our BAYESIAN + 88.7 154.8 112.6 181.1
positions, and test the performance of different losses at
several noise levels. Since σ is the spatial variance of Gaus- Table 3: The effect of limiting image resolution on UCF-QNRF.
sian distribution, a larger σ is helpful to tolerate such spatial
Different backbones. Our proposed loss functions can be
noises. Therefore, we evaluate our BAYESIAN with σ = 1
readily applied to any network structure to improve its per-
and σ = 16, respectively, and BASELINE with σ = 16. As
formance on the crowd counting task. Here we apply the
can be seen from Fig. 6, the proposed BAYESIAN performs
proposed losses to both VGG-19 and AlexNet and make
better than BASELINE in different noise levels, even with a
comparisons with the baseline loss. The quantitative results
smaller σ value.
from Table 4 indicate that our Bayesian loss functions out-
BASELINE BAYESIAN BAYESIAN
perform the baseline loss significantly on both the networks.
(σ = 16) (σ = 16) (σ = 1)
140
Backbones VGG-19 [39] AlexNet [17]
Methods MAE MSE MAE MSE
MAE

120
100 BASELINE 106.8 183.7 130.8 221.0
80 Our BAYESIAN 92.9 163.0 121.2 202.5
0% 1% 2% 3% 4% 5% Our BAYESIAN + 88.7 154.8 116.3 191.7
300
MSE

250 Table 4: Performances of models using different backbones


200 on UCF-QNRF. Our proposed methods outperform the baseline
150 method consistently.
0% 1% 2% 3% 4% 5%
Deviation
Figure 6: Robustness evaluations to annotation error. We sim- 5. Conclusions and Future Work
ulate human labeling errors by adding uniform random noises to
the annotated point locations (percentage of image height). In this paper, we propose a novel loss function for crowd
count estimation with point supervision. Different from
Cross-dataset evaluation. To further explore the general- previous methods that transform point annotations into the
ization ability of different loss functions, we conduct cross- “ground-truth” density maps using the Gaussian kernel with
dataset experiments with the VGG-19 network. In this ex- pixel-wise supervision, our loss function adopts a more re-
periment, models are trained on one dataset and tested on liable supervision on the count expectation at each anno-
the others without any further fine-tuning. More specifi- tated point. Extensive experiments have demonstrated the
cally, we train models on the largest UCF-QNRF dataset, advantages of our proposed methods in terms of accuracy,
and test them on UCF CC 50, ShanghaiTechA and Shang- robustness and generalization. The current form of our for-
haiTechB, respectively. As can be seen from Table. 2, our mulation is fairly general and can easily incorporate other
methods have certain generalization ability and outperform knowledge, e.g., specific foreground or background priors,
BASELINE on all datasets. scale and temporal likelihoods, and other facts to further
improve the proposed method.
Limiting the image resolution. We have found that image
resolutions of UCF-QNRF vary widely, and a single CNN Acknowledgements. This work was supported by the
model can not handle such a large variation well. Therefore, National Basic Research Program of China (Grant No.
we limit the image resolution to 2048 pixels and this experi- 2015CB351705) and National Major Project of China
ment performs ablation study on this factor. As can be seen (Grant No. 2017YFC0803905).

6149
References [18] Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro,
David Vazquez, and Mark Schmidt. Where are the blobs:
[1] Carlos Arteta, Victor S. Lempitsky, and Andrew Zisserman. Counting by localization with point supervision. In ECCV,
Counting in the wild. In ECCV, 2016. 1 2018. 1
[2] Deepak Babu Sam, Neeraj N. Sajjan, R. Venkatesh Babu,
[19] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Be-
and Mukundhan Srinivasan. Divide and grow: Capturing
yond bags of features: Spatial pyramid matching for recog-
huge diversity in crowd images with incrementally growing
nizing natural scene categories. In CVPR, 2006. 2
cnn. In CVPR, 2018. 6
[20] Victor Lempitsky and Andrew Zisserman. Learning to count
[3] Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu.
objects in images. In NIPS, 2010. 1, 2
Switching convolutional neural network for crowd counting.
In CVPR, 2017. 6 [21] Min Li, Zhaoxiang Zhang, Kaiqi Huang, and Tieniu Tan. Es-
[4] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale timating the number of people in crowded scenes by MID
aggregation network for accurate and efficient crowd count- based foreground segmentation and head-shoulder detection.
ing. In ECCV, 2018. 1, 2, 6 In ICPR, pages 1–4, 2008. 1, 2
[5] Antoni B. Chan, Zhang-Sheng John Liang, and Nuno Vas- [22] Sheng-Fuu Lin, Jaw-Yeh Chen, and Hung-Xin Chao. Esti-
concelos. Privacy preserving crowd monitoring: Counting mation of number of people in crowded scenes using per-
people without people models or tracking. In CVPR, 2008. spective transformation. IEEE Trans. Systems, Man, and Cy-
1, 2 bernetics, Part A, 31(6):645–654, 2001. 1, 2
[6] Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ram- [23] Bo Liu and Nuno Vasconcelos. Bayesian model adaptation
prasaath R. Selvaraju, Dhruv Batra, and Devi Parikh. Count- for crowd counts. In ICCV, 2015. 2
ing everyday objects in everyday scenes. In CVPR, 2017. 1, [24] Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G.
2 Hauptmann. Decidenet: Counting varying density crowds
[7] Ke Chen, Shaogang Gong, Tao Xiang, and Chen Change through attention guided detection and density estimation.
Loy. Cumulative attribute space for age and crowd density In CVPR, 2018. 2
estimation. In CVPR, 2013. 1, 2 [25] Lingbo Liu, Hongjun Wang, Guanbin Li, Wanli Ouyang, and
[8] Joseph Paul Cohen, Genevieve Boucher, Craig A. Glaston- Liang Lin. Crowd counting using deep recurrent spatial-
bury, Henry Z. Lo, and Yoshua Bengio. Count-ception: aware network. In IJCAI, 2018. 2
Counting by fully convolutional redundant counting. In [26] Xialei Liu, Joost van de Weijer, and Andrew D. Bagdanov.
ICCV Workshops, 2017. 1 Leveraging unlabeled data for crowd counting by learning to
[9] Diptodip Deb and Jonathan Ventura. An aggregated mul- rank. In CVPR, 2018. 2
ticolumn dilated convolution network for perspective-free [27] Zheng Ma, Lei Yu, and Antoni B. Chan. Small instance de-
counting. In CVPR Workshops, 2018. 1 tection by integer programming on object density maps. In
[10] Pedro F Felzenszwalb, David A McAllester, Deva Ramanan, CVPR, 2015. 1
et al. A discriminatively trained, multiscale, deformable part
[28] Mark Marsden, Kevin McGuinness, Suzanne Little, Ciara E.
model. In CVPR, 2008. 2
Keogh, and Noel E. O’Connor. People, penguins and petri
[11] Luca Fiaschi, Ullrich Köthe, Rahul Nair, and Fred A. Ham- dishes: Adapting object counting models to new visual do-
precht. Learning to count with regression forest and struc- mains and object types without forgetting. In CVPR, 2018.
tured labels. In ICPR, 2012. 1, 2 1
[12] Weina Ge and Robert T. Collins. Marked point processes for
[29] T. Nathan Mundhenk, Goran Konjevod, Wesam A. Sakla,
crowd counting. In CVPR, 2009. 1, 2
and Kofi Boakye. A large contextual dataset for classifica-
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
tion, detection and counting of cars with deep learning. In
Delving deep into rectifiers: Surpassing human-level perfor-
ECCV, 2016. 1
mance on imagenet classification. In ICCV, 2015. 5
[30] Daniel Oñoro-Rubio and Roberto Javier López-Sastre. To-
[14] Meng-Ru Hsieh, Yen-Liang Lin, and Winston H. Hsu.
wards perspective-free object counting with deep learning.
Drone-based object counting by spatially regularized re-
In ECCV, 2016. 1
gional proposal network. In ICCV, 2017. 1
[15] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak [31] Viet-Quoc Pham, Tatsuo Kozakaya, Osamu Yamaguchi, and
Shah. Multi-source multi-scale counting in extremely dense Ryuzo Okada. COUNT forest: Co-voting uncertain number
crowd images. In CVPR, 2013. 1, 2, 5 of targets using random forest for crowd density estimation.
[16] Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong In ICCV, 2015. 1, 2
Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak [32] Viresh Ranjan, Hieu Le, and Minh Hoai. Iterative crowd
Shah. Composition loss for counting, density map estima- counting. In ECCV, 2018. 1, 2, 6
tion and localization in dense crowds. In ECCV, 2018. 1, 2, [33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
3, 5, 6, 7 Faster r-cnn: Towards real-time object detection with region
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. proposal networks. In NIPS, 2015. 2
Imagenet classification with deep convolutional neural net- [34] Weihong Ren, Di Kang, Yandong Tang, and Antoni B. Chan.
works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Fusing crowd density maps and visual object trackers for
Weinberger, editors, NIPS. 2012. 2, 5, 8 people tracking in crowd scenes. In CVPR, 2018. 2

6150
[35] David Ryan, Simon Denman, Clinton Fookes, and Sridha ing of joint deep feature and prediction refinement for salient
Sridharan. Crowd counting using multiple local features. In object detection. In ICCV, 2019. 2
DICTA, 2009. 1, 2 [53] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang
[36] Chong Shang, Haizhou Ai, and Bo Bai. End-to-end crowd Yang. Cross-scene crowd counting via deep convolutional
counting via joint learning local and global count. In ICIP, neural networks. In CVPR, 2015. 1, 2, 6
2016. 1, 2 [54] Lu Zhang, Miaojing Shi, and Qiaobo Chen. Crowd counting
[37] Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, and via scale-adaptive convolutional neural network. In WACV,
Xiaokang Yang. Crowd counting via adversarial cross-scale 2018. 2
consistency pursuit. In CVPR, 2018. 6 [55] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and
[38] Zenglin Shi, Le Zhang, Yun Liu, Xiaofeng Cao, Yangdong Stan Z. Li. Occlusion-aware r-cnn: Detecting pedestrians
Ye, Ming-Ming Cheng, and Guoyan Zheng. Crowd counting in a crowd. In ECCV, 2018. 2
with deep negative correlation learning. In CVPR, 2018. 1, [56] Shanghang Zhang, Guanhang Wu, Joao P. Costeira, and Jose
2, 6 M. F. Moura. Fcn-rlstm: Deep spatio-temporal neural net-
[39] Karen Simonyan and Andrew Zisserman. Very deep convo- works for vehicle counting in city cameras. In ICCV, 2017.
lutional networks for large-scale image recognition. ICLR, 1
abs/1409.1556, 2014. 2, 5, 8 [57] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao,
[40] Vishwanath A. Sindagi and Vishal M. Patel. Cnn-based cas- and Yi Ma. Single-image crowd counting via multi-column
caded multi-task learning of high-level prior and density es- convolutional neural network. In CVPR, 2016. 1, 2, 3, 5, 6,
timation for crowd counting. In AVSS, 2017. 6 7
[41] Vishwanath A. Sindagi and Vishal M. Patel. Generating [58] Tao Zhao and Ramakant Nevatia. Bayesian human segmen-
high-quality crowd density maps using contextual pyramid tation in crowded situations. In CVPR, 2003. 1, 2
cnns. In ICCV, 2017. 6 [59] Zhuoyi Zhao, Hongsheng Li, Rui Zhao, and Xiaogang Wang.
[42] Korsuk Sirinukunwattana, Shan e Ahmed Raza, Yee-Wah Crossing-line crowd counting with two-phase deep neural
Tsang, David R. J. Snead, Ian A. Cree, and Nasir M. Rajpoot. networks. In ECCV, 2016. 1
Locality sensitive deep learning for detection and classifica- [60] Sanping Zhou, Jinjun Wang, Deyu Meng, Yudong Liang,
tion of nuclei in routine colon cancer histology images. IEEE Yihong Gong, and Nanning Zheng. Discriminative fea-
Trans. Med. Imaging, 35(5):1196–1206, 2016. 1 ture learning with foreground attention for person re-
[43] Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. identification. IEEE Trans. Image Processing, 2019. 2
End-to-end people detection in crowded scenes. In CVPR, [61] Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong,
2016. 2 and Nanning Zheng. Point to set similarity based deep fea-
[44] Matthias von Borstel, Melih Kandemir, Philip Schmidt, ture learning for person re-identification. In CVPR, 2017.
Madhavi K. Rao, Kumar T. Rajamani, and Fred A. Ham- 2
precht. Gaussian process density counting from weak super-
vision. In ECCV, 2016. 1
[45] Elad Walach and Lior Wolf. Learning to count with CNN
boosting. In ECCV, 2016. 1
[46] Chuan Wang, Hua Zhang, Liang Yang, Si Liu, and Xiaochun
Cao. Deep people counting in extremely dense crowds. In
ACM MM, 2015. 1, 2
[47] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas
Huang, and Yihong Gong. Locality-constrained linear cod-
ing for image classification. In CVPR, 2010. 2
[48] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian
Sun, and Chunhua Shen. Repulsion loss: Detecting pedestri-
ans in a crowd. In CVPR, 2018. 2
[49] Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, and
Nanning Zheng. Grassmann pooling as compact homoge-
neous bilinear pooling for fine-grained visual classification.
In ECCV, 2018. 2
[50] Xing Wei, Yue Zhang, Yihong Gong, and Nanning Zheng.
Kernelized subspace pooling for deep local descriptors. In
CVPR, 2018. 2
[51] Feng Xiong, Xingjian Shi, and Dit-Yan Yeung. Spatiotem-
poral modeling for crowd counting in videos. In ICCV, 2017.
2
[52] Yingyue Xu, Dan Xu, Xiaopeng Hong, Wanli Ouyang, Ji
Rongrong, Xu Min, and Guoying Zhao. Structured model-

6151

You might also like