Professional Documents
Culture Documents
6142
pixel, our proposed training loss supervises on the count ex- regression methods are more efficient than detection based
pectation at each annotated point, instead. methods, however, they do not fully utilized available point
Extensive experimental evaluations show that the pro- supervisions.
posed loss function substantially outperforms the baseline
training loss on UCF-QNRF [16], ShanghaiTech [57], and Density map estimation. This kind of methods [20, 11, 31]
UCF CC 50 [15] benchmark datasets. Moreover, our pro- take advantage of the location information to learn a map of
posed loss function equipped with the standard VGG-19 density values for each training sample and the final count
network [39] as backbone, without using any external de- estimation can be obtained by summing over the predicted
tectors or multi-scale architectures, achieves the state-of- density map. Lempitsky and Zisserman [20] proposed to
the-art performances on all the benchmark datasets, espe- transform the point annotations into a density map by the
cially with a magnificent improvement on the UCF-QNRF Gaussian kernel as “ground-truth”. Then they train their
dataset compared to other methods. models using a least-square objective. This kind of training
framework has been widely used in recent methods [11, 31].
2. Related Work Furthermore, thanks to the excellent feature learning ability
of deep CNNs, CNN based density map estimation meth-
We review related works in the literature on crowd count ods [53, 57, 51, 38, 26, 25, 4, 32, 16] have achieved the
estimation from the following respects.
state-of-the-art performance for crowd counting. One major
problem of this framework is how to determine the optimal
Detection-then-counting. Most of early works [22, 58, 21,
size of the Gaussian kernel which is influenced by many
12] estimate crowd count by detecting or segmenting indi-
factors. To make matters worse, the models are trained
vidual objects in the scene. This kind of methods has to
by a loss function which applies supervision in a pixel-to-
tackle great challenges from two respects. Firstly, they pro-
pixel manner. Obviously, the performance of such meth-
duce more accurate results (e.g. bounding-boxes or masks
ods highly depend on the quality of the generated “ground-
of instances) than the overall count which is computational
truth” density maps.
expensive and mostly suitable in lower density crowds. In
overcrowded scenes, clutters and severe occlusions make
it unfeasible to detect every single person, despite the pro- Hybrid training. Several works observed that crowd
counting benefits from mixture training strategies, e.g.,
gresses in related fields [19, 10, 47, 17, 33, 43, 61, 50, 48,
multi-task, multi-loss, etc. Liu et al. [24] proposed Deci-
34, 49, 55, 60, 52]. Secondly, training object detectors re-
quire bounding-box or instance mask annotations, which is deNet to adaptively decide whether to use a detection model
much more labor-intensive in dense crowds. Thus most of or a density map estimation model. This approach takes the
current counting datasets only provide a one-point label per advantage of mixture-of-experts where a detection based
object. model can estimate crowds accurately in low density scenes
while the density map estimation model is good at handling
Direct count regression. To avoid the more complex detec- crowded scenes. However, this method requires external
tion problem, some researchers proposed to directly learn pre-trained human detection models and is less efficient.
a mapping from image features to their counts [5, 35, 7, Some researchers proposed to combine multiple losses to
23, 46, 36, 6]. Former methods [5, 35, 7] in this category assist each other. Zhang et al. [53] proposed to train a deep
rely on hand-crafted features, such as SIFT, LBP etc., and CNN by alternatively optimizing a pixel-wise loss function
then learn a regression model. Chan et al. [5] proposed and a global count regression loss. A similar training ap-
to extract edge, texture and other low-level features of the proach was adopted by Zhang et al. [54], in which they first
crowds, and lean a Gaussian Process regression model for train their model via the density map loss and then add a
crowd counting. Chen et al. [7] proposed to transform low- relative count loss in the last few epochs. Idrees et al. [16]
level image features into a cumulative attribute space where proposed a composition loss, which consists of ℓ1 , ℓ2 , and
each dimension has clearly defined semantic interpretation ℓ∞ norm losses for the density map and a count regression
that captures how the crowd count value changes continu- loss. Compared to these hybrid losses, our proposed single
ously and cumulatively. Recent methods [46, 36, 6] resort loss function is simpler and more effective.
to deep CNNs for end-to-end learning. Wang et al. [46]
adopted an AlexNet architecture where the final fully con- 3. The Proposed Method
nected layer of 4096 neurons is replaced by a single neu-
3.1. Background and Motivation
ron for predicting the scalar count value. Shang et al. [36]
proposed to extract a set of high level image features via a Let {D(xm ) >= 0 : m = 1, 2, . . . , M } be a density
CNN firstly, and then map the features to local counts us- map, where xm denotes a 2D pixel location, and M is the
ing a Long Short-Term Memory (LSTM) unit. These direct number of pixels in the density map. Let {(zn , yn ) : n =
6143
1, 2, . . . , N } denote the point annotation map for a sam- converting point annotations into the “ground-truth” density
ple image, where N is the total crowd count, zn is a head maps generated by Eq. (1) as the learning targets, we pro-
point position and yn = n is the corresponding label. The pose to construct likelihood functions of xm given label yn
point annotation map contains only one pixel for each per- from them,
son (typically the center of the head), which is sparse, and
contains no information about the object size and shape. It p(x = xm |y = yn ) = N (xm ; zn , σ 2 12×2 ). (3)
is difficult to directly use such point annotation maps to train
To simplify the notations, we omit the random variable x
a density map estimator. A common remedy to this diffi-
and y in the following formulations, e.g., Eq. (3) becomes
culty is to convert it to a “ground-truth” density map using
p(xm |yn ) = N (xm ; zn , σ 2 12×2 ). According to Bayes’
the Gaussian kernel.
theorem, given a pixel location xm in the density map, the
N
X posterior probability of xm having the label yn can be com-
gt def
D (xm ) = N (xm ; zn , σ 2 12×2 ) puted as:
n=1
(1) p(xm |yn )p(yn ) p(xm |yn )p(yn )
N 2
X 1 kxm − zn k2 p(yn |xm ) = = N
= √ exp(− ), p(xm )
2σ 2
P
n=1
2πσ p(xm |yn )p(yn )
n=1
where N (xm ; zn , σ 2 12×2 ) denotes a 2D Gaussian distribu- p(xm |yn ) N (xm ; zn , σ 2 12×2 )
tion evaluated at xm , with the mean at the annotated point = N
= N
.
P P
zn , and an isotropic covariance matrix σ 2 12×2 . p(xm |yn ) N (xm ; zn , σ2 1 2×2 )
n=1 n=1
Many recent works use the above “ground-truth” density (4)
map as the learning target, and train a density map estimator In the above derivation, the third equality holds as we as-
using the following loss function: sume the equal prior probability p(yn ) for each class label
M
X yn , i.e. p(yn ) = N1 , without loss of generality. In prac-
baseline
F Dgt (xm ) − Dest (xm ) ,
L = (2) tice, if we know the prior that crowds are more or less tend
m=1 to appear at certain places, a tailored p(yn ) can be applied
here.
where F(·) is a distance function and Dest is the esti-
With the posterior label probability p(yn |xm ) and esti-
mated density map. If a fix-sized Gaussian kernel is adopted
def
mated density map Dest , we derive the Bayesian loss as
σ = const, it is assumed that all the people in a dataset have follows. let cmn denotes the count that xm contributes to yn ,
the same head size and shape, which is obviously not true and cn be the total count associated with yn , we have the
due to the occlusions, irregular crowd distributions, per- expectation of cn as:
spective effects, etc. An alternative solution is to use an
adaptive Gaussian kernel [57, 16] for each n: σn ∝ dn , M
X M
X
where dn is a distance that depends on its nearest neigh- E[cn ] = E[ cm
n] = E[cm
n]
bors in the spatial domain, which assumes that the crowd m=1 m=1
(5)
M
is evenly distributed. Some other methods utilize specific X
est
information such as camera parameters to get a more accu- = p(yn |xm )D (xm ).
m=1
rate perspective map, but in general such information is not
available. Obviously, the ground-truth count cn at each annotation
We argue that the point annotations in the available point is one, therefore we have the following loss function:
crowd counting datasets can rather be considered as weak
N
labels for density map estimation. It is more reasonable X
to take such annotations as priors or likelihoods instead of LBayes = F(1 − E[cn ]), (6)
n=1
the learning targets. Loss functions that impose such strict,
pixel-to-pixel supervisions as Eq. (2) on density map are where F(·) is a distance function and we adopt ℓ1 distance
not always beneficial to enhance the count estimation accu- in our experiments. A special case should be handled when
racy when used to train a CNN model, because it forces the there is no object in a training image. In such scenario we
model to learn inaccurate, or even erroneous information. directly force the sum of the density map to zero. Our pro-
posed loss function is differentiable and can be readily ap-
3.2. Bayesian Loss
plied to a given CNN using the standard back propagation
Let x be a random variable that denotes the spatial lo- training algorithm.
cation and y be a random variable that represents the anno- At the inference stage, we do not have to know the
tated head point. Based on the above discussions, instead of posterior label probability p(yn |xm ) in advance, because
6144
when we sum over the estimated density map, we eliminate 𝑑
p(yn |xm ) as follows:
N M
N X 𝒛𝑚
𝑛 𝒛𝑚
0
𝒙m
X X
est
C= E[cn ] = p(yn |xm )D (xm )
n=1 n=1 m=1 Figure 1: Geometrical illustration of the dummy back-
M
X N
X ground point, where xm denotes a pixel in the density map,
= p(yn |xm )Dest (xm ) (7) zm m
n is its nearest head point and z0 is the defined dummy
m=1 n=1 background point.
M
X
= Dest (xm ). To define the background likelihood, we construct a dummy
m=1 background point for each pixel,
3.3. Background Pixel Modelling
x m − zm
n
zm m
0 = zn + d , (13)
For background pixels that are far away from any of the k x m − zm
n k2
annotation points, it makes no sense to assign them to any
head label yn . To better model the background pixels, we where zm n denotes the nearest head point of xm , and d is
introduce an additional background label y0 = 0, in addi- a parameter that controls the margin between the head and
tion to the head labels {yn = n : n = 1, 2, . . . , N }. Then, the dummy background points. As illustrated in Fig. 1, with
the posterior label probability can be rewritten as: the defined dummy background point zm 0 , for a pixel xm
that is far away from head points, it can be assigned to the
p(xm |yn )p(yn ) background label instead. Here we also use the Gaussian
p(yn |xm ) = N
P
p(xm |yn )p(yn ) + p(xm |y0 )p(y0 ) kernel to define the background likelihood,
n=1
(8) def
p(xm |yn ) p(xm |y0 ) = N (xm ; zm 2
0 , σ 12×2 )
= .
N
(d − kxm − zm 2 (14)
1 n k2 )
P
p(xm |yn ) + p(xm |y0 ) = √ exp(− ).
2σ 2
n=1 2πσ
The last equation is simplified with the assumption p(yn ) =
p(y0 ) = N1+1 , without loss of generality. Similarly, we 3.4. Visualization and Analysis
have:
We build the entropy map Ent of label assignment for
p(xm |y0 ) visualization and analysis, which is calculated for each pixel
p(y0 |xm ) = N
. (9)
P xm as follows,
p(xm |yn ) + p(xm |y0 )
n=1
N
X
And the expected counts for each person and for the entire Ent(xm ) = − p(yn |xm ) ln p(yn |xm ). (15)
background are defined as: n=0
M
X The entropy measures the uncertainty on the label a pixel
E[cn ] = p(yn |xm )Dest (xm ), (10) xm in the density map belongs to. We display entropy maps
m=1 with different settings in Fig. 2 and have the following sum-
M marizes:
X
E[c0 ] = p(y0 |xm )Dest (xm ). (11)
• The posterior could find the boundary between persons
m=1
roughly.
In this
PMcase, the summation over the whole density
map m=1 Dest (xm ) consists of the foreground counts • Dense crowd areas have higher entropy values than
PN sparse areas.
n=1 E[cn ] and the background count E[c0 ]. Obviously,
we would like the background count to be zero and the fore- • The parameter σ controls the softness of the posterior
ground count at each annotation point equals to one, thus we label probability, comparing (b) and (c).
have the following enhanced loss function,
• Pixels far from crowds are handled better via back-
ground pixel modelling, comparing (b) and (e).
N
• The parameter d controls the margin between fore-
X
LBayes+ = F(1 − E[cn ]) + F(0 − E[c0 ]). (12)
n=1
ground and background, comparing (e) and (f).
6145
(a) Input image (b) Without y0 , σ = 16 (c) Without y0 , σ = 32
(d) Blend of (a,e) (e) With y0 , σ = 16, d = 100 (f) With y0 , σ = 16, d = 200
Figure 2: Visualization of the posterior label probability. We construct an entropy map using Eq. (15), which measures the
uncertainty on the label a pixel in the density map belongs to. The color is warmer, the value is larger. (a) Input image.
(b)-(c): Entropy maps with different σ, without background pixel modelling. (e)-(f): Entropy maps with different d, with
background pixel modelling. (d): Blend of the input image and the entropy map in (e).
6146
Datasets UCF-QNRF ShanghaiTechA ShanghaiTechB UCF CC 50
Methods MAE MSE MAE MSE MAE MSE MAE MSE
C ROWD -CNN [53] - - 181.8 277.7 32.0 49.8 467.0 498.5
MCNN [57] 277 426 110.2 173.2 26.4 41.3 377.6 509.1
CMTL [40] 252 514 101.3 152.4 20.0 31.1 322.8 341.4
S WITCH -CNN [3] 228 445 90.4 135.0 21.6 33.4 318.1 439.2
CP-CNN [41] - - 73.6 106.4 20.1 30.1 295.8 320.9
ACSCP [37] - - 75.7 102.7 17.2 27.4 291.0 404.6
D-C ONV N ET [38] - - 73.5 112.3 18.7 26.0 288.4 404.7
IG-CNN [2] - - 72.5 118.2 13.6 21.1 291.4 349.4
IC -CNN [32] - - 68.5 116.2 10.7 16.0 260.9 365.5
SAN ET [4] - - 67.0 104.5 8.4 13.6 258.4 334.9
CL-CNN [16] 132 191 - - - - - -
BASELINE 106.8 183.7 68.6 110.1 8.5 13.9 251.6 331.3
Our BAYESIAN 92.9 163.0 64.5 104.0 7.9 13.3 237.7 320.8
Our BAYESIAN + 88.7 154.8 62.8 101.8 7.7 12.7 229.3 308.2
Table 1: Benchmark evaluations on four benchmark crowd counting datasets using the MAE and MSE metrics. The baseline
and our methods are trained using VGG-19.
(a) Input image (b) BASELINE (c) Our BAYESIAN (d) Our BAYESIAN +
Figure 3: Density maps generated by (b) BASELINE, our (c) BAYESIAN, and (d) BAYESIAN +. The color is warmer, the density is higher.
Note that in dense crowds, BASELINE often produces abnormal values, while in sparse areas, it can not localize each person well. In
contrast, our methods give more accurate count estimate and localization, respectively.
6147
Training details. We augment the training data using ran- BASELINE BAYESIAN
dom crop and horizontal flipping. We note that image 140
MAE
resolutions in UCF-QNRF vary widely from 0.08 to 66 120
megapixels. However, a regular CNN can not deal with 100
80
images with all kinds of scales due to its limited receptive 0 4 8 12 16 20 24 28 32
field. Therefore, we limit the shorter side of each image 250
MSE
within 2048 pixels in UCF-QNRF. Images are then ran- 200
domly cropped for training, the crop size is 256 × 256 for
150
ShanghaiTechA and UCF CC 50 where image resolutions 0 4 8 12 16 20 24 28 32
are smaller, and 512 × 512 for ShanghaiTechB and UCF- σ
QNRF. We set the Gaussian parameter σ in Eqs. (3) and Figure 4: The curves of testing results for different losses w.r.t.
(14) to 8 and the distance parameter d in Eq. (13) to 15% the Gaussian parameter σ on UCF-QNRF.
of the shorter side of image. The parameters are selected
accurate estimations. Our methods benefit from the pro-
on a validation set (120 images randomly sampled from the
posed probability model which constructs soft posterior
training set) of UCF-QNRF.
probabilities if the pixel is close to several head points. In
4.4. Experimental Evaluations sparse areas, on the other hand, BASELINE can not recog-
nize each person well, while our methods predict more ac-
Quantitative results. We compare our proposed method
curate results both on count estimation and localization.
with the baseline and the state-of-the-art methods on the
benchmark datasets described in Sec. 4.2. To make a 4.5. Ablation Studies
fair comparison, the baseline method (BASELINE) shares
Effect of σ. Both the proposed BAYESIAN and the BASE -
the same network structure (VGG-19) and training pro-
LINE methods use the parameter σ for the Gaussian kernel.
cess as ours. We use Eq. (1) to generate the “ground-
In our loss function, σ controls the softness of the posterior
truth” density maps for the baseline method and follow
label probability as shown in Fig. 2, while it determines the
previous works [57, 16] to select parameters for the Gaus-
crowd density distribution directly in BASELINE. In this
sian kernel. Specifically, the geometry-adaptive kernels are
subsection, we study the effect of σ on the two methods
adopted for UCF-QNRF, ShanghaiTechA and UCF CC 50,
by computing their MAE and MSE values w.r.t. different σ
while a fixed Gaussian kernel with σ = 15 is used for
values on UCF-QNRF. As can be seen from the curves in
ShanghaiTechB. We study both our basic Bayesian loss
Fig. 4:
(BAYESIAN) and the enhanced Bayesian loss with the back-
ground pixel modelling (BAYESIAN +). We show the exper- • Our BAYESIAN performs well in a wide range of val-
imental results in Table 1 and the highlights can be summa- ues of σ. Our MAE and MSE is less than 98.0 and
rized as follows: 180.0 when σ changes from 0.1 to 32.0.
• BAYESIAN + achieves the state-of-the-art accuracy on • BASELINE is more sensitive to this parameter and its
all the four benchmark datasets. On the latest and the MAE and MSE vary from 118.4 to 136.2 and from
toughest UCF-QNRF dataset, it reduces the MAE and 192.3 to 250.6, respectively.
MSE values of the best method (CL-CNN) by 43.3
and 36.2, respectively. It is worth mentioning that our Effect of d. The proposed BAYESIAN + method introduces
method does not use any external detection models or an additational parameter d to control the margin between
multi-scale structures. foreground and background. Fig. 5 shows the performance
• BAYESIAN + consistently improves the performance of of BAYESIAN + w.r.t. d where we can conclude that:
BAYESIAN by around 3% on all the four datasets. • Our BAYESIAN + method performs well in a wide
• Both BAYESIAN and BAYESIAN + outperform range of of values of d. BAYESIAN + consistently out-
BASELINE significantly on all the four datasets. performs BAYESIAN when d is from 3% to 100% of
BAYESIAN + makes 15% improvements on UCF- the shorter side of the image.
QNRF, 9% on ShanghaiTechA, 8% on Shang- • The parameter d has the meaning that the had size
haiTechB, and 8% on UCF CC 50, respectively. would not exceed d so that d should not be too small.
Visualization of the estimated density maps. We visu-
alize the estimated density maps using different training Robustness to annotation error. In this subsection, we
losses in Fig. 3. From the close-ups we can see that BASE - discuss the robustness of different loss functions w.r.t. an-
LINE often predicts abnormally values in the congested ar- notation error. Labeling a person by a single point is am-
eas, in contrast, our BAYESIAN and BAYESIAN + give more biguous, because the person occupies an area in the image.
6148
BASELINE BAYESIAN BAYESIAN + UCF-QNRF → ShanghaiTechA ShanghaiTechB UCF CC 50
110 Methods MAE MSE MAE MSE MAE MSE
MAE
Although most of the datasets place the annotation point at With Resize Without Resize
the center of each head, small errors from human labeling is Methods MAE MSE MAE MSE
inevitable. In this experiment, we simulate human labeling BASELINE 106.8 183.7 128.7 193.8
Our BAYESIAN 92.9 163.0 115.1 187.0
errors by adding uniform random noises to the original head Our BAYESIAN + 88.7 154.8 112.6 181.1
positions, and test the performance of different losses at
several noise levels. Since σ is the spatial variance of Gaus- Table 3: The effect of limiting image resolution on UCF-QNRF.
sian distribution, a larger σ is helpful to tolerate such spatial
Different backbones. Our proposed loss functions can be
noises. Therefore, we evaluate our BAYESIAN with σ = 1
readily applied to any network structure to improve its per-
and σ = 16, respectively, and BASELINE with σ = 16. As
formance on the crowd counting task. Here we apply the
can be seen from Fig. 6, the proposed BAYESIAN performs
proposed losses to both VGG-19 and AlexNet and make
better than BASELINE in different noise levels, even with a
comparisons with the baseline loss. The quantitative results
smaller σ value.
from Table 4 indicate that our Bayesian loss functions out-
BASELINE BAYESIAN BAYESIAN
perform the baseline loss significantly on both the networks.
(σ = 16) (σ = 16) (σ = 1)
140
Backbones VGG-19 [39] AlexNet [17]
Methods MAE MSE MAE MSE
MAE
120
100 BASELINE 106.8 183.7 130.8 221.0
80 Our BAYESIAN 92.9 163.0 121.2 202.5
0% 1% 2% 3% 4% 5% Our BAYESIAN + 88.7 154.8 116.3 191.7
300
MSE
6149
References [18] Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro,
David Vazquez, and Mark Schmidt. Where are the blobs:
[1] Carlos Arteta, Victor S. Lempitsky, and Andrew Zisserman. Counting by localization with point supervision. In ECCV,
Counting in the wild. In ECCV, 2016. 1 2018. 1
[2] Deepak Babu Sam, Neeraj N. Sajjan, R. Venkatesh Babu,
[19] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Be-
and Mukundhan Srinivasan. Divide and grow: Capturing
yond bags of features: Spatial pyramid matching for recog-
huge diversity in crowd images with incrementally growing
nizing natural scene categories. In CVPR, 2006. 2
cnn. In CVPR, 2018. 6
[20] Victor Lempitsky and Andrew Zisserman. Learning to count
[3] Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu.
objects in images. In NIPS, 2010. 1, 2
Switching convolutional neural network for crowd counting.
In CVPR, 2017. 6 [21] Min Li, Zhaoxiang Zhang, Kaiqi Huang, and Tieniu Tan. Es-
[4] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale timating the number of people in crowded scenes by MID
aggregation network for accurate and efficient crowd count- based foreground segmentation and head-shoulder detection.
ing. In ECCV, 2018. 1, 2, 6 In ICPR, pages 1–4, 2008. 1, 2
[5] Antoni B. Chan, Zhang-Sheng John Liang, and Nuno Vas- [22] Sheng-Fuu Lin, Jaw-Yeh Chen, and Hung-Xin Chao. Esti-
concelos. Privacy preserving crowd monitoring: Counting mation of number of people in crowded scenes using per-
people without people models or tracking. In CVPR, 2008. spective transformation. IEEE Trans. Systems, Man, and Cy-
1, 2 bernetics, Part A, 31(6):645–654, 2001. 1, 2
[6] Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ram- [23] Bo Liu and Nuno Vasconcelos. Bayesian model adaptation
prasaath R. Selvaraju, Dhruv Batra, and Devi Parikh. Count- for crowd counts. In ICCV, 2015. 2
ing everyday objects in everyday scenes. In CVPR, 2017. 1, [24] Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G.
2 Hauptmann. Decidenet: Counting varying density crowds
[7] Ke Chen, Shaogang Gong, Tao Xiang, and Chen Change through attention guided detection and density estimation.
Loy. Cumulative attribute space for age and crowd density In CVPR, 2018. 2
estimation. In CVPR, 2013. 1, 2 [25] Lingbo Liu, Hongjun Wang, Guanbin Li, Wanli Ouyang, and
[8] Joseph Paul Cohen, Genevieve Boucher, Craig A. Glaston- Liang Lin. Crowd counting using deep recurrent spatial-
bury, Henry Z. Lo, and Yoshua Bengio. Count-ception: aware network. In IJCAI, 2018. 2
Counting by fully convolutional redundant counting. In [26] Xialei Liu, Joost van de Weijer, and Andrew D. Bagdanov.
ICCV Workshops, 2017. 1 Leveraging unlabeled data for crowd counting by learning to
[9] Diptodip Deb and Jonathan Ventura. An aggregated mul- rank. In CVPR, 2018. 2
ticolumn dilated convolution network for perspective-free [27] Zheng Ma, Lei Yu, and Antoni B. Chan. Small instance de-
counting. In CVPR Workshops, 2018. 1 tection by integer programming on object density maps. In
[10] Pedro F Felzenszwalb, David A McAllester, Deva Ramanan, CVPR, 2015. 1
et al. A discriminatively trained, multiscale, deformable part
[28] Mark Marsden, Kevin McGuinness, Suzanne Little, Ciara E.
model. In CVPR, 2008. 2
Keogh, and Noel E. O’Connor. People, penguins and petri
[11] Luca Fiaschi, Ullrich Köthe, Rahul Nair, and Fred A. Ham- dishes: Adapting object counting models to new visual do-
precht. Learning to count with regression forest and struc- mains and object types without forgetting. In CVPR, 2018.
tured labels. In ICPR, 2012. 1, 2 1
[12] Weina Ge and Robert T. Collins. Marked point processes for
[29] T. Nathan Mundhenk, Goran Konjevod, Wesam A. Sakla,
crowd counting. In CVPR, 2009. 1, 2
and Kofi Boakye. A large contextual dataset for classifica-
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
tion, detection and counting of cars with deep learning. In
Delving deep into rectifiers: Surpassing human-level perfor-
ECCV, 2016. 1
mance on imagenet classification. In ICCV, 2015. 5
[30] Daniel Oñoro-Rubio and Roberto Javier López-Sastre. To-
[14] Meng-Ru Hsieh, Yen-Liang Lin, and Winston H. Hsu.
wards perspective-free object counting with deep learning.
Drone-based object counting by spatially regularized re-
In ECCV, 2016. 1
gional proposal network. In ICCV, 2017. 1
[15] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak [31] Viet-Quoc Pham, Tatsuo Kozakaya, Osamu Yamaguchi, and
Shah. Multi-source multi-scale counting in extremely dense Ryuzo Okada. COUNT forest: Co-voting uncertain number
crowd images. In CVPR, 2013. 1, 2, 5 of targets using random forest for crowd density estimation.
[16] Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong In ICCV, 2015. 1, 2
Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak [32] Viresh Ranjan, Hieu Le, and Minh Hoai. Iterative crowd
Shah. Composition loss for counting, density map estima- counting. In ECCV, 2018. 1, 2, 6
tion and localization in dense crowds. In ECCV, 2018. 1, 2, [33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
3, 5, 6, 7 Faster r-cnn: Towards real-time object detection with region
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. proposal networks. In NIPS, 2015. 2
Imagenet classification with deep convolutional neural net- [34] Weihong Ren, Di Kang, Yandong Tang, and Antoni B. Chan.
works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Fusing crowd density maps and visual object trackers for
Weinberger, editors, NIPS. 2012. 2, 5, 8 people tracking in crowd scenes. In CVPR, 2018. 2
6150
[35] David Ryan, Simon Denman, Clinton Fookes, and Sridha ing of joint deep feature and prediction refinement for salient
Sridharan. Crowd counting using multiple local features. In object detection. In ICCV, 2019. 2
DICTA, 2009. 1, 2 [53] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang
[36] Chong Shang, Haizhou Ai, and Bo Bai. End-to-end crowd Yang. Cross-scene crowd counting via deep convolutional
counting via joint learning local and global count. In ICIP, neural networks. In CVPR, 2015. 1, 2, 6
2016. 1, 2 [54] Lu Zhang, Miaojing Shi, and Qiaobo Chen. Crowd counting
[37] Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, and via scale-adaptive convolutional neural network. In WACV,
Xiaokang Yang. Crowd counting via adversarial cross-scale 2018. 2
consistency pursuit. In CVPR, 2018. 6 [55] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and
[38] Zenglin Shi, Le Zhang, Yun Liu, Xiaofeng Cao, Yangdong Stan Z. Li. Occlusion-aware r-cnn: Detecting pedestrians
Ye, Ming-Ming Cheng, and Guoyan Zheng. Crowd counting in a crowd. In ECCV, 2018. 2
with deep negative correlation learning. In CVPR, 2018. 1, [56] Shanghang Zhang, Guanhang Wu, Joao P. Costeira, and Jose
2, 6 M. F. Moura. Fcn-rlstm: Deep spatio-temporal neural net-
[39] Karen Simonyan and Andrew Zisserman. Very deep convo- works for vehicle counting in city cameras. In ICCV, 2017.
lutional networks for large-scale image recognition. ICLR, 1
abs/1409.1556, 2014. 2, 5, 8 [57] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao,
[40] Vishwanath A. Sindagi and Vishal M. Patel. Cnn-based cas- and Yi Ma. Single-image crowd counting via multi-column
caded multi-task learning of high-level prior and density es- convolutional neural network. In CVPR, 2016. 1, 2, 3, 5, 6,
timation for crowd counting. In AVSS, 2017. 6 7
[41] Vishwanath A. Sindagi and Vishal M. Patel. Generating [58] Tao Zhao and Ramakant Nevatia. Bayesian human segmen-
high-quality crowd density maps using contextual pyramid tation in crowded situations. In CVPR, 2003. 1, 2
cnns. In ICCV, 2017. 6 [59] Zhuoyi Zhao, Hongsheng Li, Rui Zhao, and Xiaogang Wang.
[42] Korsuk Sirinukunwattana, Shan e Ahmed Raza, Yee-Wah Crossing-line crowd counting with two-phase deep neural
Tsang, David R. J. Snead, Ian A. Cree, and Nasir M. Rajpoot. networks. In ECCV, 2016. 1
Locality sensitive deep learning for detection and classifica- [60] Sanping Zhou, Jinjun Wang, Deyu Meng, Yudong Liang,
tion of nuclei in routine colon cancer histology images. IEEE Yihong Gong, and Nanning Zheng. Discriminative fea-
Trans. Med. Imaging, 35(5):1196–1206, 2016. 1 ture learning with foreground attention for person re-
[43] Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. identification. IEEE Trans. Image Processing, 2019. 2
End-to-end people detection in crowded scenes. In CVPR, [61] Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong,
2016. 2 and Nanning Zheng. Point to set similarity based deep fea-
[44] Matthias von Borstel, Melih Kandemir, Philip Schmidt, ture learning for person re-identification. In CVPR, 2017.
Madhavi K. Rao, Kumar T. Rajamani, and Fred A. Ham- 2
precht. Gaussian process density counting from weak super-
vision. In ECCV, 2016. 1
[45] Elad Walach and Lior Wolf. Learning to count with CNN
boosting. In ECCV, 2016. 1
[46] Chuan Wang, Hua Zhang, Liang Yang, Si Liu, and Xiaochun
Cao. Deep people counting in extremely dense crowds. In
ACM MM, 2015. 1, 2
[47] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas
Huang, and Yihong Gong. Locality-constrained linear cod-
ing for image classification. In CVPR, 2010. 2
[48] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian
Sun, and Chunhua Shen. Repulsion loss: Detecting pedestri-
ans in a crowd. In CVPR, 2018. 2
[49] Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, and
Nanning Zheng. Grassmann pooling as compact homoge-
neous bilinear pooling for fine-grained visual classification.
In ECCV, 2018. 2
[50] Xing Wei, Yue Zhang, Yihong Gong, and Nanning Zheng.
Kernelized subspace pooling for deep local descriptors. In
CVPR, 2018. 2
[51] Feng Xiong, Xingjian Shi, and Dit-Yan Yeung. Spatiotem-
poral modeling for crowd counting in videos. In ICCV, 2017.
2
[52] Yingyue Xu, Dan Xu, Xiaopeng Hong, Wanli Ouyang, Ji
Rongrong, Xu Min, and Guoying Zhao. Structured model-
6151