You are on page 1of 8

Person Head Detection Based Deep Model for People Counting in Sports Videos

Sultan Daud Khan1 , Habib Ullah1 , Mohib Ullah2 , Nicola Conci3 ,


Faouzi Alaya Cheikh2 , Azeddine Beghdadi4 .
1
University of Ha’il, Saudia Arabia. 2 Norwegian University of Science and Technology, Norway.
3
University of Trento, Italy. 4 University of Paris 13, France.

Abstract planning of future sports events and for the design of sports
places [41]. With the counting information, one can find
People counting in sports venues is emerging as a new out the distribution of people in the sports places which is
domain in the field of video surveillance. People counting very important for safety and security personnel. People
in these venues faces many key challenges, such as severe counting information can also be used to detect and track
occlusions, few pixels per head, and significant variations an individual in dense gatherings [14].
in person’s head sizes due to wide sport areas. We propose For public control and safety at sport events, it is impor-
a deep model based method, which works as a head detector tant to accurately estimate the number of people attending
and takes into consideration the scale variations of heads in the event. Acknowledging the importance of people count-
videos. Our method is based on the notion that head is the ing, various methods for people counting are reported in
most visible part in the sports venues where large number literature. Most of these methods are based on regression
of people are gathered. To cope with the problem of differ- models [1, 42, 21, 44]. However, these models are not ca-
ent scales, we generate scale aware head proposals based pable to localize the individuals in the scene. Therefore,
on scale map. Scale aware proposals are then fed to the they cannot precisely provide the locations of pedestrians
Convolutional Neural Network (CNN) and it provides a re- in the scene, which are sometimes crucial for event man-
sponse matrix containing the presence probabilities of peo- agers and organizers. Other traditional methods [10, 26]
ple observed across scene scales. We then use non-maximal treat the people counting problem as pedestrian detection
suppression to get the accurate head positions. For the per- problem. These methods perform well in low dense scenes,
formance evaluation, we carry out extensive experiments on where most parts of human body are visible. However, these
two standard datasets and compare the results with state-of- methods receive setback when applied in high density situ-
the-art (SoA) methods. The results in terms of Average Pre- ations. This is due to the fact that most parts of the human
cision (AvP), Average Recall (AvR), and Average F1-Score body are not visible due to severe occlusions and clutter in
(AvF-Score) show that our method is better than SoA meth- the scene. In high density situations, where the people are
ods. standing close to each other and due to half body occlusions,
head is the most visible part.

1. Introduction Although several attempts have been made in human


head detection in the recent years, head detection is still
The gathering of people are regularly observed in differ- a challenging task due to scale variations and appearance
ent sports, which are solely for recreation activities. How- of heads in the scene as illustrated in Fig. 1. To precisely
ever, unlikely events or disasters can still happen. To ensure detect human heads in the scene, it is imperative to capture
people safety, it is important to analyze sports dynamics and the scale variations. In order to address this problem, recent
congestion situations at sports scenes. Sports scenes analy- head detection methods [31] use object proposal generation
sis can be used in various applications, for instance, in de- strategy to guide the search of the object and capture dif-
tecting critical crowd levels, detecting anomalies, and track- ferent scales in the image. This strategy allows to avoid
ing individuals or group of individuals. To efficiently deal exhaustive search for the object across all locations of the
with these problems, the most important step is to count the image. EdgeBox [47] proposed object proposal generation
number of people in the sports scenes. based on low-level image cues, such as saliency, gradient
People counting in sports scenes has received significant and edge information. DeepBox [16] improved the object
attention from the research community. People counting in proposals by re-ranking the proposals generated by Edge-
these scenes can provide useful piece of information for the Box. MultiBox [19] generate object proposals by bounding
datasets considering five different methods are shown in
Section 4. Conclusion is presented in Section 5.

2. Related Work
Deep learning models have achieved significant success
in a wide range of applications in the recent years. In
fact, different deep learning methods have been introduced
for image and motion segmentation [2, 36, 35], detection
[37, 29, 38], tracking [39], and classification with outstand-
ing performances. Due to the success of deep learning
methods, the CNN models [15] have been used also in the
literature to provide relevant information about the num-
Figure 1. We depict scale variations in the image. The size of the ber of people in the scene. Deep learning based methods
human head is changing drastically due to perspective distortions. for people counting can be divided into two classes namely
The size of the head at the bottom (in yellow color) is large as regression-based methods and detection-based methods. In
compared to size of head in green and red boxes. the regression-based methods, counting individuals is per-
formed by using regression between the image features and
crowd size. In the detection-based methods, density maps
box regression based on feature maps generated by convo- are produced from the images and people counting is com-
lution layers. These methods achieve good results in multi- puted by performing integration over the density map.
object detection in images only when the sizes of the objects Kong et al. [13] described a viewpoint invariant
are large. However, high density images are usually com- learning-based method for counting people in crowds from
plex where the size of objects (human heads) are usually a single camera. Their method takes into account feature
small and have large scale variations. Consequently, these normalization to deal with perspective projection and dif-
methods are less effective and produce low recall rates when ferent camera orientation. Zhang et al. [43] proposed a
applied to high density situations. To overcome the limita- scale-adaptive CNN architecture with a backbone of fixed
tions of the above mentioned object proposal methods, we small receptive fields. They extract feature maps from mul-
propose multiple scales strategy for generating object pro- tiple layers and adapt them to have the same output size.
posals to detect human heads. In our method, the number They combine them to generate the final density map. The
of detections represents the number of people in a scene. number of people is then estimated by integrating the den-
Our method consists of three different stages. First sity map. Ma et al. [20] introduced a multi-scale head de-
stage is to train and employ a Convolution Neural Network tection. They used gradients difference to extract the fore-
(CNN) that acts as a head detector. In the second stage, ground of the images and apply the overlapped patches in
we engender scale map and get scale aware head propos- different scales to split the input images. Zhang et al. [45]
als. We then provide each proposal to the CNN and obtain proposed a Multi-Column Convolutional Neural Network
classification score for each proposal. After processing all (MCNN) for counting people in a single image. For this
proposals, a response matrix is obtained, where higher re- purpose, they use three columns with filter size of differ-
sponses show high probabilities of heads’ presence. Finally, ent receptive field to compensate for perspective distortion.
non-maximal suppression is applied to the response matrix Zhang et al. [42] introduced the CNN regression model with
and final results are rendered. The overall framework of our two configurations that estimates the number of people in
proposed method is depicted in Fig. 2. Compare to other different crowd scenes. Sam et al. [27] investigated Switch-
state-of-the-art methods, our framework has the following CNN which considers multiple CNN based crowd counting
contributions. (1) In contrast to other methods, our method architectures and exploited switching strategy to identify
detects human heads both in low and high-density crowd one network at a time based on the performance. Sindagi
video frames. (2) It handles scale variations in the video et al. [34] introduced Contextual Pyramid CNN that finds
frame by generating scale-aware object proposals. (3) Un- the people count by producing high-quality crowd density
like other people counting methods that only estimate count, by using global and local contextual information of scene
our method counts and localizes human heads, simultane- images. For performance evaluation, Kang et al. [11] com-
ously. pared different density estimation methods.
The rest of the paper is organized as follows. In Sec- Detection based methods [33, 30] train object detectors
tion 2, we present related work. The details of the proposed to find out the position of each individual person. There-
people counting and localization method are presented in fore, the people count is represented by the number of de-
Section 3. Experiments and evaluations on two benchmark tections in the scene. Liu et al. [18] presented a hybrid
Figure 2. We feed each proposal to the CNN to obtain classification score. A response matrix is achieved, where higher responses show
high probabilities of heads. For the purpose of visualization, the boxes of the proposals are not overlapping.

method that uses both regression and detection based count- parts. The small sizes of the heads in such scenes make the
ing and adaptively decide the appropriate counting mode counting very difficult. Therefore, the current state-of-the-
for different image locations. Our proposed method is in- art region proposal methods are not very effective. Their
spired by Shami et al. [33]. However, unlike feeding gen- results produce low Recall rates when high density videos
eral object proposal to the network as proposed in [33], are considered. To deal with this key problem, we propose
we produce scale-aware proposals by exploiting a scale a robust method for producing object proposals to encode
map, which represents the estimation of the object scales. different ranges of scales for different sizes of objects.
We explore them to guide proposals rather than exhaustive
searching all scales. We have investigated that producing 3. Human heads counting and localization
scale-aware proposals is very effective and can decrease the
search space. This approach can also filter out false pos- In this paper, we propose a new method for detecting hu-
itives at improper scales. Zhu et al. [46] estimated crowd man heads in sports videos with large scale variations. It
density by considering different regression networks. It is consists of three stages. In the first stage, the creation and
worth noticing that regression-based methods perform bet- production of object proposals is a pre-processing step. In
ter in high density situations as they capture generalized fact, object proposals are exploited to optimize the search
density information from the crowd image yet they suffer of objects and avoid exhaustive search across all video lo-
from a number of weaknesses. Firstly, the performance of cations.
these methods suffer when applied to low-density scenes To engender object proposals, the initial step is to com-
due to overestimating the count. Secondly, these methods pute a scale map M for the video frame F . To produce a
do not localize human heads. Therefore, they provide no in- scale map, the influence of perspective must be taken into
formation about the distribution of pedestrians in the scenes account. Motivated by the work of Chan et al. [3], for each
which is very crucial for the safety and security personnel scene, we choose random number of individuals between
in the sports events. It is worth noticing that numerous ob- the two extremes (top and bottom) of the video frame and
ject proposal methods have been introduced. For example, mark their heads by drawing a straight line between the two
Ghodrati et al. [8] generate object proposals by an inverse points on the head, horizontally. This line represents the
cascade from the final to the initial layer. Erhan et al. [6] size of the head as shown in Fig. 3 (can be best viewed
and Liu et al. [19] extract object regions by bounding box in zoom in) of portion of the video frame. We then cal-
regression based on CNN features maps. In high density culate the scale map by linearly interpolating between two
scenes driven by high occlusions, people usually stand very extremes of the video frame. The scale map is shown in
close to each other. Therefore, heads are the most visible Fig. 3, where the red color represents bigger head sizes and
nizable. Therefore, we fine-tuned the network on training
images from WorldExpo’10 [42] dataset.
For training, we explore stochastic gradient descent
(SGD) with momentum to optimize the parameters by min-
imizing the sum of independent log-losses. For training, we
extracted scale-aware patches that only consist of one head
and we feed them to the network as positive sample. We
generate negative samples from the background and visible
human torso, since we want our head detector to give high
response only on heads and not on other body parts. Un-
like traditional CNN that exploits SVM for second pass of
training, we use the output of R-CNN to score the propos-
als. After we feed all the proposals to the R-CNN, the out-
Figure 3. To produce object proposals, we compute a scale map
for a video frame where we choose random number of individuals put is the response map R(pi ), where R(p) is the score of
between the two extremes (top and bottom) of the video frame and the proposal and pi ∈ O is the location in the video frame.
mark their heads by drawing a straight line between the two points The higher values of the response map show the presence
on the head, horizontally. For the purpose of visualization, the of head and lower values represent the background.
boxes of the proposals are not overlapping. To compute the precise locations of the heads, we post-
process the response map by using the non-maximal sup-
pression [9][25]. It helps us to find local peaks/ maximums
blue color shows smaller head sizes. The vertical bar shows based on a fixed threshold. The performance of finding the
the range of scales in the video frame. After producing the positions of the heads is mainly affected by changing the
scale map M , we extract object proposals. For this purpose, threshold value.
we overlay a grid O of points on the video frame. It is worth
noticing that the resolution of the grid O and scale map M 4. Evaluations and Results
are the same as the resolution of the input video frame F . In this section, we evaluate the performance of our pro-
We assume that M (pi ) represents the size of head (in pix- posed framework in both qualitative and quantitative ways
els) at location pi . For every point pi ∈ O, we produce using two publicly available datasets namely S-HOCK [32]
bounding box of size M (pi ) with point pi as its center. As and UCF-HDDC [10].
we are interested in head detection, we keep square shapes The S-HOCK dataset was collected during 26th Winter
for all bounding boxes and refer them as locations of inter- Universiade held in Trento (Italy) which attracted 100,000
est. The aspect ratio of a square shape is 1. From the Fig. 3, people from all over world. The data was collected from
it is clear that the sizes of the proposals in the bottom of four different matches and five cameras are used to capture
the video frame are bigger attributing to the bigger sizes of different parts of the spectator crowd. To cover ice rink
the heads and the sizes of proposals become smaller as we and panoramic view of all the bleachers, a full HD cam-
move up to the top of the video frame era (1920x1080, 30 fps, focal length 4mm) is used. While
Our head detector is based on the model of R-CNN of 3 high resolution cameras (1280x1024, 30 fps, focal length
Oquab et al. [22]. However, their model [22] does not lo- 12mm) were used to cover different parts of the spectator
calize heads, explicitly, but can also be used for that. Addi- crowd. For each match, a pool of sequences was selected in
tionally, we explore scale-aware proposals instead of selec- order to represent a wide, uniform and representative spec-
tive search proposals to encode head with different scales. trum of situations, e.g. tens of instances of goals, shots on
For each input proposal, we extend the bounding box with goal, saves, faults, and timeouts (each sequence has more
small margin to encode local video frame context around than one event). The dataset contains 75 video sequences in
the head. The corresponding frame patch is then re-sized total, each with duration of 31 seconds (930 frames). Sam-
to 224 x 224 pixels to fit the input layer of the R-CNN. We ple frames from the dataset are shown in the first row of the
extend the R-CNN model [22] by one fully-connected layer Fig. 4.
with 2048 nodes that are initialized randomly and followed The UCF-HDDC is one of the largest and challenging
by ReLu and DropOut. The R-CNN model [22] is trained dataset for human detection in high density crowds. The
on ImageNet and then fine-tuned on images of human faces dataset contains 108 crowd images downloaded from the
extracted from HollywoodHeads dataset [17]. In its orig- Flickr. The images cover different crowd scenes with vary-
inal form, this head detector is of limited use to us since ing densities. Most of the dataset contains images from
we are working with very high density videos where human marathons events where spectators are standing watching
head barely spans a few pixels and face is almost unrecog- the marathon and others are the participants. Similarly
Figure 4. Sample frames from the S-HOCK [32] and UCF-HDDC [10] datasets. The first row depicts the sample frames from S-HOCK
dataset and the second row depicts the sample frames from UCF-HDDC dataset.

half/full body occlusions and lighting conditions also vary, Table 1. Performance of different methods using S-HOCK [32]
dataset.
which make the human head detection a challenging prob-
lem in this dataset. Sample frames from the UCF-HDDC Methods AvP AvR AvF-Score
dataset are shown in the bottom row of Fig. 4. HOG + SVM [42] 0.74 0.56 0.63
We now evaluate and compare the performance of our HASC + SVM [28] 0.36 0.64 0.46
method with other reference methods. In S-HOCK dataset, ACF [4] 0.49 0.62 0.54
we compare our results with five different baseline methods. DPM [7] 0.50 0.42 0.46
The first method (HOG + SVM) [42] learns HOG features CUBD [5] 0.84 0.30 0.44
from the images using a linear SVM classifier. The sec- Proposed 0.87 0.72 0.77
ond method (HASC + SVM) [28] learns Auto-Similarities
of Characteristics (HASC) [28] descriptor using SVM clas- Table 2. Performance of different methods using UCF-HDDC [10]
sifier. In these methods, a sliding window approach is used dataset.
to generate response maps and detections. The third method Methods AvP AvR AvF-Score
that is Aggregate Channel Features (ACF) [4] detector is CN-HOG [12] 0.33 0.25 0.28
based on Viola-Jones framework. Deformable Part Model HOG-LBP[40] 0.14 0.19 0.15
(DPM) [7] combines templates of different parts of human JDN [23] 0.25 0.21 0.22
body and learns a latent SVM classifier. The fifth model DPM [7] 0.31 0.39 0.34
is Calvin Upper Body Detector (CUBD) [5] which is the CTF [24] 0.06 0.15 0.08
combination of DPM and Viola-Jones face detector. For the Proposed 0.60 0.57 0.58
performance measure, we use average precision, recall, and
F1 scores. The results are reported in Table 1. From the
table, it can be seen that our proposed method outperforms From the Table 2, it can be seen that our proposed
other state-of-the-art methods. method outperforms state-of-the-art methods by significant
In Table 2, we compare our results with other state-of- margin. The superior performance of our method attributes
the-methods using UCF-HDDC dataset. These methods to the fact that it produces scale-aware proposals that cover
are CN-HOG [12], HOG-LBP [40], Joint Deep Network range of head sizes in each video frame. In order to visual-
(JDN) [23], DPM [7], and CTF [24]. The method CN- ize the performance, we present qualitative results in Fig. 5
HOG [12] concatenates Color Name descriptor as color fea- for both the datasets. The first column shows the sample
ture and HOG feature for object detection. HOG-LBP[40] video frames from the datasets. The second column depicts
computes HOG and LBP features for human detection in the response maps for the corresponding video frames. The
images. third column shows the detection results where each de-
tected head is marked (in red) with a bounding box. It is [3] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos. Privacy pre-
obvious from the Fig 5 that our method performs well in serving crowd monitoring: Counting people without people
both high and low density sports videos and is independent models or tracking. In Conference on computer vision and
of the scene density. pattern recognition, IEEE CVPR 2008, pages 1–7, 2008.
From the experiments, we observe that most of the [4] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature
state-of-the-art methods work well in low density situa- pyramids for object detection. IEEE transactions on pattern
analysis and machine intelligence, 36(8):1532–1545, 2014.
tions. However, the performance of these methods deteri-
[5] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari.
orate when applied on high density images. The lower per-
Articulated human pose estimation and search in (almost)
formance of these methods attribute to the following rea-
unconstrained still images. ETH Zurich, D-ITET, BIWI,
sons: (1) low resolution (2) small size of head and (3) scale Technical Report No, 272:22, 2010.
variance. In low resolution images, with extreme clutter, [6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scal-
humans become extremely blurred resulting in weak edges able object detection using deep neural networks. In Con-
and degrade the performance of HOG-based detectors. The ference on computer vision and pattern recognition, IEEE
small size of the human head also causes problem for DPM CVPR, pages 2147–2154, 2014.
detector as it has the lower limit of detecting part size of [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
23 x 23 pixels. The camera position relative to human head manan. Object detection with discriminatively trained part-
size in the scene causes perspective distortion which causes based models. IEEE transactions on pattern analysis and
a scale problem. However, all these problems are tackled by machine intelligence, 32(9):1627–1645, 2009.
our proposed method in an efficient and effective way. The [8] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and
complexity of our method is proportional with the size of L. Van Gool. Deepproposal: Hunting objects by cascading
the grid. There is a trade-off between the performance and deep convolutional layers. In International conference on
computer vision, IEEE ICCV, pages 2578–2586, 2015.
complexity. Bigger grid size requires high complexity with
better performance. [9] J. Hosang, R. Benenson, and B. Schiele. Learning non-
maximum suppression. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
5. Conclusion 4507–4515, 2017.
We proposed a method to count number of people in the [10] H. Idrees, K. Soomro, and M. Shah. Detecting humans in
dense crowds using locally-consistent scale prior and global
sports videos by detecting the human heads. We produced
occlusion reasoning. IEEE transactions on pattern analysis
scale-aware head region proposals by using the perspec-
and machine intelligence, 37(10):1986–1998, 2015.
tive effects. It has tremendously decreased the classification
[11] D. Kang, Z. Ma, and A. B. Chan. Beyond counting: com-
time and also resulted in improving the detection accuracy. parisons of density maps for crowd analysis tasks-counting,
We carry out experimental evaluations on two datasets. We detection, and tracking. Transactions on circuits and systems
reported significant improvements in the results. for video technology, IEEE TCSVT, 2018.
In our future work, we would adaptively compute the lo- [12] F. S. Khan, R. M. Anwer, J. Van De Weijer, A. D. Bagdanov,
calization magnitude for the response maps. Therefore, to M. Vanrell, and A. M. Lopez. Color attributes for object
investigate an optimal strategy for localization is an impor- detection. In 2012 IEEE Conference on Computer Vision
tant research direction. We would also develop a dynamic and Pattern Recognition, pages 3306–3313. IEEE, 2012.
method to generate the scale map without using the manual [13] D. Kong, D. Gray, and H. Tao. A viewpoint invariant ap-
annotation of heads. proach for crowd counting. In 18th International Conference
on Pattern Recognition (ICPR’06), volume 3, pages 1187–
1190. IEEE, 2006.
6. Acknowledgment
[14] L. Kong, D. Huang, J. Qin, and Y. Wang. A joint framework
We are very thankful to NVIDIA Corporation since the for athlete tracking and action recognition in sports videos.
corporation supported our research work with the donation IEEE Transactions on Circuits and Systems for Video Tech-
of the Titan Xp GPU. nology, 2019.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
References Advances in neural information processing systems, pages
[1] C. Arteta, V. Lempitsky, and A. Zisserman. Counting in the 1097–1105, 2012.
wild. In European conference on computer vision, pages [16] W. Kuo, B. Hariharan, and J. Malik. Deepbox: Learning ob-
483–498. Springer, 2016. jectness with convolutional networks. In Proceedings of the
[2] L. Bi, J. Kim, E. Ahn, A. Kumar, D. Feng, and M. Fulham. IEEE International Conference on Computer Vision, pages
Step-wise integration of deep class-specific learning for der- 2479–2487, 2015.
moscopic image segmentation. Pattern Recognition, 85:78– [17] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.
89, 2019. Learning realistic human actions from movies. In Con-
Figure 5. Visualization of head detection results. The first column shows samples frames from both datasets (first two rows from S-HOCK
and last two rows belong to UCF-HDDC). The second column shows the response maps and the third column shows the head detection.
The Figure can be best viewed in color.
ference on computer vision and pattern recognition, IEEE [32] F. Setti, D. Conigliaro, P. Rota, C. Bassetti, N. Conci,
CVPR, pages 1–8, 2008. N. Sebe, and M. Cristani. The s-hock dataset: A new bench-
[18] J. Liu, C. Gao, D. Meng, and A. G. Hauptmann. Deci- mark for spectator crowd analysis. Computer Vision and Im-
denet: Counting varying density crowds through attention age Understanding, 159:47–58, 2017.
guided detection and density estimation. In Conference on [33] M. Shami, S. Maqbool, H. Sajid, Y. Ayaz, and S.-C. S. Che-
computer vision and pattern recognition, IEEE CVPR, pages ung. People counting in dense crowd images using sparse
5197–5206, 2018. head detections. Transactions on circuits and systems for
[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- video technology, IEEE TCSVT, 2018.
Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. [34] V. A. Sindagi and V. M. Patel. Generating high-quality crowd
In European conference on computer vision, pages 21–37. density maps using contextual pyramid cnns. In Interna-
Springer, 2016. tional conference on computer vision, IEEE ICCV, pages
[20] T. Ma, Q. Ji, and N. Li. Scene invariant crowd counting using 1879–1888, 2017.
multi-scales head detection in video surveillance. IET Image [35] H. Ullah, A. B. Altamimi, M. Uzair, and M. Ullah. Anoma-
Processing, 12(12):2258–2263, 2018. lous entities detection and localization in pedestrian flows.
[21] D. Onoro-Rubio and R. J. López-Sastre. Towards Neurocomputing, 290:74–86, 2018.
perspective-free object counting with deep learning. In Eu- [36] H. Ullah and N. Conci. Crowd motion segmentation and
ropean Conference on Computer Vision, pages 615–629. anomaly detection via multi-label optimization. In ICPR
Springer, 2016. workshop on pattern recognition and crowd analysis, 2012.
[22] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and [37] H. Ullah, M. Ullah, H. Afridi, N. Conci, and F. G. De Natale.
transferring mid-level image representations using convolu- Traffic accident detection through a hydrodynamic lens. In
tional neural networks. In Conference on computer vision 2015 IEEE International Conference on Image Processing
and pattern recognition, IEEE CVPR, pages 1717–1724, (ICIP), pages 2470–2474. IEEE, 2015.
2014. [38] H. Ullah, M. Ullah, and M. Uzair. A hybrid social influence
[23] W. Ouyang and X. Wang. Joint deep learning for pedestrian model for pedestrian motion segmentation. Neural Comput-
detection. In Proceedings of the IEEE International Confer- ing and Applications, pages 1–17, 2018.
ence on Computer Vision, pages 2056–2063, 2013. [39] M. Ullah and F. Alaya Cheikh. A directed sparse graphi-
[24] M. Pedersoli, A. Vedaldi, J. Gonzalez, and X. Roca. A cal model for multi-target tracking. In IEEE Conference on
coarse-to-fine approach for fast deformable object detection. Computer Vision and Pattern Recognition Workshops, pages
Pattern Recognition, 48(5):1844–1853, 2015. 1816–1823, 2018.
[25] T. Q. Pham. Non-maximum suppression using fewer than [40] X. Wang, T. X. Han, and S. Yan. An hog-lbp human detector
two comparisons per pixel. In International Conference with partial occlusion handling. In 2009 IEEE 12th inter-
on Advanced Concepts for Intelligent Vision Systems, pages national conference on computer vision, pages 32–39. IEEE,
438–451. Springer, 2010. 2009.
[26] V. Rabaud and S. Belongie. Counting crowded moving ob- [41] M. Xu, Z. Ge, X. Jiang, G. Cui, B. Zhou, C. Xu, et al. Depth
jects. In 2006 IEEE Computer Society Conference on Com- information guided crowd counting for complex crowd
puter Vision and Pattern Recognition (CVPR’06), volume 1, scenes. Pattern Recognition Letters, 2019.
pages 705–711. IEEE, 2006. [42] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd
[27] D. B. Sam, S. Surya, and R. V. Babu. Switching convolu- counting via deep convolutional neural networks. In Con-
tional neural network for crowd counting. In Conference on ference on computer vision and pattern recognition, IEEE
computer vision and pattern recognition, IEEE CVPR, vol- CVPR, pages 833–841, 2015.
ume 1, page 6, 2017. [43] L. Zhang, M. Shi, and Q. Chen. Crowd counting via scale-
[28] M. San Biagio, M. Crocco, M. Cristani, S. Martelli, and adaptive convolutional neural network. In 2018 IEEE Win-
V. Murino. Heterogeneous auto-similarities of characteris- ter Conference on Applications of Computer Vision (WACV),
tics (hasc): Exploiting relational information for classifica- pages 1113–1121. IEEE, 2018.
tion. In Proceedings of the IEEE International Conference [44] Y. Zhang, C. Zhou, F. Chang, and A. C. Kot. Multi-resolution
on Computer Vision, pages 809–816, 2013. attention convolutional neural network for crowd counting.
[29] E. Sangineto, M. Nabi, D. Culibrk, and N. Sebe. Self paced Neurocomputing, 329:144–152, 2019.
deep learning for weakly supervised object detection. IEEE [45] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-
transactions on pattern analysis and machine intelligence, image crowd counting via multi-column convolutional neu-
41(3):712–725, 2019. ral network. In Conference on computer vision and pattern
[30] M. Saqib, S. D. Khan, N. Sharma, and M. Blumenstein. Per- recognition, IEEE CVPR, pages 589–597, 2016.
son head detection in multiple scales using deep convolu- [46] L. Zhu, C. Li, Z. Yang, K. Yuan, and S. Wang. Crowd density
tional neural networks. In 2018 International Joint Confer- estimation based on classification activation map and patch
ence on Neural Networks (IJCNN), pages 1–7. IEEE, 2018. density level. Journal of neural computing and applications,
[31] M. Saqib, S. D. Khan, N. Sharma, and M. Blumenstein. Springer NCA, pages 1–12, 2019.
Crowd counting in low-resolution crowded scenes using [47] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
region-based deep convolutional neural networks. IEEE Ac- proposals from edges. In European Conference on Computer
cess, 7:35317–35329, 2019. Vision, pages 391–405. Springer, 2014.

You might also like