You are on page 1of 30

1

CROWD TRACKING AND IDENTIFICATION

by
Kotha Abhishek Reddy
Under the supervision
of
Dr. D. P. Dogra

SCHOOL OF ELECTRICAL SCIENCES


INDIAN INSTITUTE OF TECHNOLOGY BHUBANESWAR
DECLARATION

I certify that

1. The work contained in the thesis is original and has been done by myself under
the general supervision of my supervisor.

2. The work has not been submitted to any other Institute for any degree or
diploma.

3. I have followed the guidelines provided by the Institute in writing the thesis.

4. I have conformed to the norms and guidelines given in the Ethical Code of
Conduct of the Institute.

5. Whenever I have used materials (data, theoretical analysis, and text) from other
sources, I have given due credit to them by citing them in the text of the thesis
and giving their details in the references.

6. Whenever I have quoted written materials from other sources, I have put them
under quotation marks and given due credit to the sources by citing them and
giving required details in the references.

Signature of the Student

1
ABSTRACT
It is estimated that the world population may increase to 11.2 billion in the year 2100,
which was 7.4 billion until March 2016. Due to experience of continuous population growth,
crowd analysis and monitoring has become important research era in the field of computer
vision. The phenomenon of dynamic crowd management has been studied by researchers
and intellectuals with respect to social, computational and psychological aspects. The work
shown here represents present computer vision techniques to track people in crowded scenes
and get a count of them.The algorithms used are Deep SORT and YOLOv3.
Contents

List of Figures 4

1 Introduction 5
1.1 Overview of the Area . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Motivation to crowd tracking and analysis . . . . . . . . . . . . . . . 9

2 Literature Review 10

3 Methodology 13
3.1 YOLOv3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Bounding Box Prediction . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Predictions Across Scales . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Deep SORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Linear Assignment . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Matching Cascade . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experiments and Results 23


4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Conclusion 26

6 REFERENCES 27

3
List of Figures

1.1 densely crowded scenes . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Flow chart of the process . . . . . . . . . . . . . . . . . . . . . . . . . 13


3.2 Bounding boxes with dimension priors and location prediction. We
predict the width and height of the box as offsets from cluster centroids.
We predict the center coordinates of the box relative to the location of
filter application using a sigmoid function . . . . . . . . . . . . . . . . 15
3.3 Darknet-53. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Quantitative analysis of the proposed system . . . . . . . . . . . . . . 24


4.2 one of the frames of 4K video of people walking in manhattan . . . . 24
4.3 one of the frames of video shot on oneplus 7t . . . . . . . . . . . . . . 25

4
Chapter 1

Introduction

A crowd is formed when massive amount of people gather together and they prin-
cipally agree for a common goal. The word crowd calls up many reflections more
than simply quantitative. This crowd may be volatile, easygoing, celebratory and
surprisingly may start with ridiculous acts of aggression. This formation of crowd is
completely psychological and tentative action of crowd must not be overlook. Crowd
scene investigations and their monitoring is very evolving and an enliven field of
learning. Crowd event is of immense concern in a variety of realistic applications:

1. Crowd Management: Crowd scene analysis and monitoring is applied to develop


crowd handling and controlling techniques for most popular and everyday activ-
ities such as sport matches, concerts, public exhibitions etc. for the avoidance
of crowd disasters and to facilitate emergency evacuations

2. Virtual environments: Virtual environments are essential to develop the math-


ematical model of crowd investigations in order to augment the simulation of
crowd and human life experience.

3. Visual surveillance: Such surveillance methods can help for automatic discovery
of irregular activities and alarms generation over crowd. Visual tracking of
personage helps police to identify and grasp suspects

4. Public space design: Through scene investigation the layout of public spaces
can be designed which in turn can help and optimize the space usage of a large
area like shopping malls, stadiums, rail tracks etc.

5
5. Intelligent environments: If crowd scene investigation is done intelligently then
it can assist the whole crowd or a person to follow the existing and desired
patterns of crowd.

1.1 Overview of the Area


Sociologists, psychologists and civil engineers study crowd control and space design
whereas computer vision researchers focus more on virtual environments, surveillance
and intelligent environments. Monitoring large crowds is a very challenging task,
which currently is done using surveillance cameras controlled manually by remote
human operators. The number of video feeds is usually overwhelmingly large for
the number of officers who monitor them, rendering such surveillance systems almost
useless for real-time detection of threats. In earlier period, video surveillance systems
is strongly supported by computer vision algorithms with a limitation to grip intensely
crowded scenes, such as those illustrated in Fig. 1.1. With an increase in crowd
density, there is always a noteworthy degradation in the recital in terms of object
identification, and its tracking.

6
Figure 1.1: densely crowded scenes

The general technique for moving object detection is background elimination un-
der the situation of fixed cameras. Detection of moving objects in video stream is
the first related step of information removal in many computer visualization applica-
tions, including video surveillance, people tracking, traffic monitoring, and semantic
annotation of videos. Video cameras are extensively used in surveillance applica-
tion to examine public areas, such as train stations, airports and shopping centres.
When crowds are intense, automatically tracking individuals becomes a difficult task.
Anomaly detection is also known as outlier detection, which is applicable in a verity
of application.

7
1.2 Problem Statement
There are a few existing online trackers for crowd estimation and tracking.In high-
density crowds some popular techniques such as background subtraction fail. We
cannot deal with this problem in ways similar to simple object tracking, where the
numberof objects is not large and mosts part of the object are visible. Inter-object
occlusion and self occlusion in crowd situations makes detection and tracking challeng-
ing. In such situations we also cannot perform segmentation, detection, and tracking
separately. We should consider it as single problem.
We have four main challenges during crowd detection and tracking:

1. Inter-object occlusion: a situation in which part of the target object is hidden


behind another. Inter-object occlusion becomes critical as the crowd density
increases. In very high-density crowds, most of an object’s parts may not be
visible.

2. Self occlusion: sometimes a object occludes itself, for example when a person
talks on a mobile phone, the phone and hand may hide part of the head. This
type of occlusion is more temporary and sort term.

3. Size of the visible region of the object: as the density of a crowd increases,
the size of the visible region of the object decreases. Detecting and tracking this
object in a dense situation is very difficult.

4. Appearance ambiguity: when target objects are small, then appearance


tends to be less distinguished.

The main objectives of this project are as follows to develop a robust method to
detect and track humans and get the count in a crowded video in the presence of
occlusion and evaluate the performance of the system on real videos.

8
1.3 Motivation to crowd tracking and analysis
In the video scene analysis and understanding, the focus is on object detection, track-
ing and behavior recognition. The conventional methods are not appropriate or some-
times fail for densely crowded scenes which have severe occlusions, ambiguities and
are extremely cluttered, where undetected anomalous activities might lead to adverse
situations which are terrible. A crowd has both dynamics and psychological char-
acteristics so analysis of behavior is a very complex task. Human crowds are often
goal oriented. It is a very difficult to model dynamics of a crowd at an appropriate
level. There is a need to detect, track and analyze the behavior of crowd in most
surveillance scenario.
Tracking proffered to pinpoint the dedicated object in given time-frame. The
main goal of an object tracker is to propagate the trajectory of an object. It also
localize the same object in different frames of videos. In tracking, a complete object
region is sustained at every time instant. It is vital that identifying an individual or
group in dense places is very cumbersome in order to factor out interaction among
the agents, inter object occlusions, and behavior-driven workings of the crowd and
if it is done with imperfection than it can cause many difficulties for tracking of an
agent within the group. Proficient tracking algorithms can pick up the accuracy over
given shape/object data. These shapes are studied from an open scene and helps
for identifying likely actions of individuals in the likely scenes. In realworld, it is
difficult to identify and locate people with multiple and moving backgrounds. The
biggest challenge faced by tracking methods is unrelenting occlusion in populace crowd
images as depicted in Fig. 1.1.

9
Chapter 2

Literature Review

Crowd analysis and scene understanding has drawn a lot of attention recently because
it has a broad range of applications in video surveillance. Besides surveillance, crowd
scenes also exist in movies, TV shows, personal video collections, and also videos
shared through social media. Since crowd scenes have a large number of people accu-
mulated with frequent and heavy occlusions, many existing technologies of detection,
tracking, and activity recognition, which are only applicable to sparse scenes, do not
work well in crowded scenes. Therefore a lot of new research works, especially tar-
geting crowd scenes, have been done in the past years. They cover a broad range
of topics, including crowd segmentation and detection, crowd tracking, crowd count-
ing, pedestrian traveling time estimation, crowd attribute recognition, crowd behavior
analysis, and abnormality detection in a crowd.
Many existing works on crowd analysis are scene-specific, i.e., models trained
from a particular scene can only be applied to the same scene. When switching to
a different scene, data needs to be collected from the new scene to train the model
again. It limits the applications of these works. Recently, people worked toward the
goal of scene-independent crowd analysis i.e., once a generic crowd model is trained,
it can be applied to different scenes without being retrained. This is nontrivial given
the inherently complex crowd behaviors observed across different scenes. As there
are so many crowd scenes, how to characterize and compare their dynamics is a big
challenge.
Many studies show that various crowd systems do share a set of universal prop-
erties because some general principles underlie different types of crowd behaviors.

10
Researchers proposed and estimated some generic crowd properties, such as density,
collectiveness, stability, uniformity, and conflict, from the computer vision point of
view. A more comprehensive set of 94 attributes to characterize the locations, sub-
jects, and events/actions of a crowd were proposed. In recent years, deep learning
has achieved great success in many grand challenges of computer vision. It has also
been applied to crowd analysis.Usage of deep convolutional neural networks (CNNs)
for scene-independent crowd counting, crowd density estimation, and crowd attribute
recognition is being made popular now.
The key for the success of deep learning is the availability of large scale training
data. Existing crowd datasets are very limited in size, scene-diversity, and annota-
tions, and are not suitable for training generic deep neural networks applicable to
different scenes. Very recently, two large-scale crowd datasets, i.e., the Shanghai
World Expo’10 crowd dataset and the WWW crowd dataset were proposed. 2630
video sequences from 235 surveillance cameras with disjoint views were collected from
Shanghai 2010 World Expo, in the Shanghai World Expo’10 dataset. Crowd segmen-
tation and general crowd properties of crowd density, collectiveness, and cohesiveness
on each crowd segment were annotated on this dataset. Besides, the locations of
pedestrians in the crowd were annotated for the purpose of crowd counting. The
WWW crowd dataset provided videos with over 8 million frames from 8257 scenes.
94 crowd attributes were annotated on this dataset. Both datasets are suitable for
training deep neural networks for crowd analysis.
Although convolutional networks have achieved great success on image recognition,
it is much more challenging to learn dynamic feature representations from videos
for crowd analysis with CNN. The temporal dimension is different from the spatial
dimensions. A straightforward way of treating videos as 3D volumes and directly
applying CNN on them cannot get very good results. Moreover, the computational
complexity of the training process is much higher than that on images. New network
architectures and training strategies need to be developed for crowd analysis.
Study on introduction of CNN architecture which jointly estimates pedestrian
count in a crowd and crowd density has also been made. It is trained alternatively
with crowd count and density, which helps the training jumping out of local minimum.
To handle an unseen target crowd scene, a data-driven method is presented to fine-

11
tune the trained CNN model for the target scene. However tracking of crowd and
counting has also been done with particle filtering which can handle the non linear,
non-gaussian distribution systems. With the latest trend on deep sort and a good
tracker which could be yolo in this case can be used as a combination for the problem
statement.

12
Chapter 3

Methodology

If you consider multiple object tracking, each frame of a video has more than one
object to track. A generic method to solve this has two steps
Detection: First, all the objects are detected in the frame. There can be single
or multiple detections.
Association: Once we have detections for the frame, a matching is performed
for similar detections with respect to the previous frame. The matched frames are
followed through the sequence to get the tracking for an object.

Figure 3.1: Flow chart of the process

3.1 YOLOv3
YOLO is one of the fastest algorithms out there to detect objects. Although it is
no longer the most accurate algorithm for object detection, when you need real-
time detection without losing too much precision, it is a very good choice. YOLO

13
uses a single convolutional network to predict several bounding boxes and category
probabilities for these boxes at the same time.

3.1.1 Bounding Box Prediction

Following YOLO9000 the system predicts bounding boxes using dimension clusters
as anchor boxes. The network predicts 4 coordinates for each bounding box, tx, ty,
tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the
bounding box prior has width and height pw, ph, then the predictions correspond to:

During training we use sum of squared error loss. If the ground truth for some
coordinate prediction is tˆ * our gradient is the ground truth value (computed from
the ground truth box) minus our prediction: tˆ * - t* . This ground truth value can
be easily computed by inverting the equations above. YOLOv3 predicts an objectness
score for each bounding box using logistic regression. This should be 1 if the bounding
box prior overlaps a ground truth object by more than any other bounding box prior.
If the bounding box prior is not the best but does overlap a ground truth object
by more than some threshold we ignore the prediction. We use the threshold of
.5.Our system only assigns one bounding box prior for each ground truth object. If
a bounding box prior is not assigned to a ground truth object it incurs no loss for
coordinate or class predictions, only objectness.

14
Figure 3.2: Bounding boxes with dimension priors and location prediction. We predict
the width and height of the box as offsets from cluster centroids. We predict the center
coordinates of the box relative to the location of filter application using a sigmoid
function

3.1.2 Predictions Across Scales

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those
scales using a similar concept to feature pyramid networks. From our base feature
extractor we add several convolutional layers. The last of these predicts a 3-d tensor
encoding bounding box, objectness, and class predictions. In our experiments with
COCO we predict 3 boxes at each scale so the tensor is N x N x [3 * (4 + 1 + 80)]
for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.
Next we take the feature map from 2 layers previous and upsample it by 2x. We
also take a feature map from earlier in the network and merge it with our upsampled
features using concatenation. This method allows us to get more meaningful seman-

15
tic information from the upsampled features and finer-grained information from the
earlier feature map. We then add a few more convolutional layers to process this
combined feature map, and eventually predict a similar tensor, although now twice
the size. We perform the same design one more time to predict boxes for the final
scale. Thus our predictions for the 3rd scale benefit from all the prior computation
as well as finegrained features from early on in the network. We still use k-means
clustering to determine our bounding box priors. We just sort of chose 9 clusters and
3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO
dataset the 9 clusters were: (10x13),(16x30),(33x23),(30x61),(62x45),(59x 119),(116
x 90),(156 x 198),(373 x 326).

3.1.3 Feature Extractor

We use a new network for performing feature extraction. Our new network is a hybrid
approach between the network used in YOLOv2, Darknet-19, and that newfangled
residual network stuff. Our network uses successive 3 x 3 and 1 x 1 convolutional
layers but now has some shortcut connections as well and is significantly larger. It
has 53 convolutional layers so we call it Darknet-53!

16
Figure 3.3: Darknet-53.

This new network is much more powerful than Darknet19 but still more efficient
than ResNet-101 or ResNet-152. Each network is trained with identical settings and
tested at 256x256, single crop accuracy. Run times are measured on a Titan X at
256 x 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with
fewer floating point operations and more speed. Darknet-53 is better than ResNet-101
and 1.5 faster. Darknet-53 has similar performance to ResNet-152 and is 2x faster.
Darknet-53 also achieves the highest measured floating point operations per sec-
ond. This means the network structure better utilizes the GPU, making it more
efficient to evaluate and thus faster. That’s mostly because ResNets have just way
too many layers and aren’t very efficient.

17
3.2 Deep SORT
Simple online and realtime tracking (SORT) is a much simpler framework that per-
forms Kalman filtering in image space and frame-by-frame data association using the
Hungarian method with an association metric that measures bounding box overlap.
This simple approach achieves favorable performance at high frame rates.
While achieving overall good performance in terms of tracking precision and ac-
curacy, SORT returns a relatively high number of identity switches. This is, because
the employed association metric is only accurate when state estimation uncertainty
is low. Therefore, SORT has a deficiency in tracking through occlusions as they typ-
ically appear in frontal-view camera scenes. We overcome this issue by replacing the
association metric with a more informed metric that combines motion and appear-
ance information. In particular, we apply a convolutional neural network (CNN) that
has been trained to discriminate pedestrians on a large-scale person re-identification
dataset. Through integration of this network we increase robustness against misses
and occlusions while keeping the system easy to implement, efficient, and applicable
to online scenarios.

3.2.1 Kalman Filtering

The track handling and Kalman filtering framework is mostly identical to the orig-
inal formulation. We assume a very general tracking scenario where the camera is
uncalibrated and where we have no ego-motion information available. While these
circumstances pose a challenge to the filtering framework, it is the most common
setup considered in recent multiple object tracking benchmarks. Therefore, our track-
ing scenario is defined on the eight dimensional state space (u, v, , h, x,˙ y,˙ , ˙ h˙)
that contains the bounding box center position (u, v), aspect ratio , height h, and
their respective velocities in image coordinates. We use a standard Kalman filter with
constant velocity motion and linear observation model, where we take the bounding
coordinates (u, v, , h) as direct observations of the object state.
For each track k we count the number of frames since the last successful mea-
surement association ak. This counter is incremented during Kalman filter prediction
and reset to 0 when the track has been associated with a measurement. Tracks that

18
exceed a predefined maximum age Amax are considered to have left the scene and are
deleted from the track set. New track hypotheses are initiated for each detection that
cannot be associated to an existing track. These new tracks are classified as tentative
during their first three frames. During this time, we expect a successful measure-
ment association at each time step. Tracks that are not successfully associated to a
measurement within their first three frames are deleted.

3.2.2 Linear Assignment

A conventional way to solve the association between the predicted Kalman states and
newly arrived measurements is to build an assignment problem that can be solved
using the Hungarian algorithm. Into this problem formulation we integrate motion
and appearance information through combination of two appropriate metrics.
To incorporate motion information we use the (squared) Mahalanobis distance
between predicted Kalman states and newly arrived measurements:

where we denote the projection of the i-th track distribution into measurement
space by (yi,Si) and the j-th bounding box detection by dj . The Mahalanobis distance
takes state estimation uncertainty into account by measuring how many standard
deviations the detection is away from the mean track location. Further, using this
metric it is possible to exclude unlikely associations by thresholding the Mahalanobis
distance at a 95 percent confidence interval computed from the inverse 2 distribution.
We denote this decision with an indicator.

that evaluates to 1 if the association between the i-th track and j-th detection is
admissible. For our four dimensional measurement space the corresponding Maha-
lanobis threshold is t (1) = 9.4877
While the Mahalanobis distance is a suitable association metric when motion un-
certainty is low, in our image-space problem formulation the predicted state distribu-

19
tion obtained from the Kalman filtering framework provides only a rough estimate of
the object location. In particular, unaccounted camera motion can introduce rapid
displacements in the image plane, making the Mahalanobis distance a rather un-
informed metric for tracking through occlusions. Therefore, we integrate a second
metric into the assignment problem. For each bounding box detection dj we compute
an appearance descriptor rj with rj = 1. Further, we keep a gallery Rk = r(i) k Lk
k=1 of the last Lk = 100 associated appearance descriptors for each track k. Then,
our second metric measures the smallest cosine distance between the i-th track and
j-th detection in appearance space:

Again, we introduce a binary variable to indicate if an association is admissible


according to this metric

and we find a suitable threshold for this indicator on a separate training dataset
by comparing the distance of correct and false association hypotheses from provided
ground truth. In practice, we apply a pre-trained CNN to compute bounding box
appearance descriptor.
In combination, both metrics complement each other by serving different aspects
of the assignment problem. On the one hand, the Mahalanobis distance provides
information about possible object locations based on motion that are particularly
useful for short-term predictions. On the other hand, the cosine distance considers
appearance information that are particularly useful to recover identities after longterm
occlusions, when motion is less discriminative. To build the association problem we
combine both metrics using a weighted sum

where we call an association admissible if it is within the gating region of both


metrics:

20
The influence of each metric on the combined association cost can be controlled
through hyperparameter . During our experiments we found that setting = 0 is
a reasonable choice when there is substantial camera motion. In this setting, only
appearance information are used in the association cost term. However, the Maha-
lanobis gate is still used to disregarded infeasible assignments based on possible object
locations inferred by the Kalman filter.

3.2.3 Matching Cascade

Instead of solving for measurement-to-track associations in a global assignment prob-


lem, we introduce a cascade that solves a series of subproblems. To motivate this
approach, consider the following situation: When an object is occluded for a longer
period of time, subsequent Kalman filter predictions increase the uncertainty associ-
ated with the object location. Consequently, probability mass spreads out in state
space and the observation likelihood becomes less peaked. Intuitively, the associ-
ation metric should account for this spread of probability mass by increasing the
measurement-to-track distance. Counterintuitively, when two tracks compete for the
same detection, the Mahalanobis distance favors larger uncertainty, because it ef-
fectively reduces the distance in standard deviations of any detection towards the
projected track mean. This is an undesired behavior as it can lead to increased track
fragmentations and unstable tracks. Therefore, we introduce a matching cascade that
gives priority to more frequently seen objects to encode our notion of probability
spread in the association likelihood.
Listing 1 outlines our matching algorithm. As input we provide the set of track
T and detection D indices as well as the maximum age Amax. In lines 1 and 2
we compute the association cost matrix and the matrix of admissible associations.
We then iterate over track age n to solve a linear assignment problem for tracks of
increasing age. In line 6 we select the subset of tracks Tn that have not been associated

21
with a detection in the last n frames. In line 7 we solve the linear assignment between
tracks in Tn and unmatched detections U. In lines 8 and 9 we update the set of matches
and unmatched detections, which we return after completion in line 11. Note that
this matching cascade gives priority to tracks of smaller age, i.e., tracks that have
been seen more recently.

In a final matching stage, we run intersection over union association as proposed


in the original SORT algorithm [12] on the set of unconfirmed and unmatched tracks
of age n = 1. This helps to to account for sudden appearance changes, e.g., due
to partial occlusion with static scene geometry, and to increase robustness against
erroneous initialization.

22
Chapter 4

Experiments and Results

4.1 Implementation

4.1.1 Training

The proposed model is trained on the crowdhuman dataset which contains 15,000
images with various densities of crowd. The hardware used for this implementation
was Intel
R CoreTM i5-7200U CPU @ 2.50GHz 4 and GeForce 940MX/PCIe/SSE2

Ubuntu 20.04 system configuration.The experiment is divided into two sections, the
identification and tracking of objects. The project design is python-based and evalu-
ated on five different video sequences and runs with strong FPS.

4.1.2 Output

Once training is completed the weights files in used for object detection in videos.
Input video file is broken down into total number of frames and passes each image to
our trained object detector and once detection is done bounding box information is
passed onto SORT tracking algorithm and object tracking is performed. The model
has been tested on three video sequences with the following results.

23
Figure 4.1: Quantitative analysis of the proposed system

Figure 4.2: one of the frames of 4K video of people walking in manhattan

24
Figure 4.3: one of the frames of video shot on oneplus 7t

If you observe the above detection results, those are the frames obtained by running
the model on respective videos with 30 fps. There are two bounding boxes in the
results, the white bounding box represents the person id and the green bounding box
represents the detection of individual person with confidence score.

25
Chapter 5

Conclusion

In this report visual object tracking is done on videos by training detector for custom
dataset consisting of 15000 images. The moving object detection is done using YOLO
detector and SORT tracker for tracking the objects in consecutive frames. Accuracy
and precision can be worked upon by training the system for more epochs and fine
tuning while training the detector. Performance of SORT tracker totally depends
upon the detectors performance as it is a tracker which follows tracking by detection
approach.
Few drawbacks of this system is that the people count is heavily large for a dense
crowd with the effect of occlusion, where a new id is assigned for the same subject
in consecutive frames resulting in increase in the count. For future work, a better
model could be designed avoiding the problem of severe occlusions, resulting in better
accuracy.

26
Chapter 6

REFERENCES

1. D J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv


preprint arXiv:1804.02767, 2018.

2. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time


object detection with region proposal networks,” in NIPS, 2015

3. N. Wojke and A. Bewley, ”Deep Cosine Metric Learning for Person Re-identification,”
2018 IEEE Winter Conference on Applications of Computer Vision (WACV),
Lake Tahoe, NV, 2018, pp. 748-756, doi: 10.1109/WACV.2018.00087.

4. B. Yang and R. Nevatia, “An online learned CRF model for multi-target track-
ing,” in CVPR, 2012, pp. 2034– 2041.

5. Kobayashi, Yoshinori, et al. ”3D Head Tracking using the Particle Filter with
Cascaded Classifiers.” BMVC. 2006.

6. Benenson R, Mathias M, Timofte R, Van Gool L. Pedestrian detection at 100


frames per second. In: CVPR. IEEE; 2012.

7. NWojke, Nicolai and Bewley, Alex and Paulus, Dietrich,Simple Online and Re-
altime Tracking with a Deep Association Metric,2017 IEEE International Con-
ference on Image Processing (ICIP),2017,IEEE,10.1109/ICIP.2017.8296962

8. S. Moon, J. Lee, D. Nam, H. Kim and W. Kim, ”A comparative study on multi-


object tracking methods for sports events”, 2017 19th International Conference
on Advanced Communication Technology (ICACT), 2017.

27
9. ”Object Tracking in Deep Learning - MissingLink.ai,” MissingLink.ai, 2019.
[Online]. Available: https://missinglink.ai/guides/computervision/object-tracking-
deep-learning/.

10. R. Khandelwal, ”Computer Vision—A journey from CNN to Mask RCNN and
YOLO,” Medium, 2019. [Online]. Available: https://medium.com/datadriveninvestor/comput
vision-a-journeyfrom-cnn-to-mask-r-cnn-and-yolo-1d141eba6e04

11. S.H. Rezatofighi, A. Milan, Z. Zhang, Qi. Shi, An. Dick, and I. Reid, “Joint
probabilistic data association revisited,” in ICCV, 2015, pp. 3047–3055.

12. T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature
pyramid networks for object detection. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.

13. https://nanonets.com/blog/object-tracking-deepsort/

14. R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, “Online multi-target tracking


with strong and weak detections,” in European Conference on Computer Vision.
Springer, 2016, pp. 84–99.

15. A. Andriyenko, K. Schindler, and S. Roth, “Discretecontinuous optimization for


multi-target tracking,” in CVPR, 2012, pp. 1926–1933.

16. C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis tracking re-
visited,” in ICCV, 2015, pp. 4696– 4704.

28

You might also like