You are on page 1of 36


1.1 Preamble
Object detection and tracking in video has significant scope in many computer
vision applications. The conception of dynamic computers, availability of
economical camcorders and requirement for computerized video audit and many
applications associated with it has created great significance in moving object
detection and tracking process. Object detection is the process of detecting the
instances of objects and object tracking is the process of locating moving objects in
consecutive video frames. Automatic detection and tracking of moving objects
endure as open research problem for many years which helps us to understand and
analyze objects in video instead of monitoring computer by human operators which
is very time consuming and tedious job. However building automated detection and
tracking system is not simple and has many challenges involved within it. This
research work focuses on detection and tracking of moving objects in video.

In this chapter, an overview of computer vision, pattern recognition and video

processing followed by moving object detection and tracking in video is presented.
Also, Motivation for the current research work followed by highlights of the
research work is discussed and outline of the thesis is drawn at the end.

1.2 Overview of Computer Vision and Pattern Recognition

We humans have a sense of vision, due to which we can see and perceive different
objects, perform different tasks and take decisions. This makes us intelligent,
autonomous and self-controlled. But a man-made computer does not have a sense
of vision in them unless we bestow it to them. Computer vision represents an
exciting and dynamic part of cognitive and computer science which depicts the
abilities of human vision by electronically perceiving and understanding an image.
Computer vision automates tasks that normally human vision can do by extracting,
analyzing and understanding the valuable information in the frames. Extracting
useful information from images is very challenging, but not impossible.
Challenge for any computer vision algorithm is the complexity of image data. For
example, consider the image which consists of hundreds of objects in a scene with
no change in the image intensity, it is very difficult to determine which objects left
the scene and which objects enter the scene, also it is difficult to differentiate
foreground and background. As a human, one cannot count the number of people,
trees, and cars in the scene if it is crowded but an efficient computer vision
algorithm can do this task easily.

Pattern Recognition is the systematic knowledge of machine learning or computer

vision that focuses on recognizing the patterns and categorizes patterns into a
number of categories. The examples of patterns are Disease categorization, Finger
Print Verification. Face Recognition, Event Detection, Speech Recognition, Optical
Character Recognition, etc. Pattern recognition focus on providing solutions to all
the inputs by matching most likely of the inputs, considering the variations in it for
feature selection and feature extraction. The feature extraction and classification
strategy are commonly found on the availability of the training set which is already
assigned. Supervised learning is a sort of machine learning for the accomplishment
of the input-output relationship of a system. Supervised learning is a method to
construct an unreal system that can learn the function between the input and the
output. But, in unsupervised learning the training data is not labeled. If an algorithm
uses supervised and unsupervised training data, it is called a semi-supervised
learning algorithm. The Pattern Analysis changes indicate us to use some features
to represent them instead of using the pattern itself.

1.3 Video Processing and Analysis - An Overview

Video is the sequence of frames and each frame is a spatial distribution of intensities
spread across an image that remains constant with respect to time. The sequence of
frames has varying spatial intensity distribution that varies with respect to time and
analyzing the content of video that changes often is a challenging task. Video
processing is an open research area for long years and has implications on different
applications. Video processing can be characterized as the analysis of the content
of video and understanding the scene which has potential information in it. Video
processing has many significant applications which include robotics, multimedia,
surveillance, and security, etc. Video analysis based methods are used to develop
automated machine learning algorithms that impersonate the capabilities of the
human visual system.

Video processing, computer vision and pattern recognition empower major real
world applications which include:

• Optical Character Recognition (OCR): Recognizes printed or handwritten

characters and words.

• Biometrics: Automatic face recognition, interpretation of facial expression,

finger print recognition and iris recognition.

• Medical Imaging: Visual representation of organs for analysis,

interpretation and diagnosis of the disease.

• Industry Automation: Control systems such as robots and computers to

replace human beings by autonomously managing different processes and

• Gesture Recognition: Controlling or interacting with devices just by sign

without touching them physically. Identification and recognition of human
behavior comes under gesture recognition, where machines can
communicate and interact with the humans by understanding the body

• Object Tracking: Tracks an object in a video scene such as tracking a soccer

ball in a soccer game, cars driving down a street, or people moving in a

• Object Recognition: Identifying the cars, humans, animals, etc., in a scene

based on shape, size, color, texture and motion information.

• Event Detection: Identifying certain phenomena or circumstances without

any ambiguities, for example, sudden scattering of people in public place,
parking car in a garage or parking lot, abandoned objects, etc.

• Surveillance and Security: Surveillance and security to perceive individuals,

to contribute an exceptional sense of security using evident data.

• Monitoring Traffic: Inspect stream of vehicles and identify the accidents.

• Video Abstraction: Extracts key frames in a video sequence to obtain a
computerized description of videos and produce an object based outline.

• Intuitive Games: Offers a common way of interplay with perceptive


• Autonomous Vehicle Guidance: Unmanned vehicles that can be controlled

and navigated remotely such as driverless cars, industrial transport,
underwater exploration, etc.

1.4 Moving Object Detection and Tracking in Video

Object detection and tracking in video is one of the most popular and open research
problem for many years in the field of computer vision. The increasing use and
availability of economical camcorders in recent years have created great
significance in moving object detection and tracking. Object detection is the process
of identifying/locating instances of objects in a video frame and tracking can be
characterized as the issue of evaluating the movement of an object in the image
plane. The object detection and tracking are intently associated in light of the fact
that tracking normally begins with detecting the objects, although detecting an
object over and over again in consecutive frames is usually significant to benefit
and verify tracking. It has been widely used in many applications like real-time
traffic monitoring: collecting traffic statistics in realtime and analyzing it to direct
the flow of traffic; automated video surveillance and security: monitoring a scene
to detect suspicious activities or unlikely events and raise the alarm using evident
data; video indexing: automatic annotation and retrieval of the videos in large
multimedia databases; automated vehicle navigation: video based path planning and
obstacle avoidance capabilities by controlling and navigating the vehicles remotely;
human-computer interaction: gesture recognition, intuitive games, eye gaze
tracking for data input to computers, etc.

A robust, accurate and high-performance approach for object detection and tracking
is still a great challenge today. It mainly depends on how the object to be detected
and tracked is defined. If a feature like color is used to detect and track the object
there are advantages and disadvantages associated with it, it is easy to locate all the
objects with the same color, but the problem occurs when the color of both
foreground and background is same. Illumination change is one of the challenge in
object detection and tracking process and change in illumination also changes the
color of the object, which leads to erroneous detection and tracking. So, detecting
object just based on visual features like color is not feasible in all cases.

Challenges for moving object detection and tracking includes:

• Illumination Changes: Intensity of light varies during the day in the outdoor
environment such as changeover from cloudy to bright sunny day and vice
versa. This may also happen in the indoor environment with sudden on/off
a light. Sudden change in illuminations affects the performance of detection
and tracking methods with high false positive rates.

• Complex Background: Video sequences with the complex background

which may contain irregular or periodical movement in background
relevance to foreground such as swaying of tree branches, waves, fountains,
and cloud movements.

• Occlusion: Two types of occlusion partial/full occur when an object is

hidden by another object for a while (disappear and reappears again) and
may affect the process of tracking an object. In many cases, partial occlusion
is conjoined with variations in appearance and sudden change in object‘s
direction, which makes it even more difficult to track the object.

• Camouflage: Objects which barely differ from its background and

surroundings makes the detection and tracking process complicated.

• Shadows: Shadows casted from moving objects are also considered as actual
objects because of the same motion properties and this complicates the
detection and tracking process because, with the shadow it is very difficult
to extract the shape and features of the object.

• Noisy Videos: Video sequences are generally superimposed with different

types of noises and noise will affect the object detection approaches. It is
difficult to separate the actual object pixels and unwanted pixels/false
positives obtained from noisy videos.
• Intermittent Object Motion: If the object is moving very slowly or very fast,
many approaches fail to detect the object. An intermittent motion of objects
leaves a trail of ghost region behind it in the detected foreground mask when
objects move, stop for a while and start moving again.

• Weather Conditions: It is very difficult to detect and track moving objects

in rainy, fog, snowy weather conditions.

• Low Contrast: Video frames with low contrast are difficult to separate
foreground objects and background due to its low-intensity values.

• Computational Time: Due to the complexity involved in object detection

and tracking process, it is very challenging to perform in real-time. Many
approaches perform in real time at cost of accuracy and most of the accurate
approaches fail to perform in real time.

Detection of moving objects is the fundamental step for extracting information from
the video frames. There are various object detection approaches available in the
literature and generally categorized into feature based, template based, classifier
based and motion based approaches. In feature based object detection, features like
shape, size and color of the objects are extracted and modeled in terms of these
features. In template based object detection a template describing an object is
modeled and the video frames are analyzed based on the matching features between
the template and object in a video frame. There are two types of template matching
approaches fixed template matching and deformable template matching. Fixed
template matching is ideal when the shape of object do not change often. Two
popular methods make use of fixed template matching are Image Subtraction and
Correlation technique. Deformable template matching is ideal when the object‘s
size and shape vary due to rigid and non-rigid deformations. Classifier based object
detection separates moving objects from the background by building a set of
parameters based on knowledge of object to be detected. Motion based object
detection approaches rely on detecting temporal changes in a video frame at the
pixel level. Usually, the foreground is detected by subtracting each frame with the
reference frame if there is any movement of foreground objects. A popular method
to detect moving objects based on motion is background subtraction. Optical flow
and Gaussian mixture model also make use of motion to separate foreground from

Object tracking approaches can be classified by different criteria, based on the

techniques used there are two categories: model based approaches and feature based
approaches. Model based approaches tracks objects by building a model making
use of color or contour of the objects. In feature based approaches, the objects are
tracked based on the visual features such as Haar [1.3], Histogram of Gradients
[1.2], etc. Object tracking approaches are generally classified [1.1] into point
tracking, appearance tracking and silhouette tracking. In point tracking, the detected
object is represented by points and the tracking of these points is based on the object
position and motion. Point tracking can be categorized into two: Deterministic
approaches and Probabilistic approaches. While building deterministic systems no
randomness is considered for modeling future states. Probabilistic approaches
depend on the probability of object movements while tracking the moving objects.
Appearance tracking is also called as kernel tracking in which an elliptical shape or
rectangular box is used to track the objects by considering the affinity of their
appearances in consecutive video frames. Methods falls under appearance tracking
are divided into two categories: single view (template and density) based and multi
view based. In silhouette based tracking, the object is tracked by estimating the
object region in consecutive video frames. Silhouette tracking approaches extract
information in the form of appearance density and shape models which is encoded
in the object region to track either by shape matching or contour evolution.

1.5 Motivation
Many approaches for moving object detection and tracking in video are available
in the literature and still there is a lot of scope for object detection and tracking.
Especially, due to increasing need in building autonomous systems for video
analysis and availability of economical digital cameras has developed huge
significance in moving object detection and tracking systems. Information present
in the video is significant and information of objects and background changes with
respect to time. Extracting the valuable information for higher level analysis of a
scene is really a challenging task and time-consuming. Object detection and
tracking in video poses many challenges and provides a lot of potential applications
in computer vision as discussed in the above section. Hence this research work
addresses the problem of moving object detection and tracking in video by
developing few novel approaches based on subspace and unsupervised learning

1.6 Highlights of Research Work

The main highlights of the present work are as follows:

• Extensive literature review has been done on moving object detection and
tracking in video.

• Applications of cosine transform and principal components which

efficiently generate features that describe the object characteristics is
introduced for detection of moving objects.

• The effectiveness of PCA and particle filter in the wavelet domain for object
detection and tracking is explored.

• Essence of wavelet features and model updation to build the subspace while
tracking the object is observed.

• The classification performance of linear projective unsupervised subspace

learning approach for moving object detection is observed.

• Detailed analysis of variations in building the adjacency graph and weight

matrix for linear projective approach is studied as part of object detection.

• A novel feature extraction technique known as orthogonalization is

introduced in the discriminant subspace to overcome the problem of
redundancy in Fisher Linear Discriminant.

• The proposed object detection and tracking algorithms is compared with

well-known existing algorithms.

• Extensive experiments are carried out and the results depict the effectiveness
of the proposed approaches.
Chapter 2

Literature Survey
In the previous chapter, the scope of object detection and tracking in video and its
significance in various practical applications was addressed. In this chapter a
thorough review on literature is presented.

Satrughan Kumar et al., [2.41] proposed a video object extraction and tracking using
background subtraction and adapting kalman filter. The work in this paper is
focused on realizing the relevant moving blobs on foreground by aiding the proper
initialization and updating of the background module to improve the tracking
accuracy. The first step deals with background modeling and object extraction
phase. Initially, average of some initial frames is taken as reference background.
The temporal processing creates holes and avoids spatial correlation amongst the
moving pixels. Therefore, an approximate motion field is derived using the
background subtraction and temporal difference mechanism. It generates an initial
motion field using spatial-temporal filtering on the consecutive video frames. The
block-wise entropy is evaluated above a certain range of the pixels of the difference
image in order to extract the relevant moving pixels from the initial motion field.
Finally, an adapting Kalman filter is integrated to the object extraction module in
order to track the object in the foreground. The proposed scheme is effective to
eliminate ghost, aperture distortion and achieves high tracking accuracy. But fails
to handle occlusion and computation cost of the tracking system is also high.

Shucheng Huang et al., [2.42] formulate multi-object tracking as a dynamic incremental

clustering process and propose a metric learning and multi-cue fusion based
hierarchical multiple hypotheses tracking method (MHMHT), which conducts data
association and incorporates more temporal context information. Proposed
approach can be divided into three stages: reliable tracklet generation, tracklet
association, and object association in upper levels. In the first stage, a dual-threshold
association method is used to generate object tracklets. In the second stage, the
metric learning and multi-cue fusion based tracklet MHT is used to conduct tracklet
association. Appearance similarity and dynamic similarity are fused to obtain the
association affinity. The distances between the salient templates of generated
attracted tracks and the beginning extremity templates of tracklets are utilized to
calculate the appearance similarity. While the dynamic similarity is calculated using
the Kalman Filter based on the end extremity dynamic states of generated tracks
and the beginning extremity dynamic states of tracklets. In the third stage, the same
strategy used in the second stage is adopted to conduct tracklets association in upper
level. As a result, object tracks can be generated by tracklets association with the
proposed MHT gradually. To make appearance similarity more discriminative, the
spatial-temporal relationships of reliable tracklets in sliding temporal window are
used as constraints to learn the discriminative appearance metric which measures
the distance between feature vectors and salient templates. Experimental results on
challenging TUD, CAVIAR, ETH and PETS datasets are compared with 5 state-of-
the-art methods and MHMHT achieves a little better performance than several
existing methods by tracking multiple objects in both static camera view and
different camera viewpoints with different crowd density. But fails to handle
frequent full occlusions and tracking suffers from significant scale variations.

Zhiyong Li et al., [2.43] proposes an adaptive multi–feature fusion strategy, in which

the target appearance is modeled based on timed motion history image with HSV
color histogram features and edge orientation histogram features. First, the tracking
method initializes the target model to obtain an online template and an offline
template based on HSV color feature and edge orientation feature in the first frame,
respectively. Second, several potential candidate object models for each frame of
the succeeding frames are screened out by using tMHI method based on HSV color
feature and edge orientation feature, respectively. The variances based on the
similarities between the candidate patches and the target templates are used for
adaptively adjusting the weight of each feature. Next, the Bhattacharyya coefficient
is used to measure the similarities between the candidate models and target models,
not only based on online template, but also offline template. To evaluate the
robustness and accuracy of the method, nine video sequences with challenging
factors like scale variation, occlusion, illumination variation, blurred vision,
background clutters and deformation are considered. Success rate percentage of the
method proposed by Zhiyong Li is well above the state-of-the-art methods. But fails
to track multiple targets and best discriminative features from other several features
can also be added to increase the success rate of a tracker.
Shu Tian et al., [2.44] proposes a multi-type multi-object tracking algorithm, by
introducing online inter-feedback information such as object‘s type, size and
predicted position between the detection and tracking processes into the trackingby-
detection method. Tracking algorithm consists of two iterative components:
detection by feedback from tracking and tracking based on detection. In the
detection step, objects are detected by the detectors adjusted by information from
tracking. In the tracking step, group tracking strategy based on detection is used. In
order to handle tracking scenarios with different complexity, objects are classified
into two categories, i.e. single object and multiple ones, and are dealt with different
strategies. Different evaluation metrics are used to evaluate the performance of
proposed method and it achieves 86% of multi-object tracking accuracy, 67.40% of
multi-object tracking precision, 0.23% of false alarm rate, 181 false positives and
301 false negatives for a video sequence of PETS dataset. The time complexity of
proposed method is compared with many state-of-the-art methods. Advantage of
this approach is it detects and tracks object is almost realtime and handles occlusion.
But the accuracy of tracking multiple objects is pretty low and suffers from too
many false positives.

Runmin Wang et al., [2.45] presented a novel method to detect license plates in video
sequences automatically. The framework mainly integrates the cascade detectors
method and the Tracking Learning Detection (TLD) algorithm. The cascade
detectors are used to detect license plates, and the TLD algorithm is adopted to track
the license plate regions. The license plates in the first frame image are detected by
the cascade detectors to build the original tracking list, the tracking results and the
detection results in following frames will be compared, and the newly appearing
license plate information will be added to the tracking list. Meanwhile, the tracking
results existing in the current tracking list would be replaced by the corresponding
detection results with higher degree of confidence. The proposed method is tested
with various video sequences which also include low-resolution sequences and
compared with Faradji [2.1], Zheng [2.2] and Wu [2.3] methods. The performance
of proposed system is measured in terms of Recall rate, precision rate and f-Measure
and achieves 78.75%, 74.80% and
76.73% respectively. Advantage of this approach is it detects the license plates in low-
resolution video sequences and also detects multiple license place in each time. But
some regions of the image might not be scanned which makes license plates
undetected and fails to scan the number plate for white color vehicles results in too
many false positives.

Kaveh Ahmadi et al., [2.46] presents a novel algorithm for detecting and tracking small
dim targets in Infrared (IR) image sequences with low Signal to Noise Ratio (SNR)
based on the frequency and spatial domain information. Using a Dual-Tree
Complex Wavelet Transform (DT-CWT), a CFAR detector is applied in the
frequency domain to find potential positions of objects in a frame. Following this
step, a Support Vector Machine (SVM) classification is applied to accept or reject
each potential point based on the spatial domain information of the frame. The
proposed system is compared with morphological based, wavelet based, bilateral
filter, and Human Visual System (HVS) based techniques. To assess the role of the
SVM as a refinement step, a different version of the proposed method is tested
without using the SVM. The Error Rate per Frame (EPF) and Target to Clutter Ratio
(TCR) measures are also calculated in this case for the same 10 datasets. The results
show that without SVM refinement the proposed method achieved 3.25 for EPF,
and 65% correct target detection rate in average. Proposed system with SVM
achieved 2.10 for EPF and achieves target detection rate of 95%. Time taken for
tracking (30 frames) is 22.21 seconds. This method is capable of tracking small
objects under complex background and capable of detecting and tracking targets in
infrared image sequences with low SNR values (less than 2 dB). Shortcomings of
this method are, it fails to track objects under occlusion and multiple targets cannot
be tracked simultaneously.

Faegheh Sardari et al., [2.47] propose an occlusion free object tracking method together
with a simple adaptive appearance model. The proposed appearance model which
is updated at the end of each time step includes three components: the first
component consists of a fixed template of target object, the second component
shows rapid changes in object appearance, and the third one maintains slow changes
generated along the object path. The proposed tracking method based on particle
filter detects, handles the occlusion and also robust against changes in the object
appearance model. A meta-heuristic approach called Modified Galaxy based Search
Algorithm (MGbSA), is used to reinforce finding the optimum state in the particle
filter state space. The particle filter and MGbSA based approach is tested for
different video sequences with challenges including sudden motion, partial/full
occlusion, pose variation, changes of illumination etc. And the performance of the
proposed method is measured in terms of average coordinate error (ACE) and
average scale error (ASE) with average value of 5.51 and 10.97 respectively.
Proposed method is compared with geometric particle filter [2.4], Wang [2.5], PF-
PSO [2.6], Zhang [2.7], Kalman filter and many existing methods with respect to
the values of ACE and ASE. But speed of the proposed method is not high enough;
detection in the first frame is done manually and cannot track more than one object
at any given instance.

Issam Elafi et al., [2.48] introduces a new real-time approach is based on the particle
filter and background subtraction. This approach automatically detects and tracks
multiple moving objects without any learning phase or prior knowledge about the
size, nature or the initial position. The first step is to calculate the subtracted image
to detect moving objects using background subtraction. In the second step, if an
object is detected for the first time, the particle filter seeks the white pixels in order
to estimate the localization of the object; otherwise, the particle filter seeks the
target color distribution calculated in the previous frame. In the third step, the exact
center and dimension of the detected object are calculated. In the fourth step, the
color distribution of the target (moving object) is calculated and stored to be used
in the next frame as a target color distribution. An experimental study is performed
over several video test set and compared with other seven tracking algorithms (CT,
ORIA, SCM, MTT, MS, LOT, CXT). The proposed approach outdoes the CT
tracker by 49%, the CXT tracker by 77% and the MS and ORIA trackers by over
90% with respect to success rate. Positive aspects of this method are, it tracks
moving objects in real time and automatically detects and tracks multiple moving
objects without any learning phase or prior knowledge about the size, the nature or
the initial position of objects. But this approach is prone to frequent full occlusions
and the tracker will drift away from objects in case of shadows. Ivan Huerta et al.,
[2.49] address the detection of both penumbra and umbra shadow regions. First, a
novel bottom-up approach is presented based on gradient and color models, which
successfully discriminates between chromatic moving cast shadow regions and
those regions detected as moving objects. In essence, those regions corresponding
to potential shadows are detected based on edge partitioning and color statistics.
Subsequently, temporal similarities between textures and spatial similarities
between chrominance angle and brightness distortions are analyzed for each
potential shadow region for detecting the umbra shadow regions. A tracking-based
top-down approach increases the performance of bottom-up chromatic shadow
detection algorithm by properly correcting non-detected shadows. Shadow
detection rate and shadow discrimination rate of proposed method are compared
with Qin et al., [2.8], Martel-Brisson [2.9] and Martel-Brisson [2.10]. Proposed
method achieves 83.6% of shadow detection rate and 91.3% of shadow
discrimination rate which surpasses the methods compared above. But this approach
is prone to camouflage which leads to the erroneous classification of shadows and
also computational complexity is very high.

Giyoung Lee, et al., [2.50] presents a fast and light algorithm that is suitable for an
embedded real-time visual surveillance system to detect effectively and track
multiple moving vehicles whose appearance and/or position changes abruptly at a
low frame rate. For effective tracking at low frame rates, a new matching criterion
based on greedy data association using appearance and position similarities between
detections and trackers is proposed. To manage abrupt appearance changes,
manifold learning is used to calculate appearance similarity. To manage abrupt
changes in motion, the next probable centroid area of the tracker is predicted using
trajectory information. The position similarity is then calculated based on the
predicted next position and progress direction of the tracker. Detection performance
of the proposed approach is measured in terms of Accuracy and False Alarm Rate
(FAR) with values of 98.05% and 17% respectively. Tracking performance is
measured with respect to mostly tracked trajectories (MT), mostly lost trajectories
(ML), fragmentation (FRMT), ID switches (IDS) and the proposed method is
compared with Lee [2.11], Wang [2.12] and Zhang [2.13]. Tracking performance
of proposed approach at frame rate of 2 frames per second achieves MT of 40, ML
of 1, FRMT of 4 and IDS of 0. This algorithm has some shortcomings such as
tracking requires trained parameters. Tracking performance is highly dependent on
detection performance.
Detection performance will be deteriorated because of incorrect matching errors and
cannot handle occlusion. Luqman et al., [2.51] propose an integral technique to do
object detection and tracking for video surveillance. First, pixels in the images will
be modeled with Gaussian mixture model with K-Means algorithm to separate
foreground from the background image. Then, the morphological operation is
performed to remove noise pixels. Objects will be formed with spatial evaluation,
with color mean and contour chain code as its feature. Tracking will be performed
with temporal evaluation, i.e. inter-frame object features and distance comparison.
This technique is doing well in object detection and tracking, with high true positive
and low false negative, but still suffering from false positives in the dynamic
background scene. The proposed method achieves 95.2% of precision, 90.90% of
recall and f-measure of 93% for PETS 2009 dataset. 25.5% of precision, 100%
recall and f-measure of 40.6% for traffic-snow dataset. Also, the computational cost
with respect to speed is very high compared to existing methods in the literature.
But the proposed method suffers from too many false positives and high
computational cost and also fails to handle dynamic background.

Olga Zoidi et al., [2.52] propose a visual object tracking framework, which employs an
appearance-based representation of the target object, based on local steering kernel
descriptors and color histogram information. This framework takes as input the
region of the target object in the previous video frame and a stored instance of the
target object, and tries to localize the object in the current frame by finding the
frame region that best resembles the input. As the object view changes over time,
the object model is updated. Color histogram similarity between the detected object
and the surrounding background is used for background subtraction. The proposed
tracking scheme of Olga Zoidi et al Color HistogramLocal Steering Kernel (CH-
LSK) is compared with particle filter [2.14], color histogram and local steering
kernel based methods [2.15] but CH-LSK is successful in tracking objects under
scale and rotation variations and partial occlusion, as well as in tracking rather
slowly deformable articulated objects. Average tracking accuracy of proposed
Color Histogram-Local Steering Kernel for case study 1 is 0.6928 and for case study
7 is 0.7745. But this approach is prone to full occlusion and fails to track the object
when the direction/speed of object changes suddenly.
Jean-Philippe et al., [2.53] presents the system that uses background subtraction
algorithms to detect moving objects. In order to build the object tracks, an object
model is built and updated through time inside a state machine using feature points
and spatial information. When an occlusion occurs between multiple objects, the
positions of feature points at previous observations are used to estimate the positions
and sizes of the individual occluded objects. Multiple Object Tracking Precision
(MOTP) and Multiple Object Tracking Accuracy (MOTA) are used as metrics to
evaluate the performance of Jean-Philippe et al. Urban Tracker (UR) [13] approach
and the same is compared with tracker called Traffic Intelligence (TI) [15] proposed
by Saunier et al. Proposed Jean-Philippe et al., Urban Tracker achieves MOTA of
89.28% and MOTP of 10.53px whereas Saunier et al., proposed Traffic Intelligence
achieves MOTA of 82.53% and MOTP of 7.42px for Sherb video sequence consists
of cars as objects. For a sequence consists of pedestrians as moving objects, urban
tracker achieves 68.09% of MOTA and 6.64px of MOTP, whereas traffic
intelligence achieves
1.41% of MOTA and 11.98px of MTP.

Sheng Chen et al., [2.54] presents a new approach to tracking people in crowded scenes,
where people are subject to long-term (partial) occlusions and may assume varying
postures and articulations. Temporal mid-level features (e.g., supervoxels or dense
point trajectories) as a more coherent spatiotemporal basis for handling occlusion
and pose variations are used. Tracking is formulated as labeling midlevel features
by object identifiers called constrained sequential labeling (CSL). A key feature of
this approach is that it allows for the use of flexible cost functions and constraints
that capture complex dependencies that cannot be represented in standard network-
flow formulations. Proposed approach is also compared with state-of-the-art
detection based approaches and report results on the widely used pedestrian
benchmark video PETS2009-S2L1 where people detectors are effective and also
reports on the Volleyball dataset contains 38 videos of entire collegiate volleyball
plays. Evaluation metrics: miss detection (MD), false positives (FP), ID switches
(IDS), multi-object tracking accuracy (MOTA), recalling and precision are used.
Proposed CSL results for PETS2009 S2L1 dataset is 98.28% of recalling rate,
91.07% of precision, 6 IDS and MOTA of 89.78%. Result for Henriques et al.,
[2.16] is 94.03% of recalling rate, 92.40% of precision, 10 IDS and MOTA of
84.77%. Result for Huang et al., [2.17] is 96.45% of recalling rate, 93.64% of
precision, 8 IDS and MOTA of 90.30%.

Tianzhu Zhang et al., [2.55] proposes a novel tracking algorithm that models and detects
occlusion through structured sparse learning called Tracking by Occlusion
Detection (TOD). This approach assumes that occlusion detected in previous frames
can be propagated to the current one. This propagated information determines which
pixels will contribute to the sparse representation of the current track. i.e., pixels
that were detected as part of an occlusion in the previous frame will be removed
from the target representation process. Tracker is tested on challenging benchmark
sequences, such as sports videos, which involve heavy occlusion, drastic
illumination changes, and large pose variations. Proposed tracker, TOD is compared
with other 6 recent and state-of-the-art trackers. Tracking performance is evaluated
according to the average per-frame distance between the center of the tracking result
and that of the ground truth used in [2.18, 2.19, 2.20]. Clearly, this distance should
be small. TOD consistently produces a smaller distance than other trackers by
accurately tracking the target despite severe occlusions and pose variations. But this
algorithm is prone to severe illumination changes and tracking of objects is affected
by background clutter and sometimes drifts from the target.

A novel approach is introduced by Zhang et al., [2.56] in order to deal with dynamic
scenes. A combined version of five-image difference algorithm and background
subtraction algorithm is provided to provide the complete contour of moving object.
Proposed approach is mainly divided into three consecutive steps i.e., pre-
processing, target identification and rectangular contour modeling. In very first step
video is pre-processed by median filter technique for noise removal. Secondly, five
frame differential technique; an enhanced version of the interframe differential
technique is applied. Thirdly, background subtraction algorithm is applied on the
actual image sequence and output is achieved using the dissimilarity between
current video frame and assumed background model; which is then followed by
binarization operation. In the last stage, rectangular contour model is applied to
eliminate cast shadow effect. But this approach fails to eliminate leaves flutter noise
and fails to detect multiple moving targets in an instance. Wang et al., [2.57]
presented a three-step method based on temporal information for moving object
detection. Firstly, based on the continuous symmetry difference of the adjacent
frames temporal saliency map is generated. Secondly, temporal saliency map is
binarized and candidate areas are obtained by calculating threshold using maximum
entropy sum method. Most salient point are considered as attention seed and based
upon obtained attention seed, the fuzzy approach is performed on saliency map to
grow attention seeds until entire contour of the moving objects are acquired.
Effectiveness of the proposed method is tested on four sequences from the
datasset2014 and performance is evaluated in terms of recall (Re), Specificity (Sp),
False Positive Rate (FPR), False Negative Rate (FNR), Percentage of Bad
Classifications (PBC), F-Measure and Precision. Also, the proposed approach is
compared with ICSAP [2.21], BBM [2.24], UBA [2.22], SGMM [2.23], QCH
[2.25], and GML [2.26]. And the average results of proposed method for
dataset2014 is Re of 84.67%, Sp of 99.43%, FPR of 0.57, FNR of 15.33, PBC of
86.42%, Precision of 77.57% and F-Measure of 80.23. Positive aspects of this
method are it detects objects under dynamic background both in the indoor and
outdoor environment and requires no parameters or threshold tuning. But this
method fails to handle shadows and fails to detect multiple serried moving objects.

A framework by using multiple support vector machines (SVM) is proposed by Shunli

Zhang et al., [2.58]. The multi-view SVMs tracking method is constructed based on
multiple views of features and a novel combination strategy. Three different types
of features, i.e., gray scale value, histogram of oriented gradients (HOG), and local
binary pattern (LBP), are used to train the corresponding SVMs. These features
represent the object from the perspectives of description, detection, and recognition,
respectively. A collaborative strategy with entropy criterion is used to realize the
combination of the SVMs. In addition, to learn the changes of the object and the
scenario, a novel update scheme based on subspace evolution strategy is used. The
new scheme can control the model update adaptively and help to address the
occlusion problems. A detailed comparison with several stateof-the-art trackers,
including: Frag [2.27], IVT [2.29], MIL [2.31], OAB [2.30], L1 [2.28], VTS [2.33],
TLD [2.34], MTT [2.35], CT [2.36], VTD [2.32], Struck [2.37] and PartT [2.38]
which gives out both quantitative and qualitative analysis. But requires re-
initialization of tracking process when illumination changes. Indrabayu et al., [2.59]
propose an Intelligent Transport System (ITS) in which Gaussian Mixture Model
(GMM) method was applied for vehicle detection and Kalman Filter method was
applied for object tracking. Video sequences used consist of vehicles in two
different conditions. First condition is light traffic and second condition is heavy
traffic. To evaluate the performance of detection system Receiver Operating
Characteristic (ROC) analysis is done. The result of proposed system shows that the
light traffic condition gets 100% for the precision value, 94.44% for sensitivity,
100% for specificity, and 97.22% for accuracy. While the heavy traffic condition
gets 75.79% for the precision value, 88.89% for sensitivity, 70.37% for specificity,
and 79.63% for accuracy. Average consistency of Kalman Filter for object tracking
is 100%. But the proposed approach fails to tracks objects under complete occlusion
and suffers from frequent ID switches.

A novel and effective method to track moving objects under a static background is
proposed by Sandeep et al., [2.60]. Proposed method first executes the
preprocessing tasks to remove noise from video frames. Then, draw the rectangular
window to select the target object region in the first video frame (reference frame).
Next, it applies the Laplacian operator on the selected target objects for sharpening
and edge detection. The algorithm then applies the DCT and selects the few high
energy coefficients. Subsequently, it computes the perceptual hash of the selected
target objects with the help of mean of all the AC values of the block. Perceptual
hash of a target object is used to find the similar object in subsequent frames of the
video. The proposed method is tested on real indoor-outdoor video sequences and
compared with perceptual hashing techniques, i.e., average hash (aHash) [2.39],
perceptive hash (pHash) [2.39], difference hash (dHash) [2.39] and Laplacian hash
(LHash) [2.40]. The average tracking accuracy of proposed method is 76.32%.
Some positive aspects of this method are it tracks moving object with varying object
size and significant amount of noise, also handles illumination changes and
slow/fast moving object. But, suffers from high computational cost and capable of
tracking only one target object at a time. Weina et al., [2.61] presents a method to
automatically detect small groups of individuals who are traveling together. These
groups are discovered by bottom-up hierarchical clustering using a generalized,
symmetric Hausdorff distance defined with respect to pairwise proximity and
velocity. Weina at al combine a pedestrian detector, a particle filter tracker, and a
multiobject data association algorithm to extract long-term trajectories of people
passing through the scene. The detector is run frequently (at least once per second),
and therefore, in addition to any new individuals entering the scene, people already
being tracked are detected multiple times. For each of the detection, a particle filter
tracker is instantiated to track that person through the next few seconds of video,
yielding a short-term trajectory, or tracklet. Proposed approach is tested in both
indoor and outdoor environments with different group size by handling occlusion
and pose variations. But fails to track pedestrians under severe illumination change
and prone to noise and jittering.

Jon et al., [2.62] presents a probabilistic method for vehicle detection and tracking
through the analysis of monocular images obtained from a vehicle-mounted camera.
The method is designed to address the main shortcomings of traditional particle
filtering approaches, namely Bayesian methods based on importance sampling, for
use in traffic environments. These methods do not scale well when the
dimensionality of the feature space grows, which creates significant limitations
when tracking multiple objects. Alternatively, the proposed method is based on a
Markov chain Monte Carlo (MCMC) approach, which allows efficient sampling of
the feature space. The method involves important contributions in both the motion
and the observation models of the tracker. Regarding the motion model, a new
interaction treatment is defined based on Markov random fields (MRF) that allows
for the handling of possible inter-dependencies in vehicle trajectories. As for vehicle
detection, the method relies on a supervised classification stage using support vector
machines (SVM). A new descriptor based on the analysis of gradient orientations
in concentric rectangles is defined. This descriptor involves a much smaller feature
space compared to traditional descriptors, which are too costly for real-time
applications. Some positive aspects of this approach is it can track the vehicle in a
wide variety of driving situations and environmental conditions and also handles
vehicles entering and leaving a scene pretty well. But, requires high computational
time and tracker drifts away when the number of particles is relatively small.
Chapter 3

Object Detection with Frequency Domain Features in

Reduced Subspace
3.1 Background
In the previous chapters, the significance of object detection and tracking in
various real-world applications was addressed through detailed literature
survey. Object detection in videos has many challenges involved within.
Challenges include illumination variations, low contrast sequences and
complex background both in indoor and outdoor environment. If one has to
achieve the accurate results, it is obvious to consider the above mentioned
challenges while detecting the moving objects in video frames. Accurate
detection of foreground objects is vital in future trails such as tracking and
recognition of moving objects. Hence in this chapter, we introduce the
application of cosine transform and principal components for foreground

3.2 Related Work

Object detection has many potential real world applications and video surveillance
is one of the major application and also a source generating huge video data.
In order to analyze the moving objects in a video, it is necessary to extract
those objects from video frames and a popular method to do so is background
subtraction by comparing each frame with a background model. Many
background modeling methods are proposed in the literature and basic
background modeling methods makes use of average or median or analysis of
histogram over time.

In recent years, several object detection methods based on the compressed domain
are developed and attract wide range of applications. Most of the techniques
exploit discrete cosine transformation (DCT) [3.21]–[3.25] to detect moving
objects in a video frame. Weiqiang et al., [3.1] represent the background by
utilizing the block level Discrete Cosine Transform (DCT) coefficients, and
the DCT coefficients are updated in order to adapt the background. This
method is computationally effective in-terms of speed and memory. The
Kalman filterbased approach [3.2] has the desirable computational speed and
low memory requirement. Kilger et al., [3.3] and Koller et al., [3.4] explain
the modified version of kalman filter-based approach called Selective Running
Average which monitors the real-time traffic. Wren et al., [3.5] models each
background pixel in Pfinder system using Gaussian distribution and Piccardi
[3.6] classifies the pixels using adaptive threshold by introducing standard
deviation and named it as Running Gaussian Average. Another background
modeling technique used as an alternative to averaging method is median
filtering technique and the same is used in [3.7]–[3.9]. In order to evaluate the
medoid, Cucchiara et al., [3.10] introduces an approximate implementation
based on median filtering. McFarlane and Schofield [3.11] describe a median-
based approach which adaptively varies the value of background by
approximately estimating the median value.

Simple median method described by Cheung and Kamath [3.12] have less
complexity and low computational cost. Multimodal backgrounds are
modeled using methods based on popular technique called Mixture of
Gaussians [3.13] and [3.14]. Extended version of [3.13] and [3.14] is used to
model the background which demonstrates the purpose of non-parametric
estimation with more number of Gaussian mixtures is explained by Elgammal
et al., [3.15]. [3.16] and [3.17] describes the complex and extended versions
of [3.15]. Toyama et al., [3.16] estimates the background using Wiener
prediction filter. Density modes are identified using sequential density
approximation approach and the same is presented in [3.17]. In order to
designate the background, each of the density modes is assigned with Gaussian

SACON proposed by Wang et al., [3.18] computes the sample consensus to

estimate the background model and reports encouraging results and
Contradictory to the independent pixel‘s based approach, [3.19] and [3.20]
presents a novel pixel block-based approach to estimate the background model
by considering the spatial correlation among neighboring pixels. These
approaches are not robust because it fails to adaptively update the background
model in real time due high computational complexity involved. Tang et al.,
[3.26] used gradient vector flow snake based on the discrete cosine transform
(DCT) for object boundary detection by computing the 2D discrete cosine
transform in its local neighborhood for each pixel in the spatial domain. And
a contrast measure derived from DCT coefficients is used to compute the
gradient vector flow field (GVF) which drives the snake to the object edges.
DCT-GVF is robust to multiplicative noise. Tae-Hyun et al., [3.27] used a
motion detection algorithm by change detection filter matrix derived from
Discrete Cosine Transform (DCT). The abrupt changes in vector are extracted
by high-pass filtering in frequency domain. Lin et al., [3.28] proposed an
object tracking method based on conventional particle filter, where feature
vectors are extracted from coefficient matrices of Discrete Cosine Transform.
Features extracted from DCT are robust to occlusion, scale changes and
rotation. Ziad et al., [3.29] proposed a system that exploits the feature
extraction capabilities of DCT and invokes certain normalization techniques
to increase the robustness to variations in facial geometry and illumination.
Radovan et al., [3.30] propose a method inspired by histogram of gradients
(HOG), but instead of the histogram of gradient orientations, the gradient
information is encoded by the discrete cosine transform (DCT). This obtains
a vector of features with smaller dimensionality than conventional histogram
of gradients.

Markus et al., [3.31] propose an algorithm with a combination of discrete cosine

transform and phase correlation (PC) for object detection. Energy compaction
property of DCT is used in this algorithm and requires less number of
coefficient compared to Fourier transformation (FFT) based techniques to
compute PC. Inverse Discrete Cosine Transform (IDCT) is used to get the
coefficient matrix. Swamy et al., [3.32] extracts 2D histogram of oriented
gradients (HOG) from the input image and apply it on the 2D Discrete Cosine
Transform tailed by low pass filtering to detect the vehicles in video, which is
done obtaining novel features called transform domain 2D-HOG (TD-2D
HOG). In order to decrease the cost of multi-scale scanning, the TD-2D-HOG
is employed with a classifier pyramid.

Transforms of many two dimensional image features such as gray-scale images

[3.34] and image‘s silhouette [3.35] seen in the literature are used as
discriminant features for object detection. In order to take the advantages of
the fourier transform properties, object detection approaches [3.33] and [3.36]
utilizes fourier transform on HOG or 2D-HOG and then, inverse fourier
transform is applied to convert back from frequency domain to spatial domain.
Naiel et al., [3.37] deal with the consequences of resampling an image on the
feature responses by presenting a method to approximate the extracted features
obtained from image pyramid utilizing the feature resampling technique in the
two dimensional discrete cosine transform domain. Many features have been
used for object detection in video, which includes histogram of oriented
gradients (HOG) [1.2] and its variants, such as Integral Channel Features
[3.40], and Deformable Part Models [3.41]. However, HOG and its variants
are prone to changes in scale. Haar-like features [3.38], [1.3] and interest-point
features [3.39] are also used widely for object detection.

Object detectors that are scale invariant requires extracting features at each scale
from an image pyramid but are computationally infeasible. In [3.43] and
[3.44], a feature approximation technique approximates the feature responses
at nary scales when the color features and gradient histogram features are
extracted at one scale from an image pyramid. Because of this the feature
pyramids are constructed at the reduced computational cost and results in
speedup of object detection over other methods in the literature namely [3.40]
and [3.42] at cost of detection accuracy. To overcome the issue of extracting
features from a tall image pyramid, a classifier pyramid [3.45] is used, which
also improves the speed of object detection. But the method used in [3.45]
requires large storage and lot of time for training because of using multiple
trained classifiers at different scales.

Object detection in recent years has used the various transform domain based
techniques. As an example, to leverage the linearity property of fourier
transform, the 2D-HOG [3.46] is presented along with DFT and has
contributed with a speedup of detection process compared with [3.41]. But the
performance of this method deteriorates due to high computational cost mainly
because of exact computation of feature pyramid. Feature pyramids
approximation [3.32] based on forward and inverse DFT‘s offers high
detection accuracy compared to [3.44]. But the detection speed of the
approximated feature pyramids decreases drastically when compared to [3.44]
mainly due to the complex multiplications required by forward and inverse

Ravi et al., [3.47] uses a background model to detect moving object in video based
on modified running average discrete cosine transform (RA-DCT) which is
robust to illumination variations. In order to achieve high detection accuracy,
RA-DCT includes median filter and performs some morphological operations
to reduce the background mask noise. Sagrebin et al., [3.48] proposes a
background removal method, which models each 4 X 4-pixel patch of an image
through a set of low frequency coefficient vectors obtained by means of
discrete cosine transform to extract the important information from image.
Background model will be adapted continuously depending on whether new
coefficient vector matches to the background model or not. This method
reduces the computational cost, suppresses the noise and robust to the sudden
illumination changes. Kalirajan et al., [3.49] compress the input video frames
with 2D-DCT, key feature points are derived by calculating the correlation
coefficients and matching feature points are classified based on the Bayesian
rule as foreground and background. Embedding maximum likelihood feature
points over input frames localizes the foreground feature points.

3.3 Proposed Methodology

3.3.1 Discrete Cosine Transform
Discrete cosine transform (DCT) illustrates an image as sum of sinusoids of
differing importance and frequencies. Discrete Cosine Transform is real,
orthogonal and separable. Many approaches based on Discrete Cosine
Transform witness the computational efficiency in terms of complexity and
cost. The multidimensional transform can be break down into series of 1D-
DCT in the relevant directions.

DCT represents the input data in terms of frequency components after

transforming input into a linear combination of weighted basis functions. The
onedimensional DCT transforms an image in spatial domain into its respective
frequency constituents, depicted as the set of coefficients. Inverse Discrete
Cosine Transform (IDCT) is used to convert back frequency domain samples
into spatial domain termed as reconstruction process. Initially, the 2D-DCT is
applied to extract elementary frequency components of the object in the video
frame then PCA approach is applied to obtain reduced dimensionality and to
extract features of the elementary frequency components. The detailed
procedure of proposed method is explained below.

The 2-D DCT and its inverse (IDCT) of an N x N block are shown below:

The inverse discrete cosine transform reconstructs a sequence from its discrete
cosine transform (DCT) coefficients.

Figure 3.1: Visual representation of the generic 2D-DCT process.

To overcome the issue of applying series of 1D-DCT, factoring method is used to

calculate the 2D-DCT. Initially, 1D-DCT is applied vertically on the columns
and then 1D-DCT is applied row-wise horizontally on the result obtained from
columns as depicted in figure 3.1.

Usually, the 2D-DCT is applied utilizing 8x8 or 16x16 sub-image or block of

pixels. DCT assists in isolating the video frame into fragments of differing
importance. DCT gives in general good features for object description, and the
properties of the DCT coefficients blocks makes them very good for
generating feature spaces. DCT represents the input data in terms of frequency
components after transforming input into a linear combination of weighted
basis functions. In a video, intensities of image sequence change with respect
to time. Vector is represented as sequence of intensities at certain pixel
position along time axis and the elements of vector might have same values
as background region along time. Sudden change in intensity values emerge
when is in motion boundaries and such changes in can be easily extracted
in frequency domain, also frequency domain is robust against variations in
intensity, noise and effective in eliminating constant (motionless) values on
respective pixels.

The proposed method uses the 2-D DCT specified in eq. (3.1), where 1-D DCT is
applied on the columns of the frame blocks and then 1-D DCT is applied on
the rows of the consequent frame blocks. Likewise, all the input frames are
processed with respect to 2-D DCT. DCT coefficients of each key frame are
used as local features. Usually, the dimensionality of obtained local features
is high and to outdo the issue of redundancy and to speed up the detection
process, Principal Component Analysis (PCA) can also be used for feature

3.3.2 Principal Component Analysis

Principal Component Analysis (PCA) concept is widely used to reduce the
dimensionality of data and also, it is used to extract the significant
representative of a set of feature vectors. In PCA, all significant
representatives are called as principal components and these principal
components are satisfied the condition of orthogonality.
PCA implies a mathematical process that converts a large number of possibly
correlated variables into a smaller number of uncorrelated variables called as
principal components. The first principal component presents large possible
variance in the data, and each succeeding component of PCA determines the
much of the remaining variability as possible. This confirms that the first
principal component has the best representative of features. If first principal
component could not represent the discriminative features, then succeeding
principal components are selected until best representative of data. PCA
initially, centers the data by subtracting the data from the mean of the data and
then covariance matrix is calculated to observe the relationship among the
data, in the next process, the Eigen value and Eigen vector is computed from
the covariance matrix. The largest Eigen value‘s Eigen vectors are chosen as
a first principal component.

Every key component in Principal Component Analysis is a linear combination of

variables and gives a maximized variance. Let x be a matrix for n observations
by p variables, and the covariance matrix S. Then for a linear combination of


Where is the ith variable, are linear combination coefficients for

they can be meant by a column vector and normalized by . The
variance of will be . The vector by magnifying the variance is called
the first principal component. The next principal component can be found in
the same way by magnifying subject to the constraints and

. It gives the next key component that is orthogonal to the first one. Remaining
principal components can be derived in a similar way. Coefficients
can be calculated from eigenvalues and eigenvectors of the
matrix and are ordered according to their eigenvalues. Computational
complexity and computational cost will be decreased because the DCT
processed frames are introduced to PCA for dimensionality reduction. If the
input frames are processed directly to PCA, the computational complexity and
cost will be increased because of high dimensionality among input frames and
also results in erroneous detection of foreground objects.

The proposed method can be summarized as: Select the two frames and apply two
dimensional DCT for the input frames, 1-D DCT is applied on the columns of
the frame blocks and then 1-D DCT is applied on the rows of the resulting
frame blocks. Likewise, all the input frames are processed with respect to 2-D
DCT. Compute the average image and subtract the mean of images.
Calculate the eigenvectors and eigenvalues. Best K eigenvectors are saved in
a matrix form according to their eigenvalues. Then Previous and current
frames are projected onto eigen space as Finally, the
eigenvectors are projected into and IDCT is performed.

3.4 Experimental Results and Discussion

To assess the performance of proposed method, we make use of openly accessible
standard PETS dataset and other datasets consists of various video sequences.
The challenging features of these datasets are complex background, detecting
the small/distant objects with less contrast and illumination variations both in
the indoor and outdoor environment. The successive relative approach
analyses the efficiency of proposed method versus the conventional PCA
method [3.19]. In Figure 3.2 we can see the moving object detected with all
five video sequences. Original frames are shown in the 1st row; Results
achieved from the proposed method are displayed in the 2nd row; Results for
PCA approach are shown in the 3rd row. Column 1 & 3 of Figure 3.2 shows
the results for standard PETS dataset, in which moving car is being detected
and the proposed method outperforms the PCA approach, as we can see results
obtained PCA has too many false positives and fails to retain the shape of the
object. Column 2, 4 and 5 of Figure 3.2 shows the results for other video
sequences, in column 2 moving objects in a highway are being detected; and
a person walking in an underground metro station along with the train is being
detected in column 5. In all the cases Principal Component Analysis approach
failed to retain the shape of the object and has too many false positives.
Because of this, future trails may result in erroneous outcome. But with respect
to proposed method, results are quite similar to the original ones (shape of the
objects are preserved/retained) and have relatively less false positives.

Various video sequences are also tested with respect to the proposed method,
which consists of sunny day sequences, low contrast sequences, cloudy
conditions, noisy/poor quality video sequences etc., both in the indoor and
outdoor environment. In contrast to the conventional Principal Component
Analysis [3.19], proposed method obtains fair results. Also noticed that
identified clusters obtained are very ideal and the rate of false alarms is less
when compared with PCA approach. It is also evident from figure 3.2 and
Table 3.1; the rate of false alarms, detection rate, accuracy, the rate of false
positives and false negatives achieved is satisfactory related with conventional
PCA approach.

Advantage of the proposed method is, DCT assists in isolating the video frame
into fragments of differing importance and the properties of the DCT
coefficients blocks makes them very good for generating feature spaces that
best describes the object. The features generated are extracted using PCA and
this step helps to preserve the shape of object. But, the conventional PCA fails
to guarantee the neighbor relationship for neighboring samples, which leads
to failure in preserving the shape of the object. The detection approach is also
invariant to changes in illumination and complex background. In future, the
proposed method can be further extended to track the detected objects
considering occlusion and other challenges.
Figure 3.2: Experimental Results of DCT-PCA; Input frames (Row 1); Results
achieved from proposed method (Row 2); Results obtained from PCA (Row

Table 3.1: Result analysis of DCT-PCA vs conventional PCA

Metrics Proposed Method PCA [3.19]

False Alarm Rate (FP/TP+FP) 31.30% 37.98%

Detection Rate (TP/TP+FN) 80.20% 78.38%

Accuracy (TP+TN/TF) 60.74% 56.67%

False Negative Rate 19.70% 21.55%


False Positive Rate (FP/FP+TN) 76.20% 82.69%

1. Background Search
Deep learning is the new big trend in machine learning. It had many recent
successes in computer vision, automatic speech recognition and natural
language processing.
Classification using a machine learning algorithm has 2 phases:
 Training phase: In this phase, we train a machine learning algorithm
using a dataset comprised of the images and their corresponding
 Prediction phase: In this phase, we utilize the trained model to predict
labels of unseen images.
The training phase for an image classification problem has 2 main steps:
 Feature Extraction: In this phase, we utilize domain knowledge to extract new
features that will be used by the machine learning algorithm. HoG and SIFT are
examples of features used in image classification.
 Model Training: In this phase, we utilize a clean dataset composed of the images'
features and the corresponding labels to train the machine learning model.

In the prediction phase, we apply the same feature extraction process to the new
images and we pass the features to the trained machine learning algorithm to predict
the label.

The main difference between traditional machine learning and deep learning
algorithms is in the feature engineering. In traditional machine learning algorithms,
we need to hand-craft the features. By contrast, in deep learning algorithms feature
engineering is done automatically by the algorithm. Feature engineering is difficult,
time-consuming and requires domain expertise. The promise of deep learning is
more accurate machine learning algorithms compared to traditional machine
learning with less or no feature engineering.
Fig. 1. Machine learning phase

Fig. 2. Deep learning phase

2. Approach to the Solution

Computer vision is an interdisciplinary field that has been gaining huge amounts
of traction in the recent years (since CNN) and self-driving cars have taken
centre stage. Another integral part of computer vision is object detection.
Object detection aids in pose estimation, vehicle detection, surveillance etc.
The difference between object detection algorithms and classification
algorithms is that in detection algorithms, we try to draw a bounding box
around the object of interest to locate it within the image. Also, you might not
necessarily draw just one bounding box in an object detection case, there
could be many bounding boxes representing different objects of interest
within the image and you would not know how many beforehand.
The major reason why you cannot proceed with this problem by building a
standard convolutional network followed by a fully connected layer is that, the
length of the output layer is variable — not constant, this is because the
number of occurrences of the objects of interest is not fixed. A naive approach
to solve this problem would be to take different regions of interest from the
image, and use a CNN to classify the presence of the object within that region.
The problem with this approach is that the objects of interest might have
different spatial locations within the image and different aspect ratios. Hence,
you would have to select a huge number of regions and this could
computationally blow up. Therefore, algorithms like R-CNN, YOLO etc have
been developed to find these occurrences and find them fast.

Problems with R-CNN

 It still takes a huge amount of time to train the network as you would have
to classify 2000 region proposals per image.
 It cannot be implemented real time as it takes around 47 seconds for each
test image.
 The selective search algorithm is a fixed algorithm. Therefore, no learning is
happening at that stage. This could lead to the generation of bad candidate
region proposals.

Fast R-CNN
Fig. 3. Fast R-CNN

Faster (R-CNN) solved some of the drawbacks of R-CNN to build a faster object
detection algorithm and it was called Fast R-CNN. The approach is similar to
the R-CNN algorithm. But, instead of feeding the region proposals to the CNN,
we feed the input image to the CNN to generate a convolutional feature map.
From the convolutional feature map, we identify the region of proposals and
warp them into squares and by using a RoI pooling layer we reshape them into
a fixed size so that it can be fed into a fully connected layer. From the RoI
feature vector, we use a softmax layer to predict the class of the proposed
region and also the offset values for the bounding box.
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed
2000 region proposals to the convolutional neural network every time.
Instead, the convolution operation is done only once per image and a feature
map is generated from it.

Fig. 4.Comparison of object detection algorithms

From the above graphs, we can infer that Fast R-CNN is significantly faster in
training and testing sessions over R-CNN.
Fig. 5. Faster R-CNN