You are on page 1of 22





Video Analysis (VA) is the process of analyzing video to detect and

track the activity of objects present in it. VA has wide range of applications in
domains like entertainments, health-care, retail, automotive, transport, home
automation, flame and smoke detection, safety and security. VA is used
mainly in surveillance systems where there is a need for understanding of
events occurring in a scene monitored by single or several cameras.

Researches on VA has taken a key role, as the importance of video

surveillance systems continues to rise, especially in public and home security.
The recognition of video analytics as a key security element provides both
opportunity and challenges for the researchers. The challenge in VA is related
to development of algorithms and models for analyzing the scenes in the
video. A software implemented using algorithms or hardware on general
purpose machines with video processing unit can facilitate VA.

In recent years as the need for surveillance increases, the utility of

video data gets magnified through the use of VA. Development of intelligent
video analytics tools for both real time and recorded video data are in demand.
The Intelligent Video Analysis (IVA) enables a forensic analysis of historical
data to identify patterns, trends and incidents. Development of IVA
concentrates on three directions (Figure 1.1) that are pursued in both signal
processing and computer vision communities.

The first direction focuses on visual event modeling and algorithms

that can detect, track and classify the objects and its events in the video.
The most challenging task is to bridge the gap signal level (like color, luminance
and resolution) and semantic- level processing (eg: level crossing accidents).

Figure 1.1 Directions in Intelligent Video Analysis

The second direction is towards the study on video capturing

camera networks. The video data could be fed from a single
camera or multiple cameras. The development of fusion models
of IVA, with multiple observations on same visual phenomenon
enables to improve system performance.
The third direction is the development of combined software and
hardware models to facilitate real time video analysis.


Video Analysis was carried out using computer vision techniques

(Figure 1.2). Earlier object based approach was practiced, where analysis was
done based on a two state Markov chain model at each pixel. The object
features namely, shape, size, velocity are extracted and direction of motion of

the over multiple frames are observed. A holistic approach that extract
behavioral characteristics of the object in raw video and statistical pattern was
recognized using hidden Markov models. The limitations of object based method
was overcome by holistic approach. The behavioral analysis model involves the
understanding of entire scene in video rather focusing on individual object and its
activity. The approaches suffers from increase rate of false positives.

An alternative approach of VA first identifies the object of interest in

video and tracks the motion of object, followed by anomalous behavior analysis of
the object. The tracking is the process of estimating the state of objects from the
sequence of visual observations from the video. A framework of probabilistic
graphical models and Bayesian inference were used for identification. The basic
tracking process was done by estimating the trajectory movement of the object of
interest and consistently labeling over object in video frames.

Figure 1.2 Approaches in Video Analytics

Recently, computer vision based frameworks were developed for

VA, focusing on analysis of crowd scenes using pedestrian in video data like

crowd density estimation, tracking the crowd population and crowd behavior


VA in surveillance systems mainly focuses on pedestrian detection

and behavioral analysis of pedestrians. Pedestrian Detection is a process of
determining presence of a pedestrian and marking the direction of movement
in a given video or image data. Locating pedestrians becomes a primary task
in many smart city applications like forensic surveillance, home surveillance,
hospital surveillance on patients and so on. Pedestrian detection is also widely
used in intelligent auxiliary driving, intelligent monitoring, intelligent robot,
and many other fields where pedestrian activity analysis is required.

The increase in demand for the pedestrian detection by real world

applications, many researches are carried out to attain an efficient mechanism for
identification of pedestrians. The development in computer vision technologies
and improvements in various algorithms aims at achieving accurate pedestrian
detection systems. The main steps involved in the process of pedestrian detection
is represented in Figure 1.3. Initially an image data or an image frame extracted
from the video is taken as an input. The image is made up of large number of
features and the essential features need to be extracted for detecting the input as
pedestrian or not. Once the features are extracted the available features need to be
classified based on the way the model is trained. Finally, the trained model can
able to classify the input image as pedestrian or non-pedestrian.

Figure 1.3 Basic process of pedestrian detection


However the complexity in real world background, diversified

pedestrian postures, diversified shooting angles, demands development of new
precise algorithms for pedestrian detection. It is also very challenging to
achieve very high detection rates under critical scenes such as in pedestrians at
distance. Despite the great improvements in accuracy, the task of pedestrian
detection is still faces various difficulties that requires more meticulous design
and optimization.



Pedestrian detection in an image or video can be done by extracting

the features that represents the pedestrian. The most commonly used feature
extraction methods for the process of pedestrian detection is listed in Figure 1.4.

Scale Invariant Feature Transform (SIFT)

Speeded-Up Robust Features (SURF)

Haar wavelet

Histogram of Oriented Gradient (HOG)

Local Self Similarity (LSS)

Figure 1.4. Feature Extraction Methods


1.4.1 Scale Invariant Feature Transform

Scale Invariant Feature Transform (SIFT) is used to detect and

describe local features in an image. It uses Laplacian Gaussian with difference
(LoG) for extraction key points which are invariant in scale and orientation.
Key points of the target objects are extracted from a set of images.
The extracted key points are stored in a database. When a new image is given
as an input, the key points of the target object are extracted from the new
image and compared with the key points in the database. The comparison is
done based on the Euclidean distance between the key points of new image
and that in database. From the set of matches, subsets of key points that match
on the target object and its location, scale, and orientation in the new image
are identified to filter the best matches. Based on the probability, the set of
features that closely matches the target object is computed, given the accuracy
of fit and number of probable false matches. Based on the result the presence
of the target object is decided.

1.4.2 Speeded-Up Robust Features

Speeded-Up Robust Features (SURF) is a speeded up version of

SIFT. It employs integral image and box filter to improve and optimize SIFT
features. Approximated LoG with integral filter is used in SURF that reduce
the computational cost of SURF features, outperforming the SIFT features on
repeatability, distinctiveness and robustness. Thus SURF is claimed to be
much faster than SIFT. The scale-space constructed by the SURF features, box
filter sizes are altered instead of image size. The integral image is adopted in
computing features to approximately simplify the filtering between the image
and Gaussian second-order differential.

Hence, SURF drastically reduces the number of operations for the

simple box convolutions and is independent of the scale chosen. Maximum of

three or four memory spaces are only required to calculate the sum of
intensities of selected rectangular region of any size shown in figure 1.2.

1.4.3 Haar Wavelet

Haar wavelet is a sequence of Haar functions forming a wavelet

subsets, used in discrete image transforms and processing. Wavelet analysis is
similar to Fourier analysis that allows a target function over an interval to be
represented in terms of an orthonormal basis. Viola et al. (2001), proposed a
rectangular Haar where the pixels are grouped into rectangular shaped regions.
The sum of features in the rectangular region is calculated. The Haar feature
value is equal to the difference between the sum of features in the rectangular
region and each pixel value in the rectangular region. Some of the template of
Haar feature extraction are the linear template, center template, diagonal
template, and edge template etc.

The feature template can be set to any sized sub window. Once the
template forms are identifies, the number of features can be estimated by the
size the rectangle templates and size of training sample images.

1.4.4 Histogram of Oriented Gradient

Histogram of Oriented Gradient (HOG) is a feature descriptor

technique used in image processing for object detection. This technique counts
occurrences of gradient orientation in localized portions of an image.
The image is divided in to connected regions called cells. A histogram of gradient
directions is compiled for each pixel in the cell. The feature descriptor is the
combination of these histograms. HOG works on local cells, it is invariant to
illumination or shape transformations, except for orientation of the object.

HOG had inherited the advantages of SIFT features and is robust

for changes in clothing, colors, human body figure and height. Because the

rectangle detection window could not handle rotational transformation, the

pedestrian must be in an upright position. The HOG descriptor is thus
particularly suited for human detection in images.

1.4.5 Local Self Similarity

Local Self Similarity (LSS) features are texture based descriptors

used for object detection. It works on measuring the similarity to the
neighboring pixels within certain radius of the image, instead of measuring
features like color or gradients of the pixel. The selected image block is
compared with the similarity of the neighboring pixels. The resulting sum of
square difference is normalized and projected into space intervals that are
divide into number of angle and radial intervals. The maximum value in an
interval space is considered as the feature value.


Pedestrian detection is an activity of identifying the object in an image

or video as pedestrian or not. The process of pedestrian detection in general
involves extraction and classification of features from an image or video. These
process is widely carried out using three approaches namely traditional approach,
machine learning approach and deep learning approach (Figure 1.5).

Figure 1.5 Approaches for Pedestrian Detection


1.5.1 Traditional Approach

Traditional approaches are basic methods that has variety of

algorithms implemented for the process of object detection. Among those
algorithms, most commonly used for pedestrian detection algorithms are
discussed as follows. Eigen Faces

Eigen faces is an appearance based technique for face recognition.

A database of images with varied faces are collected. The structural properties
of face was identified and converted to digital data. The image with N pixels
are considered to be matrix of N- dimension. The pixels of face images are
considered to be the eigenvectors of the covariance matrix. During recognition
phase, an Eigen face value is calculated for the given new image and Euclidian
distance between this Eigen face value and that of faces in database were
compared. The Eigen face value with smallest Euclidian distance was
considered to be the most resembling face. Eigen faces was the first working
model for facial recognition technology and were used in commercial
products. This method serves as the baseline for demonstrating the minimum
expected performance of present face recognition systems. Local Binary Pattern Algorithm

Local Binary Pattern (LBP) is a pattern based approach, where each

pixel image in image is compared with a center pixel. The pixel values with
the intensity of center pixel are marked with binary value 1 and others are
made 0. Using this method the simple circular point features are extracted.
The method is carried out to collect ring features throughout the image and
later they are unfolded in to row vectors. A binomial weight is assigned to
each vector bit in the vector and transformed into decimal code representing
LBP codes. Using frequencies of each values in the LBP codes, 1 dimensional

representation of target region in the image is formed. LBP combined with

HOG descriptor was found to improve detection performance on certain
datasets. LPB found to produce degraded results on varied lights, blur and
noisy images. Principal Component Analysis

Principal Component Analysis (PCA) is a tool for exploratory data

and predict model that uses high dimensionality reduction method. PCA
transforms large set of observations into smaller set by collecting uncorrelated
variables called principal components. A covariant matrix is formed using
these principal components. PCA can be done by decomposition of Eigen
values of covariant matrix after a normalization step. The normalization is
carried out by subtracting each data in the matrix from measured variable
mean value to attain its empirical mean (average) value zero, and normalizing

point of PCA the projection on region of interest for detection can be obtained.
Pedestrian detection can be done by integrating PCA with HOG for improved
performance. PCA is found sensitive to the relative scaling of the original
variables. Independent Component Analysis

Independent Component Analysis (ICA) is generative model, used

for large database. ICA extracts useful information or signals from the image
or video data. It is considered be the optimal over PCA for face detection.
ICA finds independent components whereas PCA optimizes uncorrelated
components. ICA works on revealing the hidden factors from the set of
random variables or signals. The data variables are considered to be linear
mixtures of latent variables. The latent variables are assumed non-Gaussian
and mutually independent, and they are called the independent components of

the observed data. ICA is used for pedestrian detection with ICA suffers from
a problems of over-complete ICA and under-complete ICA.

1.5.2 Machine Learning Approach

Machine learning is the concept of making machine to learn from

the data or information that is fed to it. Machine learning algorithms are
categorized into three types namely supervised learning, unsupervised learning
and reinforcement learning. Supervised algorithms build the model by training
the algorithm with mapping input data to output data, and predicts new output
when new input is given. Unsupervised algorithms build the models based on
the distribution or pattern of the input data. Reinforcement learning builds
models by identifying the patterns and learning from the environmental

Pedestrian detection can be done using machine learning approach

for improved computational speed and accuracy, where the traditional
approaches suffers from more false detection. There exists various machine
learning algorithms that can be used for detecting the pedestrians. Among
those, most commonly used for pedestrian detection are discussed as follows. Support Vector Machine

Support Vector Machine (SVM) is considered as a supervised

machine learning algorithm that can be used for pedestrian detection.
Generally in computer vision based techniques, the images are viewed as
matrix of pixels, a non-linear representation of data points. The features of
target object from the image is extracted and then classified. The features from
the images can be extracted, using any of the traditional feature extraction
techniques. SVM is a binary classifier that can be used for categorizing the
image as pedestrian or non-pedestrian. The general working principle of SVM
is the separation of data points into either one of the binary classes, by

defining a hyper plane between the non-linear data points of the images.
Maximum the distance between the hyper plane and the data point, maximum
the accuracy of classification. SVM can also be used for selecting healthy
features from the available data points of the images. Genetic Algorithm

Genetic Algorithm (GA) is a random adaptive global search algorithm.

It is a heuristic optimization method, where it operates on population of individuals
to produce better and optimized individuals. In GA the individual features values in
the population are considered to be chromosomes. The combination of these
chromosomes, the encoding units re called genes. During feature selection at every
iteration new population of features is created using fitness functions on the
features of old population. The new combinations of features are determined using
three methods namely selection, mutation and elimination.

The selection method selects the chromosome from the population

using some logic or rules. The cross over is done by combining the changing
sequence of genes and combining two chromosomes in some order.
The mutation is to make some change in chromosome directly based on
certain probability. This method can generate new individuals and make
available for global optimization of algorithm in pedestrian detection. XGBOOST

XGBoost is an Extreme Gradient Boosting algorithm for classifying

regression tree models using gradient lifting decision tree. The gradient
boosting is done, where new models are created by predicting the errors of
existing models carried out sequentially. Finally the created models are added
up together to form a final model. XGBoost is fond of working on standard
tabular data, it can be applied for pedestrian detection from image or video
13 AdaBoost Algorithms

AdaBoost is another gradient boosting algorithms basically

developed for binary classification. It is used with short decision trees, where
the first tree decision tree is created and the performance of the tree on each
training instance is used. The general working of this approach is building a
strong classifier from the number of weak classifiers. It is done by constructing a
model from the training data and then creating new models that attempts to
correct the errors from the previous model. Models construction is carried out
until the training set is predicts accurately or the maximum number of models
made. The integration of all models result in final model for classification.

AdaBoost is commonly used for pedestrian detection. The images

are divided into rectangular shaped windows and features are computed. It is
done by selecting windows in the sequences of any order and labeling the
sequence as pedestrians or non pedestrians. The same procedure is repeated
by selecting windows in the same sampling image with another order.
The learning process could continue until constructing a cascade of
classification rules, where the first model discards clear non-pedestrians, the
second model would discard less clear non-pedestrians and so on, being
pedestrians those windows that are not rejected at any model. Fuzzy clustering methods

Pedestrian detection from videos with complex scenes, varied

illuminations and resolutions remains a challenging task. The algorithm for
pedestrian detection required subtraction of back ground and detection of
foreground objects. Fuzzy clustering approach is carried out for grouping the
pixels into classes based on the distance between data point and the centroid of
the cluster. Fuzzy clustering approach for pedestrian detection requires the
segmentation of image as an important step, which incorporates geometric
symmetry information for classification.

1.5.3 Deep Learning Models

Machine learning algorithms can work better with small data, when
applied to large data it suffers from issues like under fitting, model complexity
and lack of resource optimization. To overcome the issues, deep
learning networks can be applied to big data for knowledge discovery,
knowledge-based prediction and knowledge application. Deep learning
enables the machine models to learn directly from images, video or text.
There exists different deep learning architectures which helped in achieving
the remarkable performance compared with other machine learning as the data
size increases. Some of widely used deep learning models are discussed in the
following section (Figure 1.6).

Deep Learning

Multilayer Restricted Boltzmann Convolution

Perceptron Machine Neural Network

Figure 1.6. Deep Learning Models Multilayer Perceptron

Network (ANN) contains at least of three layers of operation namely input

layer, hidden layer and output layer (Figure 1.7). MLP uses a supervised
learning technique with back propagation for training. Each layer is made up
of units called perceptron densely connected with each other. Each layer other
than input layer contains linear activation function.

Figure 1.7. Multilayer Perceptron with 2 hidden layers

The learning process of MLP is done in the perceptron by changing

the connection weights of input perceptron in input layer to the next layer
based on the amount of error in the output when compared to the expected
result. MLP can be applied to supervised, unsupervised and reinforcement
learning purposes. Restricted Boltzmann Machine

Restricted Boltzmann Machines (RBMs) were originally designed

for unsupervised learning. They are a type of energy-based undirected
graphical models that includes two group of layers namely visible layers and a
hidden layer. Each layer is made up of units called neurons (nodes).
There exists connection between nodes of visible layers and hidden layers.
Also there will not be any connection between nodes of same layer hence the
name restricted Boltzmann Machines. Figure 1.8 represents RBM model.

Figure 1.8. Restricted Boltzmann Machine

Each visible node takes a low-level feature from an item in the

dataset to be learned. Each node of visible layer is multiplied with some
random weight, added with a bias value. The result is fed to an activation
function at hidden layer to produce the output for the input node value.
If multiple hidden layers are used calculated value of each nodes are passed to
next hidden layers for processing until they reach final classifier layer
(output). In RBM the hidden layer and visible layers can affect each other,
where as in MLP only the input layer could affect the hidden layer. A stack of
RBMs is called a Deep Belief Network (DBN) could perform layer-wise
training and achieve superior performance as compared to MLPs in many
applications. Auto-Encoders

Auto-Encoders (AEs) are unsupervised neural networks that try to

copy input to output. AEs are used for dimensional reduction in data by
providing compact data representation. AEs (Figure 1.9) consists of two neural

network layers namely encoder and decoder. The input to encoder layer is the
functionality is to encode this
input to a latent representation space z. The Gaussian distribution is used for
encoding and the output is the mean and variance of Gaussian distribution.

Figure 1.9. Auto Encoder

Similarly the decoder accepts the latent representation as input and

gives out the parameter distribution of input data points. Convolution Neural Network Model

A convolutional Neural Network (CNN) is an assembly of deep

neural network architecture used widely in computer vision technologies.
The building blocks of CNN are convolutional layer, pooling layer, activation
layer and connected convolutional layer. A deep CNN is an architecture made
up of several connected convolutional layers for end to end operation
(Figure 1.10). Filters forms the core part of convolutional layer. At the
convolutional layer a few pixels of any size say 3 X 3 from the input image are
passed through the filter.

Figure 1.10. Convolution Neural Network Layer Stack

The filter performs dot operation on the pixels values with defined
weight at the filter and summed up into one value representing the all pixels
given to the filter. Thus, the convolutional layer generates the smaller matrix
of data points in image than its original size. The matrix is given to activation
layer provides non linearity and trains that network through back propagation.
Pooling layer down samples and reduces the size of matrix further that is
produced by filter. Pooling layer selects the one feature out of each group, thus
called max layer. The connected layer takes the output of the max layers and
produces the list of probabilities for different possible labels attached to the
given image. The classification decision is based on the highest probability of
the label.


Pedestrian detection has endless applications like person

identification, pedestrian tracking and counting, unusual event detection,
gender classification, crowd vicinity analysis, fall detection in elderly people,
autonomous driving systems, et

1.6.1 Person identification

Person identification has much attention in intelligent surveillance

systems. It is an important task security based applications. The task of
application is to identify if the pedestrian in the given the image is present in
the gallery of the images collected in the database. This system can be is also
used for counting the total number of individual persons present in a given

1.6.2 Detection of pedestrians around Automated Guided Vehicles

A more challenging application in Automated Guided Vehicles

(AGVs) is the detection of pedestrians around the blind spot areas of the
trucks. Pedestrian detection in AVGs helps in avoiding the collision of the
vehicle on pedestrian, thus reducing the number of road accidents. The goal of
this system is to increase the safety of the pedestrians around the vehicle by
alarming the driver of their presence, letting the vehicle to slow down or
braking on finding the pedestrian nearby, etc... Accuracy in detection is more
important in this application to avoid false detection. However missing of
detection could lead to fatal accidents than that of false detections.

1.6.3 Automatic capturing of web lectures and presentations

During the video capturing process of web lectures and

presentations, the pedestrian detection system will automatically follow the
lecturer by tilting cameras without the involvement of camera man.
The challenge in this application is the presence of occlusion, where
pedestrian detection need to use multiple models for identifying the human
parts separately and then combine for detection.

1.6.4 Fall detection in elderly people

Pedestrian detection can be used in assisting the care takers for

monitoring the activities of elderly persons from remote. This system involves
the pedestrian detection and tracking the activity of the elderly person through
surveillance camera. If any anomaly activity is detected like fall happened, an
alarm or alert can be given to their care takers. Activity detection remains
complex and challenging in this system.


Chapter 2 reviews the literature survey on different techniques used

for pedestrian detection. Objective of the review conducted in chapter 2 is to
identify the problems existing in pedestrian, opportunities in implementing
ML and DL models for computer vision applications. The survey of various
traditional approaches, machine learning approaches and deep learning
approaches in pedestrian detection are discussed in chapter 2.

In chapter 3, performance of traditional approaches for the process

of pedestrian detection is measured. The main objective of chapter 3
concentrates on enhancing the efficiency in extracting the image features and
improving the overall performance in classifying the image features. HOG
based approach is proposed for extracting the features and SVM is proposed
for classifying the image features. SVM based pedestrian detection model is
validated by comparing its performance with Naïve Bayes approach.

In chapter 4, a novel Hybrid Meta-heuristic approach for Pedestrian

Detection (HMPD) is implemented for pedestrian classification. Chapter 4
works with an objective to overcome the limitations of ML in working with
high dimensional data. Since SVM is observed to be good in selecting the
healthy features and Genetic Algorithm (GA) is good in classification, SVM
and GA were used for the process of hybridization. In chapter 4, HMPD

approach is proposed as a metaheuristic approach for pedestrian detection.

The performance of HMPD is analyzed by comparing its performance with
SVM based traditional ML approach.

Main objective of chapter 5 is to identify the best DL model for the

process of pedestrian detection. VGG-16, a pre-trained deep learning
architecture based on Convolution Neural Network (CNN) model is proposed
for pedestrian detection. The performance of the proposed VGG-16
architecture is validated by comparing its performance with RESNET and
HMPD model.

Chapter 6, works with an objective to optimize the hyper-

parameters of the pre-trained CNN model for enhancing the accuracy in
pedestrian detection. A novel optimized version of VGG-16 (OVGG-16) is
proposed. In OVGG-16, various hyper-parameters of the existing VGG-16 is
optimized to enhance the overall performance for the process of pedestrian
detection. The performance of the proposed OVGG-16 architecture is
validated by comparing its performance with VGG-16 model.

Chapter 7, concludes the findings of all the proposed research work

and suggestions for the future work were furnished.


In this chapter, an introduction to pedestrian detection and process

involved in pedestrian detection is explained. Different methods available for
the process of pedestrian detection is also discussed. Importance of pedestrian
detection and its applications in different fields are discussed. Finally, outline
of the entire thesis is given. The following chapter gives detailed survey of

You might also like