Embedded Computer Vision System Detects Traffic Objects

An Embedded Computer-Vision System for Multi-Object Detection in Traffic Surveillance
Report On An Embedded Computer-Vision
System for Multi-Object Detection in Traffic Surveillance
AKASH H DEEPAK
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
BANGALORE INSTITUTE OF TECHNOLOGY, BANGALORE,

INDIA
Abstract: Intelligent traffic systems for traffic surveillance and monitoring have become a
topic of great interest to some cities in the world. Generally, the existing traffic surveillance
systems are made up of costly equipment with complicated operational procedures and have
difficulties with congestion, occlusion, and lighting night/day and day/night transitions. In
this paper, we propose an embedded system for traffic surveillance that can be utilized under
these challenging conditions. This system analyses traffic and particularly focuses on the
problem of detecting and categorizing traffic objects in several traffic scenarios. Moreover, it
contains a robust detector produced by an original specialization framework. The proposed
specialization framework utilizes a generic deep detector so as to improve the detection
accuracy in a specific traffic scenario. The experiments demonstrate that the proposed
specialization framework presents encouraging results for multi-traffic object detection and
outperforms the state-of-the-art specialization frameworks on several public traffic datasets.
CONTENTS
1. INTRODUCTION 2
2. EXISTING SYSTEM 3
3. NEED FOR PROPOSED SYSTEM 4
4. PROPOSED SYSTEM 4
Department of Electronics and Communication(VLSI & ES) 1

5. EXPERIMENTS 12
6. RESULTS AND ANALYSIS 13
7. CONCLUSION 19
8. REFERENCES 19
1. INTRODUCTION
Vision based traffic surveillance still a complicated part within any traffic surveillance
system due to several factors like illumination variations, camera calibration and daytime
conditions. Accordingly, the performance requirements are not left in prototype works at
research labs anymore but they are exposed to real world problems. This demand makes the
task of building such a system highly challenging, especially when accuracy and speed are
required. Over the past decade, there has been a significant effort dedicated to the
development of traffic surveillance systems, which is intended to raise safety by monitoring
the on road environment.
Besides, we have seen the emergence of computing platforms geared towards parallelization,
such as graphical processing units and multi-core processing. Such hardware advances allow
the computer vision approaches for traffic surveillance to follow up on real-time
implementation. However, the performance of the existing systems depends much on their
traffic object detector and it is notable that a traffic surveillance system becomes more
reliable if it has a robust detector. This paper provides an embedded system for traffic
surveillance which integrates recent advances in the computer vision domain such as transfer
learning and deep learning techniques in order to enhance the detection performances in both
day and night conditions. We suggest a deep detector adopted from the Faster R-CNN deep
model for multi-traffic object detection and we put forward a specialization framework
inspired from a Sequential Monte Carlo (SMC) Faster R-CNN so as to automatically
specialize a generic deep detector to a target scene.

Fig(1):-A General Synoptic of the Proposed Framework for a an Embedded

Computer Vision System
Fig(2):-Represents the Deep Detector in both Day and Night Conditions
2. EXISTING SYSTEM
In this section, describes the related works for Transfer learning and Deep Learning. Transfer
learning aims to address the problem when the distribution of the training data from the
source domain is different from that of the target domain. Transfer learning using deep
models has been turned out to be effective in some challenges like traffic-object detection.
Deep Learning has become a latest advancement and has attracted a more attention in
computer vision and machine learning for its performance in various tasks like Face
recognition, Action Recognition, Image Classification and Vehicle Detection.
At present, vision-based vehicle object detection is divided into traditional machine vision

methods and complex deep learning methods. Traditional machine vision methods use the
motion of a vehicle to separate it from a fixed background image.
This method can be divided into three categories:-
1. The method of using background subtraction.

2. The method of using Continuous video Frame Detection.
3. The method of using Optical flow.
Using the video frame difference method, the variance is calculated according to the pixel
values of two or three consecutive video frames. By using this method and suppressing noise,
the stopping of the vehicle can also be detected. When the background image in the video is
fixed, the background information is used to establish the background model.
Then, each frame image is compared with the background model, and the moving object can
also be segmented. The method of using optical flow can detect the motion region in the
video. The generated optical flow field represents each pixel’s direction of motion and pixel
speed. Vehicle detection methods using vehicle features, such as the Scale Invariant Feature
Transform (SIFT) and Speeded Up Robust Features (SURF) methods, have been widely
used. For example, 3D models have been used to complete vehicle detection and
classification tasks. Using the correlation curves of 3D ridges on the outer surface of the
vehicle, the vehicles are divided into three categories: cars, SUVs, and minibuses.
3. NEED FOR PROPOSED SYSTEM
Several systems have been proposed by research groups in the world to solve the problems of
traffic object detection and tracking in a traffic surveillance systems. Some of them have
been described in “A Survey of vision-based vehicle detection, tracking, and behavior
analysis”, by Sivaraman and Trivedi, “A Computer Vision model for Vehicle Detection in
Traffic Surveillance”, by Neelima and Mamidisetti. These systems used motion detection to
recognize traffic objects as moving blobs and to track those blobs for a number of subsequent
frames. This system can be divided into three categories.The First category involves the
extracting of specific features to represent the traffic object for classification. Here, we have
also considered the image patch of a traffic object by considering relatively consistent

structural components of a traffic object and suggested image strip features that represented
various local traffic components like bumpers, pillars etc,. The second category focuses on
designing a classifier which can be incorporated for the use of a multi-view traffic object
detection. The last category focuses on the design of a descriptor and a classifier using
DCNN models. DCNN is referred to as Deep Convolutional Neural Network. It is a multi-
stage architecture composed of convoloution and pooling. These models can detect
pedestrians with different spatial scales by using a small-size subnetwork. In this system we
have proposed a specialized framework which is developed to automatically generate a
specaialized deep detector for multi-object traffic detection which can be used in both day
and night conditions.
4. PROPOSED SYSTEM
Here the proposed framework is used to develop a faster R-CNN model deep detector for its
superior performance and computation efficiency in detecting multiple objects. R-CNN
stands for Region based convoloutional network, it is a family machine learning models for
computer vision and specifically object detection. Table-1 discusses the difference between
Faster R-CNN deep detector and the MF R-CNN detector.
Specification/Architecture Faster R-CNN MF R-CNN

Number of pooling layers 4 3
Types of pooling layers Max-pooling Stochastic
Objects General Objects Traffic Objects
Some improvements over the previous proposed framework are:-
1. Here we have removed the 4th max pooling layer which was present in faster R-
CNN model for training and testing the network to produce larger feature maps for
small-size object proposals.
2. We are replacing the max-pooling layers by stochastic pooling. These pooling
layers will provide more information than the previous pooling strategies and also
provide a greater flexibility in choosing the output image size.

3. The suggested framework automatically specializes a generic detector to a target

device for multi-object traffic detection, which is contrary to the state-of –the-art
specialization which were limited for single-object traffic detection.
4. The developed framework can be used to specialize any deep detector for mobile
and stationary cameras.
5. Here, we can also show that the proposed framework is capable of specializing a
traffic object detector in both day and night conditions. This is not possible in
previous frameworks like SMC faster R-CNN framework because it uses a
likelihood function based on background-selection spatial-temporal cue to favour
the selection of particular positive samples from a specific scene which is not
applicable for night time conditions.
The suggested framework is an improvement of the present SMC faster R-CNN which is able
to specialize the proposed deep detector toward each traffic scene for a precise classification
in both day and night conditions.
The specialized MF R-CNN network suggests a new architecture to improve the detection of
small-sized objects and a tracking algorithm in order to favor the selection of target samples.
Fig(3):- Architecture of the suggested MF R-CNN Network
Architecture/Specification Existing Framework Proposed Framework

Generic Detector Faster R-CNN MF R-CNN

Likelihood Function Background Subtraction Tracklets
Daytime Conditions Day Day and Night
Objects Single Object Multiple Objects
Table-2 describes the differences between the work of Faster R-CNN model and MF R-CNN
network. Figure(3) illustrates the architecture of MF R-CNN network. Here the detector
contains two modules. The first module is a Region Proposal Network (RPN) that provides a
set of rectangular object proposals from an input image. The second module includes a Fast
R-CNN model which uses a set of inputs as a set of object proposals and then uses them for
classification. An RPN is a fully-conventional network that is constructed by adding two
additional convoloutional layers. One layer that encodes each convolutional map position
into short feature vector. The other layer outputs at each convoutional map position, an object
score and regress bounds for the region proposals relative to various scales and aspect ratios
at that location. The RPN shares the rest of convolutional layers with the Fast R-CNN
network.
From the figure(3), MF R-CNN passes the input image into several convolutional layers and
stochastic pooling to extract a feature map. Then the RPN fully-convolutional network, is
specifically designed to localize traffic objects into the feature map produced by the last
convolutional layer. The Region of Interest (RoI) pooling layer is utilized to pool the feature
maps of each input object proposal which is fed into a sequence of fully connected layers,
into a fixed-length feature vector.At the end of the network, we have two output layers that
produce two output vectors for one object proposal.
a. Specialization of MF R-CNN Network:-
Here we present the specialization framework for the MF R-CNN deep detector in a target
scene:-
1. Step-1:- a generic detector is trained on a generic dataset. Given the videos taken by a
stationary or mobile camera in specific scenes, at a first iteration (k = 0), the generic
detector is applied in a prediction step to detect traffic object candidates in each

individual image, which may include a lot of positive and negative detections.
2. Step-2:- A Likelihood function is applied based on the tracking algorithm which is
used to assign specific weights to each proposal sample from a specific scene.
3. Step-3:- Here a sampling function is used which determines which samples to be
included in the specialized dataset using an IR algorithm derived from the Monte
Carlo Filter. The IR algorithm transforms each weight produced by the likelihood
function in the previous step on a number of repetitions, by repeating samples of
higher weights and replace the samples with lower weights which are smaller in
numbers.
{It=Ii}Ii=1 is a set of unlabeled images extracted uniformly from a videos sequence of a target
scene.
Dk={Xk(n)}n=1Nk is a target object sample to be detected in each target image.
The target distribution can be approximated iterartively from equation(1)
P(Xk:Z0:k)=∫Xk−1 p(xk |xk − 1)p(xk − 1|z0: k − 1)dxk ….(1)
Where, C is the normalization factor.
The formalism of the SMC filter is used to approximate the unknown joint distribution
between the features of the target samples and the associated class labels by a set of samples
that are initially unknown.We can observe that the iterative process selects relevant samples
for the specialized dataset from one iteration to another, leads to converge to the right target
distribution, and makes the resulting deep detector more and more efficient.
The resolution of equation (1) is divided into three steps: prediction, update and sampling.
1. Prediction:-The prediction step gives a set of suggestions composed by traffic

proposals predicted by the output layers of the deep detector.
Prediction step is given by,
P(Xk:Z0:k-1)=∫Xk−1 p(xk |xk − 1)p(xk − 1|z0: k − 1)dxk − 1 ….(2)
In the output of this step, we keep only samples which have a value of Fk-1 greater than as.

Here we have fixed the value to 0.1.
2. Update:- A weight of π~ is estimated for each new target sample of the dataset
{Xk~(n)}n=1Nk~ according to a likelihood function:
P(Zk|Xk= Xk~)ɑ πk~n ….(3)
The likelihood function employs visual information obtained by a tracking algorithm from
the target scene, to assign a weight for each sample. The update step gives an output for a set
of weighted target samples known as the “Weighted Target Dataset”.
{(Xk~(n), πk~(n))}n=1Nk~ ….(4)
3. Sampling:- The sampling step produces the new training dataset by deciding which
samples will be included in the produced dataset
Dk={Xk(n)}n=1Nk=IR ({(Xk~(n), πk~(n))}n=1Nk~) …(5)
This step generates a new set of Dk values by drawing sample for weights at πk~(n)
B. Training Step:- The training step consists in fine-tuning the RPN and the Fast R-CNN
networks. Accordingly, we utilize a sliding window approach so as to generate T bounding
boxes for every position on the feature map that is produced by the last convolutional layer,
where every bounding box is centered on the sliding window and is associated with a scale
and an aspect ratio. Then, we calculate the Intersection-over-Union (IoU) overlap between
the boxes of the specialized dataset Dk and the bounding boxes generated by the sliding
window with different scales and ratios.

Fig(4):-Illustration of how Bounding Boxes are obtained by using Sliding

Window
C. Likelihood Function:-In order to choose the correct proposal, we put forward a likelihood
function based on a tracking method, which assigns a weight πk(n) for each sample Xk~(n). The
purpose of using Likelihood function is to select the correct samples and reduce the risk of
including wrong proposal samples in the dataset. The output of this function is a set of
weighted target samples which approximates the probability function given by,
Πk(n)=fL(Xk~(n)) ….(6)
The likelihood function is based on a tracking method called “tracklet”, to assign weights to
target samples according to their importance.The tracklet tracking algorithm used in the
proposed likelihood function.
Based on the number of samples produced by the prediction step, the user is able to track
them which are referred to as tracklets by using a multi-object tracking algorithm. This is
helpful in selection of positive samples. The tracklet tracking method is divided into 3 main
steps:-

1. Feature Extraction:-Here each target sample produced by the prediction step is passed
through the extraction feature block to define the characteristics of each sample. Then
it is characterized by a position produced by the output layer of our MF R-CNN
detector and the characteristic vector that contains appearance information
determined by a color histogram.
2. Tracklet Generation:After the execution of feature extraction step, initial tracklets
(object trajectories) are constructed by association of samples. The association
between the samples is done according to the IoU overlap and the appearance
similarity between the successive frames. Subsequently, the IoU overlap is calculated
by comparing the bounding boxes of samples in successive frames. After that, we
compare the appearance similarity between the overlapped samples in successive
frames. Similarly, the appearance similarity is provided by calculating the distance
between the two HSV histogram vectors associated to the overlapped samples.
3. Tracklet Association:-After initial tracklet constructions, we associate the tracklets
having similar signatures.The signature contains the appearance information
determined by a color histogram (HSV).
Fig(5):-Description of Training Step.
Fig(6):-Description of Tracklet Steps

5. EXPERIMENTS
Here we are evaluating the results for the proposed specialized framework by applying
several public and private datasets like Datasets, CUHK Square dataset, Logiroad Traffic
dataset and Traffic Night Dataset.
1. Datasets:- The PASCAL VOC 2007 dataset presents the source dataset which is used
to fine-tune the proposed generic MF R-CNN. It consists of about 5,011 trainval
images and 4,952 test ones over 20 object categories. In our experiments, we use 713
annotated cars and 2,008 people, to fine-tune the generic MF R-CNN.
CUHK Square dataset:-This is a public video sequence of road traffic which lasts
60 minutes. 352 images are used for specialization, uniformly extracted from the
video.
Here, 100 images are utilized for the test, extracted from the latest 30 minutes.
Logiroad Traffic dataset:-This is a public video sequence of road traffic which
lasts 20 minutes. We use 600 images for specialization, extracted uniformly from
the first 15 minutes of the video. 100 images are utilized for the test, extracted
from the latest 5 minutes.
Traffic Night dataset:-This is a private video sequence of road traffic at nighttime
which lasts 4 minutes. We use 300 images for specialization, extracted uniformly
from the first 3 minutes of the video. 100 images are utilized for the test, extracted
from the last minute.

Fig(7):-Graph shows Region of Convergence (ROC) of

specialization process based on Logiroad Traffic datasets.
6. RESULTS AND ANALYSIS
Here we will compare the results between the ROC curves of the Generic MF R-CNN,
the specialized MF R-CNN and the state-of-the-art algorithms. Here the true detection
rate is compared with constant false positive rate per in several methods related to several
datasets and annotations.
Dataset CUHK_WP CUHK_MP

Approach/Architecture
Generic faster R-CNN 0.60 0.69
SMC faster R-CNN 0.65 0.88
Generic MF R-CNN 0.71 0.74
Specialized MF R-CNN 0.75 0.90
Improvement/MF R-CNN 18% 6%
and Faster R-CNN
Improvement/Generic MF R- 6% 22%
CNN

Table-3 describes the comparison of detection rates for pedestrian with state-of-art.Table-
4 describes the comparison of detection of rate of car with respect to State-of-art.
Dataset Approach/Architecture Logiroad_MV

Generic faster R-CNN 0.40
SMC faster R-CNN 0.70
Generic MF R-CNN 0.48
Specialized MF R-CNN 0.75
Improvement/MF R-CNN and Faster R- 20%
CNN
Improvement/Generic MF R-CNN 57%
Here the last two lines in both the tables give the improvement between generic MF R-
CNN and the Faster R-CNN. Table-5 demonstrates the comparison of detection rate for
traffic night data set with state-of-art.
Dataset Approach/Architecture Nightroad_MV

Generic faster R-CNN 0.27
SMC faster R-CNN --
Generic MF R-CNN 0.38
Specialized MF R-CNN 0.60
Improvement/MF R-CNN and Faster R- 40%
CNN
Improvement/Generic MF R-CNN 57%


specialization process based on CUHK_WP Traffic datasets.


specialization process based on Logiroad_MV Traffic datasets.


specialization process based on CHUK-MP Traffic datasets.

Fig(11):-Image of hardware components of proposed embedded system.

The above image gives the details of the embedded system for the implementation of multi-
object traffic detection. For the hardware components (Figure 13) of the proposed embedded
system, we use the recent mobile device, the Tegra TX1 board. The Tegra TX1 device is a
technology developed by NVIDIA in the embedded system category. This device delivers the
performance required for the latest visual computing applications. It is built around the
NVIDIA Maxwell architecture with 256 CUDA cores delivering over 1 Tera FLOPs of
performance, 64-bit CPUs, and a camera with 5 mega pixels. Please refer to Appendix A for
more technical specification of the hardware components.
Table-6 demonstrates the different components and their specifications.
Component Specification
Camera 5 mega pixel
GPU 1 FLOPS/s 256-core
CPU 64-bit ARM A 57 CPU
Operating System Ubuntu Linux 14.04 LTS

Memory 4 GB LPDDR4 | 25.6 GB/s

CSI Up to 6 cameras | 1400 Mpix/s
Connectivity Connects to 802.11ac Wi-Fi
Networking 1 Gigabit Ethernet
Storage 16 B eMMC, SDIO, SATA
5. CONCLUSION
Here we have demonstrated an embedded system for multi-object detection in traffic

surveillance, which includes a new architecture of a deep detector adopted from the Faster R-
CNN and an original specialization framework for a traffic object detector. Given a generic
detector and a target video sequence, this framework automatically provides a specialized
traffic-object detector. The extensive experiments have demonstrated that the proposed
approach has produced a robust traffic object detector which is superior in detecting traffic
objects in different scenes and in both day and night conditions. This detector has surpassed
the state-of-the-art performance on several challenging benchmarks. Our future work will
deal with an extension of the algorithm to improve the likelihood function with more
complex visual cues like optical flow or contextual information and to inject some spatio-
temporal information into our MF R-CNN network.
7. REFERENCES
1. Ala Mhalla , Thierry Chateau, Sami Gazzah, and Najoua Essoukri Ben Amara “An
Embedded Computer-Vision System for Multi-Object Detection in Traffic Surveillance.”
2. A. Mhalla, H. Maâmatou, T. Chateau, S. Gazzah, and N. E. Ben Amara, “Faster R-CNN

scene specialization with a sequential Monte-Carlo framework.”
3. S. Sivaraman and M. M. Trivedi, “Looking at vehicles on the road: A survey of vision-

based vehicle detection, tracking, and behavior analysis.”
4. G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical representations for

face verification with convolutional deep belief networks.”


Embedded Computer Vision System Detects Traffic Objects

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Embedded Computer Vision System Detects Traffic Objects

Uploaded by

Copyright:

Available Formats

An Embedded Computer-Vision System for Multi-Object Detection in Traffic Surveillance

Report On An Embedded Computer-Vision

System for Multi-Object Detection in Traffic Surveillance

DEPARTMENT OF ELECTRONICS AND COMMUNICATION

BANGALORE INSTITUTE OF TECHNOLOGY, BANGALORE,

Department of Electronics and Communication(VLSI & ES) 1

Department of Electronics and Communication(VLSI & ES) 2

Fig(1):-A General Synoptic of the Proposed Framework for a an Embedded

Fig(2):-Represents the Deep Detector in both Day and Night Conditions

Department of Electronics and Communication(VLSI & ES) 3

This method can be divided into three categories:-

1. The method of using background subtraction.

3. NEED FOR PROPOSED SYSTEM

Department of Electronics and Communication(VLSI & ES) 4

Specification/Architecture Faster R-CNN MF R-CNN

Some improvements over the previous proposed framework are:-

Department of Electronics and Communication(VLSI & ES) 5

3. The suggested framework automatically specializes a generic detector to a target

Fig(3):- Architecture of the suggested MF R-CNN Network

Architecture/Specification Existing Framework Proposed Framework

Department of Electronics and Communication(VLSI & ES) 6

Generic Detector Faster R-CNN MF R-CNN

a. Specialization of MF R-CNN Network:-

Department of Electronics and Communication(VLSI & ES) 7

Dk={Xk(n)}n=1Nk is a target object sample to be detected in each target image.

The target distribution can be approximated iterartively from equation(1)

P(Xk:Z0:k)=∫Xk−1 p(xk |xk − 1)p(xk − 1|z0: k − 1)dxk ….(1)

Where, C is the normalization factor.

1. Prediction:-The prediction step gives a set of suggestions composed by traffic

Prediction step is given by,

P(Xk:Z0:k-1)=∫Xk−1 p(xk |xk − 1)p(xk − 1|z0: k − 1)dxk − 1 ….(2)

Department of Electronics and Communication(VLSI & ES) 8

Here we have fixed the value to 0.1.

{(Xk~(n), πk~(n))}n=1Nk~ ….(4)

Department of Electronics and Communication(VLSI & ES) 9

Fig(4):-Illustration of how Bounding Boxes are obtained by using Sliding

Department of Electronics and Communication(VLSI & ES) 10

Fig(5):-Description of Training Step.

Fig(6):-Description of Tracklet Steps

Department of Electronics and Communication(VLSI & ES) 11

Department of Electronics and Communication(VLSI & ES) 12

Fig(7):-Graph shows Region of Convergence (ROC) of

6. RESULTS AND ANALYSIS

Dataset CUHK_WP CUHK_MP

Department of Electronics and Communication(VLSI & ES) 13

Dataset Approach/Architecture Logiroad_MV

Dataset Approach/Architecture Nightroad_MV

Department of Electronics and Communication(VLSI & ES) 14

Fig(8):-Graph shows Region of Convergence (ROC) of

Department of Electronics and Communication(VLSI & ES) 15

Fig(9):-Graph shows Region of Convergence (ROC) of

Department of Electronics and Communication(VLSI & ES) 16

Fig(10):-Graph shows Region of Convergence (ROC) of

Department of Electronics and Communication(VLSI & ES) 17

Fig(11):-Image of hardware components of proposed embedded system.

Table-6 demonstrates the different components and their specifications.

Department of Electronics and Communication(VLSI & ES) 18

Memory 4 GB LPDDR4 | 25.6 GB/s

Here we have demonstrated an embedded system for multi-object detection in traffic

2. A. Mhalla, H. Maâmatou, T. Chateau, S. Gazzah, and N. E. Ben Amara, “Faster R-CNN

3. S. Sivaraman and M. M. Trivedi, “Looking at vehicles on the road: A survey of vision-

4. G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical representations for

Department of Electronics and Communication(VLSI & ES) 19