You are on page 1of 86

OBJECT DETECTION USING

TENSORFLOW

PROJECT REPORT
OF MAJOR PROJECT (EASYKART)

MASTER OF COMPUTER APPLICATIONS


2017-2019

Supervised By Submitted By
Reema Lalit Parul Kataria
Assistant Professor Roll No. 1252159

1
PANIPAT INSTITUTE OF ENGINEERING &
TECHNOLOGY, PANIPAT

DECLARATION

I, Parul, a student of Master of Computer Applications, in the Department of Computer


Applications, Panipat Institute of Engineering & Technology, Samalkha, under class Roll
No. 1252159 for the session 2017-2019, hereby, declare that the Project Report entitled
“Easykart” has been completed by me in 6th semester during the six months project
training. I hereby declare that the matter embodied in this Project is my original work and
has not been submitted earlier for award of any degree or diploma in any college or
University.

Date: PARUL

2
CERTIFICATE

Department of Computer Applications


Panipat Institute of Engineering & Technology, Samalkha

It is certified that Ms. Parul, a student of Master of Computer Applications, under class
Roll No. 1252159 for the session 2017-2019, has completed the project entitled Easykart
under my supervision.
This project report is an authentic work of the candidate, as per his/her declaration and is
found to be fit for the award of Master’s degree in Computer Applications in accordance
with the rules and regulations of PIET, Panipat as per my opinion.

I wish him/her all success in his/her all endeavours.

Ms. Reema Lalit


Assistant Professor
BCA/MCA Department

3
ACKNOWLEDGEMENT

It has been rightly remarked, “Success is the satisfactory achievement of chosen and
desired objective. It is the attainment of the major objects; you earnestly desired and
worked for with burning enthusiasm and dedication.”
Before we get into things, I would like to share a few heartfelt words with the people who
were part of this project in numerous ways, people who gave unending from the
beginning.
The successful completion of the project is a combined effort of a number of people, and
all of them have their own importance in the achievement of objective.
I cannot miss this opportunity to thank Ms. Reema Lalit as a mentor for her timely
support and valuable guidance throughout the project.
Last but not the least I would also like to thank my parents & project mates for being a
pillar of strength and support in times of stress & difficulty throughout the project
duration.

(Parul)

4
COMPANY PROFILE

INFLUENCE TECHNOLABS PVT. LTD

Established in 2012, Influence Technolabs Pvt Ltd. Always had the vision of coming up
with world class products and keys services related to online systems i.e. Travel,
Insurance, ERP, CRM, SCM etc. Founded by Ajay Kumar Jain and Malini Jain, the
company has grown to strength of more than 200 specialists working in various state-of-
the-art development facilities located worldwide .

Ajay is a engineering Graduate from Delhi College of Engineering with Experience of


more than 12 years Malini Jain is commerce Graduate of Sri Ram College of Commerce
with more than 10 years of Experiences.

Today Influence Technolabs Pvt Ltd is among the one of the largest service providers of
online booking technology. We facilitates global clients in a wide product range with
technologies that have been crafted by keeping the future perspective in mind.

Our technology solutions are vast and include creative design , solution definition, mobile
application development, product development, hotel consolidator, extranet system,
payment gateway integration, online booking engine, online insurance, Pos,
ERP,CRM,SCM, working on Automation AI and many others. This variety in services
we help us with several clients including travel agencies(Both online and
offline.),Destination management companies, tour operator, consortia, consolidators,
insurance brokers, insurance agents, aggregators, manufacturing, trading companies etc.

Our focus at Influence Technolabs Pvt Ltd. Has always been about reducing the ‘clients
cost v/s capability’ ratio while providing them with an easy interface to work upon .All
our products integrate the right mix of technical finesse and business solution –The key to
growth in the industry. We further provide customized technology solutions to suits the
unique requirements of our special clients.

5
We at Influence Technolabs have been able to create a highly fulfilling workplace. Our
desks are rated to be one of the most wanted ones in this industry and this done given an
extra edge to our organization. Reaching out to large and small companies alike,
providing solutions to enable them to grow in the online travel business with minimal
investment, no need for programmers or IT Professionals and expensive server and
hosting

Company Products

We provide IT solutions that are vast and include creative design, solution definition,
mobile application development, product development, hotel consolidator, extranet
systems, payment gateway integration, online booking engine, insurance API, insurance
Portal, AI based technology and many others.

• Sales Interface (B2B, B2C, B2E, B2B2B & B2B2C)


• Third party XML Integrations, White Label.
• GDS & LCC XML Integration (Amadeus, Galileo and Sabre etc.)
• Payment Gateway Integration
• Mid office to control the booking engine
• Mirror Ticketing
• Complete CRM Solution
• Convert Offline Document into Online Booking
• PNR Sync
• Customer Profiling

6
ABBREVIATIONS

CNN Convolutional Neural Network

CPU Central Processing Unit

FC Fully Connected (layer or network)

FCN Fully Convolutional Network

FPS Frames Per Second

GPU Graphics Processing Unit

NMS Non-maximum suppression

OiP Putting Objects in Perspective

R-CNN Convolutional Neural Network with Region proposals

RoI Region of Interest

RPN Region Proposal Network

SSD Single Shot MultiBox Detector

SVM Support Vector Machine

IoU Intersection Over Union

7
LIST OF FIGURES

Figure Figure Name Page no

Figure 2.4.2 Performance Graph 17

Figure 5.3.1 Zero level DFD 33

Figure 5.3.2 First level-pre processing 33

Figure 5.3.3 First level -processing 34

Figure 5.3.4 First level- recognition 34

Figure 5.3.5 First level-Testing image 35

Figure 5.3.6 Second level DFD 35

Figure 5.3.7 Use Case Diagram 36

Figure 5.3.8 Sequence Diagram 36

8
LIST OF TABLES

Table Figure Name Page no

Table7.1 Train_labels.csv 53

Table7.2 Train_labels.csv 54

Table7.3 Test_labels.csv 54

Table7.4 Train_labels.csv 55

Table 9.3.2 Testing 66

9
ABSTRACT

There is an ever-increasing amount of image data in the world, and the rate of growth
itself is increasing. Infotrends estimates that in 2016 still cameras and mobile devices
captured more than 1.1 trillion images. According to the same estimate, in 2020 the
figure will increase to 1.4 trillion. Many of these images are stored in cloud services or
published on the Internet. In 2014, over 1.8 billion images were uploaded daily to the
most popular platforms, such as Instagram and Facebook.

Going beyond consumer devices, there are cameras all over the world that capture images
for automation purposes. Cars monitor the road, and traffic cameras monitor the same
cars. Robots need to understand a visual scene in order to smartly build devices and sort
waste. Imaging devices are used by engineers, doctors and space explorers alike.

To effectively manage all this data, we need to have some idea about its contents.
Automated processing of image contents is useful for a wide variety of image-related
tasks. For computer systems, this means crossing the so-called semantic gap between the
pixel level information stored in the image files and the human understanding of the same
images. Computer vision attempts to bridge this cap.

Objects contained in image files can be located and identified automatically. This is
called object detection and is one of the basic problems of computer vision. As we will
demonstrate, convolutional neural networks are currently the state-of-the-art solution for
object detection. The main task of this thesis is to review and test convolutional object
detection methods.

In the theoretical part, we review the relevant literature and study how convolutional
object detection methods have improved in the past few years. In the experimental part,
we study how easily a convolutional object detection system can be implemented in
practice, test how well a detection system trained on general image data performs in a

10
specific task and explore, both experimentally and based on the literature, how the
current systems can be improved.

11
TABLE OF CONTENTS

Declaration i
Certificate ii
Acknowledgement iii
Abstraact vi

Ch. Contents Page No.


No.
1 Introduction
1.1. Project Purpose 2
1.2. Project Scope 3
1.3. Project Objectives 3
1.4. Project Goals 4
2 System Requirement Specifications
2.1. Scope 5
2.2. Objectives 6
2.3. Product Description 6
2.4. Specific Requirements 7
2.5. Functional and Non-Functional Requirements 8
2.6. System Attributes 8
3 System Design
3.1. Use Case Diagram 10
3.2. Sequence Diagram 14
3.3. Activity Diagram 15
3.4. Data Flow Diagram 16
3.5. E-R Diagram 20
3.6. Data Modelling 21
4 Source Code 30
5 Output Screens 66
6 Testing 76
7 Conclusion and Discussion
7.1. Future Scope of the Project 85
7.2. Self-analysis of project viabilities 85
7.3. Problem encountered and possible solutions 85
7.4. Summary of the project work 85

12
CHAPTER 1
PROJECT INTRODUCTION

1.1 Objective

The goal of “object detection” is to find the location of an object in a given picture
accurately and mark the object with the appropriate category. To be precise, the problem
that object detection seeks to solve involves determining where the object is, and what it
is. However, solving this problem is not easy. Unlike the human eye, a computer processes
images in two dimensions. Furthermore, the size of the object, its orientation in the space,
its attitude, and its location in the image can all vary greatly.

1.2 Introduction

Object detection is technologically challenging and practically useful problem in the field
of computer vision. Object detection deals with identifying the presence of various
individual objects in an image. Great success has been achieved in controlled environment
for object detection/recognition problem but the problem remains unsolved in uncontrolled
places, in particular, when objects are placed in arbitrary poses in cluttered and occluded
environment. As an example, it might be easy to train a domestic help robot to recognize
the presence of coffee machine with nothing else in the image.

On the other hand imagine the difficulty of such robot in detecting the machine on a
kitchen slab that is cluttered by other utensils, gadgets, tools, etc. The searching or
recognition process in such scenario is very difficult. So far, no effective solution has been
found for this problem. A lot of research is being done in the area of object recognition and
detection during the last two decades. The research on object detection is multi-disciplinary
and often involves the fields of image processing, machine learning, linear algebra,
topology, statistics/probability, optimization, etc. The research innovations in this field

13
have become so diverse that getting a primary first hand summary of most state-of-the-art
approaches is quite difficult and time consuming.

The approach used incorporates four computer vision and machine learning concepts:
sliding windows to extract sub-images from the image, feature extraction to get meaningful
data from the sub-images, Support Vector Machines (SVMs) to classify the objects in sub-
image, and Principle Component Analysis (PCA) to improve efficiency. As a model
problem for the motivating application, we focused on the problem of recognizing objects
in images, in particular, soccer balls and sunflowers. For this algorithm to be useful as a
real-time aid to the visually-impaired, it would have to be enhanced to distinguish between
“close” and “far” objects, as well as provide information about relative distance between
the user and the object, etc. We do not consider these complications in this project; we
focus on the core machine learning issues of object recognition. The training and testing of
the proposed algorithm was done using data sets .

Detecting objects
Fig. 1.2.1

14
1.3 Applications

1.3.1 Facial Recognition

Fig. 1.3.1

A deep learning facial recognition system called the “DeepFace” has been developed by a
group of researchers in the Facebook, which identifies human faces in a digital image very
effectively. Google uses its own facial recognition system in Google Photos, which
automatically segregates all the photos based on the person in the image. There are various
components involved in Facial Recognition like the eyes, nose, mouth and the eyebrows.

1.3.2 People Counting

Fig.1.3.2

15
Object detection can be also used for people counting, it is used for analyzing store
performance or crowd statistics during festivals. These tend to be more difficult as people
move out of the frame quickly.

1.3.3 Industrial Quality Check

Fig. 1.3.3

Object detection is also used in industrial processes to identify products. Finding a specific
object through visual inspection is a basic task that is involved in multiple industrial
processes like sorting, inventory management, machining, quality management, packaging
etc.

Inventory management can be very tricky as items are hard to track in real
time. Automatic object counting and localization allows improving inventory accuracy.

1.3.4 Self Driving Cars

Fig. 1.3.4

16
Self-driving cars are the Future, there’s no doubt in that. But the working behind it is very
tricky as it combines a variety of techniques to perceive their surroundings, including radar,
laser light, GPS, and computer vision.

1.4 Purpose and need


The purpose of object detection is to detect all instances of objects from a known class,
such as people, cars or faces in an image etc. In the case of a fixed rigid object only one
example may be needed, but more generally multiple training examples are necessary to
capture certain aspects of class variability.

One of the best examples of why you need object detection is the high-level algorithm for
autonomous driving:
 In order for a car to decide what to do next: accelerate, apply brakes or turn, it
needs to know where all the objects are around the car and what those objects are
 That requires object detection
 You would essentially train the car to detect known set of objects: cars,
pedestrians, traffic lights, road signs, bicycles, motorcycles, etc.

1.5 Hardware Specifications

GPU For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080
Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080,
GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs
with 32-bit (but not 16-bit). Be careful about the memory requirements when you pick your
GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with
the same memory compared to GTX cards. As such RTX cards have a memory advantage
and picking RTX cards and learn how to use 16-bit models effectively will carry you a
long way. In general, the requirements for memory are roughly the following:

17
Suspect line-up
Fig.2.4.1
CPU

The main mistake that people make is that people pay too much attention to PCIe lanes of
a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and
motherboard combination supports the number of GPUs that you want to run.We need CPU
with heavy RAM for running the large number of training steps

Performance Graph
Fig.2.4.2

1.5 Software Specifications

18
 Python 3.7
 TensorFlow
 Anaconda Software
 Machine Learning Libraries

Conda is an open source, cross-platform, language-agnostic package manager and


environment management system that installs, runs, and updates packages and their
dependencies. It was created for Python programs, but it can package and distribute
software for any language .

1.6 Expected Outcome

Detection accuracy is usually measured on a given test set where the expected outcome for
a detection sample is compared to the actual outcome of the object detection system .The
detection accuracy is the percentage of samples for which the expected outcome matches
the actual outcome of the detection system

Expected outcome
Fig.2.6

19
CHAPTER 2
BACKGROUND

2.1 Machine learning

Learning algorithms are widely used in computer vision applications. Before considering
image related tasks, we are going to have a brief look at basics of machine learning.

Machine learning has emerged as a useful tool for modelling problems that are otherwise
difficult to formulate exactly. Classical computer programs are explicitly programmed by
hand to perform a task. With machine learning, some portion of the human contribution is
replaced by a learning algorithm. As availability of computational capacity and data has
increased, machine learning has become more and more practical over the years, to the
point of being almost ubiquitous.

2.1.1Types

A typical way of using machine learning is supervised learning. A learning algorithm is


shown multiple examples that have been annotated or labelled by humans. For example,
in the object detection problem we use training images where humans have marked the
locations and classes of relevant objects. After learning from the examples, the algorithm
is able to predict the annotations or labels of previously unseen data. Classification and
regression are the most important task type. In classification, the algorithm attempts to
predict the correct class of a new piece of data based on the training data. In regression,
instead of discrete classes, the algorithm tries to predict a continuous output.

In unsupervised learning, the algorithm attempts to learn useful proper- ties of the data
without a human teacher telling what the correct output should be. Classical example of
unsupervised learning is clustering. More recently, especially with the advent of deep
learning technologies, un- supervised pre-processing has become a popular tool in
supervised learning tasks for discovering useful representations of the data [9].

2.1.2 Features

20
Some kind of pre-processing is almost always needed. Pre-processing the data into a new,
simpler variable space is called feature extraction. Of- ten, it is impractical or impossible
to use the full-dimensional training data directly. Rather, detectors are programmed to
extract interesting features from the data, and these features are used as input to the
machine learning algorithm.

In the past, the feature detectors were often hand-crafted. The problem with this approach
is that we do not always know in advance, which features are interesting. The trend in
machine learning has been towards learning the feature detectors as well, which enables
using the complete data.

2.1.3 Generalization

Since the training data cannot include every possible instance of the inputs, the learning
algorithm has to be able to generalize in order to handle unseen data points. Too simple
model estimate can fail to capture important aspects of the true model. On the other hand,
too complex methods can overfit by modelling unimportant details and noise, which also
leads to bad generalization. Typically, overfitting happens when a complex method is
used in conjunction with too little training data. An overfitted model learns to model the
known examples but does not understand what connects them.

The performance of the algorithm can be evaluated from the quality and quantity of
errors. A loss function, such as mean squared error, is used to assign a cost to the errors.
The objective in the training phase is to minimize this loss.

2.2 Neural networks

Neural networks are a popular type of machine learning model. A special case of a neural
network called the convolutional neural network (CNN) is the primary focus of this
thesis. Before discussing CNNs, we will discuss how regular neural networks work.

2.2.1 Origins

Neural networks were originally called artificial neural networks, because they were
developed to mimic the neural function of the human brain. Pioneering research includes

21
the threshold logic unit by Warren McCulloch and Walter Pitts in 1943 and the
perceptron by Frank Rosenblatt in 1957.

Even though the inspiration from biology is apparent, it would be mis- leading to
overemphasize the connection between artificial neurons and biological neurons or
neuroscience. The human brain contains approximately 100 billion neurons operating in
parallel. Artificial neurons are mathematical functions implemented on more-or-less
serial computers. Research into neural networks is mostly guided by developments in
engineering and mathematics rather than biology.

Figure 2.1: An artificial neuron.

An artificial neuron based on the McCulloch-Pitts model is shown in Figure. The neuron
k receives m input parameters xj. The neuron also has m weight parameters wkj. The
weight parameters often include a bias term that has a matching dummy input with axed
value of 1. The inputs and weights are linearly combined and summed. The sum is then
fed to an activation function ’ that produces the output yk of the neuron:

The neuron is trained by carefully selecting the weights to produce a desired output for
each input.

2.2.2 Multi-layer networks

22
Figure 2.2: A fully-connected multi-layer neural network.

A neural network is a combination of artificial neurons. The neurons are typically


grouped into layers. In a fully-connected feed-forward multi-layer network, shown in
Figure 2.2 each output of a layer of neurons is fed as input to each neuron of the next
layer. Thus, some layers process the original input data, while some process data received
from other neurons. Each neuron has a number of weights equal to the number of neurons
in the previous layer.

A multi-layer network typically includes three types of layers: an input layer, one or more
hidden layers and an output layer. The input layer usually merely passes data along
without modifying it. Most of the computation happens in the hidden layers. The output
layer converts the hidden layer activations to an output, such as a classification. A
multilayer feed-forward network with at least one hidden layer can function as a
universal approximator i.e. can be constructed to compute almost any function.

In this thesis, we will mostly discuss fully-connected networks and convolutional


networks. Convolutional networks utilize parameter sharing and have limited connections
compared to fully-connected networks. Other network types, such as recurrent networks,
are outside the scope of this thesis.

2.2.3 Back-propagation

A neural network is trained by selecting the weights of all neurons so that the network
learns to approximate target outputs from known inputs. It is difficult to solve the neuron
weights of a multi-layer network analytically. The back-propagation algorithm provides a

23
simple and effective solution to solving the weights iteratively. The classical version uses
gradient descent as optimization method. Gradient descent can be quite time-consuming
and is not guaranteed to find the global minimum of error, but with proper configuration
(known in machine learning as hyper- parameters) works well enough in practice.

In the first phase of the algorithm, an input vector is propagated forward through the
neural network. Before this, the weights of the network neurons have been initialized to
some values, for example small random values. The received output of the network is
compared to the desired output (which should be known for the training examples) using
a loss function. The gradient of the loss function is then computed. This gradient is also
called the error value. When using mean squared error as the loss function, the output
layer error value is simply the difference between the current and desired output.

The error values are then propagated back through the network to calculate the error
values of the hidden layer neurons. The hidden neuron loss function gradients can be
solved using the chain rule of derivatives. Finally, the neuron weights are updated by
calculating the gradient of the weights and subtracting a proportion of the gradient from
the weights. This ratio is called the learning rate. The learning rate can be fixed or
dynamic. After the weights have been updated, the algorithm continues by executing the
phases again with different input until the weights converge.

In the above description, we have described online learning that calculates the weight
updates after each new input. Online learning can lead to \zig-zagging" behavior, where
the single data point estimate of the gradient keeps changing direction and does not
approach the minimum directly. Another way of computing the updates is full batch
learning, where we compute the weight updates for the complete dataset. This is quite
computationally heavy and has other drawbacks. A compromise version is mini-batch
learning, where we use only some portion of the training set for each update.
Mathematical descriptions of the algorithm are readily available in other works.

2.2.4 Activation function types

24
The activation function ’ determines the final output of each neuron. It is important to
select the function properly in order to create an effective network.

Early researchers found that perceptron’s and other linear systems had severe drawbacks,
being unable to solve problems that were not linearly separable, such as the XOR-
problem. Sometimes, linear systems can solve these kinds of problems using hand-crafted
feature detectors, but this is not the most advantageous use of machine learning. Simply
adding layers does not help either, because a network composed of linear neurons
remains linear no matter how many layers it has.

A light-weight and effective way of creating a non-linear network is using rectified linear
units (ReLu). A rectified linear function generates the output using a ramp function such
as:

This type of function is easy to compute and differentiate (for back- propagation). The
function is not differentiable at zero, but this has not prevented its use in practice. ReLus
have become quite popular lately, often replacing sigmoidal activation functions, which
have smooth derivatives but suffer from gradient saturation problems and slower
computation.

For multi-class classification problems, the softmax activation function is used in the
output layer of the network:

The softmax function takes a vector of K arbitrarily large values and outputs a vector of
K values that range between 0...1 and sum to 1. The values output by the softmax unit
can be utilized as class probabilities.

2.2.5 Deep learning

Modern neural networks are often called deep neural networks. Even though multi-layer
neural networks have existed since the 1980s, several reasons pre- vented the effective
training of networks with multiple hidden layers.

One of the main problems is the curse of dimensionality. As the number of variables
increases, the number of different configurations of the variables grows exponentially. As

25
the number of configurations increases, the number of training samples should increase in
equal measure. Collecting a training dataset of sufficient size is time-consuming and
costly or outright impossible.

Fortunately, real-world data is not uniformly distributed and often involves a structure,
where the interesting information lies on a low-dimensional manifold. The manifold
hypothesis assumes that most data configurations are invalid or rare. We can decrease
dimensionality by learning to represent the data using the coordinates of the manifold.
Another way to improve generalization is to assume local constancy. This means
assuming that the function that the neural network learns to approximate should not
change much within a small region.

In the past ten years, neural networks have had a renaissance, mainly because of the
availability of more powerful computers and larger datasets. In early 2000s, it was
discovered that neural networks could be trained efficiently using graphics processing
units. GPUs are more efficient for the task than traditional CPUs and provide a relatively
cheap alternative to specialist hardware. Today, researchers typically use high-end
consumer graphic cards, such as NVIDIA Tesla K40.

Other more theoretical breakthroughs include replacing mean-squared error functions


with cross-entropy based functions and replacing sigmoidal activation functions with
rectified linear units.

With deep learning, there is less need for hand-tuned machine learning solutions that
were used previously. A classical pattern detection system, for example, includes a hand-
tuned feature detection phase before a machine learning phase. The deep learning
equivalent consists of a single neural network. The lower layers of the neural network
learn to recognize the basic features, which are then fed forward to higher layers of the
network.

2.3 Computer vision

Next, we are going to discuss computer vision in general and explore the primary subject
of this thesis, object detection, as a subproblem of computer vision.

2.3.1 Overview

Computer vision deals with the extraction of meaningful information from the contents of
digital images or video. This is distinct from mere image processing, which involves
manipulating visual information on the pixel level. Applications of computer vision

26
include image classification, visual detection, 3D scene reconstruction from 2D images,
image retrieval, augmented reality, machine vision and traffic automation.

Today, machine learning is a necessary component of many computer vision algorithms.


Such algorithms can be described as a combination of image processing and machine
learning. Effective solutions require algorithms that can cope with the vast amount of
information contained in visual images, and critically for many applications, can carry
out the computation in real time.

2.3.2 Object detection

Object detection is one of the classical problems of computer vision and is often
described as a difficult task. In many respects, it is similar to other computer vision tasks,
because it involves creating a solution that is invariant to deformation and changes in
lighting and viewpoint. What makes object detection a distinct problem is that it involves
both locating and classifying regions of an image. The locating part is not needed in, for
example, whole image classification.

To detect an object, we need to have some idea where the object might be and how the
image is segmented. This creates a type of chicken-and-egg problem, where, to recognize
the shape (and class) of an object, we need to know its location, and to recognize the
location of an object, we need to know its shape. Some visually dissimilar features, such
as the clothes and face of a human being, may be parts of the same object, but it is
difficult to know this without recognizing the object first. On the other hand, some
objects stand out only slightly from the background, requiring separation before
recognition.

Low-level visual features of an image, such as a saliency map, may be used as a guide for
locating candidate objects. The location and size is typically defined using a bounding
box, which is stored in the form of corner coordinates. Using a rectangle is simpler than
using an arbitrarily shaped polygon, and many operations, such as convolution, are
performed on rectangles in any case. The sub-image contained in the bounding box is

27
then classified by an algorithm that has been trained using machine learning. The
boundaries of the object can be further refined iteratively, after making an initial guess.

During the 2000s, popular solutions for object detection utilized feature descriptors, such
as scale-invariant feature transform (SIFT) developed by David Lowe in 1999 and
histogram of oriented gradients (HOG) popularized in 2005. In the 2010s, there has been
a shift towards utilizing convolutional neural networks.

Before the widescale adoption of CNNs, there were two competing solutions for
generating bounding boxes. In the first solution, a dense set of region proposals is
generated and then most of these are rejected. This typically involves a sliding window
detector. In the second solution, a sparse set of bounding boxes is generated using a
region proposal method, such as Selective Search. Combining sparse region proposals
with convolutional neural networks has provided good results and is currently popular.

2.4 Convolutional neural networks


Next, we are going to discuss why and how convolutional neural networks (CNN) are
used and describe their history.

2.4.1 Justification

The problem with solving computer vision problems using traditional neural networks is
that even a modestly sized image contains an enormous amount of information (see
section 2.2.5 on deep learning and the curse of dimensionality).

A monochrome 620x480 image contains 297 600 pixels. If each pixel intensity of this
image is input separately to a fully-connected network, each neuron requires 297 600
weights. A 1920x1080 full HD image would require 2,073,600 weights. If the images are
polychrome, the amount of weights is multiplied by the amount of color channels
(typically three). Thus, we can see that the overall number of free parameters in the
network quickly becomes extremely large as the image size increases. Too large models
cause over-fitting and slow performance.

28
Furthermore, many pattern detection tasks require that the solution is translation
invariant. It is inefficient to train neurons to separately recognize the same pattern in the
left-top corner and in the right-bottom corner of an image. A fully-connected neural
network fails to take this kind of structure into account.

2.4.2 Basic structure

The basic idea of the CNN was inspired by a concept in biology called the receptive field.
Receptive fields are a feature of the animal visual cortex. They act as detectors that are
sensitive to certain types of stimulus, for example, edges. They are found across the
visual field and overlap each other.

Figure 2.3: Detecting horizontal edges from an image using convolution filtering.

This biological function can be approximated in computers using the convolution


operation. In image processing, images can be filtered using convolution to produce
different visible effects. Figure 2.3 shows how a hand-selected convolutional filter
detects horizontal edges from an image, functioning similarly to a receptive field.

The discrete convolution operation between an image f and a filter matrix g is defined as:

In effect, the dot product of the filter g and a sub-image of f (with same dimensions as g)
centered on coordinates x; y produces the pixel value of h at coordinates x; y. The size of
the receptive field is adjusted by the size of the filter matrix. Aligning the filter
successively with every sub-image of f produces the of output pixel matrix h. In the case
of neural networks, the output matrix is also called an feature map (or an activation map

29
after computing the activation function). Edges need to be treated as a special case. If
image f is not padded, the output size decreases slightly with every convolution.

A set of convolutional filters can be combined to form a convolutional layer of a neural


network. The matrix values of the filters are treated as neuron parameters and trained
using machine learning. The convolution operation replaces the multiplication operation
of a regular neural network layer. Output of the layer is usually described as a volume.
The height and width of the volume depend on the dimensions of the activation map. The
depth of the volume depends on the number of filters.

Since the same filters are used for all parts of the image, the number of free parameters is
reduced drastically compared to a fully-connected neural layer. The neurons of the
convolutional layer mostly share the same parameters and are only connected to a local
region of the input. Parameter sharing resulting from convolution ensures translation
invariance. An alternative way of describing the convolutional layer is to imagine a fully-
connected layer with an infinitely strong prior placed on its weights. This prior forces the
neurons to share weights at different spatial locations and to have zero weight outside the
receptive field.

Successive convolutional layers (often combined with other types of layers, such as
pooling described below) form a convolutional neural network (CNN). An example of a
convolutional network is shown in figure. The back- propagation training algorithm,
described in section 2.2.3 is also applicable to convolutional networks. In theory, the
layers closer to the input should learn to recognize low-level features of the image, such
as edges and corners, and the layers closer to the output should learn to combine these
features to recognize more meaningful shapes. In this thesis, we are interested in studying
whether convolutional networks can learn to recognize complete objects.

2.4.3 Pooling and stride

To make the network more manageable for classification, it is useful to de- crease the
activation map size in the deep end of the network. Generally, the deep layers of the
network require less information about exact spatial locations of features, but require

30
more filter matrixes to recognize multiple high-level patterns. By reducing the height and
width of the data volume, we can increase the depth of the data volume and keep the
computation time at a reasonable level.

There are two ways of reducing the data volume size. One way is to include a pooling
layer after a convolutional layer. The layer effectively down-samples the activation maps.
Pooling has the added effect of making the resulting network more translation invariant
by forcing the detectors to be less precise. However, pooling can destroy information
about spatial relationships between subparts of patterns. Typical pooling method is max-
pooling. Max-pooling simply outputs the maximum value within a rectangular
neighborhood of the activation map.

Another way of reducing the data volume size is adjusting the stride parameter of the
convolution operation. The stride parameter controls whether the convolution output is
calculated for a neighborhood centered on every pixel of the input image (stride 1) or for
every nth pixel (stride n). Research has shown that pooling layers can often be discarded
without loss in accuracy by using convolutional layers with larger stride value. The stride
operation is equivalent to using a fixed grid for pooling.

2.4.4 Additional layers

The convolutional layer typically includes a non-linear activation function, such as a


rectified linear activation function (see subsection 2.2.4). Activations are sometimes
described as a separate layer between the convolutional layer and the pooling layer.

Some systems, such as also implement a layer called local response normalization, which
is used as a regularization technique. Local response normalization mimics a function of
biological neurons called lateral inhibition, which causes excited neurons to decrease the
activity of neighboring neurons. However, other regularization techniques are currently
more popular and these are discussed in the next section.

The final hidden layers of a CNN are typically fully-connected layers. A fully-connected
layer can capture some interesting relationships parameter-sharing convolutional layers

31
cannot. However, a fully- connected layer requires a sufficiently small data volume size
in order to be practical. Pooling and stride settings can be used to reduce the size of the
data volume that reaches the fully-connected layers. A convolutional network that does
not include any fully-connected layers, is called a fully convolutional network (FCN).

If the network is used for classification, it usually includes a softmax output layer (see
also section 2.2.4) The activations of the topmost layers can also be used directly to
generate a feature representation of an image. This means that the convolutional network
is used as a large feature detector.

2.4.5 Regularization and data augmentation

Regularization refers to methods that are used to reduce overfitting by introducing


additional constraints or information to the machine learning system. A classical way of
using regularization in neural networks is adding a penalty term to the objective/loss
function that penalizes certain types of weights. The parameter sharing feature of
convolutional networks is another example of regularization.

There are several regularization techniques that are specific to deep neural networks. A
popular technique called dropout attempts to reduce the co-adaptation of neurons. This is
achieved by randomly dropping out neurons during training, meaning that a slightly
different neural network is used for each training sample or minibatch. This causes the
system not to depend too much on any single neuron or connection and provides an
effective yet computationally inexpensive way of implementing regularization. In
convolutional networks, dropout is typically used in the final fully-connected layers.

Overfitting can also be reduced by increasing the amount of training data. When it is not
possible to acquire more actual samples, data augmentation is used to generate more
samples from the existing data. For classification using convolutional networks, this can
be achieved by computing transformations of the input images that do not alter the
perceived object classes, yet provide additional challenge to the system. The images can
be, for example, flipped, rotated or subsampled with different crops and scales. Also,
noise can be added to the input images.

32
2.4.6 Development

Convolutional neural networks were one of the first successful deep neural networks. The
Noncognition, developed by Fukushima in 1980s, provided a neural network model for
translation-invariant object recognition, inspired by biology. Le Cun et al. combined this
method with a learning algorithm, i.e. back-propagation. These early solutions were
mostly used for hand- written character recognition.

After providing some promising results, the neural network methods faded in prominence
and were mostly replaced by support vector machines. Then, in 2012, Krizhevsky et al.
achieved excellent results on the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) dataset by com- bining Le Cun’s method with recent fine-tuning methods for
deep learning. These results popularized CNNs and led to the development of new
powerful object detection methods described in chapter.

For the 2014 ImageNet challenge, Simonyan and Zisserman explored the effect of
increasing the depth of a convolutional network on localization and classification
accuracy. The team achieved results that improved the then state-of-the-art by using
convolutional networks 16 and 19 layers deep. The 16-layer architecture includes 13
convolutional layers (with 3x3 filters), 5 pooling layers (2x2 neighborhood max-pooling)
and 3 fully-connected layers. All hidden layers use rectified (ReLu) activations. The
fully-connected layers reduce 4096 channels down to 1000 softmax outputs and are
regularized using dropout. This form of network is referred to as VGG-16 later in this
thesis.

The current (2016) winner of the object detection category in the ImageNet challenge is
also CNN-based. The method uses a combination of CRAFT region proposal generation,
gated bi-directional CNN, clustering, landmark generation and ensembling.

33
34
Chapter 3
Convolutional Neural Networks

3.1 R-CNN

In 2012, Krizhevsky et al. achieved promising results with CNNs for the general image
classification task, as mentioned in section 2.4.6. In 2013, Girshick et al. published a
method generalizing these results to object detection. This method is called R-CNN
(\CNN with region proposals").

3.1.1 General description

R-CNN forward computation has several stages, shown in figure. First, the regions of
interest are generated. The RoIs are category-independent bounding boxes that have a
high likelihood of containing an interesting object. In the paper, a separate method called
Selective Search, is used for generating these, but other region generation methods can be
used instead. Selective Search, along with other region proposal generation techniques, is
discussed in further detail in section 3.3.

Next, a convolutional network is used to extract features from each region proposal. The
sub-image contained in the bounding-box is warped to match the input size of the CNN
and then fed to the network. After the network has extracted features from the input, the
features are input to support vector machines (SVM) that provide the final classification.

35
Figure 3.1: Stages of R-CNN forward computation.

The method is trained in multiple stages, beginning with the convolutional network. After
the CNN has been trained, the SVMs are fitted to the CNN features. Finally, the region
proposal generating method is trained.

3.1.2 Drawbacks

R-CNN is an important method, because it provided the first practical solution for object
detection using CNNs. Being the first, it has many drawbacks that have been improved
upon by later methods.

In his 2015 paper for Fast R-CNN, Girshick lists three main problems of R-CNN:

First, training consists of multiple stages, as described above. Second, training is


expensive. For both SVM and region proposal training, features are extracted from each
region proposal and stored on disk. This requires days of computation and hundreds of
gigabytes of storage space.

Third, and perhaps most important, object detection is slow, requiring almost a minute for
each image, even on a GPU. This is because the CNN forward computation is performed
separately for every object proposal, even if the proposals originate from the same image
or overlap each other.

3.2 Fast R-CNN

Fast R-CNN published in 2015 by Girshick provides a more practical method for object
recognition. The main idea is to perform the forward pass of the CNN for the entire
image, instead of performing it separately for each RoI.

3.2.1 General description

36
Figure 3.2: Stages of Fast R-CNN forward computation.

The general structure of Fast R-CNN is illustrated in figure 3.2. The method receives as
input an image plus regions of interest computed from the image. As in R-CNN, the RoIs
are generated using an external method. The image is processed using a CNN that
includes several convolutional and max pooling layers.

The convolutional feature map that is generated after these layers is input to a RoI
pooling layer. This extracts a fixed-length feature vector for each RoI from the feature
map. The feature vectors are then input to fully- connected layers that are connected to
two output layers: a softmax layer that produces probability estimates for the object
classes and a real-valued layer that outputs bounding box co-ordinates computed using
regression (meaning refinements to the initial candidate boxes).

3.2.2 Classification performance

According to the authors, Fast R-CNN provides significantly shorter classification time
compared to regular R-CNN, taking less than a second on a state-of-the-art GPU. This is
mainly due to using the same feature map for each RoI.

As the detection time decreases, the overall computation time begins to depend
significantly on the performance of the region proposal generation method. The RoI
generation can thus form a computational bottleneck. Additionally, when there are many
RoIs, the time spent on evaluating the fully-connected layers can dominate the evaluation
time of the convolutional layers. Classification time can be accelerated by approximately
30% if the fully-connected layers are compressed using truncated singular value decom-
position. This results in a slight decrease in precision, however.

3.2.3 Training

37
According to the original publication, Fast R-CNN is more efficient to train than R-CNN,
with nine-fold reduction in training time. The entire network (including the RoI pooling
layer and the fully-connected layers) can be trained using the back-propagation algorithm
and stochastic gradient de- scent. Typically, a pre-trained network is used as a starting
point and then ne-tuned. Training is done in mini-batches of N images. R=N RoIs are
sampled from each mini-batch image. The RoI samples are assigned to a class, if their
intersection over union (see section 4.6) with a ground-truth box is over 0.5. Other RoIs
belong to the background class.

As in classification, RoIs from the same image share computation and memory usage.
For data augmentation, the original image is flipped horizontally with probability 0.5.
The softmax classier and the bounding box regressors are ne-tuned together using a
multi-task loss function, which con- siders both the true class of the sampled RoI and the
o set of the sampled bounding box from the true bounding box.

3.3 Region proposal generation and use

To use R-CNN and Fast R-CNN, we need a method for generating the class-agnostic
regions of interest. Next, we are going to discuss general principles of RoI generation,
and have a closer look at two popular methods: Selective Search and Edge Boxes.

3.3.1 Overview

The aim of region proposal generation in object detection is to maximize recall i.e. to
generate enough regions so that all true objects are recovered. The generator is less
concerned with precision, since it is the task of the object detector to identify correct
regions from the output of the region proposal generator.

However, the amount of proposals generated affects performance. As mentioned in


section 2.3.2 there are two main approaches to region generation: dense set generation
and sparse set generation.

Dense set solutions attempt to generate by brute force an exhaustive set of bounding
boxes that includes every potential object location. This can be achieved by sliding a

38
detection window across the image. However, searching through every location of the
image is computationally costly and requires a fast object detector. Additionally, different
window shapes and sizes need to be considered. Thus, most sliding window methods
limit the amount of candidate objects by using a coarse step-size and a limited number of
fixed aspect ratios.

Most region proposals in a dense set do not contain interesting objects. These proposals
need to be discarded after the object detection phase. Detection results can be discarded,
if they fall behind a predefined confidence threshold or if their confidence value is below
a local maximum (non-maximum suppression).

Instead of discarding the regions after the object detection stage, the region proposal
generator itself can rank the regions in a class-agnostic way and discard low-ranking
regions. This generates a sparse set of object detections. Similarly to dense set methods,
thresholding and non-maximum suppression can be implemented after the detection
phase to further improve the detection quality. Sparse set solutions can be grouped into
unsupervised and supervised methods.

One of the most popular unsupervised methods is Selective Search (see section 3.3.2)
which utilizes an iterative merging of superpixels. There are also other methods that use
the same approach. Another approach is to rank the objectness of a sliding window. A
popular example of this is Edge Boxes (see section 3.3.2) which calculates the objectness
score by calculating the number of edges within a bounding box and by subtracting the
number of edges that overlap the box boundary. There is also a third group of methods
based on seed segmentation.

Supervised methods treat region proposal generation as a classification or a regression


problem. This means using a machine learning algorithm, such as a support vector
machine. It is also possible to use a convolutional network to generate the regions of
interest. An example of using a CNN for calculating the bounding boxes is Multi-Box.

39
Certain advanced object detection methods, such as Faster R-CNN described in 3.4.1 use
parts of the same convolutional network both for generating the region proposals and for
detection. We call these kinds of methods integrated methods.

3.3.2 Selective Search

Selective Search utilizes a hierarchical partitioning of an image to create a sparse set of


object locations. The main design philosophy is not to use a single strategy, but to
combine the best features of bottom-up segmentation and exhaustive search. The authors
had three main design considerations: the search should capture all scales, be diverse i.e.
not use any single strategy for grouping regions and be fast to compute.

The algorithm begins by creating a set of small initial regions using a method called
Graph Based Image Segmentation designed by Felzen- szwalb and Huttenlocher. The
method creates a set of regions called super- pixels. The superpixels are internally nearly
uniform. Combined, they span the entire image, but individually they should not span
different objects.

Selective Search then continues by iteratively grouping the regions together using a
greedy algorithm, beginning with the two most similar regions. Many complimentary
measures are used to compute the similarity. These measures consider color similarity
(by computing a color histogram), texture similarity (by computing a SIFT-like measure),
size of the regions (small regions should be merged earlier) and how well the regions fit
together (gaps should be avoided). The grouping phase ends when every region has been
combined.

The hypothetical object locations thus generated are then ordered by the likelihood of the
location containing an object. In practice, the locations are ordered based on the order in
which they were grouped together by the different measures. A certain element of
randomness is added to prevent large objects from being favored too much. Lower-
ranking duplicates are removed.

40
Both the region generating method and the similarity measures were selected to be fast to
compute, making the method fast in general. In addition to using diverse similarity
measures, the search can be further diversified by using complementary color spaces (to
ensure lighting invariance) and using complementary starting regions.

3.3.3 Edge Boxes

As the name suggests, Edge Boxes is based on detecting objects from edge maps. The
main contribution of the authors of the method is the observation that the number of edge
contours wholly enclosed by a bounding box is correlated with the likelihood that the box
contains an object.

First, the edge map is calculated using a method by the same authors called Structured
Edge Detector. Then, thick edge lines are thinned using non-maximum suppression.
Instead of operating on the edge pixels directly, the pixels are grouped using a greedy
algorithm. An affinity measure is devised to calculate whether edge groups are part of the
same contour.

The region proposals are found by scanning the image using the traditional sliding
window method and calculating an objectness score at each position, aspect ratio and
scale. The score is calculated by summing the edge strength of edge groups that lie
completely within the box and subtracting the strength of edge groups that are part of a
contour that cross the box boundary. Promising regions are then further refined.

3.4 Advanced convolutional object detection

In the experimental section of this thesis, we will focus mostly on Fast R- CNN. There
are, however, several state-of-the-art algorithms with an im- proved computation time or
accuracy. Next, we will describe two of these algorithms. See also chapter 7 for
discussion of improvements of convolutional object detection.

3.4.1 Faster R-CNN

41
Faster R-CNN by Ren et al. is an integrated method. The main idea is to use shared
convolutional layers for region proposal generation and for detection. The authors
discovered that feature maps generated by object detection networks can also be used to
generate the region proposals. The fully convolutional part of the Faster R-CNN network
that generates the feature proposals is called a region proposal network (RPN). The
authors used Fast R-CNN architecture for the detection network.

A Faster R-CNN network is trained by alternating between training for RoI generation
and detection. First, two separate networks are trained. Then, these networks are
combined and fine-tuned. During fine-tuning, certain layers are kept fixed and certain
layers are trained in turn.

The trained network receives a single image as input. The shared fully convolutional
layers generate feature maps from the image. These feature maps are fed to the RPN. The
RPN outputs region proposals, which are input, together with the said feature maps, to
the final detection layers. These layers include a RoI pooling layer and output the final
classifications.

Using shared convolutional layers, region proposals are computationally almost cost-free.
Computing the region proposals on a CNN has the added benefit of being realizable on a
GPU. Traditional RoI generation methods, such as Selective Search, are implemented
using a CPU.

For dealing with different shapes and sizes of the detection window, the method uses
special anchor boxes instead of using a pyramid of scaled images or a pyramid of
different filter sizes (see section 7.2 for discussion of scale invariance). The anchor boxes
function as reference points to different region proposals centered on the same pixel.

3.4.2 SSD

The Single Shot MultiBox Detector (SSD) takes integrated detection even further. The
method does not generate proposals at all, nor does it involve any resampling of image
segments. It generates object detections using a single pass of a convolutional network.

42
Somewhat resembling a sliding window method, the algorithm begins with a default set
of bounding boxes. These include different aspect ratios and scales. The object
predictions calculated for these boxes include o set parameters, which predict how much
the correct bounding box surrounding the object identifiers from the default box.

The algorithm deals with different scales by using feature maps from many different
convolutional layers (i.e. larger and smaller feature maps) as input to the classier. Since
the method generates a dense set of bounding boxes, the classier is followed by a non-
maximum suppression stage that eliminates most boxes below a certain confidence
threshold.

3.5 Comparing the methods

Above, we described how Fast R-CNN is faster and more accurate than regular R-CNN.
But how does Fast R-CNN perform compared to the above- mentioned advanced
methods?

Liu et al. compared the performance of Fast R-CNN, Faster R-CNN and SSD on the
PASCAL VOC 2007 test set (see section 4.5 for discussion of the standard benchmarks).
When using networks trained on the PASCAL VOC 2007 training data, Fast R-CNN
achieved a mean average precision (mAP) of 66.9 (see section 4.6 or discussion of
evaluation methods). Faster R-CNN performed slightly better, with a mAP of 69.9. SSD
achieved a mAP of 68.0 with input size 300 x 300 and 71.6 with input size 512 x 512. As
the standard implementations of Fast R-CNN and Faster R-CNN use 600 as the length of
the shorter dimension of the input image, SSD seems to perform better with similarly
sized images. However, SSD requires extensive use of data augmentation to achieve this
result. Fast R-CNN and Faster R- CNN only use horizontal flipping, and it is currently
unknown, whether they would benefit from additional augmentation.

While the advanced methods are more precise than Fast R-CNN, the real improvements
come from speed. When most of the detections with a low probability are eliminated
using thresholding and non-maximum suppression (see section 4.6 for details), SSD512

43
can run at 19 FPS on a Titan X GPU. Meanwhile, Faster R-CNN with a VGG-16
architecture performs at 7

FPS. The original authors of Faster R-CNN report a running time of 5 FPS i.e. 0.2 s per
image. Fast R-CNN has approximately the same evaluation speed, but requires additional
time for calculating the region proposals. Region generation time depends on the method,
with Selective Search re- quiring 2 seconds per image on a CPU and Edge Boxes
requiring 0.2 seconds per image.

44
CHAPTER 4
SYSTEM DESIGN

4.1 Data Flow Diagrams

A representation of a system at different levels of details with graphic nexus of symbols


representing data flows, data stores, procedures and data end points like source and
destinations is termed as Data Flow Diagram.

4.2 Design Notations

 Process
A procedure or process does operations and give the output on the supplied
arguments. The pure

Functions are considered as low level process that do not have side effects. A process
data flow component is represented as an ellipse.

 Data Flows

The nexus between one process to another or one sub identity to mother is
represented by the with the intermediate value or the label on it

45
Graphical Representation

 Actors
The element that drives the data flow by taking the inputs and thereby computing
the out is termed as the actor.

 Data Store
Sometimes data is required to be accessed later in the data flow that is done by data
store component of DFD.
 External Entity

Any external entity which can access the flow in DFD like a librarian, is called as
External Entity component. It is represented as a rectangle.

Graphical Representation

 Output Symbol

While the user interaction with the system the DFD depicts it in the form of a below
polygon.

Graphical Representation

46
4.3 Detailed Design

Zero level DFD – object identification system

INPUT IMAGE Object Recognition


detection
system
Fig. 4.3.1

First Level DFD-

Pre-Processing image

Pre-Processing Convert it into


Image gray image

10 Different
Direction

28 Dimension

Fig. 4.3.2

47
Processing

Pre-Processing Find the


Image neighborhood

Find the
nodal points

Compare Testing the


the Image Image

Fig. 4.3.3
Recognition

Processing Retrived the


Image stored Image

Compare
the both
Image

Compare
maximum
prec.

Testing Image

48
Fig. 4.3.4

Testing the Image


Testing
the Image

Yes

Not Matched No Matched

Fig. 4.3.5
Second level DFD

Capture the Image

Convert it into gray


Image

Pre-Processing Processing Recognition

Pre-Processing Image Pre-Processing Image Processing Image

Convert gray image Find the neighbourhood Retrieve the stored Image
the nodal points

10 different directions Compare both image


Find the nodal points

Compare Maximal
28 Dimensions Compare the Image Percentage

Testing the Image Testing the image

49
Fig. 4.3.6

Use case

Input gray
Image

Pre-Processing

Processing

Recognition

Fig. 4.3.7

Sequence Diagram

User Pre-Processing Processing Train Image

Input Gray Image


Give Extracted Gray

Level Of Pixel

Compare Calculated
Principle Points with
Test Image

If The Image Is Rec-


Ognized, Then It Show
To The user

Fig.5.3.8

50
Chapter 5
Implementation

5.1 Project File

We have to download these :-

o Python
o TensorFlow
o Tensorboard
o Protobuf v3.4 or above

5.2 Creating Label Map

item {
name: "/m/01g317"
id: 1
display_name: "person"}
item {
name: "/m/0199g"
id: 2
display_name: "bicycle"}
item {
name: "/m/0k4j"
id: 3
display_name: "car"}
item {
name: "/m/04_sv"
id: 4
display_name: "motorcycle"}
item {
name: "/m/01940j"
id: 27

51
display_name: "backpack"}
item {
name: "/m/080hkjn"
id: 31
display_name: "handbag"}
item {
name: "/m/01c648"
id: 73
display_name: "laptop"}
item {
name: "/m/050k8"
id: 77
display_name: "cell phone"}
item {
name: "/m/0bt_c3"
id: 84
display_name: "book"}

5.3 Creating TensorFlow Records

5.3.1 Converting *.Xml to *.Csv

To do this we can write a simple script that iterates through all *.xml files in

the Training\Images\Train and Training\Images\Test folders, and generates a *.csv


for each of the two.

import os
import glob
import pandas as pd
import argparse
import xml.etree.ElementTree as ET

def xml_to_csv(path):
print(path)
xml_list=[]
for xml_file in glob.glob(path + '/*.xml'):

52
tree=ET.parse(xml_file)
root=tree.getroot()
for member in root.findall('object'):
value =(root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
member[0].text,
int(member[4][0].text),
int(member[4][1].text),
int(member[4][2].text),
int(member[4][3].text)
)
xml_list.append(value)
column_name=['filename','width','height','class','xmin','ymin','xmax','
ymax']
xml_df = pd.DataFrame(xml_list,columns=column_name)
return xml_df
def main():
parser= argparse.ArgumentParser(
description="sample tensorflow xml-to-csv convertor")
parser.add_argument("-i", "--inputDir",help="path to the folder
where the input.xml files are stored",
type=str)
parser.add_argument("-o","--outputFile",help="name of output .csv
file(including path)",type=str)
args=parser.parse_args()
print(args)
if(args.inputDir is None):
args.inputDir=os.getcwd()
if(args.outputFile is None):
args.outputFile=args.inputDir+"/labels.csv"
assert(os.path.isdir(args.inputDir))
xml_df=xml_to_csv(args.inputDir)
xml_df.to_csv(args.outputFile,index=None)
print('successfully converted xml to csv')
if __name__ =='__main__':
main()

53
5.3.2 Converting from *.Csv to *.Record

Now that we have obtained our *.csv annotation files, we will need to convert them into
TFRecords.
from __future__ import division
from __future__ import print_function
from __future__ import absolute_import
import os
import io
import pandas as pd
import tensorflow as tf
import sys
sys.path.append("../../models/research")
from PIL import Image
from object_detection.utils import dataset_util
from collections import namedtuple,OrderedDict

flags=tf.app.flags
flags.DEFINE_string('csv_input','/tensorflow/workspace/training/annotat
ion/test_labels.csv','path to the CSV input')
flags.DEFINE_string('output_path','/tensorflow/workspace/training/annot
ation/test.record','path to output TFRecord')
flags.DEFINE_string('label0','mobile','Name of class[0] label')
flags.DEFINE_string('label1','hand','Name of class[1] label')
flags.DEFINE_string('label2','book','Name of class[2] label')
flags.DEFINE_string('label3','pen','Name of class[3] label')
flags.DEFINE_string('label4','bag','Name of class[4] label')
flags.DEFINE_string('img_path','/tensorflow/workspace/training/images/t
est','path to image')
FLAGS=flags.FLAGS
def class_text_to_int(row_label):
if row_label == FLAGS.label0:
return 1
elif row_label == FLAGS.label1:
return 2 #elif row_label ==FLAGS.label2:
elif row_label == FLAGS.label2:
return 3
elif row_label == FLAGS.label3:

54
return 4
else:
return 5
def split(df,group):
data=namedtuple('data',['filename','object'])
gb=df.groupby(group)
return[data(filename,gb.get_group(x)) for filename, x in
zip(gb.groups.keys(),gb.groups)]

def create_tf_example(group,path):
with
tf.gfile.GFile(os.path.join(path,'{}'.format(group.filename)),'rb') as
fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image=Image.open(encoded_jpg_io)
width,height = image.size

filename=group.filename.encode('utf8')
image_format = b'jpg'
xmins=[]
xmaxs=[]
ymins=[]
ymaxs=[]
classes_text =[]
classes=[]
for index,row in group.object.iterrows():
xmins.append(row['xmin']/width)
xmaxs.append(row['xmax']/width)
ymins.append(row['ymin']/height)
ymaxs.append(row['ymax']/height)
classes_text.append(row['class'].encode('utf8'))
classes.append(class_text_to_int(row['class']))
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height' : dataset_util.int64_feature(height),
'image/width' : dataset_util.int64_feature(width),
'image/filename' : dataset_util.bytes_feature(filename),
'image/source_id' : dataset_util.bytes_feature(filename),

55
'image/encoded' : dataset_util.bytes_feature(encoded_jpg),
'image/format' : dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin' :
dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax' :
dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin' :
dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax' :
dataset_util.float_list_feature(ymaxs),
'image/object/class/text' :
dataset_util.bytes_list_feature(classes_text),
'image/object/class/label' :
dataset_util.int64_list_feature(classes),
}))
return tf_example
def main(_):
writer=tf.python_io.TFRecordWriter(FLAGS.output_path)
path=os.path.join(os.getcwd(),FLAGS.img_path)
examples=pd.read_csv(FLAGS.csv_input)
grouped=split(examples,'filename')
for group in grouped:
tf_example=create_tf_example(group,path)
writer.write(tf_example.SerializeToString())

writer.close()
output_path = os.path.join(os.getcwd(),FLAGS.output_path)
print('successfully created the TFRecords :
{}'.format(output_path))
if __name__ == '__main__':
tf.app.run()

5.3.3 Configuring a Training Pipeline

The model used in this project ssd_inception_v2_coco model, since it provides a


relatively good trade-off between performance and speed
*.Config file

56
model {
ssd {
num_classes: 1
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
} }
feature_extractor {
type: "ssd_inception_v2"
depth_multiplier: 1.0
min_depth: 16
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.99999989895e-05
} }
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.0299999993294
} }
activation: RELU_6
batch_norm {
decay: 0.999700009823
center: true
scale: true
epsilon: 0.0010000000475
train: true
} } }
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}}

57
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
} }
similarity_calculator {
iou_similarity {
} }
box_predictor {
convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.99999989895e-05
} }
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.0299999993294
} }
activation: RELU_6 }
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.800000011921
kernel_size: 3
box_code_size: 4
apply_sigmoid_to_scores: false
} }
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.20000000298
max_scale: 0.949999988079

58
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.333299994469
reduce_boxes_in_lowest_layer: true
} }
post_processing {
batch_non_max_suppression {
score_threshold: 0.300000011921
iou_threshold: 0.600000023842
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID }
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
} }
classification_loss {
weighted_sigmoid {
} }
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.990000009537
loss_type: CLASSIFICATION
max_negatives_per_positive: 3
min_negatives_per_image: 0
}
classification_weight: 1.0
localization_weight: 1.0
}
}}
train_config: {
batch_size: 24
data_augmentation_options {
random_horizontal_flip {

59
} }
data_augmentation_options {
ssd_random_crop {
} }
optimizer {
rms_prop_optimizer {
learning_rate {
exponential_decay_learning_rate {
initial_learning_rate: 0.00400000018999
decay_steps: 800720
decay_factor: 0.949999988079
} }
momentum_optimizer_value: 0.899999976158
decay: 0.899999976158
epsilon: 1.0
} }
fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: true
num_steps: 200000}
train_input_reader {
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_train.record"}
}
eval_config {
num_examples: 8000
max_evals: 10
use_moving_averages: false
}
eval_input_reader {
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
shuffle: false
num_readers: 1
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_val.record"
}}

5.3.4 Training the Model

60
5.3.5 Train .py

import functools
import json
import os
import tensorflow as tf

from object_detection.builders import dataset_builder


from object_detection.builders import graph_rewriter_builder
from object_detection.builders import model_builder
from object_detection.legacy import trainer
from object_detection.utils import config_util
tf.logging.set_verbosity(tf.logging.INFO)

flags = tf.app.flags
flags.DEFINE_string('master', '', 'Name of the TensorFlow master to
use.')
flags.DEFINE_integer('task', 0, 'task id')
flags.DEFINE_integer('num_clones', 1, 'Number of clones to deploy per
worker.')
flags.DEFINE_boolean('clone_on_cpu', False,
'Force clones to be deployed on CPU. Note that
even if '
'set to False (allowing ops to run on gpu), some
ops may '
'still be run on the CPU if they have no GPU
kernel.')
flags.DEFINE_integer('worker_replicas', 1, 'Number of worker+trainer '
'replicas.')
flags.DEFINE_integer('ps_tasks', 0,
'Number of parameter server tasks. If None, does
not use '
'a parameter server.')
flags.DEFINE_string('train_dir', '',
'Directory to save the checkpoints and training
summaries.')

61
flags.DEFINE_string('pipeline_config_path', '',
'Path to a pipeline_pb2.TrainEvalPipelineConfig
config '
'file. If provided, other configs are ignored')

flags.DEFINE_string('train_config_path', '',
'Path to a train_pb2.TrainConfig config file.')
flags.DEFINE_string('input_config_path', '',
'Path to an input_reader_pb2.InputReader config
file.')
flags.DEFINE_string('model_config_path', '',
'Path to a model_pb2.DetectionModel config file.')
FLAGS = flags.FLAGS
@tf.contrib.framework.deprecated(None, 'Use
object_detection/model_main.py.')
def main(_):
assert FLAGS.train_dir, '`train_dir` is missing.'
if FLAGS.task == 0: tf.gfile.MakeDirs(FLAGS.train_dir)
if FLAGS.pipeline_config_path:
configs = config_util.get_configs_from_pipeline_file(
FLAGS.pipeline_config_path)
if FLAGS.task == 0:
tf.gfile.Copy(FLAGS.pipeline_config_path,
os.path.join(FLAGS.train_dir, 'pipeline.config'),
overwrite=True)
else:
configs = config_util.get_configs_from_multiple_files(
model_config_path=FLAGS.model_config_path,
train_config_path=FLAGS.train_config_path,
train_input_config_path=FLAGS.input_config_path)
if FLAGS.task == 0:
for name, config in [('model.config', FLAGS.model_config_path),
('train.config', FLAGS.train_config_path),
('input.config', FLAGS.input_config_path)]:
tf.gfile.Copy(config, os.path.join(FLAGS.train_dir, name),
overwrite=True
model_config = configs['model']
train_config = configs['train_config']

62
input_config = configs['train_input_config']

model_fn = functools.partial(
model_builder.build,
model_config=model_config,
is_training=True)
def get_next(config):
return dataset_builder.make_initializable_iterator(
dataset_builder.build(config)).get_next()
create_input_dict_fn = functools.partial(get_next, input_config)
env = json.loads(os.environ.get('TF_CONFIG', '{}'))
cluster_data = env.get('cluster', None)
cluster = tf.train.ClusterSpec(cluster_data) if cluster_data else
None
task_data = env.get('task', None) or {'type': 'master', 'index': 0}
task_info = type('TaskSpec', (object,), task_data)
ps_tasks = 0
worker_replicas = 1
worker_job_name = 'lonely_worker'
task = 0
is_chief = True
master = ''
if cluster_data and 'worker' in cluster_data:
worker_replicas = len(cluster_data['worker']) + 1
if cluster_data and 'ps' in cluster_data:
ps_tasks = len(cluster_data['ps'])
if worker_replicas > 1 and ps_tasks < 1:
raise ValueError('At least 1 ps task is needed for distributed
training.')
if worker_replicas >= 1 and ps_tasks > 0:
server = tf.train.Server(tf.train.ClusterSpec(cluster),
protocol='grpc',
job_name=task_info.type,
task_index=task_info.index)
if task_info.type == 'ps':
server.join()
return

63
worker_job_name = '%s/task:%d' % (task_info.type, task_info.index)
task = task_info.index
is_chief = (task_info.type == 'master')
master = server.target
graph_rewriter_fn = None
if 'graph_rewriter_config' in configs:
graph_rewriter_fn = graph_rewriter_builder.build(
configs['graph_rewriter_config'], is_training=True)
trainer.train(
create_input_dict_fn,
model_fn,
train_config,
master,
task,
FLAGS.num_clones,
worker_replicas,
FLAGS.clone_on_cpu,
ps_tasks,
worker_job_name,
is_chief,
FLAGS.train_dir,
graph_hook_fn=graph_rewriter_fn)
if __name__ == '__main__':
tf.app.run()

5.3.6 Detect Objects Using Webcam

import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile
import cv2

from collections import defaultdict

64
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from utils import label_map_util
from utils import visualization_utils as vis_util

cap = cv2.VideoCapture(0)
for the object detection.
PATH_TO_CKPT = 'trained-inference-
graphs/output_inference_graph_v1.pb/frozen_inference_graph.pb'
PATH_TO_LABELS = 'annotations/label_map.pbtxt'

NUM_CLASSES = 4
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name='')
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(
label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

def load_image_into_numpy_array(image):
(im_width, im_height) = image.size
return np.array(image.getdata()).reshape(
(im_height, im_width, 3)).astype(np.uint8)
with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
while True:
# Read frame from camera
ret, image_np = cap.read()
# Expand dimensions since the model expects images to have
shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
# Extract image tensor

65
image_tensor =
detection_graph.get_tensor_by_name('image_tensor:0')
# Extract detection boxes
boxes =
detection_graph.get_tensor_by_name('detection_boxes:0')
# Extract detection scores
scores =
detection_graph.get_tensor_by_name('detection_scores:0')
# Extract detection classes
classes =
detection_graph.get_tensor_by_name('detection_classes:0')
# Extract number of detectionsd
num_detections = detection_graph.get_tensor_by_name(
'num_detections:0')
# Actual detection.
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image_np,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
category_index,
use_normalized_coordinates=True,
line_thickness=8)

# Display output
cv2.imshow('object detection', cv2.resize(image_np, (800,
600)))

if cv2.waitKey(25) & 0xFF == ord('q'):


cv2.destroyAllWindows()
break

66
Chapter 6
Dataset

Tfrecords in which we store the images for test and train purpose

6.1 Train_labels.csv:

In this file we store all the images for training purpose according to their parameters i.e.
width, height, class, xmin, ymin, xmax, ymax

Table. 6.1

67
Table. 6.2
6.2 Test_labels.csv

In this file we store the images for testing purpose

68
Table7.3

Table.7.4

69
Chapter 7
Snapshots/Forms

7.1 Raw images


The first step is collecting images for our project. Download them from
goggle .I ensured that images were taken from multiple angles, brightness, scale etc.so that
the detector can work under different conditions of lightning and angles. Overall 100–150
pics will suffice. Some sample images below:

Fig.7.1

7.2 Labelling the image

I used labelimg to annotate the images. Annotations are created in the Pascal VOC format
which is useful later on. It is written in Python and uses Qt for interface. I used Python3 +
Qt5 with no problems. example of annotated image. Essentially we identify xmin, ymin,
xmax and ymax for the object and pass that to the model along with the image for training

70
Fig. 7.2

7.3 Creating xml files

fig. 7.3

71
Fig. 7.4
Another example of annotate the image we use upto 100 images for each object

Fig.7.5

72
7.4 Bounding Box

Fig. 7.6
7,5 Creating XML File

73
Fig. 7.7

7.6 Label Maps

After annotating the image we create a label map which includes item
name, id and display name there is one label for each object
create a label.pbtxt file that is used to convert label name to a numeric id.

Fig. 7.8

74
7.7 Raw images and xml files

This show all the images store in the test and train folder. This
images help in taining and testing of the object

Fig.7.9

75
7.8 Monitor Training Job Progress using TensorBoard

We check our training progress and loss rate using tensorboard it shows the report in
graph format . It shows the process of checkpoints how well the model train

Fig. 7.10

Fig. 7.11

76
Fig. 7.13

7.9 Target assignment

Fig. 7.14

77
7.10 Result
After running the program a new window will open, which can be used to detect objects in
real time.

Fig. 7.15

Fig. 7.1

78
Chapter 8
Testing

A set of activities carried out to check a the functionality or stability is termed as testing.
These activities are so planned and perfomed systematically that it leaves no scope for
rework or bugs. General characteristics of this strategies are:
1 Testing begins at the module level and works outward".
2 disparate testing techniques are appropriate at disparate points in time.
3 Debugging and testing are altogether disparate procedures.
4 The developer of the software conducts testing and if the project is big then there is a
testing team.
The System testing involved is the most widely used testing procedure consisting of five
stages as shown in the figure. In general, the sequence of testing activities is component
testing, integration testing, and then user testing

Unit
Testing

Module
Testing

Sub-System
Testing
System
Setting

Acceptance
Testing

(Component testing)
(Integration testing)

79
(User testing)
Fig.8.1
8.1 Functional Testing

Once the system is completed developed and integrated it is checked and evaluated for its
functionality as whole for specific demands and requirements. This type of testing falls
under the category of Blackbox testing and does not require the knowledge of in depth
working and protocol off the system.

8.2 Structural Testing

In contrary to Functional testing Structural testing checks for the functionality of the
different modules of the whole system and how well they are in link with other module.
This type of testing requires full knowledge of the behaviour, protocol and the working of
the system as a whole and module wise. The system base coding and programming
knowledge is also a requirement to perform this testing. The tester chooses inputs to
exercise paths through the code and determine the appropriate outputs.

8.3 Testing the model

To test the model, we first select a model checkpoint (usually the latest) and export into a
frozen inference graph. checkpoints is created when we train our model with the help of
checkpoint we are testing our model . we divide our data and used 70% images for training
and 30% for testing purpose so we split our images in test and train train folder
We store 100 of images per object to train the model of every angle of the object . In figure
9.3 there are some test images

80
Fig 8.3.1

We ran tests with databases built for 6,12,18,24 objects and obtained overall success
rates(correct classification on forced choice) of 99.6%, 98%, 97.4% and 97% respectively.
The worst cases were the book and the pen in 24 object test,with 19/24 and 20/24 correct
respectively

Table 8.3.2
The time to identify an object depends more or less linearly on the number of key features
fed to the system, and the size of the database. At the moment, overall recognition time on
a single processor are about 20 seconds for the 6 object database, and about 2 min for the
24 object database. This could also be improved substantially by pushing on the indexing
methods.

81
The program updates the video window with a new frame every between 0.25 sec and 0.5
seconds, which means an average of 2 - 4 FPS. In this project we detect live object with
help of camera.

Fig. 8.3.3

It identifies me as a person with 95% confidence and water bottle also with 95%
confidence. It show the accuracy of detecting the object

Fig. 8.3.4

82
Chapter 9
Maintenance & Evaluation

Maintenance is the is the term that is used to refer to modifications that are made to
software system after its release. System maintenance is an ongoing activity which covers
a wide variety of activities including removing program and design errors, updating
documentation and test data and updating user support Maintenance can be broadly
classified into following three classes:

9.1 Corrective maintenance

This is used to remove errors in the program, which occurs when the product is delivered
as well as during maintenance. Thus in corrective maintenance the product is modified to
solve the discovered errors after the software product is being delivered to customer.

9.2 Adaptive maintenance

Adaptive maintenance is generally not requested by client but it is imposed by the outside
environment. It may include following organizational changes:

 Change in the object


 Change in algorithms for faster performance
 Change in frames like instead of live detecting we need video frames
 Change in system controls and security needs etc.

9.3 Perfective maintenance

It means changing the software to improve some of its qualities like add new functions
improve computer efficiency, make it easier to use. This type of maintenance is used to

83
respond to user's additional needs may be due to the changes within or outside of the
organization. These changes include:

 Changes in software
 Economic and competitive conditions
 Changes in models

System evaluation is the process of checking the performance of a complete system to


acknowledge how it is likely to perform in live market conditions. It measures the
performance of the system that whether it may compete or not.

84
Chapter 10
Conclusion and Future Scope

The Object Detection system in Images is web based application which mainly aims to
detect the multiple objects from various types of images. To achieve this goal shape and
edge feature from image is extracted. It uses large image database for correct object
detection and recognition. This system will provide easy user interface to retrieve the
desired images. The system have additional feature such as Sketch based detection. In
Sketch detection user can draw the sketch by hand as an input. Finally the system results
output images by searching those images that user want.

Scope of Object Detection and Recognition

The project has wide scope in multiple areas and can easily increase its utilization by
adding more efficient algorithms. Some of the areas are as follows-

Medical Diagnose:
Use of object detection and recognition in medical diagnose to
detect the X-Ray report, brain tumors.

Shapes recognition:
Recognize the shape from whole region in images.

Cartography:
The cartography as the discipline dealing with the conception,
production dissemination and study of maps.

Robotics:
In robotics use of object detection is movement of body parts and
motion sensing.

85
Chapter 11
References

 He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image
recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (2016), pp. 770–778.
 Hoiem, D., Efros, A. A., and Hebert, M. Automatic photo popup. ACM transactions
on graphics (TOG) 24, 3 (2005), 577–584.
 Hoiem, D., Efros, A. A., and Hebert, M. Geometric context from a single image. In
Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on
(2005), vol. 1, IEEE, pp. 654–661.
 Hoiem, D., Efros, A. A., and Hebert, M. Putting objects in perspective.
International Journal of Computer Vision 80, 1 (2008), 3–15.
 Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural
networks 4, 2 (1991), 251–257.
 Huang, T. Computer vision: Evolution and promise. CERN EUROPEAN
ORGANIZATION FOR NUCLEAR RESEARCH-REPORTSCERN (1996), 21–
26.
 Hubel, D. H., and Wiesel, T. N. Receptive fields and functional architecture of
monkey striate cortex. The Journal of Physiology 195, 1 (1968), 215–243.
 Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. CoRR abs/1502.03167 (2015).
 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long,

86

You might also like