You are on page 1of 20

OBJECT DETECTION USING MACHINE

LEARNING TECHNIQUES

Seminar Report submitted in partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY

By

Mr. THOTA POORNITH SUNDAR (16UECN0065)


Mr. HEMANT KUMAR GAHLOT (16UECN0021)
Mr. V PRUDHVI KUMAR REDDY (16UECS0478)

Under the guidance of

Mrs. N .Beulah Jabaseeli M.E.,


ASSISTANT PROFESSOR

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


SCHOOL OF COMPUTING

VEL TECH RANGARAJAN Dr. SAGUNTHALA R & D INSTITUTE OF SCIENCE AND


TECHNOLOGY, CHENNAI 600 062, TAMILNADU, INDIA
(Deemed to be University Estd u/s 3 of UGC Act, 1956)

October, 2019
BONAFIDE CERTIFICATE

This is to certify that the seminar entitled “OBJECT DETECTION USING MACHINE
LEARNING TECHNIQUES” submitted by T Poornith Sundar (16UECN0065), Hemant Kumar Gahlot
(16UECN0021) and V Prudhvi Kumar Reddy (16UECS0478) in partial fulfillment for the requirements
for the award of the degree of Bachelor of Technology in Computer Science and Engineering is an
authentic work carried out by them under my supervision and guidance.
To the best of my knowledge, the matter embodied in the project report has not been submitted to
any other University/Institute for the award of any Degree or Diploma.

Signature of Supervisor Signature of Seminar Handling Faculty

Mrs. Beulah Jabaseeli M.E., Dr.V Dhilip Kumar M.E.,


Asst. Professor, Asst.Professor,
Department of CSE, Department of CSE,
Vel Tech Rangarajan Dr. Sagunthala Vel Tech Rangarajan Dr.Sagunthala
R & D Institute of Science and Technology, R&D Institute of Science and Technology,
Avadi, Chennai-600062 Avadi, Chennai-600062.

Submitted for the partial fulfillment for the award of the degree of Bachelor of Technology in
Computer Science and Engineering from Vel Tech Rangarajan Dr. Sagunthala R & D Institute of
Science and Technology (Deemed to be University, u/s 3 of UGC Act,1956).

ii
DECLARATION

I declare that this written submission represents my ideas in my own words and where
others' ideas or words have been included, I have adequately cited and referenced the original
sources. I also declare that I have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission. I understand that any violation of the above will be cause for disciplinary action by
the Institute and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.

(Signature with date)

T Poornith Sundar
16UECN0065

(Signature with date)

Hemant Kumar Gahlot


16UECN0021

(Signature with date)

V Prudhvi Kumar Reddy


16UECS0478

iii
APPROVAL SHEET

The Seminar report entitled OBJECT DETECTION USING MACHINE LEARNING


TECHNIQUES by T Poornith Sundar (16UECN0065), Hemant Kumar Gahlot
(16UECN0021) and V Prudhvi Kumar Reddy (16UECS0478) is approved for the degree of
B.Tech Computer Science and Engineering.

Signature of Supervisor Signature of HEAD & DEAN, SOC

Ms.Beulah Jabaseeli M.E., Dr.V.Srinivasa Rao M.Tech.,Ph.D.,


Asst. Professor, Professor,
Department of CSE, Department of CSE,
Vel Tech Rangarajan Dr. Sagunthala Vel Tech Rangarajan Dr.Sagunthala
R & D Institute of Science and Technology, R&D Institute of Science and Technology,
Avadi, Chennai-600062 Avadi, Chennai-600062.

Date :
Place:

iv
ACKNOWLEDGEMENT

We express our deepest gratitude to our respected Founder Chancellor and President
Col. Prof. Dr. R. RANGARAJAN B.E. (EEE), B.E. (MECH), M.S (AUTO). DSc.,
Foundress President Dr. R. SAGUNTHALA RANGARAJAN M.B.B.S., Chairperson
Managing Trustee and Vice President.

We are very much grateful to our beloved Vice Chancellor Prof.V.S.S.KUMAR, Ph.D.,
for providing us with an environment to complete our project successfully.

We are obligated to our beloved Registrar Mrs.N.S PREMA., for providing immense
support in all our endeavors.

We are thankful to our esteemed Director Academics Dr. ANNE KOTESWARA


RAO, Ph.D., for providing a wonderful environment to complete our project successfully.

We record indebtedness to our Head of the Department/Dean Dr.V.SRINIVASA


RAO, M.TECH, Ph.D., for immense care and encouragement towards us throughout the course
of this project.

A special thanks to our Seminar Coordinator Mrs.S.HANNAH, M.E for her valuable
guidance and support throughout the course of the project.

We also take this opportunity to express a deep sense of gratitude to Our Internal
Supervisor Mrs.N.BEULAH JABASEELI, M.E., for his/her cordial support and guidance, he/ she
helped us in completing this project through various stages.

We thank our Seminar handling Faculty Dr.V.DHILIP KUMAR M.E, Ph.D., for the
valuable information shared on proceeding with our seminar.

We thank our department faculty, supporting staffs, parents, and friends for their help and
guidance to complete this seminar.

T Poornith Sundar (VTU6928) (16UECN0065)


Hemant Kumar Gahlot (VTU7275) (16UECN0021)
V Prudhvi Kumar Reddy (VTU7978) (16UECS0478)

v
TABLE OF CONTENTS

CHAPTER
TITLE PAGE NO.
NO.

BONAFIED CERTIFICATE ii
DECLARATION iii
APPROVAL iv
ACKNOWLEDGMENT v

1 ABSTRACT 1

2 INTRODUCTION 2

3 LITERATURE REVIEW 3-4

4 METHODOLOGY 5 - 10

5 RESULTS AND DISCUSSIONS 11

6 CONCLUSION AND FUTURE ENHANCEMENTS 12

APPENDICES 13
REFERENCES 14
CHAPTER – 1

ABSTRACT

Humans can easily detect and identify objects present in an image. The human visual
system is fast and accurate and can perform complex tasks like identifying multiple objects
and detect obstacles with little conscious thought. With the availability of large amounts of
data, faster GPUs, and better algorithms, we can now easily train computers to detect and
classify multiple objects within an image with high accuracy.

An image classification or image recognition model simply detect s the probability of


an object in an image. In contrast to this, object localization refers to identifying the
location of an object in the image. An object localization algorithm will output the
coordinates of the location of an object with respect to the image. In computer vision, the
most popular way to localize an object in an image is to represent its location with the he lp
of bounding boxes.

1
CHAPTER – 2

INTRODUCTION

Object detection is a computer technology related to computer vision and image


processing that deals with detecting instances of semantic objects of a certain class (such as
humans, buildings, or cars) in digital images and videos. Well-researched domains of object
detection include face detection and pedestrian detection. Object detection has applications in
many areas of computer vision, including image retrieval and video surveillance.

Every object class has its own special features that helps in classifying the class – for
example all circles are round. Object class detection uses these special features. For example,
when looking for circles, objects that are at a particular distance from a point (i.e. the center)
are sought. Similarly, when looking for squares, objects that are perpendicular at corners and
have equal side lengths are needed. A similar approach is used for face identification where
eyes, nose, and lips can be found and features like skin color and distance between eyes can be
found.

2
CHAPTER – 3

LITERATURE SURVEY

[1] Histograms of Oriented Gradients for Human Detection

Papageorgiou et al describe a pedestrian detector based on a polynomial SVM using


rectified Haar wavelets as input descriptors, with a parts (subwindow) based variant.

Gavrila & Philomen take a more direct approach, extracting edge images and matching
them to a set of learned exemplars using chamfer distance. This has been used in a practical real-
time pedestrian detection system.

Viola et al build an efficient moving person detector, using AdaBoost to train a chain of
progressively more complex region rejection rules based on Haar-like wavelets and space-time
differences.

Ronfard et al build an articulated body detector by incorporating SVM based limb


classifiers over 1st and 2nd order Gaussian filters in a dynamic programming framework similar
to those of Felzenszwalb & Huttenlocher and Ioffe & Forsyth.

Mikolajczyk et al use combinations of orientation position histograms with binary-


thresholded gradient magnitudes to build a parts based method containing detectors for faces,
heads, and front and side profiles of upper and lower body parts. In contrast, our detector uses a
simpler architecture with a single detection window, but appears to give significantly higher
performance on pedestrian images.

Fig 1 – HOG Detection workflow

3
[2] Rich feature hierarchies for accurate object detection

Fukushima‟s “neocognitron”, a biologically inspired hierarchical and shift-invariant model


for pattern recognition, was an early attempt at just such a process. The neocognitron, however,
lacked a supervised training algorithm.

LeCun et al. provided the missing algorithm by showing that stochastic gradient descent,
via backpropagation, can train convolutional neural networks (CNNs), a class of models that
extend the neocognitron. CNNs saw heavy use in the 1990s, but then fell out of fashion,
particularly in computer vision, with the rise of support vector machines.

In 2012, Krizhevsky et al. rekindled interest in CNNs by showing substantially higher


image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC). Their success resulted from training a large CNN on 1.2 million labeled images,
together with a few twists on LeCun‟s CNN (e.g., max(x, 0) rectifying non-linearities and
“dropout” regularization).

Unlike image classification, detection requires localizing (likely many) objects within an
image. One approach frames localization as a regression problem. However, work from Szegedy
et al., concurrent with our own, indicates that this strategy may not fare well in practice (they
report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method).

An alternative is to build a sliding-window detector. CNNs have been used in this way for
at least two decades, typically on constrained object categories, such as faces and pedestrians. In
order to maintain high spatial resolution, these CNNs typically only have two convolutional and
pooling layers. We also considered adopting a sliding-window approach. However, units high up
in our network, which has five convolutional layers, have very large receptive fields (195 × 195
pixels) and strides (32×32 pixels) in the input image, which makes precise localization within the
sliding-window paradigm an open technical challenge.

4
CHAPTER – 4

METHODOLOGIES
Machine learning is concerned with learning an appropriate set of parameters within a
model class from training data. The meta-level problems of determining appropriate model
classes are referred to as model selection or model adaptation. Supervised as well as
unsupervised learning approaches have been used in emotion detection. Supervised learning
approaches rely on labeled training data, a set of training examples. For Machine Learning
approaches, it becomes necessary to first define features using one of the methods below, then
using a technique such as support vector machine (SVM) to do the classification.

Machine Learning approaches:


1. Viola–Jones object detection framework ,
2. Scale-invariant feature transform (SIFT) and
3. Histogram of oriented gradients (HOG)

[1] VIOLA–JONES OBJECT DETECTION FRAMEWORK

The Viola–Jones object detection framework is the first object detection framework to
provide competitive object detection rates in real-time proposed in 2001 by Paul Viola and
Michael Jones. Although it can be trained to detect a variety of object classes, it was motivated
primarily by the problem of face detection.

The features sought by the detection framework universally involve the sums of image
pixels within rectangular areas. However, since the features used by Viola and Jones all rely on
more than one rectangular area, they are generally more complex. The figure on the right
illustrates the four different types of features used in the framework. The value of any given
feature is the sum of the pixels within clear rectangles subtracted from the sum of the pixels
within shaded rectangles. Rectangular features of this sort are primitive when compared to
alternatives such as steerable filters. Although they are sensitive to vertical and horizontal
features, their feedback is considerably coarser.

5
Fig 2 – Viola Jones property masks

 A few properties common to human faces:


o The eye region is darker than the upper-cheeks.
o The nose bridge region is brighter than the eyes.
 Composition of properties forming match-able facial features:
o Location and size: eyes, mouth, bridge of nose
 Rectangle features:
o Value = Σ (pixels in black area) - Σ (pixels in white area)
o For example: the difference in brightness between the white &black rectangles
over a specific area

[2] SCALE-INVARIANT FEATURE TRANSFORM (SIFT)

The scale-invariant feature transform (SIFT) is a feature detection algorithm in computer


vision to detect and describe local features in images. It was patented in Canada by the
University of British Columbia and published by David Lowe in 1999. Applications include
object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture
recognition, video tracking, individual identification of wildlife and match moving.

SIFT keypoints of objects are first extracted from a set of reference images and stored in a
database. An object is recognized in a new image by individually comparing each feature from
the new image to this database and finding candidate matching features based on Euclidean
distance of their feature vectors. From the full set of matches, subsets of key-points that agree on

6
the object and its location, scale, and orientation in the new image are identified to filter out good
matches.
The determination of consistent clusters is performed rapidly by using an efficient hash
table implementation of the generalized Hough transform. Each cluster of 3 or more features that
agree on an object and its pose is then subject to further detailed model verification and
subsequently outliers are discarded. Finally the probability that a particular set of features
indicates the presence of an object is computed, given the accuracy of fit and number of probable
false matches. Object matches that pass all these tests can be identified as correct with high
confidence. After scale space extrema are detected (their location being shown in the uppermost
image) the SIFT algorithm discards low-contrast key-points (remaining points are shown in the
middle image) and then filters out those located on edges.

Fig 3 – Key-Points detection and reduction

Fig 4 – Detection mechanism in SIFT

7
[3] HISTOGRAM OF ORIENTED GRADIENTS (HOG)

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision
and image processing for the purpose of object detection. The technique counts occurrences of
gradient orientation in localized portions of an image. This method is similar to that of edge
orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but
differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local
contrast normalization for improved accuracy.

Gradient computation

The first step of calculation in many feature detectors in image pre-processing is to


ensure normalized color and gamma values. The most common method is to apply the 1-D
centered, point discrete derivative mask in one or both of the horizontal and vertical
directions. Specifically, this method requires filtering the color or intensity data of the
image with the following filter kernels:

[-1,0,1] and [-1,0,1]^T

Dalal and Triggs tested other, more complex masks, such as the 3x3 Sobel mask or
diagonal masks, but these masks generally performed more poorly in detecting humans in
images. They also experimented with Gaussian smoothing before applying the derivative
mask, but similarly found that omission of any smoothing performed better in practice.

Orientation binning

The second step of calculation is creating the cell histograms. Each pixel within the
cell casts a weighted vote for an orientation-based histogram channel based on the values
found in the gradient computation. The cells themselves can either be rectangular or radial
in shape, and the histogram channels are evenly spread over 0 to 180 degrees or 0 to 360
degrees, depending on whether the gradient is “unsigned” or “signed”.

8
Descriptor blocks

To account for changes in illumination and contrast, the gradient strengths must be
locally normalized, which requires grouping the cells together into larger, spatially
connected blocks. The HOG descriptor is then the concatenated vector of the components
of the normalized cell histograms from all of the block regions. These blocks typically
overlap, meaning that each cell contributes more than once to the final descriptor. Two
main block geometries exist: rectangular R-HOG blocks and circular C-HOG blocks. R-
HOG blocks are generally square grids, represented by three parameters: the number of
cells per block, the number of pixels per cell, and the number of channels per cell
histogram.

The R-HOG blocks appear quite similar to the scale-invariant feature transform
(SIFT) descriptors; however, despite their similar formation, R-HOG blocks are computed
in dense grids at some single scale without orientation alignment, whereas SIFT
descriptors are usually computed at sparse, scale-invariant key image points and are rotated
to align orientation. In addition, the R-HOG blocks are used in conjunction to encode
spatial form information, while SIFT descriptors are used singly.

Circular HOG blocks (C-HOG) can be found in two variants: those with a single,
central cell and those with an angularly divided central cell. In addition, these C-HOG
blocks can be described with four parameters: the number of angular and radial bins, the
radius of the center bin, and the expansion factor for the radius of additional radial bins.

Block normalization

The normalization factor can be one of the following:


L2-norm:
𝑣
𝑓=
𝑣2 + 𝑒 2

9
L2-hys: L2-norm followed by clipping

L1-norm:
𝑣
𝑓=
𝑣1 + 𝑒

L1-sqrt:

𝑣
𝑓=
𝑣1 + 𝑒

In addition, the scheme L2-hys can be computed by first taking the L2-norm,
clipping the result, and then renormalizing.

Object recognition

HOG descriptors may be used for object recognition by providing them as features
to a machine learning algorithm. However, HOG descriptors are not tied to a specific
machine learning algorithm.

Fig 5 – HOG descriptor and detector working steps

10
CHAPTER – 5

RESULTS AND DISCUSSIONS

Overall, there are several notable findings in this work. The fact that HOG greatly out-
performs wavelets and that any significant degree of smoothing before calculating gradients
damages the HOG results emphasizes that much of the available image information is from
abrupt edges at fine scales, and that blurring this in the hope of reducing the sensitivity to spatial
position is a mistake. Instead, gradients should be calculated at the finest available scale in the
current pyramid layer, rectified or used for orientation voting, and only then blurred spatially.
Given this, relatively coarse spatial quantization suffices (8×8 pixel cells / one limb width). On
the other hand, at least for human detection, it pays to sample orientation rather finely: both
wavelets and shape contexts lose out significantly here.

Secondly, strong local contrast normalization is essential for good results, and traditional
centre-surround style schemes are not the best choice. Better results can be achieved by
normalizing each element (edge, cell) several times with respect to different local supports, and
treating the results as independent signals. In our standard detector, each HOG cell appears four
times with different normalizations and including this „redundant‟ information improves
performance from 84% to 89% at 10−4 FPPW.

11
CHAPTER – 6

CONCLUSION AND FUTURE ENHANCEMENTS

[1] CONCLUSION

Machine learning approaches are a starting option to understand the detection mechanism
of objects inside the image. But Deep Learning techniques are the optimal option for object
detection and face recognition. The HOG Classification is the primary classification algorithm
used by many authors as their baseline.

[2] FUTURE ENHANCEMENTS

Although our current linear SVM detector is reasonably efficient – processing a 320×240
scale-space image (4000 detection windows) in less than a second – there is still room for
optimization and to further speed up detections it would be useful to develop a coarse-to-fine or
rejection chain style detector based on HOG descriptors.

We are also working on HOG-based detectors that incorporate motion information using
block matching or optical flow fields. Finally, although the current fixed-template-style detector
has proven difficult to beat for fully visible pedestrians, humans are highly articulated and we
believe that including a parts based model with a greater degree of local spatial invariance would
help to improve the detection results in more general situations.

12
Appendix 1

LIST OF FIGURES

S. NO FIG. NO TITLE PAGE. NO


1 1 HOG detection workflow 3
2 2 Viola Jones property masks 6
3 3 Key-Points detection and reduction 7
4 4 Detection mechanism in SIFT 7
5 5 HOG descriptor and detector working steps 10

13
REFERENCES

[1] S. Belongie, J. Malik, and J. Puzicha. “Matching shapes” The 8th ICCV, Vancouver, Canada,
pages 454–461, 2001.

[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de Prins, F. Fransens, and L. Van Gool.
“Efficient pedestrian detection: a test case for SVM based categorization” Workshop on
Cognitive Vision, 2002.

[3] P. Felzenszwalb and D. Huttenlocher. “Efficient matching of pictorial structures” CVPR,


Hilton Head Island, South Carolina, USA, pages 66–75, 2000.

[4] W. T. Freeman and M. Roth. “Orientation histograms for hand gesture recognition. Intl.
Workshop on Automatic Face and Gesture- Recognition”, IEEE Computer Society, Zurich,
Switzerland, pages 296–301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. “Computer vision for computer games.
2nd International Conference on Automatic Face and Gesture Recognition”, Killington, VT,
USA, pages 100–105, October 1996.

[6] D. M. Gavrila. “The visual analysis of human movement: A survey. CVIU, 73(1):82–98,
1999.”

[7] D. M. Gavrila, J. Giebel, and S. Munder. “Vision-based pedestrian detection: the protector+
system” Proc. of the IEEE Intelligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. “Real-time object detection for smart vehicles” CVPR, Fort
Collins, Colorado, USA, pages 87–93, 1999.

[9] S. Ioffe and D. A. Forsyth. “Probabilistic methods for finding people” IJCV, 43(1):45–68,
2001.

14

You might also like