You are on page 1of 24

Object detection

using transformers
A .Supriya(172W1A0501)
Project Guide
A.Bindu priya(17X41A0561)
M.Naresh Babu
E.Sai Teja(18X45A0516)
CH.Mounika(17X41A0568)
Project Coordinator:
Dr.B.Asha Latha
H.O.D
DR.D.Haritha
ABSTRACT
We present a new method that views object detection as a direct set prediction
problem. Our approach streamlines the detection pipeline, effectively removing the
need for many hand-designed components like a non-maximum suppression procedure
or anchor generation that explicitly encode our prior knowledge about the task. The
main ingredients of the new framework, called DEtection TRansformer or DETR, are
a set-based global loss that forces unique predictions via bipartite matching, and a
transformer encoder-decoder architecture. Given a fixed small set of learned object
queries, DETR reasons about the relations of the objects and the global image context
to directly output the final set of predictions in parallel. The new model is
conceptually simple and does not require a specialized library, unlike many other
modern detectors. DETR demonstrates accuracy and run-time performance on par
with the well-established and highly-optimized Faster RCNN baseline on the
challenging COCO object detection dataset.
introduction
Object Detection is a computer vision
technique that allows us to identify
and locate objects in an image or
voice . With this kind of identification,
object detection can be used to count
objects in an image or video and
determine and track their precise
locations, all while accurately labeling
them.
WHAT WE ARE WORKING ON

cnn DETR
Detection
Convolutional Neural Transformers(bipar
Network(resnet-50) tite match loss
function)

ENCODER- Coco object


DECODER detection
Feed Forward
Network
dataset
Convolutional neural networks
RESNET-50
ResNet-50 is a convolutional neural network that is 50 layers deep. You can load a
pretrained version of the network trained on more than a million images from the
ImageNet database .
As a result, the network has learned rich feature representations for a wide range of
images. The network has an image input size of 224-by-224.
ResNet50 is a variant of ResNet model which has 48 Convolution layers along with 1
MaxPool and 1 Average Pool layer. It has 3.8 x 10^9 Floating points operations. It is a
widely used ResNet model and we have explored ResNet50 architecture in depth.
Transformers
Transformers are a deep learning architecture
that has gained popularity in recent years. They
rely on a simple yet powerful mechanism called
attention, which enables AI models to
selectively focus on certain parts of their input
and thus reason more effectively. Transformers
have been widely applied on problems with
sequential data, in particular in natural
language processing (NLP) tasks such as
language modeling and machine translation
, and have also been extended to tasks as
diverse as speech recognition, symbolic
mathematics, and reinforcement learning.
So, What exactly is a transformers?
● In 2017, Vaswani et al. introduced the Transformer
and thereby gave birth to transformer-based encoder-
decoder models.
● Analogous to RNN-based encoder-decoder models,
transformer-based encoder-decoder models consist
of an encoder and a decoder which are both stacks of
residual attention blocks.
● Not relying on a recurrent structure allows
transformer-based encoder-decoders to be highly
parallelizable, which makes the model orders of
magnitude more computationally efficient than RNN-
based encoder-decoder models on modern hardware.
DETR
Detection Transformers (DETR)
, an important new approach to object detection and panoptic segmentation. DETR
completely changes the architecture compared with previous object detection
systems. It is the first object detection framework to successfully integrate
Transformers as a central building block in the detection pipeline.

DETR matches the performance of state-of-the-art methods, such as the well-


established and highly optimized Faster R-CNN baseline on the challenging COCO
object detection dataset, while also greatly simplifying and streamlining the
architecture.
BLOCK INPUT DATASET

DIAGRAM PRE-PROCESSING

OUTPUT OBJECT
DETECTION(CLASS,BOUNDING BOX)
OBJECT SEGMENTATION

TRANSFORMERS DECODER
(FNN) TRANSFORMERS ENCODER
DATASET

Coco 2017 dataset (123,287 images)


DEtection TRansformer
INPUT
OUTPUT
conclusion
We presented DETR, a new design for object detection systems based on
transformers and bipartite matching loss for direct set prediction. The approach
achieves comparable results to an optimized Faster R-CNN baseline on the
challenging COCO dataset. DETR is straightforward to implement and has a flexible
architecture that is easily extensible to panoptic segmentation, with competitive
results. In addition, it achieves significantly better performance on large objects, likely
due to the processing of global information performed by the self-attention.This new
design for detectors also comes with new challenges, in particular regarding training,
optimization and performances on small objects. Current detectors required several
years of improvements to cope with similar issues, and we expect future work to
successfully address them for DETR.
Thank you

You might also like