You are on page 1of 43

111Equation Chapter 1 Section 1VIETNAM NATIONAL UNIVERSITY –

HOCHIMINH CITY
INTERNATIONAL UNIVERSITY
SCHOOL OF ELECTRICAL ENGINEERING

3D Object Detection from Images for


Autonomous Driving

NGUYỄN HOÀNG DUY BẢO _ EEACIU17036

A SENIOR PROJECT SUBMITTED TO THE SCHOOL OF ELECTRICAL


ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF BACHELOR OF ELECTRICAL ENGINEERING

HO CHI MINH CITY, VIET NAM


2022
3D Object Detection from Images for Autonomous Driving

BY

NGUYEN HOANG DUY BAO

Under the guidance and approval of the committee, and approved by its members, this

thesis has been accepted in partial fulfillment of the requirements for the degree.

Approved:

________________________________
Chairperson

_______________________________
Committee member

________________________________
Committee member

________________________________
Committee member

________________________________
Committee member

i
HONESTY DECLARATION

My name is Nguyen Hoàng Duy Bao. I would like to declare that, apart from the

acknowledged references, this thesis either does not use language, ideas, or other original

material from anyone; or has not been previously submitted to any other educational and

research programs or institutions. I fully understand that any writings in this thesis

contradicted to the above statement will automatically lead to the rejection from the EE

program at the International University – Vietnam National University Hochiminh City.

Date:

Student’s Signature

(Full name)

ii
TURNITIN DECLARATION

Name of Student: Nguyễn Hoàng Duy Bảo

Date: 08/06/2022

Advisor Signature Student Signature

iii
ACKNOWLEGMENT

It is with deep gratitude and appreciation that I acknowledge the professional

guidance of Dr. Nguyen Ngoc Hung. His constant encouragement and support helped me

to achieve my goal.

………………………

iv
TABLE OF CONTENTS

HONESTY DECLARATION.............................................................................................ii

TURNITIN DECLARATION............................................................................................iii

ACKNOWLEGMENT.......................................................................................................iv

TABLE OF CONTENTS....................................................................................................v

LIST OF TABLES............................................................................................................vii

LIST OF FIGURES..........................................................................................................viii

ABBREVIATIONS AND NOTATIONS..........................................................................ix

ABSTRACT........................................................................................................................x

CHAPTER I INTRODUCTION.........................................................................................1

1.1.Overview...................................................................................................................1

1.2. Objectives.................................................................................................................1

1.3. Report organization..................................................................................................2

CHAPTER II DESIGN SPECIFICATIONS AND STANDARDS...................................3

2.1. Hardware description...............................................................................................3

2.2. Software installation................................................................................................3

CHAPTER III PROJECT MANAGEMENT......................................................................4


v
3.1. Budget and Cost Management Plan.........................................................................4

3.2 Project Schedule........................................................................................................4

CHAPTER IV LITERATURE REVIEW............................................................................5

4.1. Autonomous Car......................................................................................................5

4.1.1. Background......................................................................................................5

4.1.2 How do Autonomous vehicles working?..........................................................7

4.1.3. Challenging of self-driving cars?.....................................................................7

4.2. 3D Object Detection from Images for Autonomous Driving concept....................8

4.2.1 The overview.....................................................................................................8

4.2.2. 3D Object Detection from Images for Autonomous Driving methods............9

4.2.2.1. Methods based on 2D features................................................................10

4.2.1.1.1 Region-based methods.....................................................................11

4.2.1.1.2. Region-based methods....................................................................12

4.2.2.2. Methods Based on 3D Features..............................................................12

4.2.2.2.1. Feature lifting-based methods.........................................................12

4.2.2.2.1. Data lifting-based methods.............................................................14

4.3. Dataset....................................................................................................................15

CHAPTER V METHODOLOGY.....................................................................................17

5.1 Methods...................................................................................................................17

5.2. 2D Feature..............................................................................................................17

vi
5.2.1. Estimating the object's center of gravity........................................................18

5.2.2. Estimate the center's deviation.......................................................................19

5.2.3. Estimate the size of the object's bounding box...............................................20

5.2.4. Architecture and Loss function......................................................................20

5.3. 3D Detection..........................................................................................................22

5.4. Dataset....................................................................................................................22

5.5. Training..................................................................................................................23

CHAPTER VI EXPECTED RESULTS............................................................................25

CHAPTER VII CONCLUSION AND FUTURE WORK................................................28

7.1. The conclusion.......................................................................................................28

7.2 The future work.......................................................................................................28

REFERENCES..................................................................................................................29

vii
LIST OF TABLES

Table 2.1 Inside Hardware................................................................................................3

Table 6.1 KITTI evaluation............................................................................................22

viii
LIST OF FIGURES

Figure 4.1 Levels of driving automation...........................................................................5

Figure 4.2 3D bounding box.............................................................................................8

Figure 4.3 3D Object Detection from Images for Autonomous Driving..........................9

Figure 4.4 An illustration of the feature lifting methods................................................12

Figure 4.5 Pipeline for image-based 3D-object detection using pseudo-LiDAR...........13

Figure 5.1 Illustration of the monocular 3D object detection task.................................15

Figure 5.2 Operation diagram.........................................................................................16

Figure 5.3 Create ground truth heatmap.........................................................................16

Figure 5.4 CenterNet Architecture..................................................................................18

Figure 5.5 Backbone of the model..................................................................................19

Figure 5.6 Dataset feature...............................................................................................20

Figure 6.1 The input image.............................................................................................23

Figure 6.2 The keypoint heat map..................................................................................23

Figure 6.3 2D detection...................................................................................................23

Figure 6.4 3D decection with bounding box...................................................................24

ix
ABBREVIATIONS AND NOTATIONS

PAN: Personal Area Network

MAC: Medium Access Control

SAE: The Society of Automotive Engineers

AVS: Autonomous vehicles

SC: Stereo CenterNet

3D: Three-dimensional

2D: Two-dimensional

AD: Autonomous Driving

x
ABSTRACT

In recent years, both business and academics have paid more attention to 3D

object detection from images, one of the fundamental and difficult challenges in

autonomous driving. Image-based 3D detection has made notable advancements because

to the rapid development of deep learning algorithms. Particularly, more than 200

publications covering a wide range of theories, methods, and applications explored this

subject between 2015 and 2021. In some recent studies, this task can be achieved by

applying a pre-trained model. In this senior project, a new 3D object detection from

images methods will be proposed for the thesis project.

xi
CHAPTER I

INTRODUCTION

1.1 Overview

Object detection is a computer vision task that is now used in a wide range of consumer

applications, including surveillance and security systems, autonomous driving, mobile text

recognition, and disease diagnosis using MRI/CT scans. The objective of this senior project is

3D Object Detection from Images for Autonomous Driving (AD).

By lowering travel time, energy use, and pollution, AD has the potential to greatly better

people's lives. One of the most important aspects of autonomous driving is 3D object detection.

As a result, both research and industry have made significant efforts in the last decade to develop

self-driving vehicles. A key technology for the development of autonomous vehicles is 3D object

detection. Deep learning-based 3D detection approaches have recently gained popularity.

Object detection is one of the most main factors of AD. Autonomous cars rely on

perception of their environment to enable safe and efficient driving. This vision system employs

object identification algorithms to precisely identify nearby items, such as pedestrians, other

cars, traffic signs, and fences. For identifying and localizing these items in real time, object

detectors based on deep learning are required. This article discusses object detectors' current

condition and the problems that stand in the way of integrating them into AD.

1
1.2. Objectives

The main goal of this senior project is studying 3D Object Detection from Images for AD

with Deep Learning and Computer Vision. Then suggest suitable models that can be used for real

applications. Finally, create a training and testing model and run demo to see if can work

properly.

1.3. Report organization

The organization of this thesis report is described below:

 Chapter I: An explanation of this senior's goal motive is provided.

 Chapter II: The project's constructed specifications and standards will be presented.

 Chapter III: The senior project's management, planning, project timelines, and resource

planning will be presented.

 Chapter IV: A review of common methods will be provided.

 Chapter V: This project's recommended methodologies will be outlined.

 Chapter VI: This chapter will demonstrate the project's anticipated theoretical conclusion.

 Chapter VII: This chapter will show the projected theoretical conclusion of the project.

2
CHAPTER II

DESIGN SPECIFICATIONS AND STANDARDS

2.1. Hardware description

Table 2.1. Inside Hardware.

Edition Ubuntu 16.04

Processor Processor Intel(R) Core (TM) I5-10400F CPU @ 2.9 GHz 4.3GHz

Installed Ram Installed RAM 16.0 GB

System type System type 64-bit operating system, x64 based processor

2.2. Software installation


All of the experiments in this study were carried out on the computer of student Duy Bao.

The PC is running the most recent version of Anaconda, all the experiments are run on the

platform called Jupyter Notebook, which is provided by Anaconda. There are also some libraries

must be installed in order to make this study works:

 OpenCv
 Cython
 Numba
 Matplotlib
 Scipy

3
CHAPTER III

PROJECT MANAGEMENT

3.1. Plan for Budget and Cost Management

The total cost of this study is about 15 million VND due to the personal computer.

3.2 Project Schedule

The project has 3 parts:

 The first part: finding and understanding all the methods (7 weeks)

 The second part: building, fixing, and compiling the model (4 weeks)

 The last part is getting the result from the model and writing report. (4 weeks)

4
CHAPTER IV

LITERATURE REVIEW

4.1. Autonomous vehicles

Autonomous vehicles (AVs) are automobiles that employ technology to replace a human

driver partially or completely in navigating a vehicle from point A to point B while avoiding

road dangers and adjusting to traffic conditions. The Society of Automotive Engineers (SAE) has

developed a six-tier rating system depending on the type of human contact. The National

Highway Traffic Safety Administration (NHTSA) in the United States employs this

categorization system.

Figure 4.1: Levels of driving automation

5
4.1.1. Background

General Motors presented the concept of AD in 1939. Since then, the self-driving

automobile has undergone a full transition and is now an autonomous vehicle. To drive without

human involvement, ADs employs a mix of sensors, artificial intelligence, radars, and cameras.

This sort of vehicle is still in the development stage since numerous components must be

considered to ensure the safety of its passengers.

The Avs' priority is to recognize things. On the road, there are both moving and

immovable items such as people, other vehicles, traffic signals, and more. To avoid incidents

when moving, the car must recognize numerous things. AD collect data and create 3D maps

using sensors and cameras. This aids in identifying and detecting things on the road while

driving and ensures the safety of its passengers.

Deep neural network-based 3D object identification systems are becoming a key

component of self-driving cars. 3D object recognition helps to understand of the geometry of

physical objects in 3D space, which is necessary for predicting future object movements. While

image-based 2D object recognition and instance segmentation have made significant advances,

3D object detection has received less attention in the literature.

Detecting three-dimensional (3D) objects is an essential but difficult job in many

domains, including AD and robotics [1,2]. Several contemporary mainstream 3D detectors rely

on light detection and ranging (LiDAR) sensors to collect accurate 3D data, and the application

of LiDAR data is viewed as vital to the success of 3D detectors. Stereo cameras, on the other

hand, which function similarly to human binocular vision, are less costly and offer higher

resolutions. Hence, they have received significant interest in academia and business.

6
4.1.2. How do Autonomous vehicles working?

AD employ sensors, actuators, machine learning systems, complicated algorithms and

powerful processors to execute software.

Autonomous vehicles construct and maintain a map of their environment using a variety

of sensors placed throughout the vehicle. Radar sensors monitor the activities of surrounding

cars. Video cameras detect traffic signals while simultaneously reading road signs, tracking other

cars, and looking for people. By bouncing light pulses off the car's surrounds, lidar sensors

estimate distances, detect road boundaries, and recognize lane markers.

Following the analysis of all this sensory data, advanced software designs a course and

delivers orders to the car's actuators, which regulate acceleration, braking, and driving.

Because of hard-coded regulations, obstacle avoidance algorithms, predictive modeling,

and object identification, the program observes traffic laws and navigates obstacles.

4.1.3. Challenging of self-driving cars

Recently, 3D detection based on images has evolved significantly; nonetheless, most

modern approaches solve this problem using anchor-based 2D detection or depth estimation.

Nonetheless, the significant computational cost of these technologies prevents them from

attaining real-time performance. In this article, i would like to introduce the solution to the vision

problems for AD. It is 3D Object Detection from Images for AD.

4.2. 3D Object Detection from Images for Autonomous Driving concept

AD has the potential to significantly improve people's lives by improving mobility while

reducing travel time, pollution and energy consumption. Unsurprisingly, both academics and

business have made major efforts in the recent decade to build self-driving automobiles. As one

7
of the primaries enabling technologies for autonomous driving, 3D object detection has received

a lot of attention. Deep learning-based 3D object recognition systems have gained popularity

recently.

Existing 3D object recognition systems may be classified into two categories based on

whether the incoming data is images or LiDAR signals. Approaches that estimate 3D bounding

boxes from photos alone confront a far bigger difficulty than LiDAR-based algorithms, since it is

difficult to extract 3D information from 2D input data. Despite this inherent complexity, image-

based 3D object recognition algorithms have evolved in the field of computer vision during the

previous six years. More than 80 articles in top-tier conferences and journals in this sector have

resulted in significant gains in detection accuracy and inference speed.

4.2.1 The overview

The goal of image-based 3D object detection is to identify and find objects of interest

given RGB images and camera settings. Each item is represented by its category and bounding

box.3D bounding box is parameterize by:

 Dimension [h,w,l]

 Location [x,y,z]

 Orientation [θ, φ, ψ]

In most autonomous driving scenarios, just the heading angle θ around the up-axis (yaw

angle) is considered.

8
Figure 4.2: 3D bounding box

4.2.2. 3D Object Detection from Images for AD methods

CAD models, LiDAR signals, external data, temporal sequences, and stereo images are

examples of auxiliary data applications. Stereo images provide the most important information to

image-based 3D detection, and algorithms that use this sort of data outperform others at the

inference stage.

Up to now, 3D Detection from Images for AD categorization methods are:

+ Methods based on 2D features: These methods use the input RGB images, they begin

by estimating 2D positions, orientations, and dimensions and then reconstruct the 3D

locations from the 2D features. As a result, these approaches are often known as'result

lifting-based methods.

+ Methods based on 3D features: The primary benefit of these algorithms is that they first

construct 3D features from photographs before estimating all components in the 3D

bounding boxes, including 3D locations, in 3D space. The methodology is divided into

two categories: feature lifting methods and data lifting methods.Figure 4.1.1 below

describes 3D Detection from Images AD categorization.


9
Feature lifting
Methods based
on 3D features
Taxonomies Data lifting
Methods based
on 2D features
[result lifting]

Firgure 4.3: 3D Object Detection from Images for AD

4.2.2.1. Methods based on 2D features.

This approach calculates the orientations, 2D locations, and dimensions of an input

picture (see Figure 2 for a visual representation of these items) and then recovers the 3D

positions from these estimates. As a result, these approaches are often known as 'result lifting-

based methods.' An intuitive and widely used technique for obtaining an object's 3D position [x,

y, z] is to use CNNs to estimate the depth value d, and then lift the 2D date feature into 3D space

using:

{
z=d ,
x=( u−C x ) × z /f , (4.1)
y=( v−C y ) × z / f ,

With:

 (C ¿ ¿ x , C y )¿is the main point.

 (u , v ) is 2D location of an object.

10
 f is the focal length.

Furthermore, unlike methods that need full depth maps, such as Pseudo-LiDAR [3], these

approaches just require the depths of the objects' centers. Furthermore, while these approaches

are broadly equivalent to 2D detectors, I divide them into two sub-categories for ease of

presentation: single shot methods and region based methods.

4.2.1.1.1 Region-based methods

The region based techniques in 2D object identification correspond to the R-CNN

series' high-level concept [4], [5], [6]. After generating category-independent area suggestions

from input pictures, CNNs [6, 7] extract features from these regions. Finally, R-CNN uses these

attributes to refine the suggestions and assign them to categories. The novel region-based

framework designs for image-based 3D detection are summarized here.

Proposal generating: Unlike the commonly used 2D detection proposal generation

methods [11], [12], a simple way to generate proposals for 3D detection is to tile the 3D anchors

(proposal shape templates) in the ground plane and then project them as proposals to the image

plane.

These architectures, however, generally result in substantial processing overhead. Chen et

al. [8], [9], [10] pioneered the Mono3D and 3DOP techniques for monocular and stereo-based

algorithms, respectively, to reduce the searching space by rejecting ideas with low confidence

using domain-specific priors (e.g., shape, height, position distribution, etc.).

The Region Proposal Network (RPN) [6] enables detectors to construct 2D proposals

using features from the final shared convolutional layer rather than external techniques, reducing

11
the majority of the cost of computation and allowing multiple image-based 3D detectors to be

used.

4.2.1.1.2. Region-based methods

The single shot object detectors estimate class probabilities directly and inversely

additional 3D box items from one-by-one feature point. As a result, these systems may be able to

make judgments more quickly than region-based techniques, which is critical in the context of

AD.

Because single-shot approaches employ only CNN layers, they are also easier to install

on a variety of hardware architectures. Furthermore, certain relevant publications [13], [14], [15],

[26] shown that single-shot detectors can also yield promising results.

4.2.2.2. Methods Based on 3D Features

The fundamental advantage of these approaches is that they first build features in 3D

from images before immediately estimating all elements in the bounding boxes in 3D, including

3D positions and 3D space. To get feature in 3D, there are two categories: 'data lifting-based

methods' and 'feature lifting-based methods'.

4.2.2.2.1. Feature lifting-based methods

The feature lifting approaches work by converting image features in 2D in the image

coordinate system into 3D voxel features in the world coordinate system.

Furthermore, previous feature lifting based algorithms [17], [18], [22], [19], [20] further

compress the voxel features in 3D that are vertically oriented to yield the BEV features before

estimating final findings.

12
The main issue is figuring out how to convert 2D picture characteristics into 3D voxel

features. This issue is discussed further below.

Feature lifting for monocular methods: To achieve feature lifting, Roddick et al. [17]

presented OFTNet, a retrieval-based detection model. They obtain the voxel feature by

aggregating the 2D features throughout region of the front view image feature corresponding to

the top-left projection of each voxel (u1, v2) and bottom-right (u2, v2) corners:

u2 v2
1
V ( x , y , z)= ∑ ❑ ∑ ❑ F(u , v) (4.2)
( u 2−u1 ) ( v 2 −v 1 ) u=u 1
v=v 1

Where V(x, y, z) signify the voxel (x, y, z) and F(u, v) signify pixel characteristics (u, v).

Reading et al. [22] accomplish feature lifting by back projection [21]. To begin, They

categorize the continuous depth space and consider depth estimation as a classification task.

Instead of a single number, the depth estimate result is the distribution D for these bins. Then, to

construct the 3D frustum feature G(u, v), pixel characteristics F(u, v) is weighted by its

corresponding depth bin probabilities in D(u, v):

G(u , v )=D(u , v) ⊗F (u , v) (4.3)

Where the outer product is denoted by ⊗. Because the image-depth coordinate system

is used for this geometrical feature (u, v, d). Prior to creating the voxel feature, it must be aligned

to the 3D world coordinate system (x, y, z) using camera settings.. These two techniques are

depicted in Figure 4.4.

13
Figure 4.4. A diagram of the feature lifting approaches.

Left: Combining picture attributes across appropriate places yields 3D features.

Right: To raise 2D features into 3D space, image characteristics are weighted by their

depth distribution. From [17] and [22].

Feature lifting for stereo methods: Because of well-developed stereo matching

algorithms, creating 3D features from stereo pairs is easier than developing them from

monocular images..

Chen et al. [18] presented the Deep Stereo Geometry Network (DSGN), which lifts features

using stereo pictures as input.They begin by extracting features from stereo pairs and then

construct a 4D plane-sweep volume using the conventional plane sweeping technique [23], [24],

[25] by joining the left image feature with the reprojected right image feature at equally spaced

depth values The 4D volume will then be translated into 3D world space before being used to

generate the BEV map, which will be used to predict the final results.

4.2.2.2.1. Data lifting-based methods

The 2D images are transformed into 3D data using data lifting algorithms. The obtained

data is then used to extract 3D characteristics. The pseudo-LiDAR pipeline is an excellent

illustration of this strategy.

14
The pseudo-LiDAR pipeline was presented to construct a bridge between image-based

approaches and LiDAR-based methods, using wellstudied depth estimation, disparity estimation,

and LiDAR-based 3D object identification. In this pipeline, we must first estimate the dense

depth maps [27, 28] from pictures (or disparity maps [26, 24] and then turn them into depth maps

[34]). The 3D position (x, y, z) of the pixel (u, v) may then be calculated using

{
z=d ,
x=( u−C x ) × z /f , (4.4)
y=( v−C y ) × z /f ,

N
The pseudo-LiDAR signal {( x n , y n , z n }n=1 may be created by All pixels are back-

projected into 3D coordinates, where N is number of pixels. After that, The pseudo-LiDAR

signals may then be used as input for LiDAR-based detection algorithms [31], [32], [33].

Figure 4.4. Image-based 3D detection pipeline based on pseudo-LiDAR

4.3. Dataset

The most widely utilized are nuScenes [6], KITTI 3D [5], and Waymo Open [11], which

considerably boost development of 3D detection.

For the last decade, KITTI 3D was the sole dataset available to aid in the development of

3D detectors base on images. Front-view photos at a resolution of 1280x384 pixels are provided

by KITTI 3D. The nuScenes and Waymo Open datasets were introduced in 2019. Six cameras

15
are utilized in the nuScenes dataset to provide a 360 degree image with a resolution of 1600x900

pixels. Similarly, Waymo Open captures 360 degrees of vision with five synchronized cameras,

and the image resolution is 1920x1280 pixels.

The KITTI 3D dataset contains 7,483 pictures for training and 7,519 images for testing,

and it is standard practice to divide the training data into a training set and a validation set [15],

[1], [16]. 3DOP's split [1], the most often used, has 3,713 and 3,770 pictures for training and

validation, respectively. The large-scale nuScenes and Waymo Open give around 40K and 200K

annotated frames, respectively, and employ numerous cameras to record each frame's panoramic

perspective. NuScenes, in particular, offers 28,140 frames, 6,020 frames, and 6,009 frames for

validation, training, and testing (six images per frame). The largest, Waymo Open, provides

122,300 frames for training, 31,407 frames for validation, and 41,077 frames for testing (five

images per frame).

16
CHAPTER V

METHODOLOGY

The proposed method of 3D Object Detection from Images for AD is given in this

chapter.

5.1 Methods

In this project method that I used is methods base on 2D feature, tt will estimate the 2D

locations, orientations, and dimensions (see Figure 5.1 for a visual representation of these things)

from the 2D characteristics and then recover the 3D locations from these estimates.

Figure 5.1. The monocular 3D object detection task is depicted.

CenterNet, an anchor-free single-shot detector, was introduced and expanded to image-

based 3D detection by Zhou et al. [38] in 2019. This framework, in particular, encodes the object

as a single point (the center point of the object) and finds it via key-point estimation.
17
Furthermore, numerous parallel heads are utilized to estimate the object's other attributes, such as

depth, size, position, and orientation.

5.2. 2D Feature

This models is concerned with three major issues:

 Estimating the object's center of gravity (Based on heat maps – heatmap head).

 Estimate the center's deviation. (offset head).

 Estimate the size of the object's bounding box. (dimenssion head).

5.2.1. Estimating the object's center of gravity

Figure 5.2. Operation diagram

CenterNet uses principles from the keypoint estimation issue, treats each object's center

point as a keypoint, and forecasts this center point using a heatmap.

Each pixel in this heatmap will show the chance that the cell contains the object's center.

Following that, CenterNet filters the maximum points on the heatmap to locate the center of the

items in the picture. To create heatmap, the model will optimize the network architecture

parameters to provide a heatmap result.

18
This design will include a backbone for feature extraction and a head (in this case, a

heatmap head) for transforming features into heatmaps.

W H
With an input image I ∈ R W x H x 3 , output will be heat map Y ∈ [ 0,1 ] R x R xC with R is stride

and C is number of classes.

Figure 5.3. Create ground truth heatmap

Create the heatmap ground truth:

W H
x xC
 First, create Heatmap Y ∈ [ 0 ] R R

 From the bounding box of object, determine the keypoint is the center of the

object pc ∈ R2 with C is class

pc
 Determine corresponding point of Pc on heatmap ~
pc = (integer part)
R

−( x−~
p x ) + ( y −~
2 2
py)
 Replace Y xyc=exp ( 2
)
2δp

19
The only thing left to do is describe the loss function so that the model can learn how to

generate a heatmap fit using the heatmap groundtruth.

5.2.2. Estimate the center's deviation

p
By only take integer part of ~
pc = c , it creates an offset between the actual center point
R

and the center point in the heatmap.

To learn this offset, CenterNet uses one more head - the offset head with an output of a
W H
X X2
tensor representing the horizontal and vertical offsets: O
^ ∈R R R

5.2.3. Estimate the size of the object's bounding box

Parallel to estimating the position of the center, is estimating the size of the object,

namely the width and height.

Similar to offset estimation, CenterNet also uses a head - dimension head to estimate the
W H
width and height of the object. Output of wh head is 1 tensor: ^S ∈ R R X R X 2

5.2.4. Architecture and Loss function

I mentioned previously that model uses three head networks (3 CNNs) to learn three

separate tasks: heatmap head, offset head, and dimension head. You can see the graphic below to

help you understand it better.

20
Figure 5.4. CenterNet Architecture

To take loss function, we take a weighted sum of 3 loss subfunctions corresponding to 3

heads: ( λ ¿¿0.1 , λ off


), specifically :
=1 ¿

Ld et=Lk + λs ize Ls ize+ λo ff Lo ff (5.1)

21
With:

 Heatmap Loss(Lk ) for Heatmap Head: Use focal loss (due to imbalance between point as

keypoint and point as background). Here, α,β are hyperparameters of focal loss (the

author uses α=2, β=4), N is the number of keypoints in the image.

{
α
−1 ¿ ( 1−Y^ xyc ) log ⁡( Y^ xyc ) ∧¿ if Y xyc=1
Lk = ∑
N xyc
❑ α (5.2)
¿ ( 1−Y xyc ) ( Y^ xyc ) log ⁡( 1−Y^ xyc ) ∧¿ other wise
β

 Offset Loss (Loff ) corresponding to Offset Head: L1 loss with

Loff =
−1
N p | ( )|
∑ ❑ O^ ~p − Rp −~p (5.3)

 Size Loss ¿ corresponding to Dimension Head: L1 loss with

L 1
¿¿− ∑ ❑|^S pk −sk|¿
N k=1

Regarding backbone, CenterNet conducts testing on 4 backbones to trade-off between

accuracy and speed: ResNet18, ResNet101, DLA-34 and Hourglass-104.

Firgure 5.5. Backbone of the model

5.3. 3D Detection

3D detection requires three extra qualities per center point: depth, 3D dimension, and

orientation.

22
The depth d is a single scalar per center point. However, regressing to depth is tough.

1^
Instead, I employ Eigen et al [39] 's output transformation and d= d−1 where σ is the sigmoid
σ
W H
X
function. The depth are computed as an addtional output channel ^
D ∈ R R R of the keypoint

estimator. It employs two convolutional layers separated by a ReLU once more. It employs the

inverse sigmoidal transformation at the output layer, as opposed to prior modalities. After the

sigmoidal transformation, we train the depth estimator with an L1 loss in the original depth

domain.

The 3D dimensions of an object are three scalars. It directly regress to their absolute
W H
X X3
values in meters using a separate head ^
F∈R R R and an L1 loss

By default, orientation is a single scalar. However, it might be difficult to retreat to.

Following the lead of Mousavian et al. [7], I express the orientation as two bins using in-bin

regression. The orientation is represented by 8 scalars, with 4 scalars for each bin. Two scalars

are used for softmax classification in one bin, and the remaining two scalars regress to an angle

within each bin.

5.4. Dataset

Kitti is a collection of vision exercises created with an autonomous driving platform. The

whole benchmark includes several tasks such as stereo, optical flow, visual odometry, and so on.

The object detection dataset, comprising monocular pictures and bounding boxes, is included in

this dataset. The dataset includes 7481 training photos that have been labeled with 3D bounding

boxes. The readme of the object development kit readme on the Kitti webpage has a detailed

overview of the annotations.

23
Figure 5.6. Dataset feature

5.5. Training

I train with 512x512 input resolution. This results in an output resolution of 128128 for

all modes. As data augmentation, we utilize random flip, random scaling (between 0.6 and 1.3),

cropping, and color jittering, and Adam [40] to maximize the overall aim. We do not increase the

3D estimate branch since cropping or scaling alters the 3D data.

I train residual networks and DLA-34 using a batch size of 128 (on 8 GPUs) and a

learning rate of 5e-4 for 140 epochs, with learning rates reduced by 10 at 90 and 120 epochs,

respectively (following [55]). Over Hourglass-104, we utilize batch-size 29 (on 5 GPUs, with

master GPU batch-size 4) and learning rate 2.5e-4 for 50 epochs, with a 10% learning rate

reduction at the 40th epoch. To save computation, we fine-tune the Hourglass-104 from

ExtremeNet [61] for detection. Resnet-101 and DLA-34's downsampling layers are initialized

24
with ImageNet pretrain, whereas the up-sampling levels are randomly initialized. On 8 TITAN-V

GPUs, Resnet-101 and DLA-34 train in 2.5 days, whereas Hourglass-104 takes 5 days.

25
CHAPTER VI

EXPECTED RESULTS

On the KITTI dataset [17], which comprises properly annotated 3D bounding boxes for

automobiles in a driving scenario, we do 3D bounding box estimate experiments. KITTI contains

7841 training photos, and we adhere to the conventional training and validation splits described

in the literature [10, 54]. The average precision for automobiles at 11 recalls (0.0 to 1.0 with 0.1

increment) at IOU threshold 0.5, like in object detection [14], is used as the assessment metric.

We assess IOUs using 2D bounding box (AP), orientation (AOP), and Bird-eye-view bounding

box (BEV AP). For both training and testing, we preserve the original picture resolution and pad

at 1280384. The training completes in 70 epochs, with the learning rate decreasing at 45 and 60

epochs, respectively.

Because the number of recall criteria is tiny, the validation AP varies by up to 10% AP.

As a result, we train five models and present the average with standard deviation. Based on their

specific validation split, we compare Slow-RCNN-based Deep3DBox [38] and Faster-RCNN-

based Mono3D [9]. As demonstrated in Table 4, our technique performs similarly to its

competitors in AP and AOS and somewhat better in BEV. CenterNet outperforms both

approaches by orders of magnitude.

Table 6.1: KITTI evaluation.

26
Table 6.1 show 2D bounding box AP, average orientation score (AOS), and bird eye

view (BEV) AP on different validation splits. Higher is better.

Figure 6.1: The input image

The result after applied the models:

Figure 6.2. The keypoint heat map

Figure 6.3. 2D detection

27
Figure 6.4: 3D decection with bounding box

28
CHAPTER VII

CONCLUSION AND FUTURE WORK

7.1. The conclusion


The researching process in this senior project mainly focuses on the human 3D Object

Detection from Images for Autonomous Driving task and clarifies its difficulties. Some advance

3D Object Detection from Images for Autonomous Driving methods and their performances

have also been discussed in this report. Then, the suitable dataset for the 3D Object Detection

from Images for Autonomous Driving task is stated.

This senior project has successfully categorized 3D Object Detection from Images for

Autonomous Driving methods. Finally, 3D Object Detection from Images for Autonomous

Driving architecture based on a successfully worked system has also been proposed in terms of

theoretical architecture.

7.2 The future work


In the future, several steps are required to obtain the practical 3D Object Detection from

Images for Autonomous Driving:

 Looking for more Dataset for training and testing.

 Consecutively test the performance of CenterNet with various modifies and datasets to

get the best accuracy or speed scoring.

 Create pre-trained network to implement to real-life situations with Python-OpenCV.

29
REFERENCES

[1] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection in monocular video,” in
ECCV, 2020.
[2] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the
world’s imagery,” in CVPR, 2016.
[3] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, andK. Q. Weinberger, “Pseudo-lidar from
visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection
and semantic segmentation,” in CVPR, 2014
[5] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards realtime object detection with region
proposal networks,” in NeurIPS,2015 .
[7] K. He, G. Gkioxari, P. Doll ́ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
[8] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun,“Monocular 3d object detection for
autonomous driving,” in CVPR, 2016.
[9] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals for
accurate object class detection,” in NeurIPS, 2015.
[10] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals using stereo imagery
for accurate object class detection,” T-PAMI, 2017.
[11] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object
recognition,” IJCV, 2013.
[12] C. L. Zitnick and P. Doll ́ar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
[13] J. Feng, E. Ding, and S. Wen, “Monocular 3d object detection via feature domain adaptation,” in ECCV,
2020.
[14] X. Zhou, D. Wang, and P. Kr ̈ahenb ̈uhl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
[15] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in CVPR,
2019.
[16] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval, 1999.
[17] T. Roddick, A. Kendall, and R. Cipolla, “Orthographic feature transform for monocular 3d object
detection,” 2019.
[18] Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry network for 3d object detection,”
[19] X. Guo, S. Shi, X. Wang, and H. Li, “Liga-stereo: Learning lidar geometry aware representations for
stereo-based 3d detector,” in ICCV, 2021.
[20] J. Huang, G. Huang, Z. Zhu, and D. Du, “Bevdet: Highperformance multi-camera 3d object detection in
bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
[21] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly
unprojecting to 3d,” in ECCV, 2020.
[22] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for
monocular 3d object detection,” in CVPR, 2021.
[23] R. T. Collins, “A space-sweep approach to true multi-image matching,” in CVPR, 1996.
[24] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the
world’s imagery,” in CVPR, 2016.
[25] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,”
in ECCV, 2018.
[26] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train
convolutional networks for disparity, optical flow, and scene flow estimation,” in CVPR, 2016.
[27] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right
consistency,” in CVPR, 2017.
[28] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for
monocular depth estimation,” in CVPR, 2018.
[29] F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting of 2d detection to 6d pose and metric
shape,” in CVPR, 2019.
[30] K. He, G. Gkioxari, P. Doll ́ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.

30
[31] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object
detection from view aggregation,” in IROS, 2018.
[32] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in CVPR,
2018.
[33] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in
CVPR, 2019.
[34] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from
visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019.
[35] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark
suite,” in CVPR, 2012.
[36] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O.
Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
[37] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine
et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in ICCV, 2020.
[38] X. Zhou, D. Wang, and P. Kr ̈ahenb ̈uhl, “Objects as points,” arXiv
preprint arXiv:1904.07850, 2019.
[39] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep
network. In NIPS, 2014.
[40] P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.

31

You might also like