NHDBao - EEACIU17036 - Senior RP

111Equation Chapter 1 Section 1VIETNAM NATIONAL UNIVERSITY –
HOCHIMINH CITY
INTERNATIONAL UNIVERSITY
SCHOOL OF ELECTRICAL ENGINEERING
3D Object Detection from Images for

Autonomous Driving
NGUYỄN HOÀNG DUY BẢO _ EEACIU17036
A SENIOR PROJECT SUBMITTED TO THE SCHOOL OF ELECTRICAL

ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF BACHELOR OF ELECTRICAL ENGINEERING
HO CHI MINH CITY, VIET NAM

2022
3D Object Detection from Images for Autonomous Driving
BY
NGUYEN HOANG DUY BAO
Under the guidance and approval of the committee, and approved by its members, this
thesis has been accepted in partial fulfillment of the requirements for the degree.
Approved:
________________________________
Chairperson
_______________________________
Committee member
________________________________
Committee member
________________________________
Committee member
________________________________
Committee member
i
HONESTY DECLARATION
My name is Nguyen Hoàng Duy Bao. I would like to declare that, apart from the
acknowledged references, this thesis either does not use language, ideas, or other original
material from anyone; or has not been previously submitted to any other educational and
research programs or institutions. I fully understand that any writings in this thesis
contradicted to the above statement will automatically lead to the rejection from the EE
program at the International University – Vietnam National University Hochiminh City.
Date:
Student’s Signature
(Full name)
ii
TURNITIN DECLARATION
Name of Student: Nguyễn Hoàng Duy Bảo
Date: 08/06/2022
Advisor Signature Student Signature
iii
ACKNOWLEGMENT
It is with deep gratitude and appreciation that I acknowledge the professional
guidance of Dr. Nguyen Ngoc Hung. His constant encouragement and support helped me
to achieve my goal.
………………………
iv
TABLE OF CONTENTS
HONESTY DECLARATION.............................................................................................ii
TURNITIN DECLARATION............................................................................................iii
ACKNOWLEGMENT.......................................................................................................iv
TABLE OF CONTENTS....................................................................................................v
LIST OF TABLES............................................................................................................vii
LIST OF FIGURES..........................................................................................................viii
ABBREVIATIONS AND NOTATIONS..........................................................................ix
ABSTRACT........................................................................................................................x
CHAPTER I INTRODUCTION.........................................................................................1
1.1.Overview...................................................................................................................1
1.2. Objectives.................................................................................................................1
1.3. Report organization..................................................................................................2
CHAPTER II DESIGN SPECIFICATIONS AND STANDARDS...................................3
2.1. Hardware description...............................................................................................3
2.2. Software installation................................................................................................3
CHAPTER III PROJECT MANAGEMENT......................................................................4

v
3.1. Budget and Cost Management Plan.........................................................................4
3.2 Project Schedule........................................................................................................4
CHAPTER IV LITERATURE REVIEW............................................................................5
4.1. Autonomous Car......................................................................................................5
4.1.1. Background......................................................................................................5
4.1.2 How do Autonomous vehicles working?..........................................................7
4.1.3. Challenging of self-driving cars?.....................................................................7
4.2. 3D Object Detection from Images for Autonomous Driving concept....................8
4.2.1 The overview.....................................................................................................8
4.2.2. 3D Object Detection from Images for Autonomous Driving methods............9
4.2.2.1. Methods based on 2D features................................................................10
4.2.1.1.1 Region-based methods.....................................................................11
4.2.1.1.2. Region-based methods....................................................................12
4.2.2.2. Methods Based on 3D Features..............................................................12
4.2.2.2.1. Feature lifting-based methods.........................................................12
4.2.2.2.1. Data lifting-based methods.............................................................14
4.3. Dataset....................................................................................................................15
CHAPTER V METHODOLOGY.....................................................................................17
5.1 Methods...................................................................................................................17
5.2. 2D Feature..............................................................................................................17
vi
5.2.1. Estimating the object's center of gravity........................................................18
5.2.2. Estimate the center's deviation.......................................................................19
5.2.3. Estimate the size of the object's bounding box...............................................20
5.2.4. Architecture and Loss function......................................................................20
5.3. 3D Detection..........................................................................................................22
5.4. Dataset....................................................................................................................22
5.5. Training..................................................................................................................23
CHAPTER VI EXPECTED RESULTS............................................................................25
CHAPTER VII CONCLUSION AND FUTURE WORK................................................28
7.1. The conclusion.......................................................................................................28
7.2 The future work.......................................................................................................28
REFERENCES..................................................................................................................29
vii
LIST OF TABLES
Table 2.1 Inside Hardware................................................................................................3
Table 6.1 KITTI evaluation............................................................................................22
viii
LIST OF FIGURES
Figure 4.1 Levels of driving automation...........................................................................5
Figure 4.2 3D bounding box.............................................................................................8
Figure 4.3 3D Object Detection from Images for Autonomous Driving..........................9
Figure 4.4 An illustration of the feature lifting methods................................................12
Figure 4.5 Pipeline for image-based 3D-object detection using pseudo-LiDAR...........13
Figure 5.1 Illustration of the monocular 3D object detection task.................................15
Figure 5.2 Operation diagram.........................................................................................16
Figure 5.3 Create ground truth heatmap.........................................................................16
Figure 5.4 CenterNet Architecture..................................................................................18
Figure 5.5 Backbone of the model..................................................................................19
Figure 5.6 Dataset feature...............................................................................................20
Figure 6.1 The input image.............................................................................................23
Figure 6.2 The keypoint heat map..................................................................................23
Figure 6.3 2D detection...................................................................................................23
Figure 6.4 3D decection with bounding box...................................................................24
ix
ABBREVIATIONS AND NOTATIONS
PAN: Personal Area Network
MAC: Medium Access Control
SAE: The Society of Automotive Engineers
AVS: Autonomous vehicles
SC: Stereo CenterNet
3D: Three-dimensional
2D: Two-dimensional
AD: Autonomous Driving
x
ABSTRACT
In recent years, both business and academics have paid more attention to 3D
object detection from images, one of the fundamental and difficult challenges in
autonomous driving. Image-based 3D detection has made notable advancements because
to the rapid development of deep learning algorithms. Particularly, more than 200
publications covering a wide range of theories, methods, and applications explored this
subject between 2015 and 2021. In some recent studies, this task can be achieved by
applying a pre-trained model. In this senior project, a new 3D object detection from
images methods will be proposed for the thesis project.
xi
CHAPTER I
INTRODUCTION
1.1 Overview
Object detection is a computer vision task that is now used in a wide range of consumer
applications, including surveillance and security systems, autonomous driving, mobile text
recognition, and disease diagnosis using MRI/CT scans. The objective of this senior project is
3D Object Detection from Images for Autonomous Driving (AD).
By lowering travel time, energy use, and pollution, AD has the potential to greatly better
people's lives. One of the most important aspects of autonomous driving is 3D object detection.
As a result, both research and industry have made significant efforts in the last decade to develop
self-driving vehicles. A key technology for the development of autonomous vehicles is 3D object
detection. Deep learning-based 3D detection approaches have recently gained popularity.
Object detection is one of the most main factors of AD. Autonomous cars rely on
perception of their environment to enable safe and efficient driving. This vision system employs
object identification algorithms to precisely identify nearby items, such as pedestrians, other
cars, traffic signs, and fences. For identifying and localizing these items in real time, object
detectors based on deep learning are required. This article discusses object detectors' current
condition and the problems that stand in the way of integrating them into AD.
1
1.2. Objectives
The main goal of this senior project is studying 3D Object Detection from Images for AD
with Deep Learning and Computer Vision. Then suggest suitable models that can be used for real
applications. Finally, create a training and testing model and run demo to see if can work
properly.
1.3. Report organization
The organization of this thesis report is described below:
 Chapter I: An explanation of this senior's goal motive is provided.
 Chapter II: The project's constructed specifications and standards will be presented.
 Chapter III: The senior project's management, planning, project timelines, and resource
planning will be presented.
 Chapter IV: A review of common methods will be provided.
 Chapter V: This project's recommended methodologies will be outlined.
 Chapter VI: This chapter will demonstrate the project's anticipated theoretical conclusion.
 Chapter VII: This chapter will show the projected theoretical conclusion of the project.
2
CHAPTER II
DESIGN SPECIFICATIONS AND STANDARDS
2.1. Hardware description
Table 2.1. Inside Hardware.
Edition Ubuntu 16.04
Processor Processor Intel(R) Core (TM) I5-10400F CPU @ 2.9 GHz 4.3GHz
Installed Ram Installed RAM 16.0 GB
System type System type 64-bit operating system, x64 based processor
2.2. Software installation

All of the experiments in this study were carried out on the computer of student Duy Bao.
The PC is running the most recent version of Anaconda, all the experiments are run on the
platform called Jupyter Notebook, which is provided by Anaconda. There are also some libraries
must be installed in order to make this study works:
 OpenCv
 Cython
 Numba
 Matplotlib
 Scipy
3
CHAPTER III
PROJECT MANAGEMENT
3.1. Plan for Budget and Cost Management
The total cost of this study is about 15 million VND due to the personal computer.
3.2 Project Schedule
The project has 3 parts:
 The first part: finding and understanding all the methods (7 weeks)
 The second part: building, fixing, and compiling the model (4 weeks)
 The last part is getting the result from the model and writing report. (4 weeks)
4
CHAPTER IV
LITERATURE REVIEW
4.1. Autonomous vehicles
Autonomous vehicles (AVs) are automobiles that employ technology to replace a human
driver partially or completely in navigating a vehicle from point A to point B while avoiding
road dangers and adjusting to traffic conditions. The Society of Automotive Engineers (SAE) has
developed a six-tier rating system depending on the type of human contact. The National
Highway Traffic Safety Administration (NHTSA) in the United States employs this
categorization system.
Figure 4.1: Levels of driving automation
5
4.1.1. Background
General Motors presented the concept of AD in 1939. Since then, the self-driving
automobile has undergone a full transition and is now an autonomous vehicle. To drive without
human involvement, ADs employs a mix of sensors, artificial intelligence, radars, and cameras.
This sort of vehicle is still in the development stage since numerous components must be
considered to ensure the safety of its passengers.
The Avs' priority is to recognize things. On the road, there are both moving and
immovable items such as people, other vehicles, traffic signals, and more. To avoid incidents
when moving, the car must recognize numerous things. AD collect data and create 3D maps
using sensors and cameras. This aids in identifying and detecting things on the road while
driving and ensures the safety of its passengers.
Deep neural network-based 3D object identification systems are becoming a key
component of self-driving cars. 3D object recognition helps to understand of the geometry of
physical objects in 3D space, which is necessary for predicting future object movements. While
image-based 2D object recognition and instance segmentation have made significant advances,
3D object detection has received less attention in the literature.
Detecting three-dimensional (3D) objects is an essential but difficult job in many
domains, including AD and robotics [1,2]. Several contemporary mainstream 3D detectors rely
on light detection and ranging (LiDAR) sensors to collect accurate 3D data, and the application
of LiDAR data is viewed as vital to the success of 3D detectors. Stereo cameras, on the other
hand, which function similarly to human binocular vision, are less costly and offer higher
resolutions. Hence, they have received significant interest in academia and business.
6
4.1.2. How do Autonomous vehicles working?
AD employ sensors, actuators, machine learning systems, complicated algorithms and
powerful processors to execute software.
Autonomous vehicles construct and maintain a map of their environment using a variety
of sensors placed throughout the vehicle. Radar sensors monitor the activities of surrounding
cars. Video cameras detect traffic signals while simultaneously reading road signs, tracking other
cars, and looking for people. By bouncing light pulses off the car's surrounds, lidar sensors
estimate distances, detect road boundaries, and recognize lane markers.
Following the analysis of all this sensory data, advanced software designs a course and
delivers orders to the car's actuators, which regulate acceleration, braking, and driving.
Because of hard-coded regulations, obstacle avoidance algorithms, predictive modeling,
and object identification, the program observes traffic laws and navigates obstacles.
4.1.3. Challenging of self-driving cars
Recently, 3D detection based on images has evolved significantly; nonetheless, most
modern approaches solve this problem using anchor-based 2D detection or depth estimation.
Nonetheless, the significant computational cost of these technologies prevents them from
attaining real-time performance. In this article, i would like to introduce the solution to the vision
problems for AD. It is 3D Object Detection from Images for AD.
4.2. 3D Object Detection from Images for Autonomous Driving concept
AD has the potential to significantly improve people's lives by improving mobility while
reducing travel time, pollution and energy consumption. Unsurprisingly, both academics and
business have made major efforts in the recent decade to build self-driving automobiles. As one
7
of the primaries enabling technologies for autonomous driving, 3D object detection has received
a lot of attention. Deep learning-based 3D object recognition systems have gained popularity
recently.
Existing 3D object recognition systems may be classified into two categories based on
whether the incoming data is images or LiDAR signals. Approaches that estimate 3D bounding
boxes from photos alone confront a far bigger difficulty than LiDAR-based algorithms, since it is
difficult to extract 3D information from 2D input data. Despite this inherent complexity, image-
based 3D object recognition algorithms have evolved in the field of computer vision during the
previous six years. More than 80 articles in top-tier conferences and journals in this sector have
resulted in significant gains in detection accuracy and inference speed.
4.2.1 The overview
The goal of image-based 3D object detection is to identify and find objects of interest
given RGB images and camera settings. Each item is represented by its category and bounding
box.3D bounding box is parameterize by:
 Dimension [h,w,l]
 Location [x,y,z]
 Orientation [θ, φ, ψ]
In most autonomous driving scenarios, just the heading angle θ around the up-axis (yaw
angle) is considered.
8
Figure 4.2: 3D bounding box
4.2.2. 3D Object Detection from Images for AD methods
CAD models, LiDAR signals, external data, temporal sequences, and stereo images are
examples of auxiliary data applications. Stereo images provide the most important information to
image-based 3D detection, and algorithms that use this sort of data outperform others at the
inference stage.
Up to now, 3D Detection from Images for AD categorization methods are:
+ Methods based on 2D features: These methods use the input RGB images, they begin
by estimating 2D positions, orientations, and dimensions and then reconstruct the 3D
locations from the 2D features. As a result, these approaches are often known as'result
lifting-based methods.
+ Methods based on 3D features: The primary benefit of these algorithms is that they first
construct 3D features from photographs before estimating all components in the 3D
bounding boxes, including 3D locations, in 3D space. The methodology is divided into
two categories: feature lifting methods and data lifting methods.Figure 4.1.1 below
describes 3D Detection from Images AD categorization.

9
Feature lifting
Methods based
on 3D features
Taxonomies Data lifting
Methods based
on 2D features
[result lifting]
Firgure 4.3: 3D Object Detection from Images for AD
4.2.2.1. Methods based on 2D features.
This approach calculates the orientations, 2D locations, and dimensions of an input
picture (see Figure 2 for a visual representation of these items) and then recovers the 3D
positions from these estimates. As a result, these approaches are often known as 'result lifting-
based methods.' An intuitive and widely used technique for obtaining an object's 3D position [x,
y, z] is to use CNNs to estimate the depth value d, and then lift the 2D date feature into 3D space
using:
{
z=d ,
x=( u−C x ) × z /f , (4.1)
y=( v−C y ) × z / f ,
With:
 (C ¿ ¿ x , C y )¿is the main point.
 (u , v ) is 2D location of an object.
10
 f is the focal length.
Furthermore, unlike methods that need full depth maps, such as Pseudo-LiDAR [3], these
approaches just require the depths of the objects' centers. Furthermore, while these approaches
are broadly equivalent to 2D detectors, I divide them into two sub-categories for ease of
presentation: single shot methods and region based methods.
4.2.1.1.1 Region-based methods
The region based techniques in 2D object identification correspond to the R-CNN
series' high-level concept [4], [5], [6]. After generating category-independent area suggestions
from input pictures, CNNs [6, 7] extract features from these regions. Finally, R-CNN uses these
attributes to refine the suggestions and assign them to categories. The novel region-based
framework designs for image-based 3D detection are summarized here.
Proposal generating: Unlike the commonly used 2D detection proposal generation
methods [11], [12], a simple way to generate proposals for 3D detection is to tile the 3D anchors
(proposal shape templates) in the ground plane and then project them as proposals to the image
plane.
These architectures, however, generally result in substantial processing overhead. Chen et
al. [8], [9], [10] pioneered the Mono3D and 3DOP techniques for monocular and stereo-based
algorithms, respectively, to reduce the searching space by rejecting ideas with low confidence
using domain-specific priors (e.g., shape, height, position distribution, etc.).
The Region Proposal Network (RPN) [6] enables detectors to construct 2D proposals
using features from the final shared convolutional layer rather than external techniques, reducing
11
the majority of the cost of computation and allowing multiple image-based 3D detectors to be
used.
4.2.1.1.2. Region-based methods
The single shot object detectors estimate class probabilities directly and inversely
additional 3D box items from one-by-one feature point. As a result, these systems may be able to
make judgments more quickly than region-based techniques, which is critical in the context of
AD.
Because single-shot approaches employ only CNN layers, they are also easier to install
on a variety of hardware architectures. Furthermore, certain relevant publications [13], [14], [15],
[26] shown that single-shot detectors can also yield promising results.
4.2.2.2. Methods Based on 3D Features
The fundamental advantage of these approaches is that they first build features in 3D
from images before immediately estimating all elements in the bounding boxes in 3D, including
3D positions and 3D space. To get feature in 3D, there are two categories: 'data lifting-based
methods' and 'feature lifting-based methods'.
4.2.2.2.1. Feature lifting-based methods
The feature lifting approaches work by converting image features in 2D in the image
coordinate system into 3D voxel features in the world coordinate system.
Furthermore, previous feature lifting based algorithms [17], [18], [22], [19], [20] further
compress the voxel features in 3D that are vertically oriented to yield the BEV features before
estimating final findings.
12
The main issue is figuring out how to convert 2D picture characteristics into 3D voxel
features. This issue is discussed further below.
Feature lifting for monocular methods: To achieve feature lifting, Roddick et al. [17]
presented OFTNet, a retrieval-based detection model. They obtain the voxel feature by
aggregating the 2D features throughout region of the front view image feature corresponding to
the top-left projection of each voxel (u1, v2) and bottom-right (u2, v2) corners:
u2 v2
1
V ( x , y , z)= ∑ ❑ ∑ ❑ F(u , v) (4.2)
( u 2−u1 ) ( v 2 −v 1 ) u=u 1
v=v 1
Where V(x, y, z) signify the voxel (x, y, z) and F(u, v) signify pixel characteristics (u, v).
Reading et al. [22] accomplish feature lifting by back projection [21]. To begin, They
categorize the continuous depth space and consider depth estimation as a classification task.
Instead of a single number, the depth estimate result is the distribution D for these bins. Then, to
construct the 3D frustum feature G(u, v), pixel characteristics F(u, v) is weighted by its
corresponding depth bin probabilities in D(u, v):
G(u , v )=D(u , v) ⊗F (u , v) (4.3)
Where the outer product is denoted by ⊗. Because the image-depth coordinate system
is used for this geometrical feature (u, v, d). Prior to creating the voxel feature, it must be aligned
to the 3D world coordinate system (x, y, z) using camera settings.. These two techniques are
depicted in Figure 4.4.
13
Figure 4.4. A diagram of the feature lifting approaches.
Left: Combining picture attributes across appropriate places yields 3D features.
Right: To raise 2D features into 3D space, image characteristics are weighted by their
depth distribution. From [17] and [22].
Feature lifting for stereo methods: Because of well-developed stereo matching
algorithms, creating 3D features from stereo pairs is easier than developing them from
monocular images..
Chen et al. [18] presented the Deep Stereo Geometry Network (DSGN), which lifts features
using stereo pictures as input.They begin by extracting features from stereo pairs and then
construct a 4D plane-sweep volume using the conventional plane sweeping technique [23], [24],
[25] by joining the left image feature with the reprojected right image feature at equally spaced
depth values The 4D volume will then be translated into 3D world space before being used to
generate the BEV map, which will be used to predict the final results.
4.2.2.2.1. Data lifting-based methods
The 2D images are transformed into 3D data using data lifting algorithms. The obtained
data is then used to extract 3D characteristics. The pseudo-LiDAR pipeline is an excellent
illustration of this strategy.
14
The pseudo-LiDAR pipeline was presented to construct a bridge between image-based
approaches and LiDAR-based methods, using wellstudied depth estimation, disparity estimation,
and LiDAR-based 3D object identification. In this pipeline, we must first estimate the dense
depth maps [27, 28] from pictures (or disparity maps [26, 24] and then turn them into depth maps
[34]). The 3D position (x, y, z) of the pixel (u, v) may then be calculated using
{
z=d ,
x=( u−C x ) × z /f , (4.4)
y=( v−C y ) × z /f ,
N
The pseudo-LiDAR signal {( x n , y n , z n }n=1 may be created by All pixels are back-
projected into 3D coordinates, where N is number of pixels. After that, The pseudo-LiDAR
signals may then be used as input for LiDAR-based detection algorithms [31], [32], [33].
Figure 4.4. Image-based 3D detection pipeline based on pseudo-LiDAR
4.3. Dataset
The most widely utilized are nuScenes [6], KITTI 3D [5], and Waymo Open [11], which
considerably boost development of 3D detection.
For the last decade, KITTI 3D was the sole dataset available to aid in the development of
3D detectors base on images. Front-view photos at a resolution of 1280x384 pixels are provided
by KITTI 3D. The nuScenes and Waymo Open datasets were introduced in 2019. Six cameras
15
are utilized in the nuScenes dataset to provide a 360 degree image with a resolution of 1600x900
pixels. Similarly, Waymo Open captures 360 degrees of vision with five synchronized cameras,
and the image resolution is 1920x1280 pixels.
The KITTI 3D dataset contains 7,483 pictures for training and 7,519 images for testing,
and it is standard practice to divide the training data into a training set and a validation set [15],
[1], [16]. 3DOP's split [1], the most often used, has 3,713 and 3,770 pictures for training and
validation, respectively. The large-scale nuScenes and Waymo Open give around 40K and 200K
annotated frames, respectively, and employ numerous cameras to record each frame's panoramic
perspective. NuScenes, in particular, offers 28,140 frames, 6,020 frames, and 6,009 frames for
validation, training, and testing (six images per frame). The largest, Waymo Open, provides
122,300 frames for training, 31,407 frames for validation, and 41,077 frames for testing (five
images per frame).
16
CHAPTER V
METHODOLOGY
The proposed method of 3D Object Detection from Images for AD is given in this
chapter.
5.1 Methods
In this project method that I used is methods base on 2D feature, tt will estimate the 2D
locations, orientations, and dimensions (see Figure 5.1 for a visual representation of these things)
from the 2D characteristics and then recover the 3D locations from these estimates.
Figure 5.1. The monocular 3D object detection task is depicted.
CenterNet, an anchor-free single-shot detector, was introduced and expanded to image-
based 3D detection by Zhou et al. [38] in 2019. This framework, in particular, encodes the object
as a single point (the center point of the object) and finds it via key-point estimation.
17
Furthermore, numerous parallel heads are utilized to estimate the object's other attributes, such as
depth, size, position, and orientation.
5.2. 2D Feature
This models is concerned with three major issues:
 Estimating the object's center of gravity (Based on heat maps – heatmap head).
 Estimate the center's deviation. (offset head).
 Estimate the size of the object's bounding box. (dimenssion head).
5.2.1. Estimating the object's center of gravity
Figure 5.2. Operation diagram
CenterNet uses principles from the keypoint estimation issue, treats each object's center
point as a keypoint, and forecasts this center point using a heatmap.
Each pixel in this heatmap will show the chance that the cell contains the object's center.
Following that, CenterNet filters the maximum points on the heatmap to locate the center of the
items in the picture. To create heatmap, the model will optimize the network architecture
parameters to provide a heatmap result.
18
This design will include a backbone for feature extraction and a head (in this case, a
heatmap head) for transforming features into heatmaps.
W H
With an input image I ∈ R W x H x 3 , output will be heat map Y ∈ [ 0,1 ] R x R xC with R is stride
and C is number of classes.
Figure 5.3. Create ground truth heatmap
Create the heatmap ground truth:
W H
x xC
 First, create Heatmap Y ∈ [ 0 ] R R
 From the bounding box of object, determine the keypoint is the center of the
object pc ∈ R2 with C is class
pc
 Determine corresponding point of Pc on heatmap ~
pc = (integer part)
R
−( x−~
p x ) + ( y −~
2 2
py)
 Replace Y xyc=exp ( 2
)
2δp
19
The only thing left to do is describe the loss function so that the model can learn how to
generate a heatmap fit using the heatmap groundtruth.
5.2.2. Estimate the center's deviation
p
By only take integer part of ~
pc = c , it creates an offset between the actual center point
R
and the center point in the heatmap.
To learn this offset, CenterNet uses one more head - the offset head with an output of a
W H
X X2
tensor representing the horizontal and vertical offsets: O
^ ∈R R R
5.2.3. Estimate the size of the object's bounding box
Parallel to estimating the position of the center, is estimating the size of the object,
namely the width and height.
Similar to offset estimation, CenterNet also uses a head - dimension head to estimate the
W H
width and height of the object. Output of wh head is 1 tensor: ^S ∈ R R X R X 2
5.2.4. Architecture and Loss function
I mentioned previously that model uses three head networks (3 CNNs) to learn three
separate tasks: heatmap head, offset head, and dimension head. You can see the graphic below to
help you understand it better.
20
Figure 5.4. CenterNet Architecture
To take loss function, we take a weighted sum of 3 loss subfunctions corresponding to 3
heads: ( λ ¿¿0.1 , λ off

), specifically :
=1 ¿
Ld et=Lk + λs ize Ls ize+ λo ff Lo ff (5.1)
21
With:
 Heatmap Loss(Lk ) for Heatmap Head: Use focal loss (due to imbalance between point as
keypoint and point as background). Here, α,β are hyperparameters of focal loss (the
author uses α=2, β=4), N is the number of keypoints in the image.
{
α
−1 ¿ ( 1−Y^ xyc ) log ⁡( Y^ xyc ) ∧¿ if Y xyc=1
Lk = ∑
N xyc
❑ α (5.2)
¿ ( 1−Y xyc ) ( Y^ xyc ) log ⁡( 1−Y^ xyc ) ∧¿ other wise
β
 Offset Loss (Loff ) corresponding to Offset Head: L1 loss with
Loff =
−1
N p | ( )|
∑ ❑ O^ ~p − Rp −~p (5.3)
 Size Loss ¿ corresponding to Dimension Head: L1 loss with
L 1
¿¿− ∑ ❑|^S pk −sk|¿
N k=1
Regarding backbone, CenterNet conducts testing on 4 backbones to trade-off between
accuracy and speed: ResNet18, ResNet101, DLA-34 and Hourglass-104.
Firgure 5.5. Backbone of the model
5.3. 3D Detection
3D detection requires three extra qualities per center point: depth, 3D dimension, and
orientation.
22
The depth d is a single scalar per center point. However, regressing to depth is tough.
1^
Instead, I employ Eigen et al [39] 's output transformation and d= d−1 where σ is the sigmoid
σ
W H
X
function. The depth are computed as an addtional output channel ^
D ∈ R R R of the keypoint
estimator. It employs two convolutional layers separated by a ReLU once more. It employs the
inverse sigmoidal transformation at the output layer, as opposed to prior modalities. After the
sigmoidal transformation, we train the depth estimator with an L1 loss in the original depth
domain.
The 3D dimensions of an object are three scalars. It directly regress to their absolute
W H
X X3
values in meters using a separate head ^
F∈R R R and an L1 loss
By default, orientation is a single scalar. However, it might be difficult to retreat to.
Following the lead of Mousavian et al. [7], I express the orientation as two bins using in-bin
regression. The orientation is represented by 8 scalars, with 4 scalars for each bin. Two scalars
are used for softmax classification in one bin, and the remaining two scalars regress to an angle
within each bin.
5.4. Dataset
Kitti is a collection of vision exercises created with an autonomous driving platform. The
whole benchmark includes several tasks such as stereo, optical flow, visual odometry, and so on.
The object detection dataset, comprising monocular pictures and bounding boxes, is included in
this dataset. The dataset includes 7481 training photos that have been labeled with 3D bounding
boxes. The readme of the object development kit readme on the Kitti webpage has a detailed
overview of the annotations.
23
Figure 5.6. Dataset feature
5.5. Training
I train with 512x512 input resolution. This results in an output resolution of 128128 for
all modes. As data augmentation, we utilize random flip, random scaling (between 0.6 and 1.3),
cropping, and color jittering, and Adam [40] to maximize the overall aim. We do not increase the
3D estimate branch since cropping or scaling alters the 3D data.
I train residual networks and DLA-34 using a batch size of 128 (on 8 GPUs) and a
learning rate of 5e-4 for 140 epochs, with learning rates reduced by 10 at 90 and 120 epochs,
respectively (following [55]). Over Hourglass-104, we utilize batch-size 29 (on 5 GPUs, with
master GPU batch-size 4) and learning rate 2.5e-4 for 50 epochs, with a 10% learning rate
reduction at the 40th epoch. To save computation, we fine-tune the Hourglass-104 from
ExtremeNet [61] for detection. Resnet-101 and DLA-34's downsampling layers are initialized
24
with ImageNet pretrain, whereas the up-sampling levels are randomly initialized. On 8 TITAN-V
GPUs, Resnet-101 and DLA-34 train in 2.5 days, whereas Hourglass-104 takes 5 days.
25
CHAPTER VI
EXPECTED RESULTS
On the KITTI dataset [17], which comprises properly annotated 3D bounding boxes for
automobiles in a driving scenario, we do 3D bounding box estimate experiments. KITTI contains
7841 training photos, and we adhere to the conventional training and validation splits described
in the literature [10, 54]. The average precision for automobiles at 11 recalls (0.0 to 1.0 with 0.1
increment) at IOU threshold 0.5, like in object detection [14], is used as the assessment metric.
We assess IOUs using 2D bounding box (AP), orientation (AOP), and Bird-eye-view bounding
box (BEV AP). For both training and testing, we preserve the original picture resolution and pad
at 1280384. The training completes in 70 epochs, with the learning rate decreasing at 45 and 60
epochs, respectively.
Because the number of recall criteria is tiny, the validation AP varies by up to 10% AP.
As a result, we train five models and present the average with standard deviation. Based on their
specific validation split, we compare Slow-RCNN-based Deep3DBox [38] and Faster-RCNN-
based Mono3D [9]. As demonstrated in Table 4, our technique performs similarly to its
competitors in AP and AOS and somewhat better in BEV. CenterNet outperforms both
approaches by orders of magnitude.
Table 6.1: KITTI evaluation.
26
Table 6.1 show 2D bounding box AP, average orientation score (AOS), and bird eye
view (BEV) AP on different validation splits. Higher is better.
Figure 6.1: The input image
The result after applied the models:
Figure 6.2. The keypoint heat map
Figure 6.3. 2D detection
27
Figure 6.4: 3D decection with bounding box
28
CHAPTER VII
CONCLUSION AND FUTURE WORK
7.1. The conclusion

The researching process in this senior project mainly focuses on the human 3D Object
Detection from Images for Autonomous Driving task and clarifies its difficulties. Some advance
3D Object Detection from Images for Autonomous Driving methods and their performances
have also been discussed in this report. Then, the suitable dataset for the 3D Object Detection
from Images for Autonomous Driving task is stated.
This senior project has successfully categorized 3D Object Detection from Images for
Autonomous Driving methods. Finally, 3D Object Detection from Images for Autonomous
Driving architecture based on a successfully worked system has also been proposed in terms of
theoretical architecture.
7.2 The future work

In the future, several steps are required to obtain the practical 3D Object Detection from
Images for Autonomous Driving:
 Looking for more Dataset for training and testing.
 Consecutively test the performance of CenterNet with various modifies and datasets to
get the best accuracy or speed scoring.
 Create pre-trained network to implement to real-life situations with Python-OpenCV.
29
REFERENCES
[1] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection in monocular video,” in
ECCV, 2020.
[2] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the
world’s imagery,” in CVPR, 2016.
[3] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, andK. Q. Weinberger, “Pseudo-lidar from
visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection
and semantic segmentation,” in CVPR, 2014
[5] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards realtime object detection with region
proposal networks,” in NeurIPS,2015 .
[7] K. He, G. Gkioxari, P. Doll ́ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
[8] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun,“Monocular 3d object detection for
autonomous driving,” in CVPR, 2016.
[9] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals for
accurate object class detection,” in NeurIPS, 2015.
[10] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals using stereo imagery
for accurate object class detection,” T-PAMI, 2017.
[11] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object
recognition,” IJCV, 2013.
[12] C. L. Zitnick and P. Doll ́ar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
[13] J. Feng, E. Ding, and S. Wen, “Monocular 3d object detection via feature domain adaptation,” in ECCV,
2020.
[14] X. Zhou, D. Wang, and P. Kr ̈ahenb ̈uhl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
[15] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in CVPR,
2019.
[16] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval, 1999.
[17] T. Roddick, A. Kendall, and R. Cipolla, “Orthographic feature transform for monocular 3d object
detection,” 2019.
[18] Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry network for 3d object detection,”
[19] X. Guo, S. Shi, X. Wang, and H. Li, “Liga-stereo: Learning lidar geometry aware representations for
stereo-based 3d detector,” in ICCV, 2021.
[20] J. Huang, G. Huang, Z. Zhu, and D. Du, “Bevdet: Highperformance multi-camera 3d object detection in
bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
[21] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly
unprojecting to 3d,” in ECCV, 2020.
[22] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for
monocular 3d object detection,” in CVPR, 2021.
[23] R. T. Collins, “A space-sweep approach to true multi-image matching,” in CVPR, 1996.
[24] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the
world’s imagery,” in CVPR, 2016.
[25] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,”
in ECCV, 2018.
[26] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train
convolutional networks for disparity, optical flow, and scene flow estimation,” in CVPR, 2016.
[27] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right
consistency,” in CVPR, 2017.
[28] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for
monocular depth estimation,” in CVPR, 2018.
[29] F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting of 2d detection to 6d pose and metric
shape,” in CVPR, 2019.
[30] K. He, G. Gkioxari, P. Doll ́ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
30
[31] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object
detection from view aggregation,” in IROS, 2018.
[32] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in CVPR,
2018.
[33] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in
CVPR, 2019.
[34] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from
visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019.
[35] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark
suite,” in CVPR, 2012.
[36] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O.
Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
[37] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine
et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in ICCV, 2020.
[38] X. Zhou, D. Wang, and P. Kr ̈ahenb ̈uhl, “Objects as points,” arXiv
preprint arXiv:1904.07850, 2019.
[39] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep
network. In NIPS, 2014.
[40] P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.
31

NHDBao - EEACIU17036 - Senior RP

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NHDBao - EEACIU17036 - Senior RP

Uploaded by

Copyright:

Available Formats

111Equation Chapter 1 Section 1VIETNAM NATIONAL UNIVERSITY –

3D Object Detection from Images for

NGUYỄN HOÀNG DUY BẢO _ EEACIU17036

A SENIOR PROJECT SUBMITTED TO THE SCHOOL OF ELECTRICAL

HO CHI MINH CITY, VIET NAM

NGUYEN HOANG DUY BAO

program at the International University – Vietnam National University Hochiminh City.

Name of Student: Nguyễn Hoàng Duy Bảo

Advisor Signature Student Signature

It is with deep gratitude and appreciation that I acknowledge the professional

ABBREVIATIONS AND NOTATIONS..........................................................................ix

1.3. Report organization..................................................................................................2

CHAPTER II DESIGN SPECIFICATIONS AND STANDARDS...................................3

2.1. Hardware description...............................................................................................3

2.2. Software installation................................................................................................3

CHAPTER III PROJECT MANAGEMENT......................................................................4

3.2 Project Schedule........................................................................................................4

CHAPTER IV LITERATURE REVIEW............................................................................5

4.1. Autonomous Car......................................................................................................5

4.1.2 How do Autonomous vehicles working?..........................................................7

4.1.3. Challenging of self-driving cars?.....................................................................7

4.2. 3D Object Detection from Images for Autonomous Driving concept....................8

4.2.1 The overview.....................................................................................................8

4.2.2. 3D Object Detection from Images for Autonomous Driving methods............9

4.2.2.1. Methods based on 2D features................................................................10

4.2.1.1.1 Region-based methods.....................................................................11

4.2.1.1.2. Region-based methods....................................................................12

4.2.2.2. Methods Based on 3D Features..............................................................12

4.2.2.2.1. Feature lifting-based methods.........................................................12

4.2.2.2.1. Data lifting-based methods.............................................................14

5.2.2. Estimate the center's deviation.......................................................................19

5.2.3. Estimate the size of the object's bounding box...............................................20

5.2.4. Architecture and Loss function......................................................................20

CHAPTER VI EXPECTED RESULTS............................................................................25

CHAPTER VII CONCLUSION AND FUTURE WORK................................................28

7.1. The conclusion.......................................................................................................28

7.2 The future work.......................................................................................................28

Table 2.1 Inside Hardware................................................................................................3

Table 6.1 KITTI evaluation............................................................................................22

Figure 4.1 Levels of driving automation...........................................................................5

Figure 4.2 3D bounding box.............................................................................................8

Figure 4.3 3D Object Detection from Images for Autonomous Driving..........................9

Figure 4.4 An illustration of the feature lifting methods................................................12

Figure 4.5 Pipeline for image-based 3D-object detection using pseudo-LiDAR...........13

Figure 5.1 Illustration of the monocular 3D object detection task.................................15

Figure 5.2 Operation diagram.........................................................................................16

Figure 5.3 Create ground truth heatmap.........................................................................16

Figure 5.4 CenterNet Architecture..................................................................................18

Figure 5.5 Backbone of the model..................................................................................19

Figure 5.6 Dataset feature...............................................................................................20

Figure 6.1 The input image.............................................................................................23

Figure 6.2 The keypoint heat map..................................................................................23

Figure 6.3 2D detection...................................................................................................23

Figure 6.4 3D decection with bounding box...................................................................24

PAN: Personal Area Network

MAC: Medium Access Control

SAE: The Society of Automotive Engineers

AVS: Autonomous vehicles

SC: Stereo CenterNet

AD: Autonomous Driving

autonomous driving. Image-based 3D detection has made notable advancements because

images methods will be proposed for the thesis project.