Professional Documents
Culture Documents
HOCHIMINH CITY
INTERNATIONAL UNIVERSITY
SCHOOL OF ELECTRICAL ENGINEERING
BY
Under the guidance and approval of the committee, and approved by its members, this
thesis has been accepted in partial fulfillment of the requirements for the degree.
Approved:
________________________________
Chairperson
_______________________________
Committee member
________________________________
Committee member
________________________________
Committee member
________________________________
Committee member
i
HONESTY DECLARATION
My name is Nguyen Hoàng Duy Bao. I would like to declare that, apart from the
acknowledged references, this thesis either does not use language, ideas, or other original
material from anyone; or has not been previously submitted to any other educational and
research programs or institutions. I fully understand that any writings in this thesis
contradicted to the above statement will automatically lead to the rejection from the EE
Date:
Student’s Signature
(Full name)
ii
TURNITIN DECLARATION
Date: 08/06/2022
iii
ACKNOWLEGMENT
guidance of Dr. Nguyen Ngoc Hung. His constant encouragement and support helped me
to achieve my goal.
………………………
iv
TABLE OF CONTENTS
HONESTY DECLARATION.............................................................................................ii
TURNITIN DECLARATION............................................................................................iii
ACKNOWLEGMENT.......................................................................................................iv
TABLE OF CONTENTS....................................................................................................v
LIST OF TABLES............................................................................................................vii
LIST OF FIGURES..........................................................................................................viii
ABSTRACT........................................................................................................................x
CHAPTER I INTRODUCTION.........................................................................................1
1.1.Overview...................................................................................................................1
1.2. Objectives.................................................................................................................1
4.1.1. Background......................................................................................................5
4.3. Dataset....................................................................................................................15
CHAPTER V METHODOLOGY.....................................................................................17
5.1 Methods...................................................................................................................17
5.2. 2D Feature..............................................................................................................17
vi
5.2.1. Estimating the object's center of gravity........................................................18
5.3. 3D Detection..........................................................................................................22
5.4. Dataset....................................................................................................................22
5.5. Training..................................................................................................................23
REFERENCES..................................................................................................................29
vii
LIST OF TABLES
viii
LIST OF FIGURES
ix
ABBREVIATIONS AND NOTATIONS
3D: Three-dimensional
2D: Two-dimensional
x
ABSTRACT
In recent years, both business and academics have paid more attention to 3D
object detection from images, one of the fundamental and difficult challenges in
to the rapid development of deep learning algorithms. Particularly, more than 200
publications covering a wide range of theories, methods, and applications explored this
subject between 2015 and 2021. In some recent studies, this task can be achieved by
applying a pre-trained model. In this senior project, a new 3D object detection from
xi
CHAPTER I
INTRODUCTION
1.1 Overview
Object detection is a computer vision task that is now used in a wide range of consumer
applications, including surveillance and security systems, autonomous driving, mobile text
recognition, and disease diagnosis using MRI/CT scans. The objective of this senior project is
By lowering travel time, energy use, and pollution, AD has the potential to greatly better
people's lives. One of the most important aspects of autonomous driving is 3D object detection.
As a result, both research and industry have made significant efforts in the last decade to develop
self-driving vehicles. A key technology for the development of autonomous vehicles is 3D object
Object detection is one of the most main factors of AD. Autonomous cars rely on
perception of their environment to enable safe and efficient driving. This vision system employs
object identification algorithms to precisely identify nearby items, such as pedestrians, other
cars, traffic signs, and fences. For identifying and localizing these items in real time, object
detectors based on deep learning are required. This article discusses object detectors' current
condition and the problems that stand in the way of integrating them into AD.
1
1.2. Objectives
The main goal of this senior project is studying 3D Object Detection from Images for AD
with Deep Learning and Computer Vision. Then suggest suitable models that can be used for real
applications. Finally, create a training and testing model and run demo to see if can work
properly.
Chapter II: The project's constructed specifications and standards will be presented.
Chapter III: The senior project's management, planning, project timelines, and resource
Chapter VI: This chapter will demonstrate the project's anticipated theoretical conclusion.
Chapter VII: This chapter will show the projected theoretical conclusion of the project.
2
CHAPTER II
Processor Processor Intel(R) Core (TM) I5-10400F CPU @ 2.9 GHz 4.3GHz
System type System type 64-bit operating system, x64 based processor
The PC is running the most recent version of Anaconda, all the experiments are run on the
platform called Jupyter Notebook, which is provided by Anaconda. There are also some libraries
OpenCv
Cython
Numba
Matplotlib
Scipy
3
CHAPTER III
PROJECT MANAGEMENT
The total cost of this study is about 15 million VND due to the personal computer.
The first part: finding and understanding all the methods (7 weeks)
The second part: building, fixing, and compiling the model (4 weeks)
The last part is getting the result from the model and writing report. (4 weeks)
4
CHAPTER IV
LITERATURE REVIEW
Autonomous vehicles (AVs) are automobiles that employ technology to replace a human
driver partially or completely in navigating a vehicle from point A to point B while avoiding
road dangers and adjusting to traffic conditions. The Society of Automotive Engineers (SAE) has
developed a six-tier rating system depending on the type of human contact. The National
Highway Traffic Safety Administration (NHTSA) in the United States employs this
categorization system.
5
4.1.1. Background
General Motors presented the concept of AD in 1939. Since then, the self-driving
automobile has undergone a full transition and is now an autonomous vehicle. To drive without
human involvement, ADs employs a mix of sensors, artificial intelligence, radars, and cameras.
This sort of vehicle is still in the development stage since numerous components must be
The Avs' priority is to recognize things. On the road, there are both moving and
immovable items such as people, other vehicles, traffic signals, and more. To avoid incidents
when moving, the car must recognize numerous things. AD collect data and create 3D maps
using sensors and cameras. This aids in identifying and detecting things on the road while
physical objects in 3D space, which is necessary for predicting future object movements. While
image-based 2D object recognition and instance segmentation have made significant advances,
domains, including AD and robotics [1,2]. Several contemporary mainstream 3D detectors rely
on light detection and ranging (LiDAR) sensors to collect accurate 3D data, and the application
of LiDAR data is viewed as vital to the success of 3D detectors. Stereo cameras, on the other
hand, which function similarly to human binocular vision, are less costly and offer higher
resolutions. Hence, they have received significant interest in academia and business.
6
4.1.2. How do Autonomous vehicles working?
Autonomous vehicles construct and maintain a map of their environment using a variety
of sensors placed throughout the vehicle. Radar sensors monitor the activities of surrounding
cars. Video cameras detect traffic signals while simultaneously reading road signs, tracking other
cars, and looking for people. By bouncing light pulses off the car's surrounds, lidar sensors
Following the analysis of all this sensory data, advanced software designs a course and
delivers orders to the car's actuators, which regulate acceleration, braking, and driving.
and object identification, the program observes traffic laws and navigates obstacles.
modern approaches solve this problem using anchor-based 2D detection or depth estimation.
Nonetheless, the significant computational cost of these technologies prevents them from
attaining real-time performance. In this article, i would like to introduce the solution to the vision
AD has the potential to significantly improve people's lives by improving mobility while
reducing travel time, pollution and energy consumption. Unsurprisingly, both academics and
business have made major efforts in the recent decade to build self-driving automobiles. As one
7
of the primaries enabling technologies for autonomous driving, 3D object detection has received
a lot of attention. Deep learning-based 3D object recognition systems have gained popularity
recently.
Existing 3D object recognition systems may be classified into two categories based on
whether the incoming data is images or LiDAR signals. Approaches that estimate 3D bounding
boxes from photos alone confront a far bigger difficulty than LiDAR-based algorithms, since it is
difficult to extract 3D information from 2D input data. Despite this inherent complexity, image-
based 3D object recognition algorithms have evolved in the field of computer vision during the
previous six years. More than 80 articles in top-tier conferences and journals in this sector have
The goal of image-based 3D object detection is to identify and find objects of interest
given RGB images and camera settings. Each item is represented by its category and bounding
Dimension [h,w,l]
Location [x,y,z]
Orientation [θ, φ, ψ]
In most autonomous driving scenarios, just the heading angle θ around the up-axis (yaw
angle) is considered.
8
Figure 4.2: 3D bounding box
CAD models, LiDAR signals, external data, temporal sequences, and stereo images are
examples of auxiliary data applications. Stereo images provide the most important information to
image-based 3D detection, and algorithms that use this sort of data outperform others at the
inference stage.
+ Methods based on 2D features: These methods use the input RGB images, they begin
locations from the 2D features. As a result, these approaches are often known as'result
lifting-based methods.
+ Methods based on 3D features: The primary benefit of these algorithms is that they first
two categories: feature lifting methods and data lifting methods.Figure 4.1.1 below
picture (see Figure 2 for a visual representation of these items) and then recovers the 3D
positions from these estimates. As a result, these approaches are often known as 'result lifting-
based methods.' An intuitive and widely used technique for obtaining an object's 3D position [x,
y, z] is to use CNNs to estimate the depth value d, and then lift the 2D date feature into 3D space
using:
{
z=d ,
x=( u−C x ) × z /f , (4.1)
y=( v−C y ) × z / f ,
With:
(u , v ) is 2D location of an object.
10
f is the focal length.
Furthermore, unlike methods that need full depth maps, such as Pseudo-LiDAR [3], these
approaches just require the depths of the objects' centers. Furthermore, while these approaches
are broadly equivalent to 2D detectors, I divide them into two sub-categories for ease of
series' high-level concept [4], [5], [6]. After generating category-independent area suggestions
from input pictures, CNNs [6, 7] extract features from these regions. Finally, R-CNN uses these
attributes to refine the suggestions and assign them to categories. The novel region-based
methods [11], [12], a simple way to generate proposals for 3D detection is to tile the 3D anchors
(proposal shape templates) in the ground plane and then project them as proposals to the image
plane.
al. [8], [9], [10] pioneered the Mono3D and 3DOP techniques for monocular and stereo-based
algorithms, respectively, to reduce the searching space by rejecting ideas with low confidence
The Region Proposal Network (RPN) [6] enables detectors to construct 2D proposals
using features from the final shared convolutional layer rather than external techniques, reducing
11
the majority of the cost of computation and allowing multiple image-based 3D detectors to be
used.
The single shot object detectors estimate class probabilities directly and inversely
additional 3D box items from one-by-one feature point. As a result, these systems may be able to
make judgments more quickly than region-based techniques, which is critical in the context of
AD.
Because single-shot approaches employ only CNN layers, they are also easier to install
on a variety of hardware architectures. Furthermore, certain relevant publications [13], [14], [15],
[26] shown that single-shot detectors can also yield promising results.
The fundamental advantage of these approaches is that they first build features in 3D
from images before immediately estimating all elements in the bounding boxes in 3D, including
3D positions and 3D space. To get feature in 3D, there are two categories: 'data lifting-based
The feature lifting approaches work by converting image features in 2D in the image
Furthermore, previous feature lifting based algorithms [17], [18], [22], [19], [20] further
compress the voxel features in 3D that are vertically oriented to yield the BEV features before
12
The main issue is figuring out how to convert 2D picture characteristics into 3D voxel
Feature lifting for monocular methods: To achieve feature lifting, Roddick et al. [17]
presented OFTNet, a retrieval-based detection model. They obtain the voxel feature by
aggregating the 2D features throughout region of the front view image feature corresponding to
the top-left projection of each voxel (u1, v2) and bottom-right (u2, v2) corners:
u2 v2
1
V ( x , y , z)= ∑ ❑ ∑ ❑ F(u , v) (4.2)
( u 2−u1 ) ( v 2 −v 1 ) u=u 1
v=v 1
Where V(x, y, z) signify the voxel (x, y, z) and F(u, v) signify pixel characteristics (u, v).
Reading et al. [22] accomplish feature lifting by back projection [21]. To begin, They
categorize the continuous depth space and consider depth estimation as a classification task.
Instead of a single number, the depth estimate result is the distribution D for these bins. Then, to
construct the 3D frustum feature G(u, v), pixel characteristics F(u, v) is weighted by its
Where the outer product is denoted by ⊗. Because the image-depth coordinate system
is used for this geometrical feature (u, v, d). Prior to creating the voxel feature, it must be aligned
to the 3D world coordinate system (x, y, z) using camera settings.. These two techniques are
13
Figure 4.4. A diagram of the feature lifting approaches.
Right: To raise 2D features into 3D space, image characteristics are weighted by their
algorithms, creating 3D features from stereo pairs is easier than developing them from
monocular images..
Chen et al. [18] presented the Deep Stereo Geometry Network (DSGN), which lifts features
using stereo pictures as input.They begin by extracting features from stereo pairs and then
construct a 4D plane-sweep volume using the conventional plane sweeping technique [23], [24],
[25] by joining the left image feature with the reprojected right image feature at equally spaced
depth values The 4D volume will then be translated into 3D world space before being used to
generate the BEV map, which will be used to predict the final results.
The 2D images are transformed into 3D data using data lifting algorithms. The obtained
14
The pseudo-LiDAR pipeline was presented to construct a bridge between image-based
approaches and LiDAR-based methods, using wellstudied depth estimation, disparity estimation,
and LiDAR-based 3D object identification. In this pipeline, we must first estimate the dense
depth maps [27, 28] from pictures (or disparity maps [26, 24] and then turn them into depth maps
[34]). The 3D position (x, y, z) of the pixel (u, v) may then be calculated using
{
z=d ,
x=( u−C x ) × z /f , (4.4)
y=( v−C y ) × z /f ,
N
The pseudo-LiDAR signal {( x n , y n , z n }n=1 may be created by All pixels are back-
projected into 3D coordinates, where N is number of pixels. After that, The pseudo-LiDAR
signals may then be used as input for LiDAR-based detection algorithms [31], [32], [33].
4.3. Dataset
The most widely utilized are nuScenes [6], KITTI 3D [5], and Waymo Open [11], which
For the last decade, KITTI 3D was the sole dataset available to aid in the development of
3D detectors base on images. Front-view photos at a resolution of 1280x384 pixels are provided
by KITTI 3D. The nuScenes and Waymo Open datasets were introduced in 2019. Six cameras
15
are utilized in the nuScenes dataset to provide a 360 degree image with a resolution of 1600x900
pixels. Similarly, Waymo Open captures 360 degrees of vision with five synchronized cameras,
The KITTI 3D dataset contains 7,483 pictures for training and 7,519 images for testing,
and it is standard practice to divide the training data into a training set and a validation set [15],
[1], [16]. 3DOP's split [1], the most often used, has 3,713 and 3,770 pictures for training and
validation, respectively. The large-scale nuScenes and Waymo Open give around 40K and 200K
annotated frames, respectively, and employ numerous cameras to record each frame's panoramic
perspective. NuScenes, in particular, offers 28,140 frames, 6,020 frames, and 6,009 frames for
validation, training, and testing (six images per frame). The largest, Waymo Open, provides
122,300 frames for training, 31,407 frames for validation, and 41,077 frames for testing (five
16
CHAPTER V
METHODOLOGY
The proposed method of 3D Object Detection from Images for AD is given in this
chapter.
5.1 Methods
In this project method that I used is methods base on 2D feature, tt will estimate the 2D
locations, orientations, and dimensions (see Figure 5.1 for a visual representation of these things)
from the 2D characteristics and then recover the 3D locations from these estimates.
based 3D detection by Zhou et al. [38] in 2019. This framework, in particular, encodes the object
as a single point (the center point of the object) and finds it via key-point estimation.
17
Furthermore, numerous parallel heads are utilized to estimate the object's other attributes, such as
5.2. 2D Feature
Estimating the object's center of gravity (Based on heat maps – heatmap head).
CenterNet uses principles from the keypoint estimation issue, treats each object's center
Each pixel in this heatmap will show the chance that the cell contains the object's center.
Following that, CenterNet filters the maximum points on the heatmap to locate the center of the
items in the picture. To create heatmap, the model will optimize the network architecture
18
This design will include a backbone for feature extraction and a head (in this case, a
W H
With an input image I ∈ R W x H x 3 , output will be heat map Y ∈ [ 0,1 ] R x R xC with R is stride
W H
x xC
First, create Heatmap Y ∈ [ 0 ] R R
From the bounding box of object, determine the keypoint is the center of the
pc
Determine corresponding point of Pc on heatmap ~
pc = (integer part)
R
−( x−~
p x ) + ( y −~
2 2
py)
Replace Y xyc=exp ( 2
)
2δp
19
The only thing left to do is describe the loss function so that the model can learn how to
p
By only take integer part of ~
pc = c , it creates an offset between the actual center point
R
To learn this offset, CenterNet uses one more head - the offset head with an output of a
W H
X X2
tensor representing the horizontal and vertical offsets: O
^ ∈R R R
Parallel to estimating the position of the center, is estimating the size of the object,
Similar to offset estimation, CenterNet also uses a head - dimension head to estimate the
W H
width and height of the object. Output of wh head is 1 tensor: ^S ∈ R R X R X 2
I mentioned previously that model uses three head networks (3 CNNs) to learn three
separate tasks: heatmap head, offset head, and dimension head. You can see the graphic below to
20
Figure 5.4. CenterNet Architecture
21
With:
Heatmap Loss(Lk ) for Heatmap Head: Use focal loss (due to imbalance between point as
keypoint and point as background). Here, α,β are hyperparameters of focal loss (the
{
α
−1 ¿ ( 1−Y^ xyc ) log ( Y^ xyc ) ∧¿ if Y xyc=1
Lk = ∑
N xyc
❑ α (5.2)
¿ ( 1−Y xyc ) ( Y^ xyc ) log ( 1−Y^ xyc ) ∧¿ other wise
β
Loff =
−1
N p | ( )|
∑ ❑ O^ ~p − Rp −~p (5.3)
L 1
¿¿− ∑ ❑|^S pk −sk|¿
N k=1
5.3. 3D Detection
3D detection requires three extra qualities per center point: depth, 3D dimension, and
orientation.
22
The depth d is a single scalar per center point. However, regressing to depth is tough.
1^
Instead, I employ Eigen et al [39] 's output transformation and d= d−1 where σ is the sigmoid
σ
W H
X
function. The depth are computed as an addtional output channel ^
D ∈ R R R of the keypoint
estimator. It employs two convolutional layers separated by a ReLU once more. It employs the
inverse sigmoidal transformation at the output layer, as opposed to prior modalities. After the
sigmoidal transformation, we train the depth estimator with an L1 loss in the original depth
domain.
The 3D dimensions of an object are three scalars. It directly regress to their absolute
W H
X X3
values in meters using a separate head ^
F∈R R R and an L1 loss
Following the lead of Mousavian et al. [7], I express the orientation as two bins using in-bin
regression. The orientation is represented by 8 scalars, with 4 scalars for each bin. Two scalars
are used for softmax classification in one bin, and the remaining two scalars regress to an angle
5.4. Dataset
Kitti is a collection of vision exercises created with an autonomous driving platform. The
whole benchmark includes several tasks such as stereo, optical flow, visual odometry, and so on.
The object detection dataset, comprising monocular pictures and bounding boxes, is included in
this dataset. The dataset includes 7481 training photos that have been labeled with 3D bounding
boxes. The readme of the object development kit readme on the Kitti webpage has a detailed
23
Figure 5.6. Dataset feature
5.5. Training
I train with 512x512 input resolution. This results in an output resolution of 128128 for
all modes. As data augmentation, we utilize random flip, random scaling (between 0.6 and 1.3),
cropping, and color jittering, and Adam [40] to maximize the overall aim. We do not increase the
I train residual networks and DLA-34 using a batch size of 128 (on 8 GPUs) and a
learning rate of 5e-4 for 140 epochs, with learning rates reduced by 10 at 90 and 120 epochs,
respectively (following [55]). Over Hourglass-104, we utilize batch-size 29 (on 5 GPUs, with
master GPU batch-size 4) and learning rate 2.5e-4 for 50 epochs, with a 10% learning rate
reduction at the 40th epoch. To save computation, we fine-tune the Hourglass-104 from
ExtremeNet [61] for detection. Resnet-101 and DLA-34's downsampling layers are initialized
24
with ImageNet pretrain, whereas the up-sampling levels are randomly initialized. On 8 TITAN-V
GPUs, Resnet-101 and DLA-34 train in 2.5 days, whereas Hourglass-104 takes 5 days.
25
CHAPTER VI
EXPECTED RESULTS
On the KITTI dataset [17], which comprises properly annotated 3D bounding boxes for
7841 training photos, and we adhere to the conventional training and validation splits described
in the literature [10, 54]. The average precision for automobiles at 11 recalls (0.0 to 1.0 with 0.1
increment) at IOU threshold 0.5, like in object detection [14], is used as the assessment metric.
We assess IOUs using 2D bounding box (AP), orientation (AOP), and Bird-eye-view bounding
box (BEV AP). For both training and testing, we preserve the original picture resolution and pad
at 1280384. The training completes in 70 epochs, with the learning rate decreasing at 45 and 60
epochs, respectively.
Because the number of recall criteria is tiny, the validation AP varies by up to 10% AP.
As a result, we train five models and present the average with standard deviation. Based on their
based Mono3D [9]. As demonstrated in Table 4, our technique performs similarly to its
competitors in AP and AOS and somewhat better in BEV. CenterNet outperforms both
26
Table 6.1 show 2D bounding box AP, average orientation score (AOS), and bird eye
27
Figure 6.4: 3D decection with bounding box
28
CHAPTER VII
Detection from Images for Autonomous Driving task and clarifies its difficulties. Some advance
3D Object Detection from Images for Autonomous Driving methods and their performances
have also been discussed in this report. Then, the suitable dataset for the 3D Object Detection
This senior project has successfully categorized 3D Object Detection from Images for
Autonomous Driving methods. Finally, 3D Object Detection from Images for Autonomous
Driving architecture based on a successfully worked system has also been proposed in terms of
theoretical architecture.
Consecutively test the performance of CenterNet with various modifies and datasets to
29
REFERENCES
[1] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection in monocular video,” in
ECCV, 2020.
[2] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the
world’s imagery,” in CVPR, 2016.
[3] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, andK. Q. Weinberger, “Pseudo-lidar from
visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection
and semantic segmentation,” in CVPR, 2014
[5] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards realtime object detection with region
proposal networks,” in NeurIPS,2015 .
[7] K. He, G. Gkioxari, P. Doll ́ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
[8] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun,“Monocular 3d object detection for
autonomous driving,” in CVPR, 2016.
[9] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals for
accurate object class detection,” in NeurIPS, 2015.
[10] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals using stereo imagery
for accurate object class detection,” T-PAMI, 2017.
[11] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object
recognition,” IJCV, 2013.
[12] C. L. Zitnick and P. Doll ́ar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
[13] J. Feng, E. Ding, and S. Wen, “Monocular 3d object detection via feature domain adaptation,” in ECCV,
2020.
[14] X. Zhou, D. Wang, and P. Kr ̈ahenb ̈uhl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
[15] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in CVPR,
2019.
[16] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval, 1999.
[17] T. Roddick, A. Kendall, and R. Cipolla, “Orthographic feature transform for monocular 3d object
detection,” 2019.
[18] Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry network for 3d object detection,”
[19] X. Guo, S. Shi, X. Wang, and H. Li, “Liga-stereo: Learning lidar geometry aware representations for
stereo-based 3d detector,” in ICCV, 2021.
[20] J. Huang, G. Huang, Z. Zhu, and D. Du, “Bevdet: Highperformance multi-camera 3d object detection in
bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
[21] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly
unprojecting to 3d,” in ECCV, 2020.
[22] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for
monocular 3d object detection,” in CVPR, 2021.
[23] R. T. Collins, “A space-sweep approach to true multi-image matching,” in CVPR, 1996.
[24] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the
world’s imagery,” in CVPR, 2016.
[25] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,”
in ECCV, 2018.
[26] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train
convolutional networks for disparity, optical flow, and scene flow estimation,” in CVPR, 2016.
[27] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right
consistency,” in CVPR, 2017.
[28] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for
monocular depth estimation,” in CVPR, 2018.
[29] F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting of 2d detection to 6d pose and metric
shape,” in CVPR, 2019.
[30] K. He, G. Gkioxari, P. Doll ́ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
30
[31] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object
detection from view aggregation,” in IROS, 2018.
[32] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in CVPR,
2018.
[33] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in
CVPR, 2019.
[34] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from
visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019.
[35] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark
suite,” in CVPR, 2012.
[36] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O.
Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
[37] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine
et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in ICCV, 2020.
[38] X. Zhou, D. Wang, and P. Kr ̈ahenb ̈uhl, “Objects as points,” arXiv
preprint arXiv:1904.07850, 2019.
[39] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep
network. In NIPS, 2014.
[40] P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.
31