You are on page 1of 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

SSA3D: Semantic Segmentation Assisted One-Stage


Three-Dimensional Vehicle Object Detection
Shangfeng Huang , Guorong Cai, Zongyue Wang , Qiming Xia , and Ruisheng Wang , Senior Member, IEEE

Abstract— One-stage 3D object detection using mobile light A complete and reliable ITS includes analysis of road objects,
detection and ranging (LiDAR) has developed rapidly in recent especially vehicle detection and tracking. In particular, LiDAR
years. Specifically, one-stage methods have attracted attention has become the most popular device for object detection in
because of their high efficiency and light weight compared with
two-stage methods. Inspired by this, we present the semantic autonomous vehicle systems, due to its advantage of being
segmentation assisted one-stage three-dimensional vehicle object able to acquire 3d data all-weather. In other word, the progress
detection (SSA3D), a network for the rapid detection of objects of object detection algorithm based on LiDAR point cloud is
that keeps the advantages of the semantic segmentation module conducive to promoting the development of ITS.
in the two-stage methods without increasing redundant compu- For example, Muhammad et al. [1] introduced the auto-
tational load. First, we modified the sampling of the farthest
point to improve the quality of the sampling points. This helps matic driving algorithm with a focus on road, lane, vehi-
to reduce sampling outlier points and bad points that are difficult cle, pedestrian, drowsiness detection, collision avoidance, and
to perceive in the spatial structure information surrounding the traffic sign detection through sensing and vision-based deep
point. Second, a neighbor attention group module is devoted to learning methods. In our work, we force on the 3D object
selectively add extra weight to neighbor points because of the detection in the front of vehicles. Also, the 3D object detection
different importance of neighbor points for the corresponding
sampling point. Correctly increasing the weight is helpful to can be used in robot trajectory planning and virtual reality.
obtain richer spatial structure information. Finally, a delicate According to a survey on 3D detection [2], the 3D object
box generation module is included as a voted center point layer detection methods are mainly divided into image-based object
based on the generalized Hoff vote method and an anchor- detection, point cloud-based object detection, and image-point
free regression. We used the feature aggregation module as the could combination-based object detection. When using LiDAR
backbone and the feature propagation module as the auxiliary
network to achieve efficiency. At the same time, the auxiliary data to detect objects, the methods are more concerned with
network retains the ability to extract point-wise features from the location, size and orientation of them.
state-of-the-art semantic segmentation network. In experiments, Image-based 3D object detection has its unique advantages.
we evaluated and tested the SSA3D on a common KITTI dataset The image pixel is regular and ordered compared with the
and achieved improved performance in the class of car accuracy. point cloud, and has rich RGB (red, green, blue) information.
Index Terms— Autonomous driving, 3D object detection, The network can directly utilize Convolutional Neural Net-
LiDAR point clouds, computer vision. work (CNN) to extract this information. Specifically, earlier
work [3], [4] used deep CNN to directly estimate 3D boxes
I. I NTRODUCTION and an object posed by a single image for the first time.
Gs3d [5] makes full use of the visual features to capture the
I NTELLIGENT transportation systems (ITS) have emerged
in many countries around the world, which may be regarded
as a successful exploration of future transportation systems.
structure information of objects and subsequently estimates
3D boxes. Previous studies [6], [7] present a deep neural
network with coarse to fine layers to generate 3D boxes.
Manuscript received February 24, 2021; revised October 14, 2021; accepted Chen et al. [8], [9] encode the size of boxes with an energy
November 17, 2021. This work was supported in part by the National Natural
Science Foundation of China under Grant 41971424 and Grant 61701191; minimization function from images. However, for autonomous
in part by the Key Technical Project of Xiamen Ocean Bureau under Grant driving we not only need to know whether there is a vehicle
18CZB033HJ11; in part by the Key Technical Project of Xiamen Science in the front of the car but also know the precise location of the
and Technology Bureau under Grant 3502Z20191018, Grant 3502Z20201007,
Grant 3502Z20191022, and Grant 3502Z20203057; in part by the Science vehicle. In image-based 3D object detection, it is challenging
and Technology Project of Education Department of Fujian Province under to obtain the precise location and size of the object. However,
Grant JAT190321, Grant JAT190318, and Grant JAT190315; and in part by the these details can be obtained directly in the point cloud.
Natural Science Foundation of China under Grant 42071443.6. The Associate
Editor for this article was Z. Duric. (Corresponding authors: Guorong Cai; In other words, utilizing the spatial structure information of
Ruisheng Wang.) point cloud can easily obtain a series of 3D data, typically
Shangfeng Huang, Guorong Cai, Zongyue Wang, and Qiming Xia are more accurate location information. Intuitively speaking, the
with the Department of the Computer Engineering College, Jimei University,
Xiamen 361021, China (e-mail: shangfenghuang@jmu.edu.cn; guorongcai. location of vehicles can be used to prevent accidents and
jmu@gmail.com; wangzongyue@jmu.edu.cn; qimingxia96@163.com). regular traffic in the ITS.
Ruisheng Wang is with the Department of Geomatics Engineering, Schulich The point cloud-based 3D object detection can be classified
School of Engineering, University of Calgary, Calgary, AB T2N 1N4, Canada
(e-mail: ruiswang@ucalgary.ca). into two categories unlike the image-based method. The first
Digital Object Identifier 10.1109/TITS.2021.3133476 category [10]–[14] is to convert the whole point cloud to
1558-0016 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

the voxels. However, these methods not only lose abundant effort. Some efficient one-stage methods like 3DVID [28] can
spatial structure information but also increase the computation be used on the online 3D video object detection. 3DVID [28]
complexity of the 3D CNN to learn the feature from the propose a novel Pillar Message Passing Network to encode
voxels. Voxelnet [11] utilizes PointNet [15] after voxelization each frame and an attentive spatiotemporal transformer GRU
to extract features from voxels, while SECOND [16] replace to aggregate the spatiotemporal information. SA-SSD [29]
this step with the sparse convolution layers [17]. Further, utilizes structure awareness and an auxiliary network to obtain
PointPillars [14] utilizes pillars instead of voxels to reduce precise localization. Its auxiliary network inspired our work.
the amount of computation caused by voxelization. The recent 3DSSD [30] proposes a novel fusion sampling method in the
work [18] proposed by Hao Li use the voxel or pillars to down-sampling that allows it to choose more interesting and
extract the feature and then map an unstructured point cloud to important points. This method then estimates proposals based
a single-channel 2D heatmap to detect objects. These methods on these points.
have all achieved state-of-the-art performance, and we have Cui et al. [31] introduces recent deep-learning-based data
even conducted voxelizing the point cloud to estimate 3D fusion approaches that leverage both image and point
bounding boxes in previous work. This work showed us that cloud and some existed challenge between current academic
voxelizing the point cloud makes it more difficult to capture researches and real-world applications. Combining RGB infor-
detailed spatial structure information. mation of the image and spatial structure information of the
Another category directly utilizes the raw point cloud as point cloud can make up for the lack of color and distant
the input without any changes. Point could semantic seg- object information in the light detection and ranging (LiDAR).
mentation methods become more and more efficient. For Specifically, AVOD [32] and MV3D [33] present a multiple
example, PointNet [15] and PointNet++ [19] can directly view fusion 3D detection network. Specifically, AVOD [32]
input the raw point cloud, and the method [20] proposed uses the LiDAR bird’s eye view and the LiDAR front view
by Huan Luo is able to utilizes the colorized point cloud and image as the input, while MV3D [33] uses the LiDAR
to segment semantic labels of the road scene. Since many bird’s eye view and 3D anchor grid and image. Next the
semantic segmentation methods show the power of learning final region-based fusion work estimates the 3D bounding
structural features of point cloud, more and more object boxes based on three views. MVF-PointNet [34] and Frustum
detection methods achieve direct processing of raw point could convent [35] generate a series of 3D frustum regions in the
and learn point cloud structural features with the help of them. point cloud based on two-dimensional (2D) region proposals
Specifically, PointRCNN [21], a two-stage network, utilizes from the image to predict the 3D bounding boxes. IPOD [36]
PointNet++ [19] as the semantic segmentation backbone utilizes a 2D segmentation network to predict foreground
network. This method succeeds in distinguishing foreground pixels and projects them onto the point cloud to remove most
points and background points by PointNet++ [19]. Next, 3D of the background points. It subsequently generated proposals
bounding boxes are estimated based on the foreground points. on the predicted front points. A method called PointIOU
STD [22] like PointRCNN [21] utilizes PointNet++ [19] was designed to reduce the redundancy and ambiguity of
to learn features and converts the internal point features of the proposal box. PointPainting [37] projects the semantic
proposals from sparse representation to dense representation segmentation results from the image onto the corresponding
by the proposed PointsPool module. Some weakly two-stage point cloud and subsequently estimates the 3D bounding boxes
methods [23], [24] generate foreground point segmentation to according to the semantic information and spatial structure
produce cylinder-shaped 3D object proposals, and then refine information. PV-EncoNet [38] eliminates a large number of
the cylinder proposals on the second stage. We found that invalid point clouds, and then adds texture information (fused
the semantic segmentation played a vital role and directly from camera image) through point cloud coloring to enhance
affects the final performance after summarizing these two- features. It can efficiently encodes both the spatial and texture
stage methods. However, the long inference time makes it hard features of each colored point to detect objects. However the
to apply this approach be applied in a real-world autonomous image and point cloud combination-based object detection
driving system and we had to abandon the two-stage methods. method has to consider how to match the point cloud to the
The question remains of how attractive is the effect of the image and how to fuse the two different formats of data. This
semantic segmentation network. Therefore, our work finds a would be additional and difficult work that would greatly affect
novel way to solve this problem. the final experimental results.
The one-stage point cloud-based methods are different from Motivation: Our evaluation revealed that the network can
the two-stage method and are known for their efficiency. directly utilize the point cloud as input to easily estimate
Specifically, PointGNN [25] proposes a network that utilizes the 3D bounding box with scale information. We found
GNN [26] to extract point-wise features. This method makes that 3DSSD [30] and other two-stage method based on
full use of GNN to perceive spatial structure and achieve PointNet++ [19] all utilize the farthest point sampling (FPS)
outstanding performance on the KITTI dataset [27]. GNN can module in features extraction. However these methods never
perceive the spatial structure information of the point cloud consider the quality of the sampling points. If the outliers
very well, but it requires that the point cloud cannot be down- points are sampled, the points of the next FPS sampling have
sampled at will. Therefore, the network has to conduct the a high probability of including these points. The phenom-
GNN in the whole point cloud. Compared with the other enon will directly affect the final performance. In this paper,
methods it requires a substantial amount of computational we proposed a selection module after the sampling module

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: SSA3D 3

Fig. 1. Illustration of SSA3D. SSA3D is mainly composed of a backbone, a box generation layer, and an auxiliary module. (a) Backbone network. The
backbone network is composed of three nuanced attention sampling modules and is dedicated to extracting point features. (b) Box Generation Layer. The box
generation layer generates a prediction head based on the central point obtained through the voting module. (c) Auxiliary network. The auxiliary network is
dedicated to helping the backbone extract features better.

introduced in 3DSSD [30]. The selection module considers in the ITS to more promote the close cooperation of vehicles
about the neighbouring information of each sampled point, the to improve the efficiency of transportation. SSA3D mainly
purpose is to abandon bad points and retain better quality ones. consists of three modules (Figure 1). Section II-A explains
The inconsistent importance of neighboring points for the how our backbone extracts the sampling point feature and
corresponding sampling point was apparent from the typical subsequently introduces the box generation layer. The detail
feature aggregation backbone [19] used in many methods. of the auxiliary module is introduced in Section II-B and
Therefore, a module, the Attention Group, is proposed to Section II-C lists the various loss functions.
add extra weight for different neighbor points to extract the
features better. In addition, we utilized an auxiliary as a
A. Backbone and Box Generation Layer
refinement to approximately zip the two-stage method into a
one-stage method which allowed us to obtain the effect of the SSA3D employs the commonly used feature extraction
semantic segmentation network (Figure 1). Our method there- backbone network that uses the raw points P0 ∈ R N∗4 as
fore has the advantage of making use of both the one-stage and the input as described in earlier studies [21], [30] (Figure 1).
two-stage. Specifically, the backbone consists of D-FPS attention sam-
pling module, Fusion attention sampling module and Separate
A. Our Contributions attention sampling module which are based on the set abstract
layer in PointNet++ [19]. We expand three attention sam-
In summary, our contributions are the following. pling modules in more detail as described in the following
1) We proposed a highly efficient and precise one-stage section.
3D object detection network with the help of a semantic The Attention Sampling Module: The attention sampling
segmentation network. module consists of the points sampling module, selection
2) We proposed a selection module after sampling points module, and attention group module. The points sampling
to capture more useful points to reduce the impact of module is freely combined by farthest point sampling based
outliers and bad points. on Euclidean distance (D-FPS) and farthest point sampling
3) We proposed the use of a neighbor attention module after based on feature squared distance (F-FPS). As shown in
attaining high quality points. This helps to add different Figure 2, D-FPS attention sampling module only uses the
weights to different neighbor points for aggregating D-FPS in the raw point cloud and the output is the feature

neighbor information. F1D−F P S ∈ R N1 C1 . N1 is the number of the sampled points
and C1 the number of channels of the feature. Fusion attention
II. T HE M ETHOD sampling module uses both D-FPS and F-FPS in the output
The ITS covers a lot of research field including the traffic of D-FPS attention sampling module, and the output feature
∗ ∗
prediction, traffic flow detection, automatic driving and so F2D−F P S ∈ R N2 C2 and F2F −F P S ∈ R N2 C2 , respectively.
on. In our work, we focused on the task of object detection, However, it is inevitable to exist redundant points between
which is a key component of automatic driving systems. The the obtained F2D−F P S and F2F −F P S , but existing redundant
proposed object detection network, SSA3D, is dedicated to points is allowed. Because D-FPS is designed to retain as much
aggregate more precisely location, size and orientation of spatial structure information of point cloud as possible while
objects in front of the vehicles. This information can be used F-FPS aims to get as much foreground points as possible so

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 2. Illustration of attention sampling module. These modules consist of points sampling by a freely combined D-FPS and F-FPS, selection module, and
attention group. In particular, fusion attention sampling is shown in detail.

that these points can be selected as the center points in next


module. Then, Separate attention sampling module only uses
D-FPS in points with feature F2D−F P S and only uses F-FPS
in points with feature F2F −F P S . It outputs F3D−F P S ∈ R N3 ∗C3
and F3F −F P S ∈ R N3 ∗C3 , respectively.
Points Sampling and Selection Module: The follow-
ing describes how the steps for the points sampling and
selection module were determined. Fusion attention sam-
pling
 module is shown the Figure 2. Let P1D−F P S =
x i , yi, z i |i = 1, . . . N1 ∈ R N1 ∗3 be the coordinates of Fig. 3. Illustration of without the selection module. The red points represent
the point cloud. We obtain the sampling points set after the sampling points, and the remaining points are black.
the F-FPS or D-FPS method. Taking the D-FPS as an
example,
 the sampling points
 set is denoted as P̄2D−F P S =
x i , yi, z i |i = 1, . . . N11 ∈ R N11 ∗3 . However, these sam-
around point P1 losses after previous data preprocessing and
pling methods only consider the properties of the candidate
sampling. P1 ’s feature information will be weaker compared
sampling point itself and don’t consider the properties of its
with P2 because there are not neighboring points to help point
surrounding neighboring points. After the points sampling
P1 perceive its spatial information. At the same time, it can
module, the sampling points must extract the surrounding
be seen that P1 belongs to a relatively contour point. The
spatial structure through the Attention Group, MLP, and Max-
neighboring structure around the point P2 is stronger than
pool. However, when the local spatial structure information
point P1 and we should preserve more of these points like
surrounding the sampling point is bad or the sampling point
point P2 . We therefore add a selection module. The selection
has no neighbour points, the extraction process will not per-
module consists of a group and selecting module. P̄2D−F P S ∈
form well theoretically. It will influence the final performance.
R N11 ∗3 and F1D−F P S ∈ R N1 ∗C that is the feature of the
If we could improve the quality of the sampling points, the
previous points are used as the input of this module. The group
extraction process will be improved. Based on this theory,
we added a selection module to select sampling points with is the ball neighbor search that captures the neighbouring
points and features within the radius  r for  the sampling 
rich neighboring information after points sampling module.
points. After the group, we get f¯ = di, , n i |i = 1, . . . N11 ,
The selection module considers about the neighbouring infor-
where n i denotes the number of neighbouring points,  and
mation of every sampling point to select the better quality sum (x i −x̄ j )
points. d is di = ni , i = 1, . . . N11 , j = 1, . . . n i , where
The reason we added the selection module after the sam- x is the coordinate of sampling points and x̄ corresponds
pling is shown in Figure 3. In particular, we artificially mark to neighbor points coordinates. Next, we input feature
the sample points as red points, which are denoted as P1 and vector f¯ into the selecting module, which mainly con-
P2 , while the remaining points are black. After the attention sists of the MLP and softmax function and output score
group module, one can see that there are few black points of N11 points. Immediately after this step, we take out
in the neighboring region of point P1 . In other words, the the top N2 points based on the value of the softmax
spatial structure around point P1 is weak. The main reason score to obtain point set P2D−F P S ∈ R N2 ∗3 and feature
is that point P1 is a noise point or the local spatial structure F̄2D−F P S ∈ R N2 ∗C1 .

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: SSA3D 5

Algorithm 1 The Algorithm of the Attention Group


Input: Points (N1 , 3), Features (N1 , C1 ), Sampling_points (N2 , 3), Radius(3), Neighbor_number, N1 < N2
Output: New_feature (N2 , C2 )
1: procedure ATTENTION _ GROUP (Points, Features, Sampling_points, Radius, Neighbor_number)
2: New_features = []  Initializing variable
3: for r in Radius do
# Neighbor_index (N2 , Neighbor_number)
4: Neighbor_index = Group(Points, Sampling_points, Neighbor_number, r )
# Neighbor_points (N2 , neighbor_number, 3), Neighbor_feature (N2 , neighbor_number, C1 )
5: Neighbor_points, Neighbor_feature = get_value(Points, Features, Neighbor_index)
# Relative Point Position Encoding to obtain some information
6: Relative_position = Contact(Sampling_points, Sampling_points - Neighbor_points, Sampling_points -
Neighbor_points)
# Relative_feature (N2 , neighbor_number, C1 )
7: Relative_feature = Conv(Relative_position)
# Neighbor_feature (N2 , neighbor_number, 2 C1 )
8: Neighbor_feature = Contact(Relative_position, Neighbor_feature)
9: Neighbor_feature = Neighbor_feature * so f tmax(Neighbor_feature)
# Temp_new_feature (N2 , 2 C1 )
10: Temp_new_feature = Max Pool(M L P(Neighbor_feature))
11: New_feature = Contact(New_feature, Temp_new_feature)
12: end for
13: New_feature = Conv(New_feature)  New_feature (N2 , C2 )
14: return New_feature
15: end procedure

The Attention Group: Previous methods like features. The connected features are subsequently multiplied
PointNet++ [19], when fusing the features of the surrounding by the value of their softmax.
points to the corresponding sample points, assume that the Box Generation Layer: We used the box generation layer of
information of the surrounding points is equally important for 3DSSD [30] directly and do not change it much. The sampling
the sample points. However, we found that the information points obtained from F-FPS were directly used to predict the
of surrounding points is different in importance to the center points of the corresponding objects through the voted
corresponding sample points. The reason for this different center point module, and other modules were used to extract
importance depends on the distance between the sample features of the center points.
point and the neighbouring point and the location of the In the regression box head, given each predicted center
neighbouring point. We therefore used an attention mechanism point, the box head predicts its distance (d x , d x , dz ) of each
like RandLA-Net [39] to solve this problem. The modified predicted center point to the corresponding object and the
method shows superior performance in assigning different size (dl , dw , dh ) and orientation. A more precise predicted
weights to different proximity points. orientation can be used to aggregate the travel direction of
We generated a feature vector Relative_position, which vehicles, which helps to better direct and regular transportation
includes the location of the sampling point, the local coor- in ITS. Therefore, we applied the classification and regression
dinates of the neighbouring points, and the distance from ideas as described previously [40] in orientation regression
the neighbouring points to the corresponding sampling points because the predicted bounding box hasn’t prior orientation.
(Algorithm 1). We assumed that the importance of each Specifically, we predefined a hyperparameter Na that was
neighbouring point to the corresponding sample point may set to 12 in our work. This means that we define Na as
be related to these factors. This module made a significant the orientation angle bins. A point is classified into the bins
contribution to our network. The obtained Relative_position according to its value and the predicted residual once the
was taken as the value for the next step. We contacted the predicted orientation is known. The corresponding bin value
Relative_position with the original features and multiplied is used as the input into the regression function.
the importance weights of the neighboring points using the We also utilize the 3D center-ness assignment strategy. The
softmax function. This entire process is illustrated in Figure 2 ground truth label of each predicted center point is generated
The group module is used to obtain the neighbouring points in two steps. The first step determines whether the predicted
and the features of the neighbouring points as shown in the center point is inside the object that use lmask , a binary value,
figure. Next, the network uses the relationship between the to represent it. The second step utilizes a formulation to
sample points and the neighbouring points to obtain the extra calculate the score of the predicted center point location. The
features. The extra features are connected to the original output center score lctrness is computed using the distance

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

from the predicted center point to the six surfaces of the Equation 3.
corresponding ground truth box as in Equation 1.
 L a = −αt (1 − pt )γ log ( pt )

min( f, b) min(l, r ) min(t, d) where pt =


p forground points
(3)
lctrness = 3 × × , (1) 1 − p othersize,
max( f, b) max(l, r ) max(t, d)
where αt = 0.25 and r = 2 keeping the default settings during
where the f and b are the front and back surfaces distance,
training the auxiliary module.
respectively, the l and r are the left and right surfaces distance,
The center point classification loss L c is the cross-entropy
respectively, and the t and d are the top and bottom surface
loss. Its input si and u i represent the predicted classification
distance, respectively. Finally, the result of the multiplication
score and the classification label calculated by Equation 1,
of lmask and lctrness is the classification score.
respectively.
The regression loss L r for the supervised bounding boxes
B. Exclusive Use of the Auxiliary Module for Training is defined as Equation 4.
We have observed many point cloud semantic segmenta- L r = L dist + L size + L angle + L corner , (4)
tion networks, including PointNet++ [19], PointSIFT [41],
PointCNN [42], and PointConv [43]. We believe that the where L dist , Lsi ze, L angle , and L corner stand for the distance
feature propagation layer (FP layers) plays a significant role loss of center points, size loss of the bounding box, angle loss
in these networks. The FP layer is not only used to restore of the bounding box, and corner distance loss. Specifically,
sampled points to their original point cloud size. It is also L dist and L size are the smooth-l1 loss. L dist supervises offsets
used to help the feature extraction layer to extract features of distance from the predicted center points to corresponding
better through a gradient update of the classification loss. ground truth center points, while L size supervises offsets of
Specifically, the output point size of the FP layer 1 corresponds size from the predicted bounding box to the corresponding
to the output point size of Fusion attention sampling module bounding box. The L angle loss includes commonly angle class
(Figure 1). We only use the FP layer in training and do not loss and the angle residual loss as described earlier [40]. The
use it in testing. Finally, we achieved semantic segmentation L corner calculates the distance form the predicted bounding
aided by one-stage object detection. box 8 corners to the responding ground truth bounding box.
In the FP layer, we up-sample the point cloud using a L s supervises the predicted shift in the voted center point
three-point linear interpolation. Next the skip link concate- module. L s is the same as the L dist and L size is the smooth-
nation operation concatenates the up-sampled features and the l1. The residuals between the predicted and the real shift from
corresponding attention sampling module output features. For the sampling point to its corresponding object center serve as
example, the output feature of Fusion attention sampling mod- the input. N p∗ is the number of positive sampling points.
ule and the up-sampled features of FP layer 1 are concatenated
to obtain features. Finally, the contacted features are used as III. E XPERIMENTS
the input the MLP to obtain final features of the up-sampling
We evaluated our proposed semantic segmentation assisted
points. After three FP layers, the network output point-wise
one-stage three-dimensional vehicle object (SSA3D) method
feature. Next point-wise features are fed into the classifier to
on the KITTI 3D object detection benchmark [45].
output the point-wise labels.

A. KITTI Dataset
C. Loss Function
The KITTI dataset consists of 7,481 training samples and
The total loss function comprises losses in each of the fol- 7,518 testing samples. Our experiments have been designed
lowing: auxiliary classification loss, center point classification to follow the standard protocol in a fairer way. The training
loss, regression loss, and center shifting loss. The total loss data are divided into a 3,712 training sample set and a 3,769
function L is defined as Equation 2. validation sample set. Our method is only tested on the most
1 commonly used cars class due to the large amount of data.
L = λ1 L f ocal + L c (si , u i )
Nc
i
1 1 B. Data Augmentation
+ λ2 [u i > 0] L r + λ3 ∗ L s , (2)
Np Np The dataset is augmented by various methods to obtain
i
better performance, including flipping, rotating, and scaling the
where L a , L c , L r , and L s stand for the auxiliary classification point cloud. Specifically, we followed the standard of PointPil-
loss, center point classification loss, regression loss, and center lars [14], where the probability of flipping is set to
0.5, and the
shifting loss respectively. Nc and N p are the amount of rotation angle is subject to a uniform distribution (− π2 , π2 ).
predicted center points and positive center points located in Each point cloud is randomly flipped along the x-axis, and
the foreground instance respectively. we rotate each bounding box and add a random translation
The auxiliary classification loss L a is the focal loss (x, y, z). In addition, we also randomly scaled the whole
descried earlier [44] to handle the class imbalance problem as point cloud around the z-axis.

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: SSA3D 7

TABLE I
T HE P ERFORMANCE OF SSA3D ON THE KITTI D ATASET ON A C LASS
C AR D RAWN F ROM THE O FFICIAL B ENCHMARK [45]. “L” AND “R”
R EPRESENT U SING L I DAR AND RGB I MAGES
AS THE I NPUT, R ESPECTIVELY

Fig. 4. The illustration of intersection of Union. The blue bounding box is


the ground truth, the yellow bounding box is the predicted box, and the green
bounding box is the overlap area.

C. Evaluation Metrics
The average precision (AP) with an Intersection over
Union (IoU) threshold of 0.7 is used as evaluation metrics
and SSA3D is assessed on three levels of difficulty. As shown
in the Figure 4, the blue box represents the ground truth
box while the orange box represents the predict box. The
intersection of two boxes is called overlapping area, which
is represented as green. Therefore, the IoU is calculated as
follows:
Voverlapping area
I oU = , (5)
Vground truth box + Vpredict box − Voverlapping area
where I oU and Vx stand for output and the Volume of x commonly used settings. Invisible points in the front image
respectively. were dropped. Finally, we selected the interesting points as
We only calculate the precision for boxes with IoU greater input. We just randomly choose 8K interesting points from
than or equal to 0.7. Specifically, true positive (TP), true the entire point cloud scene because of the limitation of the
negative (TN), false positive (FP), and false negative (FN), GPU. The batch size was set to 8 and was equally distributed
the evaluation metrics introduced by Machine Learning, are on two GPU cards. The learning rate was initialized to 0.002.
used to evaluate our result. TP represents the number of object
which is a car and also is detected as a car, TN represents the E. Experimental Results
number of object which not a car and also isn’t detected as a Our experimental results are summarized in Table I. While
car, FP represents the number of object which is not a car but our method did not give the best performance, it already
is detected as a car, and FN represents the number of object surpasses most of the other methods. Specifically, our method
which is a car but isn’t detected as a car. The formula of outperforms all methods based on voxelization and performs
precision is defined as follows: better than PointPillars [14] by (4.45%, 3.73%, 4.03%) in 3D
TP detection. In addition, we also demonstrated the same improve-
Precision = . (6) ments for the two stages of our method. This means that our
T P + FP
semantic segmentation module works. Moreover, our results
The average precision is stand for average precision of all
also show the superiority of our approach when compared to
batches.
multi-sensor methods like F-ConvNet [35]. It is worth noting
According to the KITTI official documents, the object that
that our model is built on top of the architecture of 3DSSD [30]
the min bounding box height is 25Px, max occlusion level is
and performs better than it by (0.63%, 1.93%, and 1.98%.) We
difficult to see and max truncation is 50% is called the hard
show some of the visualized results in section III-G.
level, the object that the min bounding box height is 25Px,
We tested some state-of-the-art methods on the KITTI Val
max occlusion level is partly occluded and max truncation
dataset for the car class under the same computing environ-
is 30% is called the moderate level, and the object that the
ment in table II. All methods were run under the same condi-
min bounding box height is 40Px, max occlusion level is
tions. The performance of our model was the performance than
difficult to see and max truncation is 15% is called the easy
3DSSD [30] because we strengthened the ability to extract the
level.
feature.
As shown in the above tables, the performance of the
D. Implementation Details proposed method has a significant improvement in the testing
We run our work on two NVIDIA RTX 2080TI cards. set (I), rather in the validation set (II), especially the hard-
We first obtained the point cloud region between X(0 m, level cars. The KITTI only provides the training set and the
70.4 m), Y (-40 m, 40 m), Z (-3 m, 1 m) by following the testing set. Therefore, most methods split the training set

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 5. The result of the easy car. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is the result of our.
The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted boxes. The red
number is the confidence score.

TABLE II TABLE III


R ESULTS FOR THE KITTI VAL D ATASET IN THE C AR C LASS N ETWORK S TRUCTURE A MONG D IFFERENT M ETHODS
BASED ON P OINT C LOUD

into a training set and a validation set. And two sets are
not overlapped but come from the same scenes. When the
model is trained by the training set, it can learn about the
methods. All the methods are tested on the CPU Intel(R)
validation set too. The similar accuracy of 3DSSD [30] and
i9-9900X and GPU RTX 2080Ti. FLOPs represents the
our method on the validation set only explains that both have
floating point operations (FLOPs), which indicates the com-
a better performance in the known scenes. The testing set
plexity of the model. Thop [50] calculates FLOPs of the
with unknown scenes can evaluate the model better. Accord-
framework pytorch code, while profile [51] is used to cal-
ing to the table I, our method shows significant improve-
culate FLOPs of the framework tensorflow code. According
ment for detection evaluation compared to other mainstream
to table III, the efficiency of the proposed method is still
methods.
comparable.

F. Model Efficiency G. Visualization Results


Normally, mainstream methods use the inference runtime We visualized some point cloud scenes to evaluate the per-
to evaluate the efficiency of models because the inference formance of our method. These visualization results directly
runtime is directly related to the detection speed. However, shows the ability of our method to extract features. Following
the inference process of model can be influenced by many the KITTI [27], our the visualization results are divided into
factors, such as the platform, network structure, network the easy, moderate and hard car class. Specifically, the easy car
parameters etc. As is shown in table III, the efficiency of class has abundant points and its shape is relatively complete;
the proposed method has been compared with state-of-art the moderate car class has a medium number of points and its

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: SSA3D 9

Fig. 6. The result of the moderate car. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is the result of
our. The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted boxes. The
red number is the confidence score. We magnified the differences among three methods.

part shape is relatively complete; the hard car class has only TABLE IV
a few points and its shape is seriously damaged. T HE E FFECTIVENESS OF THE AUXILIARY N ETWORK
The Easy and Moderate Car Results: We show visualization
results of three methods in Figure 5. As shown in the visual-
ization results, if we see these easy cars from a specific visual
angle, their appearance structure is very complete. It also
makes the task of detecting easy car easier so that most of
the existing methods achieve excellent performance on easy
car. All the methods we visualized successfully detected all
the easy cars. Our results are only slightly different from hard car just contain a few points. It means that it is almost
theirs, which makes our method slightly better than them in impossible for the network to regress the box regardless of
accuracy. For example, the shape and location of the prediction whether it is extracted through structural information alone
box is closer to the ground truth box. However, there are or semantic segmentation. Although our method can only
some differences in moderate cars. In Figure 6, we show the barely detect it, it does not work well. In Figures 8, two hard
moderate car results. From these results, we found that all three cars also only contain a few points. PointPillars and 3DSSD
methods can detect the car with the clear structure, but not all can only detect one of them, which is already an excellent
methods can detect the carwith the fuzzy structure. As shown performance in fact. Fortunately, our method successfully
in Figure 6, we have magnified the key result. The car is made detected two hard cars. We think that this performance must
up of 11 points, and its structure is seriously lost. so detecting be depended to the common effect of extracting point fea-
the car becomes a challenging task. Neither PointPillars nor tures and auxiliary network. The predicted semantic label
3DSSD can predict the moderate car. Fortunately, our method and the regression box have low confidence by this weak
successfully detected the location of the moderate car. It means spatial structure. But when network combines with these two
that our method has a certain ability to detect the cars with ways in training, it is possible to regress the current box in
losing structure. testing.
The Hard Car Results: The points that represent distant We made a simple ablation experiment to intuitively prove
objects may only contain a few points since LiDAR causes the effectiveness of our auxiliary network in Table IV.
the loss of the point cloud when scanning. In fact, most hard By adding auxiliary modules, all tasks have been improved.
cars contain only a few points. It is difficult to predict the It means that our method successfully migrated the advantages
semantic label and regression the box through the spatial of the two-stage network semantic segmentation network to the
structure information of points. So most methods do not one-stage network.
perform well in the hard car, but we found some interesting Most of the existing methods focus on how to regress the
things by visualizing our results. As shown in Figures 7, the boxes in the location of cars, but not on error detection.

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 7. The result of the hard car. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is the result of our.
The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted boxes. The red
number is the confidence score. We magnified the differences among three methods.

Fig. 8. The result of the hard car. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is the result of our.
The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted boxes. The red
number is the confidence score. We magnified the differences among three methods.

We designed the auxiliary network to help regress boxes driving. It is not what we expected. Fortunately, we found this
through the semantic labels of the point cloud. It avoids the advantage in the our visualization results.
network will not regress boxes in the point cloud structure Distinguishing Similar Spatial Structures: The scene shown
similar location. In the real world, the wrong boxes will in Figure 9 is a relatively simple scene that includes a car,
cause the car to make a wrong judgment in the automatic a truck, and a bicycle. The spatial shape of the fence is very

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: SSA3D 11

Fig. 9. Fault detection that detects the fence as a car. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is
the result of our. The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted
boxes. The red number is the confidence score. We magnified the differences among three methods.

Fig. 10. Fault detection that detects the surface of the buildings as a car. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result
of 3DSSD and (d) is the result of our. The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes
represent the predicted boxes. The red number is the confidence score. We magnified the differences among three methods.

similar to the spatial structure shape of the rear of the car. It is successfully solved this problem when comparing the three
therefore difficult to distinguish them. There are still some methods. Our proposed method of sampling and attention
differences in details because it is not a car. Theoretically, groups can better extract the local feature. The situation shown
the network can solve this problem by better extracting the in Figure 10 is also a simple scene, including some cars and
local spatial structure information. Fortunately, our method some buildings. The methods used for the comparison take the

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 11. Fault detection of the van at the front. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is the
result of our. The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted
boxes. The red number is the confidence score. We magnified the differences among three methods.

Fig. 12. Fault detection of the van at the side. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is the
result of our. The green, blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted
boxes. The red number is the confidence score. We magnified the differences among three methods.

surface of a building as the car because they are similar on In real scenes, vans and cars are very different in shape
one side. This also proves the effectiveness of our proposed and size and we can easily distinguish them from the RGB
method. Note that although the compared methods gave some image. However, we can only obtain a part of the appearance
false detection with the low confident score, our method did of the object due to the impenetrability of LiDAR. In addition,
not directly detect them. the LiDAR inevitably causes some points to be lost when

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: SSA3D 13

Fig. 13. The failure cases. (a) is the front image of the car, (b) is the result of PointPillars, (c) is the result of 3DSSD and (d) is the result of our. The green,
blue and yellow boxes are the ground truth of the easy, moderate and hard car, respectively. The red boxes represent the predicted boxes. The red number is
the confidence score. We magnified the differences among three methods.

acquiring data because of environmental influences. Therefore, IV. S UMMARY


both the point cloud shapes of van and car are almost the In this paper, we utilize the point cloud semantic segmen-
same in the point cloud scenes, which makes classification tation network to aid 3D object detection. The difference
and detection tasks difficult. We found that nly using spa- between our method and the traditional two-stage methods
tial structure features to classify objects did not solve this is that we take advantage of the semantic segmentation in
problem well. We therefore used the point cloud semantic the two-stage networks without significantly increasing the
segmentation network to assist the object detection network computational load. Meanwhile, to better extract the spatial
to achieve better performance results. The scene in Figure 11 structure information of points and select the better points,
includes a truck, two cars, and a van. It is difficult for us we propose that using the selection module and attention
to distinguish these vehicles through our human eyes from group solves these issues. Experiments on the KITTI detection
this enlarged point cloud image. This type of problem is benchmark showed better performance with high efficiency.
therefore a huge challenge for humans and AI. It is impossible In future work, we will build a network based on the point
to distinguish them through traditional structural information. cloud that can detect multi-class objects simultaneously.
Note that with the help of the semantic segmentation mod-
ule, our network method successfully solves the problem. R EFERENCES
In Figures 12, the same problems exist. We can see that [1] K. Muhammad, A. Ullah, J. Lloret, J. D. Ser, and
the point cloud shape of the van is the same as the point V. H. C. de Albuquerque, “Deep learning for safe autonomous
cloud shape of the car. The compared methods detect the driving: Current challenges and future directions,” IEEE Trans. Intell.
Transp. Syst., vol. 22, no. 7, pp. 4316–4336, Jul. 2021.
van as a car. Specifically, SSA3D can distinguish between the [2] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby,
vehicles whether from the side point cloud or the front point and A. Mouzakitis, “A survey on 3D object detection methods for
cloud. autonomous driving applications,” IEEE Trans. Intell. Transp. Syst.,
vol. 20, no. 10, pp. 3782–3795, Oct. 2019.
Failure Cases: As shown in Figure 13, the hard car contains [3] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3D bounding
two points and three methods do not work well. We found box estimation using deep learning and geometry,” in Proc. IEEE Conf.
that when the spatial structure information of the object is Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 7074–7082.
[4] M. Zhu et al., “Single image 3D object detection and pose estimation for
too weak, our method can not detect it. The main reasons for grasping,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2014,
failure we think is that sampling operations of pre-processing pp. 3936–3943.
greatly reduces the density of point cloud so that there are [5] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang, “GS3D: An
efficient 3D object detection framework for autonomous driving,” in
a few points represented the hard car. It causes the loss of Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019,
spatial structure of the car. However, increasing the point pp. 1019–1028.
cloud density significantly increases the inference time. So, [6] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep
MANTA: A coarse-to-fine many-task network for joint 2D and 3D
our future work is to propose fast detection networks that can vehicle analysis from monocular image,” in Proc. IEEE Conf. Comput.
deal high-density point clouds. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2040–2049.

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

[7] R. Mottaghi, Y. Xiang, and S. Savarese, “A coarse-to-fine model for [30] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3DSSD: Point-based 3D single
3D pose estimation and sub-category recognition,” in Proc. IEEE Conf. stage object detector,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 418–426. Recognit. (CVPR), Jun. 2020, pp. 11040–11048.
[8] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and [31] Y. Cui et al., “Deep learning for image and point cloud fusion in
R. Urtasun, “Monocular 3D object detection for autonomous driving,” autonomous driving: A review,” IEEE Trans. Intell. Transp. Syst., early
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, access, Mar. 17, 2021, doi: 10.1109/TITS.2020.3023541.
pp. 2147–2156. [32] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint
[9] X. Chen et al., “3D object proposals for accurate object class 3D proposal generation and object detection from view aggregation,”
detection,” in Proc. Adv. Neural Inf. Process. Syst., vol. 28, 2015, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018,
pp. 424–432. pp. 1–8.
[10] S. Song and J. Xiao, “Deep sliding shapes for amodal 3D object [33] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detec-
detection in RGB-D images,” in Proc. IEEE Conf. Comput. Vis. Pattern tion network for autonomous driving,” in Proc. IEEE Conf. Comput. Vis.
Recognit. (CVPR), Jun. 2016, pp. 808–816. Pattern Recognit. (CVPR), Jul. 2017, pp. 1907–1915.
[11] Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud [34] P. Cao, H. Chen, Y. Zhang, and G. Wang, “Multi-view frustum pointnet
based 3D object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. for object detection in autonomous driving,” in Proc. IEEE Int. Conf.
Pattern Recognit., Jun. 2018, pp. 4490–4499. Image Process. (ICIP), Sep. 2019, pp. 3896–3899.
[35] Z. Wang and K. Jia, “Frustum ConvNet: Sliding frustums to aggre-
[12] D. Z. Wang and I. Posner, “Voting for voting in online point cloud object
gate local point-wise features for amodal 3D object detection,” 2019,
detection,” in Proc. Robot., Sci. Syst., Rome, Italy, 2015, vol. 1, no. 3,
arXiv:1903.01864.
pp. 10–15.
[36] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “IPOD: Intensive point-
[13] D. Maturana and S. Scherer, “VoxNet: A 3D convolutional neural
based object detector for point cloud,” 2018, arXiv:1812.05276.
network for real-time object recognition,” in Proc. IEEE/RSJ Int. Conf.
[37] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “PointPainting:
Intell. Robots Syst. (IROS), Sep. 2015, pp. 922–928.
Sequential fusion for 3D object detection,” in Proc. IEEE/CVF Conf.
[14] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 4604–4612.
“PointPillars: Fast encoders for object detection from point clouds,” [38] Z. Ouyang, X. Dong, J. Cui, J. Niu, and M. Guizani, “PV-
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), EncoNet: Fast object detection based on colored point cloud,”
Jun. 2019, pp. 12697–12705. IEEE Trans. Intell. Transp. Syst., early access, Sep. 27, 2021, doi:
[15] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet: 10.1109/TITS.2021.3114062.
Deep learning on point sets for 3D classification and segmentation,” [39] Q. Hu et al., “RandLA-net: Efficient semantic segmentation of large-
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, scale point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
pp. 652–660. Recognit. (CVPR), Jun. 2020, pp. 11108–11117.
[16] Y. Yan, Y. Mao, and B. Li, “SECOND: Sparsely embedded convolutional [40] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets
detection,” Sensors, vol. 18, no. 10, p. 3337, 2018. for 3D object detection from RGB-D data,” in Proc. IEEE/CVF Conf.
[17] B. Graham, M. Engelcke, and L. V. D. Maaten, “3D semantic Comput. Vis. Pattern Recognit., Jun. 2018, pp. 918–927.
segmentation with submanifold sparse convolutional networks,” in [41] M. Jiang, Y. Wu, T. Zhao, Z. Zhao, and C. Lu, “PointSIFT: A SIFT-
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, like network module for 3D point cloud semantic segmentation,” 2018,
pp. 9224–9232. arXiv:1807.00652.
[18] H. Li, S. Zhao, W. Zhao, L. Zhang, and J. Shen, “One-stage anchor- [42] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN:
free 3D vehicle detection from LiDAR sensors,” Sensors, vol. 21, no. 8, Convolution on X-transformed points,” in Adv. Neural Inf. Process. Syst.,
p. 2651, 2021. vol. 31, 2018, pp. 820–830.
[19] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical [43] W. Wu, Z. Qi, and L. Fuxin, “PointConv: Deep convolutional networks
feature learning on point sets in a metric space,” in Proc. Adv. Neural on 3D point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Inf. Process. Syst., vol. 30, 2017, pp. 5099–5108. Recognit. (CVPR), Jun. 2019, pp. 9621–9630.
[20] H. Luo et al., “Patch-based semantic labeling of road scene using [44] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
colorized mobile LiDAR point clouds,” IEEE Trans. Intell. Transp. Syst., dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
vol. 17, no. 5, pp. 1286–1297, May 2015. Oct. 2017, pp. 2980–2988.
[21] S. Shi, X. Wang, and H. P. Li, “3D object proposal generation and [45] (2019). Kitti 3D Object Detection Benchmark. [Online]. Available:
detection from point cloud,” in Proc. IEEE Conf. Comput. Vis. Pattern http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d
Recognit. (CVPR), Long Beach, CA, USA, 2019, pp. 15–20. [46] Y. Chen, S. Liu, X. Shen, and J. Jia, “Fast point R-CNN,” in Proc.
[22] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “STD: Sparse-to-dense 3D IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9775–9784.
object detector for point cloud,” in Proc. IEEE/CVF Int. Conf. Comput. [47] J. Lehner, A. Mitterecker, T. Adler, M. Hofmarcher, B. Nessler, and
Vis. (ICCV), Oct. 2019, pp. 1951–1960. S. Hochreiter, “Patch refinement–localized 3D object detection,” 2019,
arXiv:1910.04093.
[23] Q. Meng, W. Wang, T. Zhou, J. Shen, L. Van Gool, and D. Dai, “Weakly
[48] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3D
supervised 3D object detection from lidar point cloud,” in Proc. Eur.
object detection from point cloud with part-aware and part-aggregation
Conf. Comput. Vis. Glasgow, U.K.: Springer, 2020, pp. 515–531.
network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 8,
[24] Q. Meng, W. Wang, T. Zhou, J. Shen, Y. Jia, and L. Van Gool, “Towards pp. 2647–2664, Aug. 2021.
a weakly supervised framework for 3D point cloud object detection
[49] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion
and annotation,” IEEE Trans. Pattern Anal. Mach. Intell., early access,
for multi-sensor 3D object detection,” in Proc. Eur. Conf. Comput. Vis.
Mar. 3, 2021, doi: 10.1109/TPAMI.2021.3063611.
(ECCV), 2018, pp. 641–656.
[25] W. Shi and R. Rajkumar, “Point-GNN: Graph neural network for 3D [50] L. Zhu. (2019). PyTorch-OpCounter. [Online]. Available:
object detection in a point cloud,” in Proc. IEEE/CVF Conf. Comput. https://github.com/Lyken17/pytorch-OpCounter
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 1711–1719. [51] (2017). Tensorflow. [Online]. Available: https://github.com/tensorflow/
[26] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, tensorflow/blob/master/tensorflow/python/profiler/model_analyzer.py
“The graph neural network model,” IEEE Trans. Neural Netw., vol. 20,
no. 1, pp. 61–80, Dec. 2009.
[27] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: Shangfeng Huang is currently pursuing the
The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, bachelor’s degree with the Computer Engineering
2013. College, Jimei University, Xiamen, Fujian, China.
[28] J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang, “LiDAR-based His research interests include machine learning,
online 3D video object detection with graph-based message passing and 3D object detection, semantic segmentation of point
spatiotemporal transformer attention,” in Proc. IEEE/CVF Conf. Comput. clouds, deep learning theory and its application,
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 11495–11504. virtual reality, and augmented reality.
[29] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure
aware single-stage 3D object detection from point cloud,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
pp. 11873–11882.

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: SSA3D 15

Guorong Cai received the Ph.D. degree from Qiming Xia is currently pursuing the mas-
Xiamen University, Xiamen, China. He is currently ter’s degree with the Computer Engineering Col-
a Professor in computer science with Jimei Uni- lege, Jimei University, Xiamen, Fujian, China. His
versity, Xiamen. His research interests include 3D research interests include 3D object detection of
object detection, 3D reconstruction, machine learn- point clouds, machine learning, deep learning theory
ing, image-based object detection/recognition, and and its application, knowledge graph, and image
image/video retrieval. processing.

Ruisheng Wang (Senior Member, IEEE) received


the B.Eng. degree in photogrammetry and remote
sensing from Wuhan University, the M.Sc.E. degree
Zongyue Wang received the Ph.D. degree from in geomatics engineering from the University of
Wuhan University, Wuhan, Hubei, China. He is New Brunswick, and the Ph.D. degree in electrical
currently a Professor in computer engineering with and computer engineering from McGill University.
Jimei University, Xiamen, Fujian, China. His main He is currently a Professor with the Department
research interests are image processing, machine of Geomatics Engineering, University of Calgary,
learning, computer vision, and remote sensing data which he joined in 2012. Prior to that, he worked
processing. as an Industrial Researcher at HERE (formerly
NAVTEQ), Chicago, USA, in 2008. His primary
research focus there was on mobile LiDAR data processing for next generation
map making and navigation. His research interests include geomatics and
computer vision, especially point cloud processing.

Authorized licensed use limited to: University of Calgary. Downloaded on July 11,2022 at 21:08:44 UTC from IEEE Xplore. Restrictions apply.

You might also like