You are on page 1of 7

Page 1

Page 2

Hi everyone, today I will present the research update for theme 4.

Page 3

Today’s topic is a literature review on a recently published paper named V2V4Real, a real world
large-scale dataset for vehicle-to-vehicle cooperative perception.

Page 4

Perception is critical in autonomous driving. Single-vehicle vision systems still suffer from many
real-world challenges, such as occlusions and short-range perceiving capability. These
challenges stem mainly from the limited field-of-view of the individual vehicle, leading to an
incomplete understanding of the surrounding traffic.
To address these challenges, multi-sensor perception systems that fuse data from various
sensors such as cameras, lidars, and radars are used. Multi-sensor perception systems can
provide a more complete and accurate understanding of the surrounding traffic and improve the
safety of autonomous driving. However, designing and integrating such systems pose additional
challenges such as data fusion, sensor calibration, and real-time processing.

Page 5

Multiple connected and automated vehicles (CAVs) can communicate and share captured
sensor information simultaneously. This can enable V2V (vehicle-to-vehicle) perception, which
can improve the safety and efficiency of autonomous driving.
However, it remains challenging to validate V2V perception in real-world scenarios due to the
lack of public benchmarks. Most of the existing V2V datasets rely on open-source simulators.
However, there exists a clear domain gap between synthetic data and real-world data, as the
traffic behavior and sensor rendering in simulators are often not realistic enough.
To address this challenge, researchers are exploring various approaches to generate more
realistic V2V datasets, such as using real-world sensor data, developing advanced simulation
tools, and leveraging crowd-sourcing techniques.

Page 6

Validating V2V perception algorithms is critical for enabling safe and efficient autonomous
driving. However, existing V2V datasets are limited in their scope and realism.
Compared to the only existing real-world cooperative dataset DAIR-V2X, the Authors proposed
V2V4Real dataset shows several strengths.
First, it fills the gap by focusing on the important V2V cooperation, which is essential for safe
and efficient autonomous driving.
Second, it includes four diverse road types, ranging from highways to urban streets, to provide a
comprehensive evaluation of V2V perception algorithms.
Third, it provides high-definition (HD) maps, which are essential for accurate and precise
localization of vehicles in the environment.
Fourth, it constructs several benchmarks, including obstacle detection, road user tracking, and
intersection negotiation, to comprehensively evaluate the performance of V2V perception
algorithms.
Finally, it provides eight state-of-the-art cooperative perception algorithms for benchmarking,
which can serve as a baseline for future research and development.

Page 7

In summary, the authors of V2V4Real made the following key contributions:


First, they built the V2V4Real dataset from real-world diverse scenarios. All the frames are
captured by multi-modal sensor readings, providing a realistic and comprehensive evaluation of
V2V perception algorithms.
Second, they provided more than 240K annotated 3D bounding boxes for 5 vehicle classes, as
well as corresponding HDMaps along the driving routes. This enables accurate and precise
localization of objects in the environment.
Third, they introduced three cooperative perception tasks, including 3D object detection, object
tracking, and Sim2Real, providing comprehensive benchmarks for evaluating the performance
of V2V perception algorithms.

Page 8

In this section, I will introduce a few related research in terms of Autonomous Driving Datasets,
3D Detection and V2V/V2X Cooperative Perception

Page 9

Autonomous driving datasets are a crucial resource for developing and evaluating perception
algorithms. Over the years, these datasets have evolved significantly, from early datasets like
Cityscapes that mainly focus on 2D annotations for RGB camera images to more advanced
datasets like KITTI that provide multimodal sensor readings.
KITTI was a pioneering dataset that provides front-facing stereo camera and LiDAR data, which
enables accurate and comprehensive annotations of the environment. It includes annotations
for various tasks, such as 3D object detection, tracking, and road segmentation, making it a
valuable resource for autonomous driving research.
However, KITTI also has limitations, such as a relatively small dataset size and limited diversity
of scenarios. To address these limitations, newer datasets have been proposed, such as
Waymo Open Dataset and nuScenes, which provide more extensive and diverse annotations,
including HD maps and 3D object tracking.

Page 10
In autonomous driving, 3D object detection is a crucial task for accurately and efficiently
perceiving the environment. There are several techniques for 3D detection, including camera-
based, LiDAR-based, and camera-LiDAR fusion.
Camera-based 3D detection involves detecting 3D objects from a single or multiple RGB
images. This technique is challenging due to the lack of depth information in 2D images, but
recent advances in deep learning have significantly improved the accuracy and efficiency of this
approach.
LiDAR-based 3D detection converts LiDAR points into voxels or pillars, which can be processed
by 3D convolutional neural networks (CNNs) to detect 3D objects. LiDAR-based techniques
have been shown to be effective in various scenarios, such as low-light conditions and occluded
objects.
Camera-LiDAR fusion is a recent trend in 3D detection that combines the advantages of both
camera and LiDAR-based techniques. This approach fuses the 2D image features and 3D point
cloud features to improve the accuracy and robustness of 3D object detection.

Page 11

Cooperative perception is an essential aspect of autonomous driving, as it allows multiple


vehicles to share sensor information and improve the accuracy and robustness of perception
systems. With cooperative perception, it is possible to tackle the limitations of single-vehicle
perception and achieve better detection and tracking of objects.
V2V/V2X cooperative perception can be roughly divided into three categories: early fusion, late
fusion, and intermediate fusion.
Early fusion is a technique that fuses the raw sensor data from multiple vehicles into a single
representation, such as a point cloud or an image. This approach can improve the accuracy of
perception systems by leveraging the diverse viewpoints and sensor modalities of multiple
vehicles.
Late fusion involves combining the results of individual perception systems from multiple
vehicles to form a consensus estimate. This approach is less computationally demanding than
early fusion and can be more robust to sensor failures.
Intermediate fusion is a hybrid approach that combines the advantages of early and late fusion.
It involves fusing the sensor data at a higher level of abstraction, such as feature maps or object
proposals.

Page 12

In this section, I will explain the setup of data collection, data annotation approach and data
statistics

Page 13
The authors collected the V2V4Real dataset using two experimental connected automated
vehicles - a Tesla and a Ford Fusion. These vehicles were retrofitted by Transportation
Research Center (TRC) and AutonomouStuff (AStuff) companies, and equipped with a
Velodyne VLP-32 LiDAR sensor, two mono cameras, and GPS/IMU integration systems. The
authors maintained a distance of 150 meters between the two vehicles to ensure overlap
between their views while driving simultaneously in Columbus, Ohio. They varied the relative
poses of the two vehicles across different scenarios to enrich the diversity of sensor-view
combinations. The authors collected driving logs for three days covering 347 km of highway
road and 63 km of city road. The data was collected for 19 hours, resulting in a total of 310K
frames, of which 67 most representative scenarios were selected. For each scene, the authors
ensured that the asynchronizations between two vehicles' sensor systems were less than 50
ms. All scenarios were aligned with maps containing drivable regions, road boundaries, and
dash lines.

Page 14

The authors use SusTechPoint to annotate 3D bounding boxes for the LiDAR data collected in
four different coordinate systems - the LiDAR coordinate system for Tesla and Ford Fusion, the
HDmap coordinate, and the earth-earth, fixed-coordinate (ECEF). The dataset includes five
object classes, and each object's driving state is recorded with a 7-degree-of-freedom 3D
bounding box. Additionally, the authors provide a global point cloud map and vector map
annotation. To generate the HDMaps, the authors preprocess each LiDAR frame by removing
dynamic objects and apply a scan matching algorithm to compute the relative transformation
between two consecutive LiDAR frames.

Page 15

The authors present an analysis of the LiDAR point density distribution and bounding box size
distribution for objects in their V2V4Real dataset. The figure shows the distribution of LiDAR
points within different objects' bounding boxes, and the size distribution of the bounding boxes
is shown. The left figure indicates that when only one vehicle (Tesla) is used to scan the
environment, the number of LiDAR points within the bounding boxes decreases dramatically as
the distance increases radially. However, with the shared visual information from the other
vehicle (Ford Fusion), the LiDAR point density of each object increases significantly and still
remains high even at a distance of 100 meters, highlighting the benefits of cooperative
perception. The right figure shows that the annotated objects have diverse bounding box sizes,
ranging in length from 2.5 to 23 meters, in width from 1.5 to 4.5 meters, and in height from 1 to
4.5 meters, demonstrating the diversity of the V2V4Real dataset.

Page 16

In this section, I will explain tasks investigated on this V2V4Real dataset, including cooperative
detection, tracking, Sim2Real transfer learning
Page 17

Cooperative 3D object detection poses several domain-specific challenges, such as GPS error,
asynchronicity, and bandwidth limitation. To overcome these challenges and design efficient
cooperative detection methods, groundtruth is defined in a unified (the ego) coordinate system.
Evaluation is done using Average Precision (AP) at Intersection-over-Union (IoU) 0.5 and 0.7 as
the metric. To assess the transmission cost, Average MegaByte (AM) is employed. The
evaluation is done under two settings: 1) Sync setting, where the sensor systems of both
vehicles are synchronized, and 2) Async setting, where there is an asynchronicity of up to 50
ms between the sensor systems of the two vehicles. Benchmarking methods include no fusion,
early fusion, late fusion, and intermediate fusion. The study aims to demonstrate the benefits of
cooperative perception and explore the most effective fusion methods for multi-vehicle
cooperative 3D object detection.

Page 18

Object tracking is an important task in cooperative perception systems. There are two major
approaches to tracking algorithms: joint detection and tracking and tracking by detection. The
authors of this paper focus on the second class. They use the AB3Dmot tracker as their
baseline tracker and evaluate their method using several metrics. The evaluation metrics used
are Multi Object Tracking Accuracy (MOTA), Mostly Tracked Trajectories (MT), Mostly Lost
Trajectories (ML), Average Multiobject Tracking Accuracy (AMOTA), Average Multiobject
Tracking Precision (AMOTP), and scaled Average Multiobject Tracking Accuracy (sAMOTA).
These metrics provide a comprehensive assessment of the tracking performance.

Page 19

The authors investigate how to reduce the domain discrepancy between the target domain
(V2V4Real) and the source domain (OPV2V) in cooperative 3D detection. They propose to use
domain adaptation methods to achieve this goal. The evaluation will be conducted on the test
set of V2V4Real dataset under the Sync setting. The baseline method is to train the detection
models on OPV2V and directly test on V2V4Real without any domain adaptation. The authors
aim to improve the detection performance of the cooperative 3D detection task by reducing the
gap between the two domains. They plan to train the models on the source domain and fine-
tune them on the target domain using various domain adaptation methods such as adversarial
training, self-training, and so on. The performance of the proposed methods will be evaluated
using metrics such as Average Precision (AP) at Intersection-over-Union (IoU) of 0.5 and 0.7.
The proposed approach can potentially enable better real-world deployment of cooperative
perception systems by addressing the challenges of domain discrepancy.

Page 20

In this section, I will cover Implementation details and benchmark results on


3D LiDAR Object Detection, 3D Object Tracking and Sim2Real Domain Adaptation

Page 21

The dataset is split into three sets: train, validation, and test, with 14,210, 2,000, and 3,986
frames, respectively, for all three tasks. PointPillar is used as the backbone for all detection
models to extract 2D features from the point cloud data. Normal point cloud data
augmentations, such as scaling, rotation, and flip, are applied to all experiments. For the
tracking task, the previous three frames, along with the current frame, are used as inputs.

Page 22

In the Cooperative 3D Object Detection benchmark, it was found that using cooperative
perception methods can significantly improve the accuracy of all evaluation ranges. The
intermediate fusion methods were found to achieve the best trade-off between accuracy and
transmission cost. However, it was observed that the AP of all methods except for No Fusion
dropped significantly when the communication delay was introduced. These findings highlight
the importance of cooperative perception in the development of autonomous vehicles and the
need for efficient methods of data fusion. Overall, the results suggest that intermediate fusion
methods may offer the most promising approach for achieving accurate and efficient
cooperative 3D object detection.

Page 23

In the 3D object tracking benchmark, the authors found that combining AB3Dmot with
cooperative detection significantly improved performance compared to using a single-vehicle
tracking method. Additionally, they found that CoBEVT achieved the best performance in most
of the evaluation metrics. These results suggest that cooperative perception methods can
greatly benefit 3D object tracking in autonomous driving systems. It is important to note that
different tracking algorithms may have varying performance depending on the specific scenario
and dataset, so careful evaluation and selection of the appropriate algorithm is crucial.

Page 24

In the domain adaptation benchmark, the authors found that there was a significant domain gap
between the simulated dataset (OPV2V) and the real-world dataset (V2V4Real). The detection
models trained on the simulated data only had decreased accuracy when tested on the real-
world data. However, after employing domain adaptation methods, the CoBEVT model was able
to achieve a performance of 40.2%, which is higher than the No Fusion baseline method. This
indicates that domain adaptation techniques can effectively reduce domain discrepancies and
improve the accuracy of cooperative 3D object detection models.
Page 25

In conclusion, the V2V4Real dataset is a significant contribution to the field of V2V cooperative
perception research, with extensive LiDAR frames, RGB images, and annotated bounding
boxes. The benchmarks for 3D object detection, object tracking, and domain adaptation
demonstrate the benefits of cooperative perception and highlight the importance of domain
adaptation for real-world scenarios. However, it is important to note that there may still be
challenging scenarios that are not present in the training set, and out-of-distribution detection
remains an unexplored topic. Future work may focus on HDMap learning tasks and camera
images to further advance the field of cooperative perception.

Page 26

Thank you for listening. If you have any questions please feel free to ask me

You might also like