Professional Documents
Culture Documents
net/publication/333323920
CITATIONS READS
60 656
4 authors:
Some of the authors of this publication are also working on these related projects:
Digital Image Processing system for extracting and processing information of interest from images View project
All content following this page was uploaded by Rodolfo Valiente Romero on 23 May 2019.
Abstract— A fundamental challenge in autonomous vehicles On-Board Image stream from V2 (vehicle ahead)
is adjusting the steering angle at different road conditions. Re- Buffer via V2V Communication
CNN
LSTM
cent state-of-the-art solutions addressing this challenge include (time-
FC
deep learning techniques as they provide end-to-end solution distributed)
t
to predict steering angles directly from the raw input images t-1
t-2
with higher accuracy. Most of these works ignore the temporal t-3
Vehicle 2 (V2)
dependencies between the image frames. In this paper, we tackle
V1 (Own camera)
the problem of utilizing multiple sets of images shared between
two autonomous vehicles to improve the accuracy of controlling
the steering angle by considering the temporal dependencies
between the image frames. This problem has not been studied t
t-3
in the literature widely. We present and study a new deep t+1
t-2 t+2
t-1 t
t+3
Vehicle 1 (V1)
architecture to predict the steering angle automatically by using
Long-Short-Term-Memory (LSTM) in our deep architecture.
Our deep architecture is an end-to-end network that utilizes Fig. 1. The overview of our proposed vehicle-assisted end-to-end
CNN, LSTM and fully connected (FC) layers and it uses both system. Vehicle 2 (V2) sends his information to Vehicle 1 (V1) over
present and futures images (shared by a vehicle ahead via V2V communication. V1 combines that information along with its own
Vehicle-to-Vehicle (V2V) communication) as input to control the information to control the steering angle. The prediction is made through
steering angle. Our model demonstrates the lowest error when our CNN+LSTM+FC network (see Fig. 2 for the details of our network).
compared to the other existing approaches in the literature.
conv2D
conv2D
conv2D
conv2D
conv2D
Image Set (self)
variants of convolutional architecture for end-to-end models
FC 1
in [16], [17], [18]. Another widely used approach for con-
trolling vehicle steering angle in autonomous systems is via FC 10
sensor fusion where combining image data with other sensor FC 50
data such as LiDAR, RADAR, GPS improves the accuracy in Image Set (remote) 64 64 64 FC 100
autonomous operation [19], [20]. As an instance, in [17], the
authors designed a fusion network using both image features Fig. 2. CNN + LSTM + FC Image sharing model. Our model uses 5
and LiDAR features based on VGGNet. convolutional layers, followed by 3 LSTM layers, followed by 4 FC layers.
See Table I for further details of our proposed architecture.
All the above-listed work focus on utilizing the image
data obtained from the on-board sensors and they do not TABLE I
consider the assisted data that comes from another car. In D ETAILS O F P ROPOSED A RCHITECTURE
this paper, we demonstrate that using additional data that
comes from the ahead vehicle helps us obtain better accuracy Layer Type Size Stride Activation
in controlling steering angle. In our approach, we utilize the 0 Input 640*480*3*2X - -
1 Conv2D 5*5, 24 Filters (5,4) ReLU
information that is available to a vehicle ahead of our car to 2 Conv2D 5*5, 32 Filters (3,2) ReLU
control the steering angle. The rest of the paper is organized 3 Conv2D 5*5, 48 Filters (5,4) ReLU
as follows: In Section III the proposed approach is explained, 4 Conv2D 5*5, 64 Filters (1,1) ReLU
5 Conv2D 5*5, 128 Filters (1,2) ReLU
Section IV provides details about the performed experiments. 6 LSTM 64 Units - Tanh
Finally, in Section V we conclude the paper and discuss 7 LSTM 64 Units - Tanh
possible directions for future work. 8 LSTM 64 Units - Tanh
9 FC 100 - ReLU
III. P ROPOSED A PPROACH 10 FC 50 - ReLU
11 FC 10 - ReLU
Controlling steering angle directly from input images is a 12 FC 1 - Linear
regression-value problem. For that purpose, we can either
use a single image or a sequence of (multiple) images.
Considering multiple frames in a sequence can benefit us in
situations where the present image alone is affected by noise with a single unit at the end, we use the Mean Squared Error
or contains less useful information. For example, when the (MSE) loss function in our network during the training.
current image is burnt largely by direct sunlight or when the
IV. E XPERIMENT S ETUP
vehicle reaches a dead-end. In such situations, the correlation
between the current frame and the past frames can be useful In this section we will elaborate further on the dataset
to decide the next steering value. To utilize multiple images as well as data preprocessing and evaluation metrics. We
as a sequence, we use LSTM. LSTM has a recursive structure conclude the section with details of our implementation.
acting as a memory, through which a network can keep some
past information and solve for a regression value based on A. Dataset
the dependency of the consecutive frames [21], [22]. In order to compare our results to existing work in the
Our proposed idea in this paper relies on the fact that the literature, we used the self-driving car dataset by Udacity.
condition of the road ahead has already been seen by another The dataset has a wide variation of 100K images from
vehicle recently and we can utilize that information to control simultaneous Center, Left and Right camera on a vehicle,
the steering angle of our car as discussed above. Fig. 1 collected in sunny and overcast weather, 33K images belong
illustrates our approach. In the figure, Vehicle 1 receives a to center camera. The dataset contains the data of 5 different
set of images from Vehicle 2 over V2V communication and trips with a total drive time of 1694 seconds. Test vehicle has
keeps the data on board. It combines the received data with 3 cameras mounted as in [1] collecting images at a rate of
the data obtained from the onboard camera and processes near 20Hz. Steering wheel angle, acceleration, brake, GPS
those two sets of images on board to control the steering data was also recorded. The distribution of the steering wheel
angle via an end-to-end deep architecture. This method angles over the entire dataset is shown in Fig. 3. As shown
enables the vehicle to look ahead of its current position at in Fig. 3, the dataset distribution includes a wide range of
any given time. steering angles. The image size is 480*640*3 pixels and total
Our deep architecture is presented in Fig. 2. The network dataset is of 3.63 GB. Since there is no dataset available
takes the set of images from both vehicles as input and at with V2V communication images currently, here we simulate
the last layer, it predicts the steering angle as the regression the environment by creating a virtual vehicle that is moving
output. The details of our deep architecture are given in Table ahead of the autonomous vehicle and sharing camera images
I. Since we construct this problem as a regression problem by using the Udacity dataset.
4
10
3
D. Evaluation Metrics
2.5
The steering angle is a continuous variable predicted for
2
each time step over the sequential data and the metrics: mean
Histogram
1.5
absolute error (MAE) and root mean squared error (RMSE)
1
are two of the most common used metrics in the literature
0.5 to measure the effectiveness of the controlling systems. For
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
example, RMSE is used in [15], [23] and MAE in [26]. Both
Angle (Radians)
MAE and RMSE express average model prediction error and
Fig. 3. The angle distribution within the entire Udacity dataset (angle in
their values can range from 0 to ∞. They both are indifferent
radians vs. total number of frames), just angles between -1 and 1 radians to the error sign. Lower values are better for both metrics.
are shown.
E. Baseline Networks
As baseline, we include multiple deep architectures that
Udacity dataset has been used widely in the recent relevant have been proposed in the literature to compare our proposed
literature [15], [23] and we also use Udacity dataset in this algorithm. Those models from [1], [15] and [23] are, to the
paper to compare our results to the existing techniques in best of our knowledge, the best reported approaches in the
literature. Along with the steering angle, the dataset contains literature using a camera only. In total, we chose 5 baseline
spatial (latitude, longitude, altitude) and dynamic (angle, end-to-end algorithms to compare our results. We name these
torque, speed) information labelled with each image. The five models as models A, B, C, D and E in the rest of
data format for each image is: index, timestamp, width, this paper. Model A is our implementation of the model
height, frame id, filename, angle, torque, speed, latitude, presented in [1]. Models B and C are the proposal of [15] .
longitude, altitude. For our purpose, we are only using the Models D and E are reproduced as in [23]. The overview of
sequence of center-camera images. these models is given in Fig. 4. Model A uses a CNN-based
network while Model B combines LSTM with 3D-CNN and
B. Data Preprocessing uses 25 time-steps as input. Model C is based on ResNet
[27] model and Model D uses the difference image of two
The images in the dataset are recorded at the rate around given time-steps as input to a CNN-based network. Finally,
20 frame per second. Therefore, usually there is a large Model E uses the concatenation of two images coming from
overlap between consecutive frames. To avoid overfitting, different time-steps as input to a CNN-based network.
we used image augmentation to get more variance in our
image dataset. Our image augmentation technic randomly F. Implementation and Hyperparameter Tuning
adds brightness and contrast to change pixel values. We Our implementations use Keras with a Tensor Flow back-
also tested image cropping to exclude possible redundant end. Final training is done on two NVIDIA Tesla V100 16GB
information that are not relevant in our application. However, GPUs. When implemented on our system, the training took
in our test the models perform better without cropping. 4 hours for the model in [1] and between 9-12 hours for
For the sequential model implementation, we preprocessed the deeper networks used in [15], in [23] and our proposed
the data in a different way. Since we do want to keep the network.
visual sequential relevance in the series of frames while We used Adam optimizer [28] in all our experiments
avoiding overfitting, we shuffle the dataset while keeping (learning rate of 10−2 , β1 = 0.900 , β2 = 0.999, E = 10−8 ).
track of the sequential information. We then train our model For learning rate, we tested from 10−1 to 10−6 and we
with 80% images on the same sequence from the subsets and found the best-performing learning rate being 10−3 . We also
validate on the rest 20%. studied the minibatch size to see its effect on our network.
Minibatch sizes of 128, 64 and 32 are tested and the value
C. Vehicle-assisted Image Sharing
t t-24 . . t t t+∆t [t , t + ∆t]
.
Modern wireless technology allows us to share data be- t
64 yielded the best results for us therefore we used 64 in our 0.16 0.17
0.16
Fig. 5 demonstrates how the value of the loss function 0.12 0.12 0.13
RMSE
0.095
changes as the number of epochs increases for both training 0.1
0.08 0.073
and validation data sets. The MSE loss decreases after the 0.06
0.073 0.071
0.051 0.05
first few epochs rapidly and then remains stable, remaining 0.04
0.043
0.046
0.042 0.044
0.044 0.044
0.06
Validation Set
0 2 4 6 8 10 12 14 16 18 20
Number of Images (x)
0.02
(0.0444), that is almost the same as the value obtained the
0.015
training ∆t value, is also obtained at ∆t = 20. Because
0.01
0.005
of those two local minimums, we noticed that the change
0
in error remains small inside the red area as shown in the
0 5
Number of Ephochs
10 15
figure. However, the error does not increase evenly on both
sides of the training value (∆t = 30) as most of the RMSE
Fig. 5. Training and Validation steps for our best model with x = 8. values within the red area remains on the left side of the
TABLE II
C OMPARISON TO R ELATED W ORK IN TERMS OF RMSE. TABLE III
a X = 8, b X = 10. C OMPARISON TO R ELATED W ORK IN TERMS OF MAE.
aX = 8, b X = 10.
Model: A B C D E Fa Gb A D E Fa Gb
[1] [15] [15] [23] [23] Ours Ours Model: [1] [23] [23] Ours Ours
Training 0.099 0.113 0.077 0.061 0.177 0.034 0.044 Training 0.067 0.038 0.046 0.022 0.031
Validation 0.098 0.112 0.077 0.083 0.149 0.042 0.044 Validation 0.062 0.041 0.039 0.033 0.036
2
over each frame of the entire Udacity dataset in Fig. 9. There 0.5
0
are total of 33808 images in the dataset. The ground-truth for -0.5
0.055
0.051
0.053
keeping of self-driving cars. In IEEE Intelligent Vehicles Symposium,
0.049
0.048 0.049 Proceedings, 2017.
0.05 0.047 0.047
0.046
0.044
0.046
0.044
0.046 [3] Hesham M. Eraqi, Mohamed N. Moustafa, and Jens Honer. End-to-
0.045 End Deep Learning for Steering Autonomous Vehicles Considering
Temporal Dependencies. oct 2017.
0.04
0 10 20 30 40 50 60 70 80 90 [4] Hossein Nourkhiz Mahjoub, Behrad Toghi, S M Osman Gani,
Number of frames ahead ( )
and Yaser P. Fallah. V2X System Architecture Utilizing Hybrid
Gaussian Process-based Model Structures. arXiv e-prints, page
Fig. 7. RMSE value vs. size of the number of frames ahead (∆t) over arXiv:1903.01576, Mar 2019.
the validation data. The model is trained at ∆t = 30 and x = 10. Between [5] H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A stochastic hybrid
the ∆t values: 13 and 37 (the red area) the change in RMSE value remains framework for driver behavior modeling based on hierarchical dirichlet
small and the algorithm almost yields the same min value at ∆t = 20 process. In 2018 IEEE 88th Vehicular Technology Conference (VTC-
which is different than the training value. Fall), pages 1–5, Aug 2018.
Model A Model D
1 1
0.5 0.5
Angle (Radians) 0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
Model E Model F
4
1
2 0.5
Angle (Radians)
0
0
-0.5
-2
-1
-4
-1.5
0 1 2 3 0 1 2 3
Number of Images x 10 4 Number of Images x 10 4
Fig. 9. Individual error values (shown in radian) made at each time frame is plotted for four models namely: Model A, Model D, Model E and Model F.
The dataset is the Udacity Dataset. Ground-truth is shown in Fig. 8. The upper and lower red lines highlight the maximum and minimum errors made by
each algorithm. The error for each frame (the y axis) for Model A, D and F are plotted in the range of [-1.5, +1,2] and the error for Model E is plotted
in the range of [-4.3, +4.3].
[6] H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A driver behavior [17] Johann Dirdal. End-to-end learning and sensor fusion with deep
modeling structure based on non-parametric bayesian stochastic hybrid convolutional networks for steering an off-road unmanned ground
architecture. In 2018 IEEE 88th Vehicular Technology Conference vehicle. PhD thesis, 2018.
(VTC-Fall), pages 1–5, Aug 2018. [18] Hao Yu, Shu Yang, Weihao Gu, and Shaoyu Zhang. Baidu driving
[7] Mohamed Aly. Real time detection of lane markers in urban streets. dataset and end-To-end reactive control model. In IEEE Intelligent
In IEEE Intelligent Vehicles Symposium, Proceedings, 2008. Vehicles Symposium, Proceedings, 2017.
[8] Jose M. Alvarez, Theo Gevers, Yann LeCun, and Antonio M. Lopez. [19] Hyunggi Cho, Young Woo Seo, B. V.K.Vijaya Kumar, and Ragu-
Road scene segmentation from a single image. In Lecture Notes nathan Raj Rajkumar. A multi-sensor fusion system for moving object
in Computer Science (including subseries Lecture Notes in Artificial detection and tracking in urban driving environments. In Proceedings
Intelligence and Lecture Notes in Bioinformatics), 2012. - IEEE International Conference on Robotics and Automation, 2014.
[9] Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. End-to- [20] Daniel Gohring, Miao Wang, Michael Schnurmacher, and Tinosch
end learning of driving models from large-scale video datasets. In Ganjineh. Radar/Lidar sensor fusion for car-following on highways.
Proceedings - 30th IEEE Conference on Computer Vision and Pattern In ICARA 2011 - Proceedings of the 5th International Conference on
Recognition, CVPR 2017, 2017. Automation, Robotics and Applications, 2011.
[10] Wilko Schwarting, Javier Alonso-Mora, and Daniela Rus. Planning [21] Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to
and decision-making for autonomous vehicles. Annual Review of forget: Continual prediction with LSTM. Neural Computation, 2000.
Control, Robotics, and Autonomous Systems, 1:187–210, 2018. [22] Klaus Greff, Rupesh K. Srivastava, Jan Koutnik, Bas R. Steunebrink,
[11] Nakul Agarwal, Abhishek Sharma, and Jieh Ren Chang. Real-time and Jurgen Schmidhuber. LSTM: A Search Space Odyssey. IEEE
traffic light signal recognition system for a self-driving car. In Transactions on Neural Networks and Learning Systems, 2017.
Advances in Intelligent Systems and Computing, 2018. [23] Dhruv Choudhary and Gaurav Bansal. Convolutional Architectures
[12] Bok Suk Shin, Xiaozheng Mou, Wei Mou, and Han Wang. Vision- for Self-Driving Cars. Technical report, 2017.
based navigation of an unmanned surface vehicle with object detection [24] B. Toghi, M. Saifuddin, H. N. Mahjoub, M. O. Mughal, Y. P. Fallah,
and tracking abilities. Machine Vision and Applications, 2018. J. Rao, and S. Das. Multiple access in cellular v2x: Performance
[13] Alexander Amini, Guy Rosman, Sertac Karaman, and Daniela Rus. analysis in highly congested vehicular networks. In 2018 IEEE
Variational end-to-end navigation and localization. arXiv preprint Vehicular Networking Conference (VNC), pages 1–8, Dec 2018.
arXiv:1811.10119, 2018. [25] Behrad Toghi, Md Saifuddin, Yaser P. Fallah, and M. O. Mughal.
[14] Dean a Pomerleau. Alvinn: An Autonomous Land Vehicle in a Neural Analysis of Distributed Congestion Control in Cellular Vehicle-to-
Network. Advances in Neural Information Processing Systems, 1989. everything Networks. arXiv e-prints, page arXiv:1904.00071, Mar
[15] Shuyang Du, Haoli Guo, and Andrew Simpson. Self-Driving Car 2019.
Steering Angle Prediction Based on Image Recognition. Technical [26] Mhafuzul Islam, Mahsrur Chowdhury, Hongda Li, and Hongxin Hu.
report, 2017. Vision-based Navigation of Autonomous Vehicle in Roadway Envi-
[16] Alexandru Gurghian, Tejaswi Koduri, Smita V. Bailur, Kyle J. Carey, ronments with Unexpected Hazards. arXiv preprint arXiv:1810.03967,
and Vidya N. Murali. DeepLanes: End-To-End Lane Position Es- 2018.
timation Using Deep Neural Networks. In IEEE Computer Society [27] Songtao Wu, Shenghua Zhong, and Yan Liu. ResNet. CVPR, 2015.
Conference on Computer Vision and Pattern Recognition Workshops, [28] Lauren A. Hannah. Stochastic Optimization. In International Ency-
2016. clopedia of the Social & Behavioral Sciences: Second Edition. 2015.