IT Zabljak 2023

27th International Conference on Information Technology (IT)
Žabljak, 15 – 18 February, 2023
Vehicle Speed Estimation From Audio Signals

Using 1D Convolutional Neural Networks
Ivana Čavor and Slobodan Djukanović
Abstract— This paper presents an approach to acoustic the authors claim that in single-microphone systems, signal
vehicle speed estimation using audio data obtained from single- frequency information based on the Doppler effect is not
sensor measurements. One-dimensional convolutional neural useful for estimating the vehicle’s speed. In [8], the speed
network (1D CNN) is used to estimate the vehicle’s speed
directly from raw audio signal. The proposed approach does not and position of vehicle are determined by analyzing the
require manual feature extraction and can be trained directly characteristic changes in the spectrogram. The authors of
on unprocessed time-domain signals. The VS13 dataset, which [9] consider several representations of audio signals and
contains 400 audio-video recordings of 13 different vehicles, use 85 discriminative features to identify speed variations
is used for training and testing of the proposed model. Two using different machine learning techniques. To estimate the
training procedures have been evaluated and tested, one based
on determining optimal number of training epochs and the vehicle’s speed in [10], time-frequency characteristics of the
other based on recording model state with minimal validation sound signal produced by the vehicle engine are used as input
loss. The experimental results show that the average estimation to the neural network.
error on VS13 is 9.50 km/h and 8.88 km/h, respectively. A new speed-dependent analytical function, referred to as
modified attenuation (MA) is proposed in [11]. Deep neural
I. I NTRODUCTION
network is used to estimate the MA feature based on the log-
Intelligent transport systems have become a crucial part mel spectrogram of the vehicle sound and support vector
of the concept of smart cities, therefore the relevance of regression is used to predict the vehicle’s speed. The MA
automatic traffic monitoring (TM) systems has increased feature is used in [12] to correct the exact moment when the
considerably [1]. Different types of sensors enable the col- vehicle passed by the sound sensor, resulting in a drop of the
lection of various data (text, sound, image, video) about mean error of the estimated vehicle speed from 7.39 km/h to
traffic conditions and their real-time processing. Information 6.92 km/h. Mel-frequency representations (mel-spectrogram,
regarding the number of vehicles on the roads, vehicle types, log-mel spectrogram, log-mel cepstral spectrogram) of the
estimated speed and acceleration of vehicles, are particularly sound signal produced by vehicle and a fully connected neu-
useful for the enforcement of legal regulations for speed ral network are combined in [13]. It should be emphasized
limits and traffic safety improvement. For example, in the that each mentioned approach employs some audio signal
vicinity of speed cameras, the number of speeding vehicles processing technique before speed calculation.
and collisions was reduced by 35% and 25% respectively In this paper, we propose vehicle speed estimation model
[2]. based on 1D CNNs, which can automatically extract impor-
Modern TM systems are mostly based on computer vision tant features from raw audio signals. A brief introduction to
and enable the detection, tracking and classification of vehi- 1D CNN is provided in Section II, whereas the proposed
cles based on digital images [3]–[6]. However, despite their model is described in Section III. Section III also briefly
good performance, systems based on specialized cameras describes the VS13 dataset of audio-video recordings used
are more expensive, complex to install and maintain, and for training and testing of the proposed model. Experimental
memory-intensive in terms of data processing and storage. results are presented in Section IV. Finally, Section V gives
Acoustic sensors are becoming a viable alternative for visual the conclusions and directions of the future research.
surveillance systems since they require significantly fewer
resources. In this paper, we estimate the vehicle’s speed II. O NE - DIMENSIONAL C ONVOLUTIONAL N EURAL
using the sound it makes as it passes by acoustic sensor N ETWORKS
(microphone). A. 1D CNN overview
In recent years, various approaches for TM have been Deep learning models are developed based on the structure
introduced. The spectral and spatial characteristics of audio and information processing system of the human brain.
signals are used in [7] for vehicle speed estimation and While two-dimensional CNNs have found widespread use
in computer vision tasks, 1D CNNs are primarily intended
Ivana Čavor is with the Faculty of Maritime Studies Kotor, University of
Montenegro, Kotor, Montenegro (e-mail: ivana.ca@ucg.ac.me) for processing one-dimensional sequences [14].
Slobodan Djukanović with the Faculty of Electrical Engineering, Univer- CNN is a specific type of deep neural networks that
sity of Montenegro, Podgorica, Montenegro (e-mail: slobdj@ucg.ac.me) use convolution layers, which are made up of a collection
979-8-3503-9751-2/23/$31.00 ©2023 IEEE of convolutional filters or kernels. Typically, convolutional
TABLE I
T HE PROPOSED 1D CNN ARCHITECTURE
# Kernel Pool Output

Layer Stride
Filters Size Size Shape
Input – – – – 264600
Conv1D 32 32 – 16 32 × 16536
Conv1D 32 8 – 4 32 × 4133
AveragePool – – 2 2 32 × 2066
Conv1D 64 2 – 2 64 × 1033
Conv1D 64 2 – 2 64 × 516
AveragePool – – 2 2 64 × 258
Conv1D 96 2 – 2 96 × 129
Conv1D 96 2 – 2 96 × 64
AveragePool – – 2 2 96 × 32
Conv1D 128 2 – 2 128 × 16
Conv1D 128 2 – 2 128 × 8
Flatten – – – – 1024
Dense – – – – 100
Dense – – – – 10
Output – – – – 1
Fig. 1. Illustration of the typical structure of a 1D CNN model. deep learning is Rectified Linear Unit (ReLU). The function
returns 0 for negative inputs, but if the input is positive, it
returns that positive value.
filters are applied over a number of layers, each of which ex- The pooling layer takes the feature maps produced by
tracting characteristics of the considered signal at a particular a convolution layer and extracts the dominant features.
abstraction level. CNNs enable reliable and effective signal There are two common pooling strategies: max pooling and
analysis and classification through the learning of various average pooling. Max pooling extracts windows from the
feature space forms [15]. input feature maps and returns the highest value whereas
Compared with the traditional fully-connected neural net- average pooling outputs the average value. In this manner, the
works (FCNNs), CNNs greatly reduce the network structure number of learning parameters and the computation burden
parameters, accelerates the training process of the model are reduced [18].
and reduces the risk of overfitting to a certain extent [16]. The flatten layer converts the input data into one-
The fundamental advantage of CNNs is that they can be dimensional feature vector that FC layer can use. In neural
trained directly using raw temporal signals without the need networks, FC layers, also referred to as linear or dense layers,
for handcrafted feature engineering. In addition, as opposed are frequently used. They link every input neuron to every
to FCNNs, the patterns that CNNs learn are translation- output neuron.
invariant and they can learn spatial hierarchies of patterns,
which allows them to efficiently learn increasingly complex III. P ROPOSED SPEED ESTIMATION METHOD
and abstract concepts. A. Dataset
B. CNN architecture The proposed 1D CNN model will be evaluated on the

VS13 dataset. The dataset contains thirteen vehicles, selected
The architecture of a CNN consists of four basic elements: to be as diverse as possible in terms of manufacturer, produc-
Convolutional layer (CL), Pooling layer (PL), Flatten layer tion year, engine type, power and transmission, resulting in a
(FL) and a Fully Connected (FC) layer. The depth of the total of 400 annotated audio-video recordings [19]. Recorded
model depends on the number of connected layers and speeds in VS13 range from 30 km/h to 110 km/h, as listed
typically the FC layers are the last in that series. Figure 1 in Table I in [19].
depicts the basic blocks of the 1D CNN architecture. From the original audio-video recordings, 10 second-long
The fundamental operation of CNN is convolution, a linear audio sequences in WAV format with a sampling rate of
operation that involves the multiplication of an input with 44 100 Hz were extracted. The speed and camera pass-by
convolution filters (a set of weights). A repeated application instant of the considered vehicles are recorded in the annota-
of the same filter to the input results in a map of activations tion data. We consider audio segments taken 3 seconds before
called a feature map, indicating the locations and strength and 3 seconds after the pass-by camera instant. Therefore,
of a detected feature in an input [17]. The convolutional the input to the proposed 1D CNN model is one-dimensional
layer involves the definition of a variety of hyperparameters, array of length 264 600, as shown in Table I.
including stride, padding, type of the activation function,
number and size of filters, etc. The activation function is B. 1D CNN architecture
very significant since it performs non-linear transformation to The raw sound signal produced by vehicle is used as input
the input enabling the network to improve the representation to the proposed 1D CNN model. The architecture of the
ability. One of the most popular activation functions in model, presented in Table I, is made up of 8 convolutional
layers. The first convolutional layer uses 32 kernels, with
size 32, in order to extract dominant features and suppress
high frequency noise [20]. Afterwards, the kernel size and
the stride size are decreased down the layers. In order to
gain better feature representation, successive small kernels
with size 2 have been used. Average pooling is implemented
right after two consecutive convolutional layers to reduce the
size of the feature space and learn location-invariant signal
characteristics. After flattening the resulting convolutional
output, we stack two fully connected layers, one with 100
and the other with 10 neurons. The output of the network is
one neuron, which gives the speed estimation in kilometer Fig. 2. Training and validation losses for Kia Sportage.
per hour (km/h). All convolutional layers in the model use
the ReLU activation function, whereas the fully connected
layers have no activation applied. This procedure yields a model with minimal validation loss
over 40 trainings with different combination of files used for
C. Training procedures validation.
In this paper, we compare two different training ap-
proaches for the proposed 1D CNN model. We used the IV. E XPERIMENTAL RESULTS
Adam optimizer with the learning rate of 0.001. Since the Vehicle speed estimation will be considered a regression
dataset size is relatively small, we use iterated k-fold cross- problem. In regression tasks, the root mean squared error
validation (CV) with shuffling, in order to evaluate our model (RMSE) is typically used to calculate how well the model
as precisely as possible and thus ensure better generalization fits a dataset. The RMSE can be calculated as
of the model. We repeat k-fold CV multiple times, for
s
1 X
instance M , shuffling the dataset before splitting it into RMSE(y) = (f (xi ) − yi )2 , (1)
N i
K folds. More precisely, we use K = 4, i.e., we repeat
4-fold cross-validation M = 10 times, making a total of where N represent the number of speed measurements, yi is
K × M = 40 separate model trainings. Due to shuffling, the vehicle’s actual speed in the i-th measurement and f (xi )
in each training, the validation dataset differs from all other is its corresponding prediction.
trainings. In both training strategies, in one CV round, one The speed estimation performance of the considered train-
fold (vehicle) is held out for testing, whereas the remaining ing approaches, given in Section III-C, are compared and
twelve are used for training and validating the model. The the RMSEs of speed estimations are presented in Table
model is implemented in Keras. II, per vehicle and average (bottom). The average speed
In the first approach, we record the training and validation estimation errors are 9.65 km/h and 8.88 km/h for the
losses after each training-validation split, in each repetition, first (optimal epoch) and the second approach (model with
in order to average training and validation losses over all minimal validation loss), respectively.
repetitions. Figure 2 shows the training and validation losses It can be seen that Renault Scenic performs significantly
for Kia Sportage. Similar trend of losses is obtained for worse than the other vehicles in terms of speed estimation
other vehicles in VS13. Due to large errors at the beginning for both approaches. This is expected since Renault Scenic
of the training, we limit y-axis below 200, so that we can recording session took place on a windy day, which resulted
clearly see the trend of the losses. It can be noted that the in a strong wind corrupting the sound of the vehicle [19]. On
training error decreases over the epochs, as expected. On the other hand, Peugeot 208 has the lowest speed estimation
the other hand, the validation error decreases until it reaches error for both cases.
minimum (optimal value). From that point on, the model Compared with [11], which yields RMSE = 7.39 km/h,
starts to overfit. In Fig. 2, dashed green line denotes the obtained on a subset of 10 out of 13 vehicles from VS13,
optimal epoch. We use this epoch to stop the training process our second approach gives notably higher RMSE. On the
before overfitting and achieve better generalisation on the test other hand, the advantage of our approach is that is does not
data. require any human-crafted feature, such as the MA feature
The second training approach focuses on finding a model introduced in [11]. The 1D CNN model proposed herein can
with optimal parameters in terms of minimal validation loss. be trained end-to-end, which is a considerable advantage with
That is, during each model training (out of 40 trainings), the respect to [11]. The obtained results are promising and set a
validation loss is monitored to ensure saving model weights baseline for future research.
that produce minimal validation loss (implemented by the
ModelCheckpoint callback in Keras). Such an obtained V. C ONCLUSIONS
model is compared with the currently best model in terms of We proposed an end-to-end 1D CNN model for vehicle
validation loss. If it outperforms the currently best model, it speed estimation which uses raw sound signal, and gives
becomes the currently best model. Otherwise, we neglect it. an average estimation error of 8.88 km/h on the VS13
TABLE II
[5] A. Yabo, S. I. Arroyo, F. G. Safar, and D. Oliva, “Vehicle classification
RMS S OF SPEED ESTIMATION and speed estimation using computer vision techniques,” in XXV
Congreso Argentino de Control Automático (AADECA 2016)(Buenos
VEHICLE RMSE1 [km/h] RMSE2 [km/h] Aires, 2016), 2016.
Citroen C4 Picasso 7.83 7.15 [6] M. Yan, M. Li, H. He, and J. Peng, “Deep learning for vehicle speed
Kia Sportage 11.38 7.93 prediction,” Energy Procedia, vol. 152, pp. 618–623, 2018.
Mazda 3 9.22 8.48 [7] H. V. Koops and F. Franchetti, “An ensemble technique for estimating
Mercedes AMG550 10.44 8.5 vehicle speed and gear position from acoustic data,” in 2015 IEEE
Mercedes GLA 11.32 8.24 International Conference on Digital Signal Processing (DSP). IEEE,
Nissan Qashqai 10.17 5.53 2015, pp. 422–426.
Opel Insignia 8.00 9.75 [8] S. Barnwal, R. Barnwal, R. Hegde, R. Singh, and B. Raj, “Doppler
Peugeot 208 5.81 4.71 based speed estimation of vehicles using passive sensor,” in 2013
Peugeot 3008 10.70 10.25 IEEE International Conference on Multimedia and Expo Workshops
Peugeot 307 9.49 8.09 (ICMEW). IEEE, 2013, pp. 1–4.
Renault Captur 7.94 8.23 [9] E. Kubera, A. Wieczorkowska, A. Kuranc, and T. Słowik, “Discov-
Renault Scenic 15.38 19.96 ering speed changes of vehicles from audio data,” Sensors, vol. 19,
no. 14, p. 3067, 2019.
VW Passat 7.85 8.62
[10] J. Giraldo-Guzmán, A. G. Marrugo, and S. H. Contreras-Ortiz, “Ve-
Average 9.65 8.88 hicle speed estimation using audio features and neural networks,” in
2016 IEEE ANDESCON. IEEE, 2016, pp. 1–4.
[11] S. Djukanović, J. Matas, and T. Virtanen, “Acoustic vehicle speed
estimation from single sensor measurements,” IEEE Sensors Journal,
dataset. The main limitation of the proposed method for vol. 21, no. 20, pp. 23 317–23 324, 2021.
sound speed estimation is the size of the training dataset. [12] N. Bulatović and S. Djukanović, “An approach to improving sound-
based vehicle speed estimation,” in 2022 Zooming innovation in
In the context of deep learning, a set of 400 samples is consumer technologies conference (ZINC 2022). IEEE, 2022.
small to achieve reliable results. Cross-validation techniques [13] N. Bulatović and S. Djukanović, “Mel-spectrogram features for acous-
can help to a certain extent to overcome this limitation, tic vehicle detection and speed estimation,” in 2022 26th International
Conference on Information Technology (IT). IEEE, 2022, pp. 1–4.
however a sufficient dataset increase is still necessary to [14] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J.
make substantial progress. In this sense, one of the future Inman, “1d convolutional neural networks and applications: A survey,”
research directions will be extension of the dataset with data Mechanical systems and signal processing, vol. 151, p. 107398, 2021.
[15] LJ. Stanković and D. Mandić, “Convolutional neural networks demys-
augmentation techniques. We will also consider increasing tified: A matched filtering perspective-based tutorial,” IEEE Transac-
the dataset by adding samples from another dataset, used in tions on Systems, Man, and Cybernetics: Systems, 2023.
earlier research on the topic of vehicle counting [21], [22]. [16] C. Du, R. Zhong, Y. Zhuo, X. Zhang, F. Yu, F. Li, Y. Rong, and
Y. Gong, “Research on fault diagnosis of automobile engines based
Those samples could be annotated using vision-based speed on the deep learning 1d-CNN method,” Engineering Research Express,
estimation methods. vol. 4, no. 1, p. 015003, 2022.
[17] J. Brownlee, “How do convolutional layers work in deep learning
R EFERENCES neural networks,” Machine Learning Mastery, vol. 17, 2020.
[18] S. Sharma and R. Mehra, “Implications of pooling strategies in convo-
[1] G. Radivojevć, G. Šormaz, and B. Lazić, “Trends in development lutional neural networks: A deep insight,” Foundations of Computing
and integration of ITS,” Journal of Road and Traffic Engineering, and Decision Sciences, vol. 44, no. 3, pp. 303–330, 2019.
vol. 66, no. 3, pp. 33–40, Oct. 2020. [Online]. Available: [19] S. Djukanović, N. Bulatović, and I. Čavor, “A dataset for audio-video
https://www.putisaobracaj.rs/index.php/PiS/article/view/151 based vehicle speed estimation,” in 2022 30th Telecommunications
[2] C. Wilson, C. Willis, J. K. Hendrikz, R. Le Brocque, and N. Bellamy, Forum (TELFOR). IEEE, 2022, pp. 1–4.
“Speed cameras for the prevention of road traffic injuries and deaths,” [20] W. Zhang, G. Peng, C. Li, Y. Chen, and Z. Zhang, “A new deep
Cochrane database of systematic reviews, no. 11, 2010. learning model for fault diagnosis with good anti-noise and domain
[3] J. Kurniawan, S. G. Syahra, C. K. Dewa et al., “Traffic congestion adaptation ability on raw vibration signals,” Sensors, vol. 17, no. 2, p.
detection: learning from CCTV monitoring images using convolutional 425, 2017.
neural network,” Procedia computer science, vol. 144, pp. 291–297, [21] S. Djukanović, J. Matas, and T. Virtanen, “Robust audio-based vehicle
2018. counting in low-to-moderate traffic flow,” in 2020 IEEE Intelligent
[4] H. Dong, M. Wen, and Z. Yang, “Vehicle speed estimation based on Vehicles Symposium (IV). IEEE, 2020, pp. 1608–1614.
3d convnets and non-local blocks,” Future Internet, vol. 11, no. 6, p. [22] S. Djukanović, Y. Patel, J. Matas, and T. Virtanen, “Neural network-
123, 2019. based acoustic vehicle counting,” in 29th European Signal Processing
Conference (EUSIPCO 2021), 2021.

IT Zabljak 2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IT Zabljak 2023

Uploaded by

Copyright:

Available Formats

27th International Conference on Information Technology (IT)

Žabljak, 15 – 18 February, 2023

Vehicle Speed Estimation From Audio Signals

# Kernel Pool Output

B. CNN architecture The proposed 1D CNN model will be evaluated on the

You might also like