Professional Documents
Culture Documents
Abstract— This paper presents an approach to acoustic the authors claim that in single-microphone systems, signal
vehicle speed estimation using audio data obtained from single- frequency information based on the Doppler effect is not
sensor measurements. One-dimensional convolutional neural useful for estimating the vehicle’s speed. In [8], the speed
network (1D CNN) is used to estimate the vehicle’s speed
directly from raw audio signal. The proposed approach does not and position of vehicle are determined by analyzing the
require manual feature extraction and can be trained directly characteristic changes in the spectrogram. The authors of
on unprocessed time-domain signals. The VS13 dataset, which [9] consider several representations of audio signals and
contains 400 audio-video recordings of 13 different vehicles, use 85 discriminative features to identify speed variations
is used for training and testing of the proposed model. Two using different machine learning techniques. To estimate the
training procedures have been evaluated and tested, one based
on determining optimal number of training epochs and the vehicle’s speed in [10], time-frequency characteristics of the
other based on recording model state with minimal validation sound signal produced by the vehicle engine are used as input
loss. The experimental results show that the average estimation to the neural network.
error on VS13 is 9.50 km/h and 8.88 km/h, respectively. A new speed-dependent analytical function, referred to as
modified attenuation (MA) is proposed in [11]. Deep neural
I. I NTRODUCTION
network is used to estimate the MA feature based on the log-
Intelligent transport systems have become a crucial part mel spectrogram of the vehicle sound and support vector
of the concept of smart cities, therefore the relevance of regression is used to predict the vehicle’s speed. The MA
automatic traffic monitoring (TM) systems has increased feature is used in [12] to correct the exact moment when the
considerably [1]. Different types of sensors enable the col- vehicle passed by the sound sensor, resulting in a drop of the
lection of various data (text, sound, image, video) about mean error of the estimated vehicle speed from 7.39 km/h to
traffic conditions and their real-time processing. Information 6.92 km/h. Mel-frequency representations (mel-spectrogram,
regarding the number of vehicles on the roads, vehicle types, log-mel spectrogram, log-mel cepstral spectrogram) of the
estimated speed and acceleration of vehicles, are particularly sound signal produced by vehicle and a fully connected neu-
useful for the enforcement of legal regulations for speed ral network are combined in [13]. It should be emphasized
limits and traffic safety improvement. For example, in the that each mentioned approach employs some audio signal
vicinity of speed cameras, the number of speeding vehicles processing technique before speed calculation.
and collisions was reduced by 35% and 25% respectively In this paper, we propose vehicle speed estimation model
[2]. based on 1D CNNs, which can automatically extract impor-
Modern TM systems are mostly based on computer vision tant features from raw audio signals. A brief introduction to
and enable the detection, tracking and classification of vehi- 1D CNN is provided in Section II, whereas the proposed
cles based on digital images [3]–[6]. However, despite their model is described in Section III. Section III also briefly
good performance, systems based on specialized cameras describes the VS13 dataset of audio-video recordings used
are more expensive, complex to install and maintain, and for training and testing of the proposed model. Experimental
memory-intensive in terms of data processing and storage. results are presented in Section IV. Finally, Section V gives
Acoustic sensors are becoming a viable alternative for visual the conclusions and directions of the future research.
surveillance systems since they require significantly fewer
resources. In this paper, we estimate the vehicle’s speed II. O NE - DIMENSIONAL C ONVOLUTIONAL N EURAL
using the sound it makes as it passes by acoustic sensor N ETWORKS
(microphone). A. 1D CNN overview
In recent years, various approaches for TM have been Deep learning models are developed based on the structure
introduced. The spectral and spatial characteristics of audio and information processing system of the human brain.
signals are used in [7] for vehicle speed estimation and While two-dimensional CNNs have found widespread use
in computer vision tasks, 1D CNNs are primarily intended
Ivana Čavor is with the Faculty of Maritime Studies Kotor, University of
Montenegro, Kotor, Montenegro (e-mail: ivana.ca@ucg.ac.me) for processing one-dimensional sequences [14].
Slobodan Djukanović with the Faculty of Electrical Engineering, Univer- CNN is a specific type of deep neural networks that
sity of Montenegro, Podgorica, Montenegro (e-mail: slobdj@ucg.ac.me) use convolution layers, which are made up of a collection
979-8-3503-9751-2/23/$31.00 ©2023 IEEE of convolutional filters or kernels. Typically, convolutional
TABLE I
T HE PROPOSED 1D CNN ARCHITECTURE
Fig. 1. Illustration of the typical structure of a 1D CNN model. deep learning is Rectified Linear Unit (ReLU). The function
returns 0 for negative inputs, but if the input is positive, it
returns that positive value.
filters are applied over a number of layers, each of which ex- The pooling layer takes the feature maps produced by
tracting characteristics of the considered signal at a particular a convolution layer and extracts the dominant features.
abstraction level. CNNs enable reliable and effective signal There are two common pooling strategies: max pooling and
analysis and classification through the learning of various average pooling. Max pooling extracts windows from the
feature space forms [15]. input feature maps and returns the highest value whereas
Compared with the traditional fully-connected neural net- average pooling outputs the average value. In this manner, the
works (FCNNs), CNNs greatly reduce the network structure number of learning parameters and the computation burden
parameters, accelerates the training process of the model are reduced [18].
and reduces the risk of overfitting to a certain extent [16]. The flatten layer converts the input data into one-
The fundamental advantage of CNNs is that they can be dimensional feature vector that FC layer can use. In neural
trained directly using raw temporal signals without the need networks, FC layers, also referred to as linear or dense layers,
for handcrafted feature engineering. In addition, as opposed are frequently used. They link every input neuron to every
to FCNNs, the patterns that CNNs learn are translation- output neuron.
invariant and they can learn spatial hierarchies of patterns,
which allows them to efficiently learn increasingly complex III. P ROPOSED SPEED ESTIMATION METHOD
and abstract concepts. A. Dataset