Computers and Electrical Engineering

JID: CAEE
ARTICLE IN PRESS [m3Gsc;January 4, 2018;4:31]
Computers and Electrical Engineering 0 0 0 (2018) 1–16
Contents lists available at ScienceDirect
Computers and Electrical Engineering

journal homepage: www.elsevier.com/locate/compeleceng
Human behavior characterization for driving style recognition

in vehicle systemR
Fabio Martinelli a, Francesco Mercaldo a, Albina Orlando b, Vittoria Nardone c,
Antonella Santone d, Arun Kumar Sangaiah e,∗
a
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, Italy
b
Istituto per le Applicazioni del Calcolo “M. Picone”, Consiglio Nazionale delle Ricerche, Napoli, Italy
c
Department of Engineering, University of Sannio, Benevento, Italy
d
Department of Bioscience and Territory, University of Molise, Pesche (IS), Italy
e
School of Computing Science and Engineering, VIT University, Vellore 632014, India
a r t i c l e i n f o a b s t r a c t
Article history: Despite the development of new technologies in order to prevent the stealing of cars, the
Received 12 September 2017 number of car thefts is sharply increasing. With the advent of electronics, new ways to
Revised 29 December 2017
steal cars were found. In order to avoid auto-theft attacks, in this paper we propose a ma-
Accepted 29 December 2017
chine learning based method to silently and continuously profile the driver by analyzing
Available online xxx
built-in vehicle sensors. We consider a dataset composed by 51 different features extracted
Keywords: by 10 different drivers, evaluating the efficiency of the proposed method in driver identi-
CAN fication. We also find the most relevant features able to discriminate the car owner by an
OBD impostor. We obtain a precision and a recall equal to 99% evaluating a dataset containing
Authentication data extracted from real vehicle.
Machine learning
Supervised learning
© 2017 Elsevier Ltd. All rights reserved.
Automotive
1. Introduction
Car theft is very increasing in every area of the globe and the phenomenon does not appear to stop.
As a matter of fact, car theft appears to be growing in the 2016 in the United States. While burglary and larceny theft
were down by 10% and 3% respectively, car theft was up by 1%. Despite declines for property crime overall, car theft was
on the rise early last year, according to new national crime figures from the FBI. The number of stolen car cases rose 1% in
the first half of 2015, compared to the same period of the year before1 , the latest FBI Uniform Crime Report says2 . In the
United States, a car is stolen every 45 s. California has consistently led the United States in motor vehicle thefts, both in
total vehicles stolen and thefts per capita.
In last years, cars are equipped with many computers on board, exposing them to a new type of attacks [1–3]. As a
matter of fact, the operating systems running on cars are exposed to bug and vulnerabilities [4].
R
Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. S. H. Ahmed.
∗
Corresponding author.
E-mail addresses: fabio.martinelli@iit.cnr.it (F. Martinelli), francesco.mercaldo@iit.cnr.it (F. Mercaldo), a.orlando@iac.cnr.it (A. Orlando),
vnardone@unisannio.it (V. Nardone), antonella.santone@unimol.it (A. Santone), sarunkumar@vit.ac.in (A.K. Sangaiah).
1
http://www.usatoday.com/story/money/cars/2016/02/16/car- theft- rate- starts-rise/80445860/.
2
https://ucr.fbi.gov/.
https://doi.org/10.1016/j.compeleceng.2017.12.050
0045-7906/© 2017 Elsevier Ltd. All rights reserved.
Please cite this article as: F. Martinelli et al., Human behavior characterization for driving style recognition in vehicle
system, Computers and Electrical Engineering (2018), https://doi.org/10.1016/j.compeleceng.2017.12.050
JID: CAEE
2 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16
Considering that software equipping with an Internet connection opens the doors to real-time traffic navigation, intelli-
gent fleet management, car-sharing and autonomous driving, this scenario calls for a plethora of new theft possibility [5].
Thus, an increasing number of top-of-the-range vehicles are being stolen every day by thieves who simply drive off after
bypassing security devices by hacking on-board computers3 . For instance, car thieves using “keyless” techniques are esti-
mated to have stolen more than 60 0 0 vehicles in London during the 2014, almost half of all cars and vans stolen, while
top-of-the-range BMWs and Range Rovers, as well as Ford Fiestas, Transit and Mercedes Sprinter vans, make up 70% of all
vehicles stolen in this way.
The main used technique involves breaking into the vehicle and plugging a laptop into the hidden diagnostic socket
used by the garages to detect and solve faults: once connected, the thieves can access the vehicle’s electronic information,
allowing them to drive it away.
Since cars are evolved with on-board computers, other developed techniques consist in get owners to install malicious
software into their smart-phone working as a door lock in order to make the door open.
Recently, BMW patched the ConnectedDrive system because researchers showed that it was possible to obtain wireless
access to the air conditioning and door lock of cars4 : as cars are becoming more and more intelligent, also cyber-attacks
targeting them become intelligent in order to exploit the new possibilities opened by the technological trend.
Starting from these considerations, in this paper we propose a method to detect car theft using machine learning tech-
niques.
Our method permits to silently and continuously verify the identity of the car owner. The main idea behind the proposed
method is represented by the exploration of the possibility to discriminate different driving styles considering a feature set
related to the vehicle. To address this issue, we adopt machine learning with the aim to infer behavioral characteristics to
discriminate between different drivers from a set of vehicle-related features. The silent and continuous driver identification
permits to authenticate the owner while he/she is driving overcoming existing anti-theft systems. As a matter of fact, once
the attacker is able to unlock the door and to start the engine, the attacker has full access to the vehicle. Other actors that
can benefits from the continuous and silent driver identification are the insurance companies: new insurance paradigms, as
the “Usage-based insurance”, are emerging. Basically, the Usage-based insurance, also known as pay as you drive and pay
how you drive and mile-based auto insurance, is a type of vehicle insurance whereby the costs are dependent upon type of
vehicle used, measured against time, distance, behavior and place [6].
We define the driver profile by merging together information about his behavior. More precisely, using well-known ma-
chine learning algorithms, we classify the set of features obtained from real car employed in real environment to test the
effectiveness of the extracted features.
The paper poses the following research question:
• is it possible to characterize the driver behavior through a set of features generated by himself when he/she is driving?
The main advantages of our method are:
• the features can be captured using the car built-in sensors without additional hardware;
• the features can be gathered with a good degree of precision and are not influenced by external factors (for instance
noises, air impurity);
• the features can be collected while the user is driving the car: the driver is not required to enter any image or voice
(this is the reason why the method is called silent);
• the performances obtained are significantly better than those reported in literature using a lower number of features.
The paper proceeds as follows: Section 2 discusses related work; Section 3 introduces preliminaries on the Controller
Area Network Protocol (CAN) and on the machine learning techniques; Section 4 deeply describes and motivates the detec-
tion method; Section 5 illustrates the results of the experiments and finally, conclusions are drawn in Section 6.
2. Related work
In the following we review the current literature related to the driving style recognition.
In the past, the automotive real-world data retrieving was limited due to the difficulty to equip the sensors in cars. From
the introduction of CAN this limit is overcome.
Wakita et al. in [7] propose a driver identification method that is based on the driving behavior signals that are observed
while the driver is following another vehicle. They analyzed signals, as accelerator pedal, brake pedal, vehicle velocity, and
distance from the vehicle in front, were measured using a driving simulator. The identification rates were 81% for twelve
drivers using a driving simulator and 73% for thirty drivers.
Data from the accelerator and the steering wheel were analyzed by researchers in [8]. Observing the considered features,
they employ Hidden Markov model (HMM) to model the driver characteristics. They build two models for each driver, one
trained from accelerator data and one learned from steering wheel angle data. The models can be used to identify different
drivers with an accuracy equal to 85%.
3
http://www.dailymail.co.uk/news/article- 2938793/Car- hackers- driving- motors- Increasing- numbers- stolen- thieves- simply- bypass- security- devices.
html.
4
http://www.reuters.com/article/bmw-cybersecurity-idUSL6N0V92VD20150130.
JID: CAEE
F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 3
Researchers in [9] classify a set of features extracted from the power-train signals of the vehicle, showing that their
classifier is able to classify the human driving style based on the power demands placed on the vehicle power-train with an
overall accuracy equal to 77%.
Van Ly et al. [10] explore the possibility of using the inertial sensors of the vehicle from the CAN bus to build a profile
of the driver observing braking and turning events to characterize an individual compared to acceleration events.
Researchers in [11] model gas and brake pedal operation patterns with Gaussian mixture model (GMM). They achieve an
identification rate equal to 89.6% for a driving simulator and 76.8% for a field test with 276 drivers, resulting in 61% and 55%
error reduction, respectively, over a driver model based on raw pedal operation signals without spectral analysis.
Driver behavior is described and modeled in [12] using data from steering wheel angle, brake status, acceleration status,
and vehicle speed through Hidden Markov models (HMMs) and GMMs employed to capture the sequence of driving char-
acteristics acquired from the CAN bus information. They obtain 69% accuracy for action classification, and 25% accuracy for
driver identification.
In ref. [13] the features extracted from the accelerator and brake pedal pressure are used as inputs to a Fuzzy Neural
Network (FNN) system to ascertain the identity of the driver. Two fuzzy neural networks, namely, the Evolving Fuzzy Neural
Network (EFuNN) and the Adaptive Network-based Fuzzy Inference System (ANFIS), are used to demonstrate the viability of
the two proposed feature extraction techniques.
Kwak et al. in [14] propose a method based on driving pattern of the car. They consider mechanical feature from the CAN
vehicle evaluating them with four different classification algorithms, obtaining an accuracy equal to 0.939 with Decision
Tree, equal to 0.844 with k Nearest Neighbor (KNN), equal to 0.961 with RandomForest and equal to 0.747 using Multilayer
Perceptron (MLP) algorithm.
3. Background
In this section we provide preliminaries about the Controller Area Network (CAN) and the On Board Diagnostic System
(OBD-II) protocols and the machine learning algorithms employed in the paper.
3.1. The CAN and the OBD-II protocols
The need to allow the communication between the plethora of electronic devices present inside modern automobiles
(as the centralized locking system, the air conditioning control, the traction control and the anti-lock braking system for
instances) and their growing complexity would lead to an unsustainable increase in dedicated connections.
Each electronic devices inside the vehicle is able to communicate with neighboring components generating a large
amounts of real-time data [1]. As a matter of fact, modern automobiles contain upwards of 50 electronic control units
(i.e., the so-called ECUs) networked together [5].
These are the reasons why the Controller Area Network protocol (CAN)5 was introduced in the early 1980s by Robert
Bosch GmbH6 , in order to allow the communication between ECUs at a speed of up to 1 Mbit/s.
Basically, the CAN protocol defines a generic communication standard through a bus and it is defined as a part of the
ISO/OSI stack in levels 1 and 2 physical data link: the ecosystem of the ECUs communicate with one another by sending
CAN packets. These kinds of packets are broadcast to all the components on the bus and each component decides whether
it is intended for them, although segmented CAN networks do exist.
In practice, using the CAN protocol the ECU “A” is able to send data to the ECU “B”, but this is not enough to realize
the communication: it is also necessary that the ECU “B” is able to recognize and use the data received by ECU “A”, for
this reason it is necessary something that make able the two electronic control units to “speak the same language”. For this
reason the OBD-II standard (On Board Diagnostics) [15] was introduced, in order to define a common language that make
the various ECUs able to communicate.
The OBD-II protocol is more specific than the CAN one and it is specifically created as a standard for vehicles; it has be-
come mandatory in America in vehicles manufactured since 1996. The European version of this standard is called European
On Board Diagnostics (EOBD), it is substantially identical to the OBD-II one and it is mandatory since 20017 .
The CAN and the OBD-II protocols cooperate in the following way: first of all, the OBD-II message is generated, then
this the message becomes the content of the CAN message that is sent on the bus. The ECU that receives the CAN message
makes the reverse process: it processes the message CAN, OBD-II drawing its original message and processes it.
Furthermore, OBD-II is not only a communication standard, it also defines the connector that must be present in the
passenger compartment of the vehicle for connecting OBD-II compatible instruments8 . This connector is essentially a port
for the connection to the CAN bus. In this way it is possible to extract the data from the CAN bus.
While the OBD-II standard has been mandatory for all cars and light trucks sold in the United States since 1996, the
EOBD (i.e., the European On Board Diagnostics) standard has been mandatory for all petrol vehicles sold in the European
5
www.can.bosch.com.
6
https://www.bosch.com/.
7
https://www.sae.org/events/pdf/obd- eu/2016_obd- eu_guide.pdf.
8
http://www.obdii.com/connector.html.
JID: CAEE
Union since 2001 and all diesel vehicles since 2004. Both the OBD-II and the EOBD standards are available on cars and
trucks.
Relating to the motorcycles, essentially, these vehicles are conform to the CAN protocol but they do not exhibit an OBD-II
connector. They present their own proprietary connector and converters for the various manufacturers that do support the
protocol standard and use the OBD-II scan tool to retrieve information. The main problem from this point of view is that,
since there has not been a regulatory component mandating standardization for motorcycles, the various connection types
have become proprietary as manufacturers try to maintain their closed end to end systems.
3.2. Machine learning
Machine learning is a type of artificial intelligence able to provide computers with the ability to learn without being
explicitly programmed [16].
Machine learning tasks are typically classified into two categories, depending on the nature of the learning available to
a learning system:
• Supervised learning: the computer is presented with example inputs and their desired outputs, given by a “teacher”, and
the goal is to learn a general rule that maps inputs to outputs. It represents the classification: the process of building a
model of classes from a set of records that contains class labels.
• Unsupervised learning: no labels are given to the learning algorithm, leaving it on its own to find structure in its input.
Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature
learning).
The algorithms considered are supervised decision tree-based i.e., they use a decision tree as a predictive model which
maps observations about an item (represented in the branches) to conclusions about the target of the items value (repre-
sented in the leaves). These algorithms (i.e., J48, J48graft, J48consolidated, RandomTree, RepTree) are the most widespread to
solve data mining problems [16] for instance, from malware detection [17–20] to pathologies classification [21].
In order to enforce the conclusion validity, we consider in this work five different machine learning algorithms in order
to demonstrate the effectiveness of our method in discriminating between car owner and impostors.
4. The method
In this section we describe our method to identify driver behavior using data retrieved by CAN bus. As stated in the
introduction, real data [14], processed from in-vehicle CAN data, are considered. In order to collect data, the On Board
Diagnostics 2 (OBD-II) and CarbigsP as OBD-II scanner are used. The recent vehicle has many measurement sensors and
control sensors, so the vehicle is managed by ECU in it. ECU is the device that controls parts of the vehicle such as Engine,
Automatic Transmission, and Antilock Braking System (ABS). OBD refers to the self-diagnostic and reports capability by
monitoring vehicle system in terms of ECU measurement and vehicle failure. The data are recorded every 1 second during
driving.
We considered a real-environment and not a simulated one in order to examine all possible real-world variables, for
instance: slowdowns traffic lights and all the possible variables that are not considerable in a simulated environment.
The total number of the collected features is 51. Table 1 describes the features involved in the study. The data are related
to a recent model of KIA Motors Corporation in South Korea.
We designed an experiment in order to evaluate the effectiveness of the feature vector we propose, expressed through
the research question stated in the introduction.
More specifically, the experiment is aimed at verifying whether the feature set is able to discriminate the car owner by
impostors.
The classification is carried out by using well-known classifiers built with the feature set as explained in Section 3.
The evaluation consists of three stages: (i) comparison of descriptive statistics of the populations of the drivers; (ii)
hypotheses testing, in order to verify whether the features present different distributions for the populations of the drivers;
and (iii) a classification analysis aimed at assessing whether the features are able to correctly classify between car owner
and impostors.
Relating to the descriptive statistics, we report the box plot of the distribution of the drivers involved in the study in
order to demonstrate that the distributions are different.
With regards to the hypotheses testing, the null hypothesis to be tested is:
H0 : “the drivers have similar values of the considered features”.
The null hypothesis was tested with Wald–Wolfowitz (with the p-level fixed to .05), Mann–Whitney (with the p-level
fixed to .05) and with Kolmogorov–Smirnov Test (with the p-level fixed to .05). We chose to run three different tests in
order to enforce the conclusion validity.
The purpose of these tests is to determine the level of significance, i.e., the risk (the probability) that erroneous conclu-
sions be drawn: in our case, we set the significance level equal to .05, which means that we accept to make mistakes 5
times out of 100.
JID: CAEE
Table 1
Features involved in the study.
# Feature Description
1 Fuel_consumption The instant value related to fuel consumption

2 Accelerator_Pedal_value This sensor registers the movement of the accelerator pedal: Accelerator pedal opening angle
percentage as determined by the accelerator position sensor.
3 Throttle_position_signal The relative throttle position sensor is used to monitor the throttle position of a vehicle
4 Short_Term_Fuel_Trim_Bank1 Fuel trims are the percentage of change in fuel over time in short term
5 Intake_air_pressure This data is used to calculate air density and determine the engine’s air mass flow rate
6 Filtered_Accelerator_Pedal_value ECU’s filtered accelerator pedal opening angle percentage as determined by the accelerator
position sensor.
7 Absolute_throttle_position Actual position of the throttle
8 Engine_soacking_time Duration of time a vehicle’s engine is at rest prior to being started.
9 Inhibition_of_engine_fuel_cut_off The fuel cut-off control system is responsive to a brake switch signal and an engine speed
signal having a value above a fuel recovery threshold to decrease the value of a fuel cut-off
threshold to again perform the fuel cut-off even in the normal fuel recovery range. This
value represents the inhibition of engine fuel cut off
10 Engine_in_fuel_cut_off This value represents the inhibition of engine fuel cut off, i.e., fuel cut-off threshold.
11 Fuel_Pressure Effective pressure is the actual applied pressure for the injector, and is the pressure
differential across the injector.
12 Long_Term_Fuel_Trim_Bank1 Fuel trims are the percentage of change in fuel over time in long term.
13 Engine_speed It is also called engine’s RPM, i.e., Revolutions Per Minute. In other words it is the number of
revolutions the crankshaft makes per minute.
14 Engine_torque_after_correction The value after correcting the torque to which an engine is adjusted before a gear
disengagement.
15 Torque_of_friction Friction torque is the torque caused by the frictional force that occurs when two objects in
contact move.
16 Flywheel_torque_interventions The flywheel stores energy when torque is applied by the energy source, and it releases
stored energy when the energy source is not applying torque to it. The value represent the
flywheel torque after torque interventions.
17 Current_spark_timing The time to set the angle relative to piston position and crankshaft angular velocity that a
spark will occur in the combustion chamber near the end of the compression stroke.
18 Engine_coolant_temperature The temperature of the engine coolant of the internal combustion engine
19 Engine_Idle_Target_Speed The desired idle RPM in relation to coolant temp.
20 Engine_torque Engine torque is also related to the gearing. The lower the gear, greater is the pulling ability
of an engine and hence greater the torque that this value represents.
21 Calculated_LOAD_value This value indicates a percentage of peak available torque.
22 Min_indicated_engine_torque Minimum Engine_torque value
23 Max_indicated_engine_torque Maximum Engine_torque value
24 Flywheel_torque The value represent the flywheel torque
25 Torque_scaling_factor This value is described as how flexible or how much force can be expressed in a given gear
when the driver scales the gear.
26 Standard_Torque_Ratio This value is described as how flexible or how much force can be expressed in a given gear.
27 Requested_spark_retard_angle The transmission control unit (TCU) controls modern electronic automatic transmissions. This
value computes the requested spark retard angle from TCU.
28 Requests_engine_torque_limit This parameter monitors the request to engine torque limits (ETL) by TCU
29 Requested_engine_RPM_increase This parameter monitors the TCU requests related to the RPM engine increasing
30 Target_engine_speed_used_ in_lock-up_module It monitors the lock-up valve, used to shut off the signal pressure line of pneumatic
actuators.
31 Glow_plug_control_request It monitors the request to check the glow plug
32 Activation_of_Air_compressor The value of the air compressor’s working.
33 Torque_converter_speed A particular kind of fluid coupling that is used to transfer rotating power from a prime mover
34 Current_Gear The engaged gear
35 Transmission_oil_temperature The value of the temperature of the fluid inside the transmission.
36 Wheel_velocity_front_left-hand The speed of the front left hand wheel
37 Wheel_velocity_rear_right-hand The speed of the rear right hand wheel
38 Wheel_velocity_front_right-hand The speed of the front right hand wheel
39 Wheel_velocity_rear_left-hand The speed of the rear left hand wheel
40 Torque_converter_turbine_speed_-_Unfiltered A torque converter is a type of fluid coupling that is used to transfer rotating power from a
prime mover, such as an internal turbines in this case.
41 Clutch_operation_acknowledge It is responsible to signalize when a clutch operation happens.
42 Converter_clutch It is responsible for activating the torque converter clutch to prevent slipping at highway
speeds
43 Gear_Selection It represents the gear selected by the sensor
44 Vehicle_speed It represents the current speed of the vehicle
45 Acceleration_speed_-_Longitudinal It represents the value related to the acceleration speed longitudinal
46 Indication_of_brake_ switch_ON/OFF It indicates whether the brake indication is on or off
47 Master_cylinder_pressure The pressure of the master cylinder, a control device that converts non-hydraulic pressure
into hydraulic one.
48 Calculated_road_gradient This value computes the slope of the currently traveled road
49 Acceleration_speed_-_Lateral It consists in the acceleration value that a curving car manifest
50 Steering_wheel_speed This value represents the wheel speed when steering
51 Steering_wheel_angle This value represents the wheel angle when steering
JID: CAEE
Fig. 1. The flow diagram of the proposed approach.
The classification analysis was aimed at assessing whether the feature set is able to correctly classify car owner and
impostors.
We adopt the supervised learning approach, considering that the driver features evaluated in this work contain the driver
labels.
The supervised learning approach is composed of two different steps:
1. Learning step: starting from the labeled dataset (i.e., where each feature is related to a class. In our case, the class is
represented by the driver), we filter the data in order to obtain a feature vector. The feature vectors, belonging to all the
drivers involved in the experiment with the associated labels, represent the input for the machine learning algorithm
that is able to build a model from the analyzed data. The output of this step is the model obtained by the labeled
dataset.
2. Prediction step: the output of this step is the classification of a feature vector belonging to the car owner or to an
impostor. Using the model built in the previous phase, we input this model using a feature vector without the label: the
classifier will output with their label prediction (i.e., car owner or impostor).
In addition, we perform a principal component analysis (PCA) [22,23] in order to identify, from the 51 features involved,
the best features discriminating the driver behavior. We employ two different algorithms: BestFirst [22] and GreedyStepwise
[24]. In case the PCA analysis is able to return the best features, we classify the new feature set with the classification
algorithms in order to compare the previous results obtained using the full feature set.
Fig. 1 represents the flow diagram of the proposed approach.
The most important step is represented by the selection block in Fig. 1: in this block the feature set is acquired from the
OBD and it is evaluated against the learned model in order to decide whether it belongs to the car owner.
The classification analysis was accomplished with Weka9 , a suite of machine learning software, largely employed in data
mining for scientific research.
In the following, we present two possible scenarios related to the adoption of the proposed method in two different
fields. The first one (i.e., the anti-theft instrument) is depicted in Fig. 2, while the second one (the car insurance company
tool) is depicted in Fig. 3.
The scenarios shown in Fig. 2 are related to two cases: the first one (i.e., Scenario 1) related to the car owner when is
driving the own car, in this case the case the proposed method will recognize the driver as the car owner and the proposed
method will not send an alert; differently in the second case (i.e., Scenario 2) the car owner will receive an alert through
their mobile device because the proposed method does not recognize the driving style as belonging to the car owner.
The second area of applicability of the proposed solution is related to car insurance, the corresponding scenario is rep-
resented in Fig. 3.
While in the anti-theft scenario the controls about the driving style are performed into the mobile device car owner (that
also receives the feature set gathered from the own car), in this scenario the feature sets of different vehicles are sent to
the company insurance servers in order to be analyzed. As a matter of fact, several car insurance companies are proposing
policy driver-oriented: driver accident policy is an accessory guarantee that allows people to receive financial compensation
in case of bodily injury suffered by the driver, only in the case of a faulted fault. In this case the main problem, from the car
insurance company point of view, is to establish who is driving the car at the time of the accident. Furthermore, considering
that the car insurance is able also to determine whether the car is guided by the insured or from another person, this can
be considered as deterrent for the car owner to give own car to other people.
9
http://www.cs.waikato.ac.nz/ml/weka/.
JID: CAEE
Fig. 2. An example of utilization of the proposed solution in order to identity car theft. In the upper box (i.e., Scenario 1) an example of the usage of the
proposed method when the car owner is driving own car, while in the lower box (i.e., Scenario 2) an example of the usage of the proposed method when
a car thief is driving the owner car: in this case the proposed method sends an alert to the car owner mobile device.
Fig. 3. An example of usage of the proposed method useful for car insurance companies proposing driver accident policy. In this scenario the vehicles
under analysis send periodically the data to the company servers in order to store and analyze the data with the aim to identify whether the driver is the
car owner.
5. The evaluation
In this section we describe the real-world dataset used in this paper and the results of the experiment.
For the sake of clarity, the results of the evaluation will be discussed reflecting the data analysis’ division in three phases
discussed in previous section: descriptive statistics, hypotheses testing and classification.
5.1. The dataset
Ten different drivers participated to the experiment by driving, with the same car, 4 different round-trip paths in Seoul
(i.e., between Korea University and SANGAM World Cup Stadium) for about 23 h of total driving time. Fig. 4 shows the path
considered by the different participant drivers.
The driving paths are of three types: city way, motor way and parking space with a total length of about 46 km. The
experiment is performed since July 28, 2015. The experiments were performed in the similar time zone from 8 p.m to 11
p.m on weekdays. The ten drivers, labeled from “A” to “J”, completed two round trips for reliable classification, while data
are collected from totally different road conditions. The city way has signal lamps and crosswalks, but the motor way has
none. The parking space is required to drive slowly and cautiously.
The data that we have used has total 94,401 items recorded every second with the size of 16.7 Mb in total and it is
freely available for research purpose10 .
10
https://sites.google.com/a/hksecurity.net/ocslab/Datasets/driving-dataset.
JID: CAEE
Fig. 4. Path from Korea University to Seoul World Cup Stadium.
Fig. 5. Box plots related to the Fuel_consuption feature (i.e., the number #1 in Table 1) for the ten different drivers considered in the experiment.
Fig. 6. Box plots related to the Intake_air_pressure feature (i.e., the number #5 in Table 1) for the ten different drivers considered in the experiment.
5.2. Descriptive statistics
Figs. 5–8 show the box plots related to the Fuel_consuption (i.e., the feature number #1 in Table 1), the Intake_air_pressure
(i.e., the feature number #5 in Table 1), the Long_Term_Fuel_Trim_Bank1 (i.e., the feature number #12 in Table 1) and the
Transmission_oil_temperature (i.e., the feature number #35 in Table 1). Considering the big number of features, we do not
show the box plot related to the full set composed of the 51 features, but a similar consideration can be done for all the
features involved in the study.
The box plots in Fig. 5 present the distributions for the 10 drivers related to the Fuel_consuption feature. This feature
ranges between 0 and 10,0 0 0 and it is measured in cubic millimeter (mcc).
JID: CAEE
Fig. 7. Box plots related to the Long_Term_Fuel_Trim#Bank1 feature (i.e., the number #12 in Table 1) for the ten different drivers considered in the experi-
ment.
Fig. 8. Box plots related to the Transmission_oil_temperature feature (i.e., the number #35 in Table 1) for the ten different drivers considered in the experi-
ment.
The 10 drivers present a similar distribution; considering that the Fuel_consuption feature can be correlated to the Accel-
erator_pedal_value (more pressure on the accelerator pedal increases the speed of the vehicle and then more fuel is required)
the driver A presents a slightly larger box plot if compared with other drivers.
The box plots in Fig. 6 show the driver distribution related to Intake_air_pressure feature (i.e., the pressure of air inhaled
to engine). It ranges between 0 and 255, and it is measured in Kilopascal (kPA).
From the analysis of the Intake_air_pressure, it seems that engines of A, E and J drivers inhale similar pressure of air
(ranging between 45 and 60 kPA) while, the remaining engine drivers (B, C, D, F, G, H and I) need to an air pressure
ranging between 0 and 50 kPA. Considering that the air intake pressure is influenced by the degree of opening throttle
plate which draws the fuel to the combustion chamber in engine [25], a bigger air pressure will be reflected in fuel increased
consumption.
The Long_Term_Fuel_Trim#Bank1 feature box plots are shown in Fig. 7. This feature represents, in percentage, the correc-
tion value being used by the fuel control system in loop modes of operation and it is expressed in percentage. We explain
how the correction value works in detail: basically, fuel trims are defined as the percentage of change in fuel over time. For
engines that work correctly, the ratio between air and fuel must be included in a small interval. There are several conditions
that make this interval too wide, for instance, cold start-up, cruising down the highway and idling in heavy traffic.
The engine computer tries to perform the best in order to maintain this proper ratio between air and fuel by fine-tuning
the amount of fuel going into the engine: it adds or takes away fuel, the oxygen sensors monitor how much oxygen is in
the exhaust and respond by telling it to the engine computer.
This change in fuel being added or taken away is called Fuel Trim. The oxygen sensors are what drive the fuel trim
readings. Changes in O2 sensor voltages cause a direct change in fuel. The short term fuel trim (STFT) refers to immediate
changes in fuel occurring several times per second. The long term fuel trims (LTFT) are driven by the short term fuel trims.
LTFT refers to changes in STFT but averaged over a longer period of time. A negative fuel trim percentage indicates a taking
away of fuel, while a positive percentage indicates an adding of fuel. STFT are immediate ups and downs in fuel, while LTFT
are what is occurring over a longer period.
JID: CAEE
Table 2
Results of the null hypothesis H0 test.
# Feature Wald–Wolfowitz Mann–Whitney Kolmogorov–Smirnov Test Result
[1–2] 0.0 0 0 0.0 0 0 p < .001 passed

3 0.0 0 0 0.0 0 0 047 p < .001 passed
4 0.0 0 0 0.227366 p < .001 not passed
5 0.0 0 0 0.0 0 0 p < .001 passed
6 0.0 0 0 1.000 p > .10 not passed
7 0.0 0 0 0.0 0 0121 p < .001 passed
8 0.0 0 0 0.0 0 0 p < .001 passed
9 0.0 0 0 1.000 p > .10 not passed
10 0.0 0 0 0.488959 p > .10 not passed
11 0.0 0 0 1.0 0 0 p > .10 not passed
[12–21] 0.0 0 0 0.0 0 0 p < .001 passed
22 0.0 0 0 0.495956 p < .001 not passed
23 0.0 0 0 0.004297 p < .001 passed
24 0.0 0 0 0.0 0 0 p < .001 passed
[25–27] 0.0 0 0 1.000 p > .10 not passed
28 0.0 0 0 0.09627 p > .10 not passed
29 0.0 0 0 0.0 0 0 p > .10 not passed
[30–32] 0.0 0 0 1.000 p > .10 not passed
33 0.0 0 0 0.0 0 0 p < .001 passed
34 0.0 0 0 0.951169 p < .001 not passed
35 0.0 0 0 0.0 0 0 p < .001 passed
36 0.0 0 0 0.0 0 0 0 02 p < .001 passed
37 0.0 0 0 0.0 0 0 0 05 p < .001 passed
38 0.0 0 0 0.0 0 0 0 06 p < .001 passed
39 0.0 0 0 0.0 0 0 0 02 p < .001 passed
[40–42] 0.0 0 0 0.0 0 0 p < .001 passed
43 0.0 0 0 0.338733 p > .10 not passed
44 0.0 0 0 0.0 0 0 0 05 p < .001 passed
[45–49] 0.0 0 0 0.0 0 0 p < .001 passed
50 0.0 0 0 0.33013 p < .01 not passed
51 0.0 0 0 0.727628 p < .001 not passed
The Long_Term_Fuel_Trim_Bank1 feature distributions show different trends between the 10 drivers. As a matter of fact,
the A driver exhibits the lowest distribution, while the E one the most greater one. All the remaining drivers (i.e., B, C, F,
G, H, I and J) present box plots ranging between 2% and 6%. Driver D exhibits the most tiny box plot. Considering that this
value is increased when conditions, like cold start-up, cruising down the highway and idling in heavy traffic, happen, we
think that this feature (and the box plot confirms our hypothesis) can be very useful in order to discriminate car owner
from impostors.
The box plots related to the Transmission_oil_temperature are represented in Fig. 8: it represents the oil temperature
inside the transmission and it is ranging between −40 and 215 °C.
The A driver presents the highest value for this feature: 100 °C. For the remaining drivers, the value ranges between
85 and 95 °C: this is considered as a normal value. The only exception is represented by the E driver engine with lowest
temperature values. Considering that an engine in good conditions is able to reach the same oil temperature degree, this
feature seems to be not indicative for drivers discrimination (and the box plots seem to confirm our hypothesis).
5.3. Hypothesis testing
The hypothesis testing aims at evaluating if the features present different distributions for the populations of the 10
drivers involved in the experiment with statistical evidence.
We assume valid the results when the null hypothesis is rejected by the three tests performed.
Table 2 shows the null hypothesis H0 test.
All the features are able to successfully pass the Wald–Wolfowitz test, while the Mann-Whitney test is not passed by
features #4, #6, #10, #22, #25, #26, #27, #34, #43, #51 and the Kolmogorov–Smirnov test is not passed by features #6,
#9, #10, #11, #25, #26, #27, #28, #29, #30, #31, #43. We highlight that features #6, #10, #25, #26 and #27 not passed
both the Mann–Whitney test and the Kolmogorov–Smirnov one. This is symptomatic that in the dataset not all the features
contribute to discriminate different drivers, especially when the features have not passed two tests on three.
To summarize, considering that we assume valid the results when the null hypothesis is rejected by the three tests
performed, the features that have not passed the null hypothesis H0 test are: #4, #6, #9, #10, #11, #22, #25, #26, #27, #28,
#29, #30, #31, #34, #43, #51 i.e., 17 features on the 51 considered in the study: the classification analysis with the feature
selection will confirm whether these resulting features are not able to discriminate between the different drivers involved
in the study.
JID: CAEE
Table 3
Parameters employed for the classification algorithms learning.
Parameter J48 J48Graft J48Consolidated RandomTree RepTree
batchSize 100 100 100

confidenceFactor 0.25 0.25 0.25
minNumObj 2 2 2
numDecimalPlaces 2 2 2 2 2
maxDepth 0 −1
KValue 0
minVarianteProp 0.001
5.4. Classification analysis
We classified the features extracted using five classification algorithms: J48, J48graft, J48consolidated, RandomTree and
RepTree.
Five metrics were used to evaluate the classification results: False Positive (FP) rate, Precision, Recall, F-Measure and ROC
Area.
The FP rate is calculated as the ratio between the number of negative driver traces wrongly categorized as belonging to
fp
the owner (i.e.,the false positives) and the total number of actual impostor traces (i.e., the true negatives): FP rate = f p+ tn
,
where fp indicates the number of false positives and tn the number of true negatives.
The Precision has been computed as the proportion of the examples that truly belong to class X among all those which
were assigned to the class. It is the ratio of the number of relevant records retrieved to the total number of irrelevant and
tp
relevant records retrieved: Precision = t p+ fp
, where tp indicates the number of true positives and fp indicates the number of
false positives.
The Recall has been computed as the proportion of examples that were assigned to class X, among all the examples that
truly belong to the class, i.e., how much part of the class was captured. It is the ratio of the number of relevant records
tp
retrieved to the total number of relevant records: Recall = t p+ fn
, where tp indicates the number of true positives and fn
indicates the number of false negatives.
The F-Measure is a measure of a test’s accuracy. This score can be interpreted as a weighted average of the precision and
Precision∗Recall
recall: F-Measure =2 ∗ Precision +Recall
.
The Roc Area is defined as the probability that a positive instance randomly chosen is classified above a negative ran-
domly chosen.
The classification analysis consisted of building classifiers in order to evaluate feature accuracy to distinguish the car
owner by an impostor.
We consider two different approaches in order to build the model starting from the feature.
In the first one, the multi driver classification, for training the first classifier, we defined T as a set of labeled behavioral
traces (BT, l), where each BT is associated to a label l ∈ {A, B, C, D, E, F, G, H, I, J}.
For training the second classifier, i.e., the binary one, we defined T as a set of labeled behavioral traces (BT, l), where
each BT is associated to a label l ∈ {impostor, owner}. For each BT we built a feature vector F ∈ Ry , where y is the number
of the features used in training phase (y = 51).
For the learning phase, we use a k-fold cross-validation [16,26]: the dataset is randomly partitioned into k subsets. A
single subset is retained as the validation dataset for testing the model, while the remaining k-1 subsets of the original
dataset are used as training data. We repeated the process for k=10 times; each one of the k subsets has been used once as
the validation dataset. To obtain a single estimate, we computed the average of the k results from the folds.
We evaluated the effectiveness of the classification method with the following procedure:
1. build a training set T ⊂ D;

2. build a testing set T = D ÷ T ;
3. run the training phase on T;
4. apply the learned classifier to each element of T .
Each classification was performed using 20% of the dataset as training dataset and 80% as testing dataset employing the
full feature set.
We defined Cu as the set of the classifications we performed, where u identifies the driver (1 ≤ u ≤ 10).
For sake of clarity, we explain with an example the method we adopted in the binary classification: when we perform
the C2 classification, we label the traces related to the driver #2 as owner traces, and the traces of the other users as
impostor, while in the multi driver classification we consider the ten different label drivers.
Table 3 shows the parameters considered for the learning task of the five algorithms involved in the evaluation.
The results that we obtained with this procedure are shown in Table 4. In particular, the table shows the FP Rate, Pre-
cision, Recall, F-Measure and RocArea for classifying the full drivers dataset (multi-driver classification) and the single one
computed with five different algorithms. The time column is related to the time (in seconds) to learn the classifier. In the
JID: CAEE
Table 4
Classification results.
Family Algorithm FP Rate Precision Recall F-Measure Roc Area Time
J48 0.001 0.992 0.992 0.992 0.998 15.95s

J48graft 0.001 0.992 0.992 0.992 0.998 16.04s
All drivers J48consolidated 0.001 0.991 0.991 0.991 0.998 16.18s
RandomTree 0.014 0.880 0.880 0.880 0.933 1.07s
RepTree 0.002 0.987 0.987 0.987 0.998 4.40s
J48 0.0 0 0 0.998 0.997 0.998 0.999 14.04s
J48graft 0.0 0 0 0.998 0.996 0.997 0.999 16.02s
Driver A J48consolidated 0.0 0 0 0.997 0.995 0.996 0.999 15.98s
RandomTree 0.004 0.956 0.944 0.950 0.970 0.99s
RepTree 0.0 0 0 0.997 0.996 0.996 1.0 0 0 4.10s
J48 0.001 0.991 0.994 0.992 0.998 7.96s
J48graft 0.002 0.990 0.994 0.992 0.998 7.99s
Driver B J48consolidated 0.002 0.990 0.990 0.990 0.998 6.98s
RandomTree 0.015 0.902 0.898 0.900 0.941 1.48s
RepTree 0.002 0.987 0.986 0.986 0.998 3.67s
J48 0.001 0.991 0.992 0.991 0.997 8.95s
J48graft 0.001 0.990 0.991 0.991 0.997 8.99s
Driver C J48consolidated 0.001 0.985 0.992 0.989 0.997 9.06s
RandomTree 0.015 0.823 0.826 0.824 0.905 1.19s
RepTree 0.002 0.977 0.979 0.978 0.998 4.15s
J48 0.002 0.991 0.988 0.989 0.996 12.04s
J48graft 0.001 0.992 0.987 0.990 0.996 12.01s
Driver D J48consolidated 0.002 0.988 0.981 0.984 0.997 11.58s
RandomTree 0.022 0.862 0.863 0.863 0.92 1.49s
RepTree 0.003 0.979 0.979 0.979 0.997 5.59s
J48 0.0 0 0 0.997 0.998 0.997 1.0 0 0 9.30s
J48graft 0.0 0 0 0.996 0.998 0.997 0.999 9.12s
Driver E J48consolidated 0.001 0.995 0.997 0.996 0.999 9.10s
RandomTree 0.005 0.949 0.954 0.952 0.974 0.50s
RepTree 0.0 0 0 0.996 0.997 0.996 0.999 2.71s
J48 0.001 0.994 0.996 0.995 0.999 4.47s
J48graft 0.001 0.993 0.996 0.994 0.999 4.54s
Driver F J48consolidated 0.001 0.992 0.994 0.993 0.999 4.22s
RandomTree 0.012 0.913 0.917 0.915 0.953 1.09s
RepTree 0.001 0.992 0.992 0.992 0.999 3.14s
J48 0.001 0.992 0.991 0.992 0.997 13.38s
J48graft 0.001 0.992 0.991 0.992 0.997 13.47s
Driver G J48consolidated 0.001 0.991 0.990 0.990 0.997 13.29s
RandomTree 0.010 0.881 0.885 0.883 0.937 1.55s
RepTree 0.001 0.984 0.984 0.984 0.998 5.12s
J48 0.001 0.993 0.993 0.993 0.998 12.66s
J48graft 0.001 0.993 0.993 0.993 0.998 12.71s
Driver H J48consolidated 0.001 0.990 0.990 0.990 0.998 12.67s
RandomTree 0.018 0.844 0.852 0.848 0.917 1.80s
RepTree 0.002 0.986 0.987 0.986 0.998 5.40s
J48 0.001 0.988 0.988 0.988 0.997 8.64s
J48graft 0.001 0.990 0.988 0.989 0.997 8.58s
Driver I J48consolidated 0.001 0.991 0.994 0.993 0.998 8.41s
RandomTree 0.018 0.808 0.816 0.812 0.899 1.11s
RepTree 0.001 0.988 0.984 0.986 0.998 4.36s
J48 0.001 0.990 0.988 0.989 0.997 9.03s
J48graft 0.001 0.990 0.990 0.990 0.997 9.07s
Driver J J48consolidated 0.001 0.990 0.990 0.990 0.998 9.17s
RandomTree 0.015 0.856 0.838 0.846 0.911 1.07s
RepTree 0.002 0.984 0.986 0.985 0.998 3.78s
all drivers classification the time to learn the algorithms is ranging between 4.4s (with RepTree algorithm) and 16.18s (with
J48consolidated algorithm).
In the multi driver classification (All drivers family) we obtain following best results from the point of the views of the
metrics we considered:
• FP rate equal to 0.001 with the J48, J48graft and J48consolidated algorithms;
• Precision, Recall and F-Measure equal to 0.992 using the J48 and the J48graft classification algorithms;
• Roc Area equal to 0.998 using J48, J48graft, J48consolidated and RepTree classification algorithms.
JID: CAEE
Table 5
Feature selection results.
# Feature
5 Intake_air_pressure
8 Engine_soacking_time
12 Long_Term_Fuel_Trim_Bank1
15 Torque_of_friction
35 Transmission_oil_temperature
50 Steering_wheel_speed
Fig. 9. FP Rate values for the 10 drivers involved in the experiment obtained classifying the best features using the J48 algorithm.
The reader can find in Table 4 the single driver results, while Table 5 shows the feature selection results. The two
feature selection algorithms employed, the BestFirst and the GreedyStepwise, confirm that 6 features on the 51 con-
sidered in the full features dataset are the most discriminatory in driver identification i.e., the #5 (Intake_air_pressure),
the #8 (Engine_soacking_time), the #12 (Long_Term_Fuel_Trim_Bank1), the #15 (Torque_of_friction), the #35 (Transmis-
sion_oil_temperature) and the #50 (Steering_wheel_speed) features.
As expected, all the six features resulting from the feature selection step successfully passed the null hypothesis H0 test,
as shown in Table 2.
Table 6 shows the classification analysis considering the features retrieved from the feature selection step.
Table 6 shows the FP Rate, Precision, Recall, F-Measure and RocArea for classifying the full drivers dataset and the single
one computed with the five different algorithms. The time column is related to the time (in seconds) to learn the classifier.
In the all drivers classification the time to learn the algorithms is ranging between 0.41s (with the RandomTree algorithm)
and 1.71s (with the J48consolidated algorithm). Without PCA analysis the J48consolidated algorithm employs 16.18s to learn
the classifier. The obtained results in terms of the analyzed metrics are closed to the previous ones, confirming that the
excluded features were not useful in the classification task. Indeed, in the second classification we considered only 6 features
on the 51 considered in the previous one: the use of 6 features instead of 51 is reflected in a higher applicability of the
proposed method in the real world. As a matter of fact, the use of a lower number of features is reflected in a smaller
storage space and in a shorter computing time in order to identify the driver impostor.
In the multi driver classification, using the best features (All drivers family), we obtain the following best results from
the point of the views of the metrics we considered:
• FP rate equal to 0.001 with the J48 and J48graft algorithms;

• Precision, Recall and F-Measure equal to 0.989 using the J48 and the J48graft classification algorithms;
• Roc Area equal to 0.998 using J48, J48consolidated, and RepTree classification algorithms.
In order to have a full vision about the single driver identification, we represent using histogram the obtained perfor-
mance by the classification algorithms only for the FP rate considering the classification algorithm able to reach the best
performance in the best feature classification i.e., the J48 algorithm.
Fig. 9 shows the FP Rate obtained classifying using the six best features with the J48 algorithm. The FP Rate refers to the
10 drivers involved in the experiment. The FP Rate ranges between 0 and 0.002. This results demonstrate that the number
of false positives of our methods is very low i.e., it is equal to 0.002 in the worst case (related to C, D, H and I drivers).
Drivers A, E and G exhibit the best case with a FP rate equal to 0, while B, F and J drivers obtain an FP rate equal to 0.001.
JID: CAEE
Table 6
Best features classification results.
Family Algorithm FP Rate Precision Recall F-Measure Roc Area Time
J48 0.001 0.989 0.989 0.989 0.998 1.69s

J48graft 0.001 0.990 0.99 0.990 0.997 1.58s
All drivers J48consolidated 0.002 0.986 0.986 0.986 0.998 1.71s
RandomTree 0.002 0.984 0.984 0.984 0.991 0.41s
RepTree 0.002 0.984 0.984 0.984 0.998 0.54s
J48 0.0 0 0 0.998 0.997 0.998 0.999 1.33s
J48graft 0.0 0 0 0.997 0.997 0.997 0.999 1.29s
Driver A J48consolidated 0.0 0 0 0.997 0.997 0.997 0.999 1.36s
RandomTree 0.0 0 0 0.997 0.996 0.997 0.998 0.25s
RepTree 0.0 0 0 0.997 0.996 0.997 1.0 0 0 0.49s
J48 0.001 0.991 0.991 0.991 0.998 3.26s
J48graft 0.002 0.990 0.992 0.991 0.998 3.18s
Driver B J48consolidated 0.002 0.988 0.985 0.987 0.998 3.23s
RandomTree 0.002 0.986 0.987 0.986 0.992 0.35s
RepTree 0.002 0.986 0.987 0.986 0.998 0.47s
J48 0.002 0.980 0.979 0.980 0.997 2.36s
J48graft 0.002 0.982 0.979 0.98 0.996 2.25s
Driver C J48consolidated 0.002 0.972 0.983 0.977 0.997 2.45s
RandomTree 0.002 0.974 0.971 0.972 0.984 0.39s
RepTree 0.002 0.973 0.972 0.972 0.997 0.42s
J48 0.002 0.987 0.984 0.985 0.997 4.71s
J48graft 0.002 0.987 0.986 0.986 0.996 4.79s
Driver D J48consolidated 0.002 0.985 0.973 0.979 0.997 4.69s
RandomTree 0.003 0.980 0.980 0.980 0.988 0.46s
RepTree 0.003 0.980 0.974 0.977 0.997 0.79s
J48 0.0 0 0 0.997 0.997 0.997 0.999 1.64s
J48graft 0.0 0 0 0.998 0.997 0.998 0.999 1.58s
Driver E J48consolidated 0.0 0 0 0.996 0.996 0.996 0.999 1.62s
RandomTree 0.001 0.995 0.996 0.995 0.998 0.27s
RepTree 0.0 0 0 0.996 0.995 0.995 0.999 0.49s
J48 0.001 0.990 0.991 0.990 0.998 1.23s
J48graft 0.001 0.990 0.991 0.991 0.998 1.27s
Driver F J48consolidated 0.002 0.984 0.987 0.986 0.998 1.25s
RandomTree 0.002 0.986 0.988 0.987 0.993 0.32s
RepTree 0.002 0.984 0.986 0.985 0.999 0.31s
J48 0.0 0 0 0.995 0.995 0.995 0.999 2.28s
J48graft 0.001 0.994 0.995 0.994 0.999 2.34s
Driver G J48consolidated 0.001 0.991 0.994 0.992 0.998 2.31s
RandomTree 0.001 0.989 0.989 0.989 0.994 0.38s
RepTree 0.001 0.989 0.990 0.989 0.999 0.51s
J48 0.002 0.986 0.988 0.987 0.998 4.11s
J48graft 0.001 0.988 0.987 0.988 0.997 4.07s
Driver H J48consolidated 0.002 0.982 0.986 0.984 0.998 4.05s
RandomTree 0.002 0.980 0.981 0.980 0.989 0.53s
RepTree 0.003 0.978 0.982 0.980 0.998 0.85s
J48 0.002 0.982 0.985 0.983 0.996 2.79s
J48graft 0.001 0.984 0.985 0.985 0.997 2.67s
Driver I J48consolidated 0.002 0.976 0.984 0.980 0.996 2.87s
RandomTree 0.002 0.979 0.974 0.977 0.986 0.28s
RepTree 0.002 0.976 0.982 0.979 0.998 0.43s
J48 0.001 0.988 0.986 0.987 0.997 2.51s
J48graft 0.001 0.987 0.987 0.987 0.997 2.43s
Driver J J48consolidated 0.001 0.986 0.983 0.984 0.997 2.56s
RandomTree 0.003 0.976 0.978 0.977 0.988 0.33s
RepTree 0.002 0.981 0.976 0.979 0.997 0.51s
6. Conclusions and future work
Modern vehicles, differently by older ones, integrate a lot of sophisticated electronic devices. This increasing technologies
permitted to find new way to steal cars, for instance by exploiting the vulnerabilities of the operating system embedded in
today’s car. This scenario calls for new methodologies in order to stem the phenomenon resulting from the introduction
of computers in the car, with the consequent vulnerability of software used. In this paper, we propose a method able to
discriminate an impostor by the car owner using a set of characteristics available by the sensor embedded into the car.
JID: CAEE
Using machine learning techniques, we design several classifiers able to evaluate the effectiveness of our method: as a
matter of fact we obtain, in average, a precision and a recall equal to 0.99 in car owner discrimination. As a future work,
we plan to take into account in our model the type of the road in order to design a system able to advise the user about
the driving style to adopt.
A weakness of the proposed method is represented by the time-windows required to collect the features to learn the
classification algorithms; this is a common problem to all machine learning based solutions. As an additional future work,
we will evaluate the minimum time-window able to guarantee performances over a certain threshold.
Moreover, we will extend the evaluation considering features extracted by trucks and motorcycles with the aim to iden-
tify thefts not only cars-related.
Author contributions
Fabio Martinelli, Francesco Mercaldo, Albina Orlando, Vittoria Nardone, Antonella Santone and Arun Kumar Sangaiah are
all responsible for the concept of the paper, the results presented and the writing. All the authors have read and approved
the final published manuscript.
Acknowledgments
This work has been partially supported by H2020 EU-funded projects NeCS and C3ISP and EIT-Digital Project HII and
PRIN “Governing Adaptive and Unplanned Systems of Systems” and the EU project CyberSure 734815.
Supplementary material
Supplementary material associated with this article can be found, in the online version, at 10.1016/j.compeleceng.2017.
12.050.
References
[1] Martinelli F, Mercaldo F, Nardone V, Santone A. Car hacking identification through fuzzy logic algorithms. In: Fuzzy Systems (FUZZ-IEEE), 2017 IEEE
International Conference on. IEEE; 2017. p. 1–7.
[2] Alheeti KMA, Gruebler A, McDonald-Maier KD. An intrusion detection system against malicious attacks on the communication network of driverless
cars. In: 2015 12th Annual IEEE Consumer Communications and Networking Conference (CCNC). IEEE; 2015. p. 916–21.
[3] Lyamin N, Vinel AV, Jonsson M, Loo J. Real-time detection of denial-of-service attacks in ieee 802.11 p vehicular networks. IEEE Commun Lett
2014;18(1):110–13.
[4] Taylor A, Leblanc S, Japkowicz N. Anomaly detection in automobile control network data with long short-term memory networks. In: Data Science
and Advanced Analytics (DSAA), 2016 IEEE International Conference on. IEEE; 2016. p. 130–9.
[5] Massaro E, Ahn C, Ratti C, Santi P, Stahlmann R, Lamprecht A, et al. The car as an ambient sensing platform. Proc IEEE 2017;105(1):3–7.
[6] Marotta A, Martinelli F, Nanni S, Orlando A, Yautsiukhin A. Cyber-insurance survey. Computer Science Review 2017;24(Supplement C):35–61. doi:10.
1016/j.cosrev.2017.01.001.
[7] Wakita T, Ozawa K, Miyajima C, Igarashi K, Katunobu I, Takeda K, et al. Driver identification using driving behavior signals. IEICE Trans Inf Syst
2006;89(3):1188–94.
[8] Zhang X, Zhao X, Rong J. A study of individual characteristics of driving behavior based on hidden markov model. Sensors Transd 2014;167(3):194.
[9] Kedar-Dongarkar G, Das M. Driver classification for optimization of energy usage in a vehicle. Procedia Comput Sci 2012;8:388–93.
[10] Van Ly M, Martin S, Trivedi MM. Driver classification and driving style recognition using inertial sensors. In: Intelligent Vehicles Symposium (IV), 2013
IEEE. IEEE; 2013. p. 1040–5.
[11] Miyajima C, Nishiwaki Y, Ozawa K, Wakita T, Itou K, Takeda K, et al. Driver modeling based on driving behavior and its evaluation in driver identifi-
cation. Proc IEEE 2007;95(2):427–37.
[12] Choi S, Kim J, Kwak D, Angkititrakul P, Hansen JH. Analysis and classification of driver behavior using in-vehicle can-bus information. In: Biennial
Workshop on DSP for In-Vehicle and Mobile Systems; 2007. p. 17–19.
[13] Meng X, Lee KK, Xu Y. Human driving behavior recognition based on hidden markov models. In: Robotics and Biomimetics, 2006. ROBIO’06. IEEE
[14] Kwak BI, Woo J, Kim HK. Know your master: Driver profiling-based anti-theft method. In: PST 2016; 2016. p. 211–18.
[15] Birnbaum R, Truglia J. Getting to know OBD II. R. Birnbaum; 2001.
[16] Mitchell TM. Machine learning and data mining. Commun ACM 1999;42(11):30–6.
[17] Battista P, Mercaldo F, Nardone V, Santone A, Visaggio CA. Identification of android malware families with model checking. In: Proceedings of the 2nd
International Conference on Information Systems Security and Privacy, ICISSP 2016, Rome, Italy, February 19–21, 2016. SciTePress; 2016. p. 542–7.
[18] Mercaldo F, Visaggio CA, Canfora G, Cimitile A. Mobile malware detection in the real world. In: Software Engineering Companion (ICSE-C), IEEE/ACM
[19] Canfora G, Mercaldo F, Visaggio CA, Di Notte P. Metamorphic malware detection using code metrics. Inf Secur J 2014;23(3):57–67.
[20] Martinelli F, Marulli F, Mercaldo F. Evaluating convolutional neural network for effective mobile malware detection. Procedia Comput Sci
2017;112:2372–81.
[21] Mercaldo F, Nardone V, Santone A. Diabetes mellitus affected patients classification and diagnosis through machine learning techniques. Procedia
Comput Sci 2017;112(C):2519–28.
[22] Wold S, Esbensen K, Geladi P. Principal component analysis. Chemomet Intell Lab Syst 1987;2(1–3):37–52.
[23] Jolliffe IT. Principal component analysis and factor analysis. In: Principal component analysis. Springer; 1986. p. 115–28.
[24] Sadeghi R, Zarkami R, Sabetraftar K, Van Damme P. Application of genetic algorithm and greedy stepwise to select input variables in classification tree
models for the prediction of habitat requirements of azolla filiculoides (lam.) in anzali wetland, iran. Ecol Modell 2013;251:44–53.
[25] Abdullah NR, Shahruddin NS, Mamat AMI, Kasolang S, Zulkifli A, Mamat R. Effects of air intake pressure to the fuel economy and exhaust emissions
on a small si engine. Procedia Eng 2013;68:278–84.
[26] Refaeilzadeh P, Tang L, Liu H. Cross-validation. In: Encyclopedia of database systems. Springer; 2009. p. 532–8.
JID: CAEE
Fabio Martinelli (M.Sc. 1994, Ph.D. 1999) is a research director at IIT-CNR, where he leads the cyber security project. He is co-author of more than three
hundreds scientific papers. His main research interests involve security and privacy in distributed and mobile systems and foundations of security and
trust.
Francesco Mercaldo obtained his Ph.D. in 2015 with a dissertation on malware analysis using machine learning techniques. The core of his research is
finding methods and methodologies to detect new threats applying the empirical methods of software engineering. Currently he works as post-doctoral
researcher at Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy.
Albina Orlando received the M.Sc. degree in Economics in 1996 and the Ph.D. degree in Financial Mathematics and Actuarial Science in 20 0 0 from the
University “Federico II”, Napoli Italy. She is a Researcher with IAC “M. Picone” of the Consiglio Nazionale delle Ricerche, Napoli, Italy. Her research interests
lie in Mathematical models for insurance sciences, Risk management in life insurance and Stochastic mortality models.
Vittoria Nardone is a Ph.D. student at University of Sannio under the supervision of Antonella Santone. Vittoria obtained the M.Sc. in Computer Engineering
at the same University. Her current research is focused on the application on formal verification methods, in particular on the model checking technique.
She applies the following technique on the Android malware analysis.
Antonella Santone is an Associate Professor at the University of Molise, Italy. She received both the Laurea degree in Computer Science and the Ph.D. degree
in Computer Systems Engineering at the University of Pisa, Italy. Her research interests include formal description languages, temporal logic, concurrent
and distributed systems modeling, heuristic search, formal methods for systems biology and for software security.
Arun Kumar Sangaiah has received his Ph.D. in Computer Science and Engineering from VIT University, Vellore, India. He is currently an Associate Professor
in VIT University. He has authored more than 100 publications in different journals and conference of national and international repute. His current research
work includes global software development, wireless ad hoc and sensor networks, machine learning.

Computers and Electrical Engineering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computers and Electrical Engineering

Uploaded by

Copyright:

Available Formats

JID: CAEE

ARTICLE IN PRESS [m3Gsc;January 4, 2018;4:31]

Computers and Electrical Engineering 0 0 0 (2018) 1–16

Contents lists available at ScienceDirect

Computers and Electrical Engineering

Human behavior characterization for driving style recognition

2 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 3

3.1. The CAN and the OBD-II protocols

4 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

3.2. Machine learning

F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 5

1 Fuel_consumption The instant value related to fuel consumption

6 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

Fig. 1. The ﬂow diagram of the proposed approach.

F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 7

5.1. The dataset

8 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

Fig. 4. Path from Korea University to Seoul World Cup Stadium.

5.2. Descriptive statistics

F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 9

10 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

# Feature Wald–Wolfowitz Mann–Whitney Kolmogorov–Smirnov Test Result

[1–2] 0.0 0 0 0.0 0 0 p < .001 passed

5.3. Hypothesis testing

F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 11

Parameter J48 J48Graft J48Consolidated RandomTree RepTree

batchSize 100 100 100

5.4. Classiﬁcation analysis

1. build a training set T ⊂ D;

12 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

Family Algorithm FP Rate Precision Recall F-Measure Roc Area Time

J48 0.001 0.992 0.992 0.992 0.998 15.95s

F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 13

• FP rate equal to 0.001 with the J48 and J48graft algorithms;

14 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

Family Algorithm FP Rate Precision Recall F-Measure Roc Area Time

J48 0.001 0.989 0.989 0.989 0.998 1.69s

6. Conclusions and future work

F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16 15

16 F. Martinelli et al. / Computers and Electrical Engineering 000 (2018) 1–16

You might also like