You are on page 1of 9

Assessing Safety on Dutch Freeways

with Data from Infrastructure-Based


Intelligent Transportation Systems
Mohamed Abdel-Aty, Anurag Pande, Abhishek Das, and Willem Jan Knibbe

Most freeway traffic surveillance technologies deployed around the and Recker (1) and Pande and Abdel-Aty (2) related crash patterns to
world remain infrastructure based, with underground loop detectors traffic data from single-loop detector stations in California and dual-
being the most common among them. A proactive application for traffic loop detector stations in Florida, respectively. These models can dif-
surveillance data recently explored for some freeways in the United ferentiate crash-prone traffic conditions from noncrash (or normal)
States is the estimation of real-time crash risk. The application involves traffic conditions in real time. They can potentially be used to issue
establishing relationships between historical crashes and archived traffic warnings or implement traffic management strategies ahead in time
data collected before those crashes. In these studies, crash occurrence to reduce measures of crash risk and avoid an imminent crash.
on freeway sections has been related to temporal–spatial variation in Despite the extensive analysis presented in these studies, a critical
speed and high lane occupancy. Critical modeling questions that remain issue that remains is transferability of such an approach, especially
unanswered relate to transferability of such an approach. This study to locations with varying traffic surveillance systems. The archived
attempts to address the issues of such transfer through analysis of freeway traffic data may vary in collected traffic parameters, level
crash data and corresponding loop detector data from five freeways in of aggregation, and spacing of loop detectors. It may be possible to
the Utrecht region of the Netherlands. Traffic surveillance systems transfer only the approach to real-time crash-risk estimation and not
for these freeways include more detectors per kilometer than most U.S. the final models developed by Golob et al. (1) or Pande and Abdel-Aty
freeways. Their real-time data are also already being used for applications (2). The current study explores a sample of crash as well as noncrash
of advanced intelligent transportation systems. The analysis procedure data from five freeways in the Utrecht region of the Netherlands
proposed here accounts for these distinctions. In addition to these trans- with the objective of developing a real-time crash-risk measure.
ferability issues, application is introduced of a new data-mining method- The traffic surveillance–management apparatus on these freeways
ology, Random Forests, for identifying variables significantly associated is advanced in its ability to collect information about congestion, to
with the binary target variable (crash versus noncrash). It was found that apply variable speed limits (VSLs) through dynamic message signs
the average and standard deviations of speed and volume are related to (DMSs), and to archive traffic speed and volume data. The changes
real-time crash likelihood. Subjecting these significantly related vari- in speed limits are implemented either automatically (primarily in
ables to multilayer perceptron and normal radial basis function neural response to congestion triggers) or manually by operators. Understand-
networks resulted in classifiers that achieved classification accuracy of ing what impact, if any, this implementation has on real-time safety
approximately 61% for crashes and 79% for noncrashes. The promis- would aid in future system improvements. On the Dutch freeways,
ing classification accuracy indicates that these models can be used for the density of loop detectors is higher (i.e., the detectors are more
reliable assessment of real-time crash risk on Dutch freeways as well. closely spaced than for comparable traffic surveillance systems in
the United States) while the distance between them is more variable.
Underground loop detectors are one of the most common traffic For this work, the data collected from the detectors were available in
surveillance apparatuses on freeways. Data from these detectors are the form of 1-min aggregates rather than the 30-s data used in the
used for various intelligent transportation system (ITS) applications, authors’ previous studies.
such as travel time estimation and incident detection. Recently, how- With the traffic data corresponding to crash as well as noncrash
ever, the focus has been shifting toward proactive applications of these cases, the problem was set up as one of classification, as in a previ-
data, including the development of real-time crash-risk assessment ous study by two of the current authors (2). However, in the current
models. These models are developed from analysis of traffic surveil- study, a relatively recent data exploration technique, namely, Random
lance data observed before historical crashes. For example, Golob Forests, was used for selection of variables instead of classification
trees. Single decision trees tend to be unstable while Random Forests,
which is a combination of multiple tree classifiers, tends to be more
M. Abdel-Aty, A. Pande, and A. Das, Department of Civil and Environmental Engi- robust. It works efficiently on large data sets and is being increasingly
neering, University of Central Florida, 4000 Central Florida Boulevard, Orlando,
FL 32816-2450. W. J. Knibbe, Rijkswaterstaat Transport Research Center, used in the process of selecting variables. A detailed discussion
Netherlands, Rotterdam NL-3000, Netherlands. Corresponding author: A. Pande, of this methodology for selection of variables is appears later in
anurag@mail.ucf.edu. this paper.
With the variables selected by means of the Random Forests
Transportation Research Record: Journal of the Transportation Research Board,
No. 2083, Transportation Research Board of the National Academies, Washington,
method, some advanced neural network classifiers were developed.
D.C., 2008, pp. 153–161. The results indicated promise on the issue of transferability, even as
DOI: 10.3141/2083-18 the results showed subtle differences due to the nature of the data used

153
154 Transportation Research Record 2083

for the present study. The next section provides details of the data vide scant independent information. A higher threshold [say, 1⁄2 mi
used for this study. The analysis, results, and discussion—and avenues (0.8 km), comparable to the average distance between consecutive
for future research—follow in subsequent sections of the paper. loop detectors on Interstate 4 in Orlando, Florida] would have meant
that the six detectors for data collection were spread around 5 to 6 km
(2.5 to 3 km in each direction). That would in turn have meant that
DATA COLLECTION AND PREPARATION there had to be a 2.5- to 3-km instrumented section (section with
loop detectors installed) of freeway on both sides (upstream and
Study Area and Available Data downstream) of a crash location. A larger threshold such as 0.8 km
would essentially have meant leaving out more crashes closer to the
As noted earlier, data from loop detectors on five freeway sections two ends of the five freeway corridors. Because the research was
in the Utrecht region of the Netherlands were used in this study. The dealing with five roadways rather than one long freeway section, such
freeways and mileposts (in kilometers) corresponding to the extrem- a larger threshold would have resulted in a significant reduction in
ities of the sections are as follows: A1 (28.82 to 46.96 km), A12 the already limited sample size.
(39.8 to 91.6 km), A2 (36.9 to 94.58 km), A27 (53.35 to 100.35 km) The next step was to extract loop data corresponding to the
and A28 (0.25 to 32.375 km). Loop detectors measuring the traffic crashes. The first upstream and downstream loop detector stations
parameters were present in both directions of travel, even though
relative to the crash were named US1 and DS1, respectively. The
they were not perfectly aligned at certain locations. Furthermore,
subsequent loop detector stations in either direction were named
spacing between consecutive loop detectors in the same direction of
US2 and DS2 and US3 and DS3, respectively. Figure 1 shows the
travel was not consistently the same but was always less than 0.8 km
positions of the upstream and downstream loop detector stations
(0.5 mi). Hence, the arrangement was significantly different from
relative to a crash and the minimum spacing between them. In Fig-
the series of loop detectors present on Interstate 4, the freeway used
ure 1, Position 1 represents the set of two loop detectors nearest to a
in a previous study by two current authors (2). As noted later, this
crash location in the upstream and downstream directions. Positions 2
difference was critical in devising the approach to data analysis.
and 3 included the sets of two subsequent detectors in upstream and
The following traffic parameters were available for every minute:
downstream directions. The significance of defining these sets of
speed, volume (normalized as hourly volume), a flag for traffic con-
detectors will be made clear later in this section. The loop data were
gestion, and the message displayed on the DMSs. The possible DMS
displays included the implemented VSLs or suggested traffic maneu- then extracted in the following format: if a crash, for example, had
vers (e.g., merge left, etc.). The dataset did not contain information occurred on September 4, 2006 (Monday), at 05:00 p.m. on Free-
on lane occupancy. These data were available for the month of way A2, then the corresponding loop detector stations of interest were
September 2006 along with the incident reports for the same period US1, US2, and US3 in the upstream direction and DS1, DS2, and
and freeways. The incident reports included 288 crashes, loop data DS3 in the downstream direction. This crash case would have a loop
corresponding to which were extracted and used in this study. database consisting of 1-min averages of speed and volume, con-
gestion flag along with the message displayed on the DMS for that
minute for all the lanes at the six stations from 4:50 to 5:05 p.m.
Data Preparation (15-min window) on September 4, 2006. A variable Y was created
with a value of 1 for all the crashes. It would be used as the binary
First, the location and time of occurrence for each of the 288 crashes target variable for developing classification models with a value of 0
were identified. Then, for every crash, six loop detector stations (three for the noncrash cases.
stations each in the upstream and downstream directions) were iden- The modeling procedure required noncrash data corresponding to
tified. Because some of the loop detectors were very closely spaced, each crash to be available. For the crash considered in the previous
it was decided to have a minimum of 250 m between consecutive loop paragraph (which was assumed to have occurred at 05:00 p.m. on
detectors stations. The threshold of 250 m was established because Monday, September 4, 2006), the corresponding noncrash loop data
of the arrangement of loop detectors. A lower threshold would have was collected for the same time window as the crash data on all
meant that the data from two consecutive loop detectors would pro- Mondays in September 2006. Furthermore, these data were collected

Position 3

Position 2
Position 1

Crash Location
US3 US2 US1 DS1 DS2 DS3

≥ 250m

Direction of Travel

FIGURE 1 Arrangement of the loop detectors stations.


Abdel-Aty, Pande, Das, and Knibbe 155

from the same upstream and downstream loop detector stations from congestion was contingent on the speed data being observed, with
which the data corresponding to the crash case were extracted. This average speed below 50 km/h indicating congestion. The speed data
sampling scheme controlled for other critical factors affecting crash used to flag locations for congestion were smoothened by using an
occurrence, such as driver population, location on the freeway (geo- exponential smoothing algorithm. A smoothing algorithm ensures that
metric features, etc.), time of day, and day of the week. The variable y individual observations with relatively high or low values (sometimes
will be 0 for all noncrash data. referred to as spikes) do not cause the congestion indicator to fluctuate
The next step was the loop data aggregation. The raw 1-min data unrealistically. Theoretical details and advantages of the exponential
were noticed to have random noise and were difficult to work with smoothing algorithm may be found in Hunter (4). A variable CON was
in a modeling framework. Therefore, the raw data were aggregated to created and valued 1 for congested condition and 0 for noncongested
the 5-min level to obtain averages and standard deviations. Figure 2 conditions to convert the flag into a numerical variable. The average
demonstrates the noise reduction in speed data following the 5-min and standard deviations for this variable were then created with
aggregation. The 5-min aggregate data provided more allowance in the same procedure. The nomenclature for average and standard
relation to time to analyze data and to estimate and possibly to reduce deviations of CON was also identical. For example, the variable
the likelihood of crashes. The decision to have a 5-min level of aggre- ACOND1_2 represented average congestion at the loop detector
gation rather than a 3-min level has also been discussed in detail in a Station D1 during the (5-min) Time Slice 2. Dutch freeways have a
previous study (3). The 15-min period for which data were collected VSL system in place. The VSL information disseminated on the DMS
was divided into three time slices, numbered 0 through 2. The inter- that coincided with the pavement location containing the loop detector
val between the time of a crash and 5 min after the crash was named was also part of the traffic surveillance database. For the research
Time Slice 0; the interval between the time of the crash and 5 min problem at hand, it was important to determine the following: first,
before it was named Time Slice 1; and the interval between 5 and whether VSL was in effect at a particular station–time slice and,
10 min before the crash Time Slice 2. The traffic parameters were second, if there had been a change in the values of the VSL applied
further aggregated across lanes, and the averages (and standard at those locations within a particular time slice (i.e., a change within a
deviations) for speed and volumes at the 5-min level were calculated change). Hence, two binary variables, V and CW, were created for each
along with the logarithm of coefficient of variation (standard deviation/ of the six stations and three time slices. For example, V_D1_2 = 1
average). The aggregation across all lanes was necessary as there were implies that VSL had been implemented on the location of DS1
instances in which a particular lane’s loop detector was not reporting during Time Slice 2, and CW_U1_1 = 0 implies that there was no
data. Aggregating over all the lanes lead to a lesser number of missing change in the implementation strategy on the location of US1 dur-
observations in the data set. ing Time Slice 1 and so on. Thus, there will be one row of data for
The nomenclature for these average and standard deviations is of each crash–noncrash case after the aggregation process with 180
the form XYZα_β. X takes the value A or S for average and standard (10 parameters × six stations × three time slices) potential input
deviation, respectively, while Y takes the value S or V for speed and variables. The final data set had 288 crash and 968 noncrash cases.
volume, respectively. Zα takes the value of U1, U2, U3, D1, D2, The procedures for selection of variables and modeling used in
or D3 depending on the station to which a traffic parameter belongs this study required that all corresponding variables in the data set be
(the nearest upstream or downstream station relative to the crash nonmissing. As expected, there were a significant number of obser-
location being U1 or D1 and subsequent detectors being U2 or D2 vations for which at least one of the six data-collection stations was
and U3 or D3, respectively). β takes the value 0, 1, or 2, which refer not reporting data. This situation resulted in some variables with
to the three time slices. Hence, ASD1_2 and AVU1_2 represent missing values for those observations. Hence, there was a risk of
average speed on Station DS1 over Time Slice 2 and average volume reducing the number of good observations drastically if data from
on Station US1 over Time Slice 2, respectively. The corresponding all six stations in the same model were used. A similar problem
names for coefficient of variation variables can be deduced by had been encountered in a previous study (5). To illustrate the prob-
dropping the first letter (A or S) and replacing it with the term CV. lem caused by failures in loop detectors, it can be assumed that the
As noted earlier, in addition to speed and volume information, probability of failure for each of the six stations is a and that their
traffic surveillance data from these freeways also included a con- failures are independent of each other. The expected proportion of
gestion indicator for the location of each loop detector. The flag for complete cases (i.e., cases with no missing data from any stations)

108
106
104
Time (minute)

102
100
1 minute
98
5 minute
96
94
92
90
88
0 10 20 30 40 50 60
Speed (km/h)

FIGURE 2 Noise reduction at 5-min aggregation level.


156 Transportation Research Record 2083

will be (1 − a)k, where k is the number of stations whose parameters neural network-based modeling procedures with parameters identified
are included in the same model. For a = .15 (15% probability of through the preceding classification trees as inputs (2, 8).
failure) and k = 6, there would be only 38% complete observations. In this study, Random Forests, a collection of multiple tree clas-
But if only one station is considered (k = 1), then on average 85% of sifiers, were used for selection of variables. A decision tree, with all
the observations would be complete. This illustration does not pro- its simplicity and handling of missing values, can be very unstable.
vide an actual estimate of good observations but only exemplifies That is, small changes in input variables might result in large changes
the problem of missing observations caused by the use of data from in the output. In this situation, Random Forests are a more robust
more stations. tool for selection of variables. Another advantage of using Random
On the basis of the above discussion, it was decided that data from Forests instead of classification trees is that, due to an internal test
no more than two stations would be used in the same model. Observ- mechanism, one need not divide the input data set into separate
ing traffic parameters from one station at a time could have led to more training and validation samples. Selection of variables through Ran-
complete records. However, with data from only one loop detector dom Forest methodology is implemented through the randomForest
station, there would be no way to examine the interplay between function implementation in R (9).
traffic data being observed upstream and downstream of the crash
location and the effect it has on the crash risk. With these constraints
kept in perspective, the following strategy was adopted for the Random Forests and Variable Importance Scores
analysis: the procedure of selection of variables and modeling would
be accomplished in three independent phases. These three phases As noted earlier, a Random Forest is a collection of tree classifiers.
differed from each other in the sets of two loop detectors that provide A random subset of variables independently sampled from the input
the data used in each phase. The first phase used data corresponding variables is used to grow each constituent tree to the full extent. The
to nonmissing observations from the set of loop detectors referred to resulting classification from each tree is then treated as a vote for
as Position 1 in Figure 1 (i.e., stations US1 and DS1). The second phase the corresponding classification. In this study, the binary target is
included data from the set of two stations referred to as Position 2 in the variable y, which takes a value of 1 for crash records and 0 for
Figure 1 (i.e., stations US2 and DS2). Similarly, data corresponding noncrash records. For any input vector, the forest chooses that class
to nonmissing observations from the loop detector locations US3 with the maximum number of votes. The process of growing each
and DS3 was used in the third phase. The individual phases of this constituent tree may be divided into the following steps (10):
approach allowed investigation of the impact of traffic patterns 1. For N number of cases in the training set, a bootstrap sample
observed at loop detector stations located upstream and downstream of N is drawn for growing the tree.
of the crash location. In contrast, the difference between the results 2. For M input variables, a constant number of m (m << M) vari-
of the three phases helped in inferring about change in the effect of the ables are selected at random from M at each node. The best split among
traffic parameters on crash risk with respect to time and space. m is used to split the node.
Hence, the data sets prepared for the three phases had 144, 162, and 3. Trees are grown to the full extent without pruning.
143 complete crash records for Positions 1, 2, and 3, respectively.
The numbers of corresponding complete noncrash records were 506, The forest error rate is directly proportional to the correlation
560, and 505, respectively. between any two trees and inversely proportional to the strength
This three-phase analysis was conducted for all three time slices. of individual trees. In other words, Random Forests with strong indi-
However, the focus of discussion in this paper is on variables calcu- vidual trees providing independent information lead to better clas-
lated for Time Slice 2 (i.e., 5 to 10 min before a crash). This time sification performance. Random Forests run efficiently on large data
period was close enough to the crash time and from experience should sets because they can handle a large number of variables without
have been able to provide insight into crash-prone traffic conditions. overfitting the data. Because Random Forests was used here as a
Time Slice 1 (i.e., 0 to 5 min before a crash) would be too close to the data-exploration tool, the feature of interest in this study was its abil-
crash time to work in real time. In addition, Time Slice 0 is the period ity to identify variables most significantly associated with the binary
0 to 5 min after the crash and hence would be interesting only for target (10).
incident detection, which, of course, is not the focus of this research. When a particular tree is grown from a bootstrap sample, one-third
The variables of importance were identified by using the Random of the training cases are left out and not used in the growth of the tree.
Forests methodology first proposed by Breiman (6). The selected The left-out cases are called out-of-bag (OOB) data. The OOB cases,
variables were then used as inputs for building the neural network effectively an internal test data set, are used to obtain an unbiased
models. The two data mining processes, Random Forests and neural error estimate as well as the estimates of variable importance. The
networks are discussed in the section that follows. process for assessing importance of variable p in the context of binary
classification is as follows:

• First, the OOB cases are subjected to every constituent tree


MODELING METHODOLOGIES grown to the full extent, and the votes for the correct class are counted.
• Then, the values of the variable p are permuted randomly, and
Data-mining processes are used to find unsuspected relationships in the permuted cases are put through the tree again with the votes again
large observational data sets (7). These processes typically involve counted.
analysis in which the objectives of the data analysis have no bear- • The raw importance score of variable p would be the average
ing on the data-collection strategy (i.e., no experimental design). difference in the votes between the permuted OOB data and the
Establishing relationships between loop detector data and crash data untouched OOB data across all trees in the forest (10).
that are collected independently of each other is an ideal problem
for data-mining analysis. The authors previously used data-mining Another variable importance measure is based on the Gini impor-
processes like classification trees for selection of variables and tance. In this measure, whenever a split of a node is made on the
Abdel-Aty, Pande, Das, and Knibbe 157

variable p, the Gini impurity for the two descendant nodes is less the input and the weight vectors. Ordinary RBF (ORBF) neural net-
than the parent node. The reductions in Gini impurity are summed for works use an exponential activation function. The RBF networks
each variable over all the trees in the forest. The Gini impurity provides that instead use a softmax activation function are called normalized
an importance score that is generally consistent with the permutation RBF (NRBF) networks. Detailed discussion on the advantages of
importance score. In this study, the second variable-importance score NRBF networks (over ORBF networks) are discussed by Tao (12).
based on the Gini impurity criterion was used to assess the importance In NRBF networks, the height of the Gaussian curve over the hori-
of variables. zontal axis is termed the altitude. In this study, an unconstrained
NRBF (NRBFUN) network, which makes no assumptions about the
form of combination function, was used. The same LM procedure used
Multilayer Perceptron Neural to train the MLP network may be used for training these networks.
Network Architecture More detailed information on this neural network architecture and
the training algorithm may be found elsewhere (8).
A neural network may be defined as a massive parallel distributed
processor made up of simple processing units with a natural propen-
sity for storing experimental knowledge and making it available to ANALYSIS AND RESULTS
use (11). The computing power of neural networks comes from the
ability to learn and generalize. Neural network models are usually Selection of Variables
specified by three entities: model of processing elements themselves,
model of interconnections and structures, and the learning rules. A As described earlier in the paper, the modeling procedure was per-
multilayer perceptron (MLP) network, with one hidden layer of formed in three independent phases. These phases differ in the sets
neurons, is one of the two neural network architectures used in this of loop detectors from which the traffic data were collected. These
study. The connections are feed forward. The combination function, sets are referred to as Position 1, Position 2, and Position 3 (see
which is the net input to hidden-layer neurons, is determined through Figure 1) on the basis of the relative position of constituent stations
the inner product between the vectors of connection weights and with respect to the crash location. Parameters of importance measured
inputs. The nonlinear nature of the activation function is critical, as it during Time Slice 2 (5 to 10 min before a crash) were used as inputs
allows the network to learn any underlying relationships of interest for developing the final classification model for each of the three
between inputs and outputs. A detailed description of the MLP neural phases. A series of neural network models was developed and eval-
network architecture and the Levenberg–Marquardt (LM) training uated to determine the best overall model. The data set included
procedure for a related application may be found elsewhere (8). averages, standard deviations, and the logarithm of the coefficients
of variation of speed and volume. The coefficient of variation essen-
tially represents both average and standard deviation. Therefore, it
Normalized Radial Basis Function was decided that two separate runs of the procedure for selection of
Neural Network variables would be performed. One run included average and standard
deviation as inputs, and the other included coefficient of variation
The other neural network architecture used was a radial basis function and not the constituent average and standard deviation. This gave
(RBF) network, which is also feed forward with a single hidden six sets of importance variables (three positional phases × two runs).
layer. The combination function is more complex for these networks The data sets were read in R, and Random Forests were grown by
and is based on a distance function (referred to as width) between using the Random Forest function (9). Figures 3a and 3b show the

ASU1_2 cvsd1_2
ASD1_2
SSD1_2 cvsu1_2
SSU1_2 cvvd1_2
SVU1_2 cvvu1_2
SVD1_2
ACONU1_2
AVD1_2
AVU1_2 SCOND1_2
ACONU1_2 ACOND1_2
SCOND1_2
SCONU1_2
ACOND1_2
SCONU1_2 v_u1_2
v_u1_2 v_d1_2
cw_u1_2 cw_u1_2
v_d1_2
cw_d1_2 cw_d1_2

0 5 15 25 0 5 15 25
Mean Decrease Gini Mean Decrease Gini
(a) (b)

FIGURE 3 Variable-importance plots.


158 Transportation Research Record 2083

variable importance plots for the two runs of the first phase, which variables did indicate that rear-end crashes likely dominated the crash
included data from Position 1 (i.e., nonmissing observations from data on these freeways as well.
stations US1 and DS1). The variable importance was based on the
mean decrease in Gini. Because this was part of the exploratory
analysis, the Gini measure was preferable over the actual classification Neural Network–Based Classification Models
accuracy. The Gini plots in Figure 3 demonstrate a clear distinction
between the variables that may be important compared with those In the variable section procedure described earlier, six sets (three
that might not be on the basis of a significant drop in the importance positional phases × two runs) of importance variables were avail-
measure. able. These sets were used as inputs to develop a series of neural
Figure 3a shows the results for the run, which includes averages network models and identify the models with the best classification
and standard deviations of speeds and volumes along with the variables performance. The performance of a neural network model may be
representing VSL implementation (e.g., V_U1_2). It is clear that the measured by determining how well the models capture the target
first eight variables have distinctly higher variable importance scores event (crashes, in this case) across various deciles (one decile rep-
than the remaining variables. Hence, it may be inferred that the actual resents 10 percentiles) of posterior probability, which is the term
averages and standard deviations of traffic parameters measured used for the output of the models that lies between 0 and 1. The
during Time Slice 2 related to crash risk more significantly relative closer it is to unity, the more likely it is for that observation to be a
to the variables representing the congestion flag and VSL imple- crash. Because crashes are rare events, one should be parsimonious
mentation. Figure 3b illustrates the results for the run that included in issuing warnings. Therefore, it would be unreasonable to assign
a logarithm of the coefficients of variation of speed and volume, and more than 20% to 30% of observations as crashes. In light of these
they were found to be significant for both upstream and downstream observations, it was decided to evaluate the neural network models
loop detector locations. Similar results were found for the other two at the validation stage on the basis of the cumulative percentage of
positional phases (with data from Positions 2 and 3) for Time Slice 2. crashes identified within the first three deciles of posterior probability.
Table 1 shows the variables of importance (in order of their impor- On the basis of this criterion, it was noticed that the models with
tance) selected for all the three positional phases for each of the three average and standard deviations of traffic parameters consistently
time slices. The cells highlighted in the table represent the position outperformed (i.e., identified a higher percentage of crashes) the
and time slice importance variables corresponding to those discussed models with coefficients of variation. A possible reason for this
in the next section as inputs to the neural network models. performance is that the models based on data mining operate better
A closer examination of the individual trees constituting the forests with the primary traffic parameters (average and standard deviation)
revealed that higher variation in speed at both upstream and down- without being constrained by a prespecified relationship between the
stream stations and also lower average speeds upstream, in Time two (coefficient of variation = standard deviation − average).
Slice 2, increased the likelihood of crashes. Results for Time Slice 1 Proofs in the literature have shown that an MLP network with
(5 min interval before a crash) were similar to the results in Time one hidden layer and nonlinear activation functions for the hidden
Slice 2 except for the significance of the variable representing aver- nodes can learn to approximate virtually any continuous function
age congestion at the upstream station (ACONU2_1) for Phase 2. This (13). Therefore, the critical modeling issue was to estimate the num-
measure of congestion upstream was significant at all three phases if ber of neurons in the hidden layer. The methodology adopted for
Time Slice 0 was considered. Time Slice 0 represents the 5-min selection of the appropriate number of nodes in the hidden layer of
duration after a crash. It is not surprising that the congestion flag at the MLP–NRBFUN models was to evaluate the performance of multiple
upstream stations can be associated with a crash occurrence. How- models with hidden nodes varying from three through 10.
ever, this information was useful only for the application of incident In Figure 4a for Position 1 (including traffic parameters from
detection, which of course was not the focus of this research. stations US1 and DS1), individual MLP and NRBFUN models that
Another interesting aspect of the variables found significant was identified the highest percentage of crashes within each of the first
that similar sets of variables were found to be significantly associated three deciles (10, 20, and 30 percentiles) were identified. The best
with rear-end crashes (2). Rear-end crashes are the most frequent MLP models found in the first step of evaluation had seven neurons
type of crash on freeways in the United States. In the present study, (10th percentile), three neurons (20th percentile), and 10 neurons
the type of crash was not known. However, similarity of significant (30th percentile). Correspondingly, the best NRBFUN included four

TABLE 1 Variables of Importance for Separating Crash and Noncrash Cases on Basis of Random Forest

Phase Time Slice Important Variables

Position1 0 ASU1_0, ASD1_0, SSU1_0, AVD1_0, AVU1_0, SSD1_0, SVD1_0, SVU1_0, ACONU1_0
Position1 1 ASU1_1, ASD1_1, AVD1_1, AVU1_1, SSU1_1, SVU1_1, SSD1_1, SVD1_1
Position1 2 ASU1_2, ASD1_2, SSD1_2, SSU1_2, SVU1_2, SVD1_2, AVD1_2, AVU1_2
Position2 0 ASU2_0, ASD2_0, SVU2_0, AVD2_0, SVD2_0, AVU2_0, SSU2_0, SSD2_0, ACONU2_0
Position2 1 ASU2_1, AVD2_1, ASD2_1, ACONU2_1, SSU2_1, SSD2_1, AVU2_1, SVD2_1, SVU2_1
Position2 2 ASU2_2, ASD2_2, AVD2_2, SSU2_2, SVU2_2, AVU2_2, SVD2_2, SSD2_2
Position3 0 ASD3_0, ASU3_0, SVD3_0, SSU3_0, AVD3_0, AVU3_0, SSD3_0, SVU3_0, ACONU3_0
Position3 1 ASU3_1, ASD3_1, SSU3_1, AVD3_1, SSD3_1, SVD3_1, SVU3_1, AVU3_1
Position3 2 ASU3_2, AVD3_2, SSU3_2, ASD3_2, AVU3_2, SVU3_2, SSD3_2, SVD3_2
Abdel-Aty, Pande, Das, and Knibbe 159

100
90

Percentage of crashes
80
hybrid of MLP-10 and
70 NRBFUN-8

identified
60
baseline model
50
40
30 best model possible
20
10
0
0 20 40 60 80 100
Percentile
(a)

100
Percentage of crashes

90
80 hybrid of MLP-4 and
70 MLP-10
identified

60
baseline model
50
40
30 best model possible
20
10
0
0 20 40 60 80 100
Percentile
(b)

100
Percentage of crashes

90
80
70
identified

60 MLP-10
50 baseline model
40 best model possible
30
20
10
0
0 20 40 60 80 100
Percentile
(c)

FIGURE 4 Classification performance of the neural network-based models: comparison of the best
model in each position with baseline and best possible model for (a) Position 1, (b) Position 2,
and (c) Position 3.

hidden neurons (10th percentile) and eight hidden neurons (20th and in the validation data set captured within various deciles of posterior
30th percentiles). The maximum percentage of crashes identified probability. It also shows the performance of a random baseline model,
within the third decile was 59%. To improve the classification per- which represents the expected percentage of crashes identified in the
formance of the models further, some hybrid models were explored in validation sample if one randomly assigns a validation data set obser-
the next step. All possible two-at-a-time combinations of the models vation as crash or noncrash. The plot also shows the performance of
identified above (i.e., MLP-3, MLP-7, MLP-10, NRBFUN-4, and a hypothetical optimum model for which posterior probability output
NRBFUN-8) were used to estimate these hybrid models. The models for every crash is higher than that for the every single noncrash case
were hybridized by averaging the posterior probability output from in the validation data set. This hypothetical model of course would
the individual models. capture 100% of the crashes within the minimum possible percentile
Maximum improvement in the percentage of crashes identified values. The separation of the model lift curve from the random
within the 30th percentile was achieved by hybridizing the MLP baseline curve and its proximity to the hypothetical optimum model
network with 10 hidden neurons (MLP-10) and the NRBFUN net- can be used to assess the classification performance of the model.
work with eight hidden neurons (NRBFUN-10). Figure 4a depicts Figures 4b and 4c show similar plots for the best models in the phases,
the lift plot of the hybrid model. It shows the percentage of the crashes including data from station set Position 2 and Position 3. The best
160 Transportation Research Record 2083

model in Position 2 was a hybrid of two MLP networks with four From the remaining 137 (= 196 − 59) observations, 17 will be missed
and 10 hidden neurons, respectively, while the best in Position 3 was crashes and 120 noncrash cases will be correctly identified. This
a MLP network with 10 hidden neurons. essentially means that 78.95% (120 of 152) noncrashes have been
The model developed for Position 1 had the maximum separation correctly identified. The model thus achieves around 79% classifica-
from the random baseline model at the 30th percentile, while the tion accuracy for noncrash cases, and 61% accuracy for crash cases.
model for Position 2 had the greatest separation at the 10th per- In contrast, the models developed in Position 2, in Time Slice 2, at
centile. Hence, the model developed in the first phase (i.e., Position 1) the 30th percentile of posterior probability, had crash and noncrash
was the best at the 3rd decile, while the one developed in the second classification accuracy of 56% and 81%, respectively; for Position 3,
phase (i.e., Position 2) was the best at the 1st decile. These two the figures stood at 43% and 74%, respectively. The promising per-
phases included data from stations US1 and DS1 and Stations US2 and formance of the Position 1 model (with data from 5 to 10 min before
DS2, respectively. These two curves coincide with their respective the crash) leads to the inference that it is indeed possible to assess
random baseline curves as they approach the 8th and 9th deciles. It real-time crash risk on the basis of traffic surveillance data collected
essentially means that certain crashes are just impossible to identify on these Dutch freeways.
by using these models. It has interesting implications for the future
scope of this research. First of all, a one-size-fits-all approach might
not work for all crashes, and separate models may be required for CONCLUSIONS AND FUTURE SCOPE
different types of crashes. If one observes the 8th and 9th deciles for
the Position 3 model (Figure 4c), the model curve never converges Loop detectors are an essential part of the traffic surveillance infra-
with the random baseline. This observation indicates that traffic data structure around the world. On freeways, loop detectors provide such
from stations located at some spatial separation (e.g., Position 3) might traffic parameters as speed, volume, and some measure of density at
be significant only for crash types that do not constitute a major time intervals as small as 20 s. The authors, in some previous studies,
sample of the crash data used in this study. It further underscores the have developed crash-risk assessment models for Interstate 4 in
need to look at crash data by type. The inputs used in these models Florida (e.g., 2, 3, and 8) that use these real-time loop detector data
included averages and standard deviations of speed and volume and as inputs. These models may be applied to formulate strategies that
not the coefficients of variation. A complete list of inputs may be can reduce the crash risk in real time. These proactive applications
found in Table 1. go beyond the traditional traffic management research that largely
The performance was measured in terms of the percentages of focuses on incident detection.
crashes identified at the first three deciles (10th, 20th, and 30th per- The objective of this research was to assess whether the approach
centiles). The percentage of crashes identified by the baseline model used for real time crash-risk assessment on Interstate 4 can be trans-
was equal to the corresponding percentile values. For the actual clas- ferred to European freeways equipped with more extensive traffic
sification models, as the percentage of observations declared as crashes surveillance–management systems. One month of crash data from
increases, the crash identification would improve, but the percentage five freeway sections in the Netherlands, along with corresponding
of noncrash cases correctly identified would decrease, thus increas- crash and noncrash loop detector data, were used in this study to
ing false alarms. Table 2 shows the performance of the three classi- explore the transferability of such an approach. Besides, examining
fication models depicted in Figure 4 over the validation data set. In the transferability-related issues, this research also demonstrated
Table 2, the percentage of crashes identified by each model within application of a new data-mining methodology for selection of vari-
the first three deciles is depicted. The table also shows the differences ables. It is one of the first applications of the Random Forest–based
between the percentages identified by the corresponding model and the procedure for selection of variables in the transportation engineering
random baseline model (+X) along with differences between the per- literature. The advantage of using Random Forests instead of more
centages identified by the corresponding model and the hypothetical traditional classification trees, especially when sample size is limited,
perfect model (−Y). is that there is no need for a separate cross-validation–test data set
The procedure for estimating the classification accuracies of the to obtain unbiased error estimates.
models for crash and noncrash cases is explained below, with the The classification performance of various neural networks (from
example of the Position 1 model (highlighted cell in Table 2). If the inputs found significant by Random Forest) was satisfactory,
the 30th percentile of posterior probability is used as the threshold and the best model provided 61% accuracy for crashes and 79% accu-
to separate crashes from noncrashes, then 30% of 196 (= 152 + 44) racy for noncrash cases. The results also indicated that some crashes
validation data set observations—59 observations—will be classified may not be identifiable in real time. The results seemed to be consis-
as crashes. Therefore, the hybrid model would identify more than 61% tent with the results reported in a previous study by the authors (14)
of the crashes (i.e., 27 of 44) by assigning 59 patterns as crashes. but might need further validation with a larger sample. A closer

TABLE 2 Performance Table of Classification Accuracies


for Three Best Position Models

Percentages of Crashes Identified in the Validation Dataset


Percentile
of Posterior Baseline Position1 Position2 Position3
Probability Model Model (%) Model (%) Model (%)

10 10 25 (+15) (−20) 32 (+22) (−10) 13 (+3) (−32)


20 20 39 (+19) (−50) 37 (+17) (−50) 25 (+5) (−64)
30 30 61 (+31) (−39) 56 (+26) (−44) 43 (+13) (−57)
Abdel-Aty, Pande, Das, and Knibbe 161

examination of the correlations between the outputs from different models by types of crash can be developed. These models can then
models permitted the inference that the models using data from dif- be used to assess the risk of a particular type of crash on the freeway,
ferent sets of two stations (i.e., models for Position 1, 2, or 3) were and more specific warnings may be issued to the drivers through
correctly identifying different groups of crashes. These groups of DMSs. In the Netherlands, the en route information delivery system
crashes might have had some relation with crash types (rear-end, etc.). is quite advanced compared with that on the freeways in Florida.
The results were quite promising in that they revealed the similarity Therefore, the implementation of proactive traffic management
of the crash-prone traffic conditions between the Dutch and the U.S. strategies in the Netherlands may be easier to achieve.
freeways.
The relationship being sought by these models between crash
occurrence and traffic data is also worth discussing. The traffic con- ACKNOWLEDGMENTS
ditions observed 5 to 10 min before the crash related to the crash
occurrence in that they led to potential conflicts that could in turn The authors thank Ryan Cunningham for his help in the data prep-
cause a crash. For example, a high CV in speed was found to be a aration process. The authors also thank the anonymous reviewers
significant factor associated with crash occurrence. It indicated that for their insightful comments, which resulted in a significantly
speeds were varying significantly with time and that the conditions improved paper.
on the freeway were unstable 5 to 10 min before the crashes. These
speed data indicated that drivers were slowing down and speeding
up quite often. Under these conditions, drivers on the freeway were REFERENCES
more likely to make an error leading to a crash (even though it
would not involve the same vehicles that traversed the location 5 1. Golob, T., and W. Recker. A Method for Relating Type of Crash to Traffic
to 10 min earlier). Flow Characteristics on Urban Freeways. Transportation Research A,
Vol. 38, No. 1, 2004, pp. 53–80.
One of the most important differences in the data collected on 2. Pande, A., and M. Abdel-Aty. Comprehensive Analysis of Relationship
these freeways (compared with most freeways in Florida) was the Between Real-Time Traffic Surveillance Data and Rear-End Crashes on
information on VSLs. The results of this study show that the variables Freeways. In Transportation Research Record: Journal of the Trans-
representing averages and standard deviations of speed and volume portation Research Board, No. 1953, Transportation Research Board of
the National Academies, 2006, pp. 31–40.
were more significantly associated with real-time crash risk than the 3. Pande, A., M. Abdel-Aty, and L. Hsia. Spatiotemporal Variation of Risk
variables representing VSL application (e.g., V_U1_2 and CW_D1_2; Preceding Crashes on Freeways. In Transportation Research Record:,
see Figure 3). It does not necessarily mean that VSL implementation Journal of the Transportation Research Board, No. 1908, Transportation
has little or no impact on crash risk. It does indicate that to identify Research Board of the National Academies, Washington, D.C., 2005,
pp. 26–36.
clearly the impact of VSL implementation, one needs to relate it 4. Hunter, J. S. The Exponentially Weighted Moving Average. Journal of
with specific types of crashes (e.g., rear-end). The list of significant Quality Technology, Vol. 18, 1986, pp. 203–210.
variables (averages and standard deviations of speed and volume 5. Abdel-Aty, M., N. Uddin, and A. Pande. Split Models for Predicting
observed 5 to 10 min before the crash on stations upstream and down- Multivehicle Crashes During High-Speed and Low-Speed Operating
Conditions on Freeways. In Transportation Research Record: Journal of
stream of the crash site) does suggest that crash-type distribution on the Transportation Research Board, No. 1908, Transportation Research
these freeways might be similar to that on Interstate 4 in Florida. In Board of the National Academies, Washington, D.C., 2005, pp. 51–58.
other words, the crashes more prevalent on these freeways might be 6. Breiman, L. Random Forests. Machine Learning, Vol. 45, No. 1, 2001,
rear-end and lane-change related. It is also possible that the effect of pp. 5–32.
7. Hand, D., H. Mannila, and P. Smyth. Principles of Data Mining. M.I.T.
VSL implementation might be reflected in the speed and volume data
Press, Cambridge, Mass., 2001.
being reported by the detectors. The issue, however, needs to be inves- 8. Pande, A., and M. Abdel-Aty. Assessment of Freeway Traffic Param-
tigated through further analysis by relating specific crash types and eters Leading to Lane-Change Related Collisions. Accident Analysis
traffic conditions. ITS countermeasures such as VSLs may be used to and Prevention, Vol. 38, No. 5, 2006, pp. 936–948.
alleviate crash-prone conditions. With the infrastructure for imple- 9. Liaw, A., and M. Wiener. Classification and Regression by RandomForest.
R News: Newsletter of the R Project, Vol. 2, No. 3, 2002, pp. 18–22,
menting VSL already in place on the freeways under consideration, http://cran.r-project.org/doc/RNews/RNews_2002-3.pdf.
VSL strategies specifically tailored to reducing the measure of crash 10. Breiman, L., and A. Cutler. Random Forests. www.stat.berkeley.edu/
risk obtained in this study may be more readily evaluated for these ∼breiman/RandomForests/. Accessed March 18, 2007.
freeways. In a broad sense, these strategies would attempt to reduce 11. Christodoulou, C., and M. Georgiopoulos. Applications of Neural
Networks in Electromagnetics. Artech House, Boston, Mass., 2001.
the temporal speed variance as well as the differential between speeds 12. Tao, K. M. A Closer Look at the Radial Basis Function (RBF) Networks.
measured upstream and downstream of crash-prone locations. Proc., 27th Asilomar Conference on Signals, Systems and Computers,
In spite of the promise of transferability; there remains a signifi- Vol. 1 (A. Singh, ed.), IEEE Computational Intelligence Society Press,
cant scope of improvement. The present study included data for only Los Alamitos, Calif., 1993.
13. Cybenko, C. Approximations by Superposition of Sigmoid Functions.
a single month on five freeway sections. The results need to be vali- Mathematics of Control, Signals and Systems, Vol. 2, 1989, pp. 303–314.
dated with a larger sample, which would also allow for combining 14. Pande, A., and M. Abdel-Aty. Multiple-Model Framework for Assessment
information from more stations into the same model. Because models of Real-Time Crash Risk. In Transportation Research Record: Journal of
from three positional phases (Positions 1, 2, and 3) seem to be iden- the Transportation Research Board, No. 2019, Transportation Research
Board of the National Academies, Washington, D.C., 2007, pp. 99–107.
tifying different crashes, there is a possibility for improving classi-
fication performance by combining information from multiple stations. The Safety Data, Analysis, and Evaluation Committee sponsored publication of
Furthermore, if the information on crash type is available, then specific this paper.

You might also like