You are on page 1of 5

DEVELOPMENT OF CRASH PREDICTION MODEL USING

MULTIPLE REGRESSION ANALYSIS


Harshit Gupta1, Dr. Siddhartha Rokade2
1
PG Student, 2Assistant Professor,
Department of Civil Engineering, Maulana Azad National Institute of Technology, Bhopal

Abstract predictors. SPSS software is used for regression


The purpose of the study is to develop a model analysis.
for prediction of crashes in urban medium 2. DESCRIPTION OF DATA
size cities. In this paper Crash prediction To develop a crash prediction model
model (CPM) is developed using multiple independent and dependent variables are
regression analysis. A model is a simplified needed. The selection of appropriate variables
representation of a real world process. It for model development is an important step as
should be representative in the sense that it these variables significantly influence the
should contain the salient features of the accuracy and validity of model. The collected
phenomena under study. In general, one of data was divided into two parts that is
the objectives in modeling is to have a simple dependent data and independent data.
model to explain a complex phenomenon. a) Dependent data - Road accident data
Keywords: Crash prediction model (CPM), b) Independent data - Factors affecting the
Multiple regression Analysis, Medium size crashes like segment length, carriageway
city width, median width etc. These variables
1. INTRODUCTION are selected on the basis former literatures
Road accidents are increasing every year. The and data of these variables are collected
total number of road accidents increased from the selected segments.
marginally from 4,86,476 in 2013 to 4,89,400 in Traffic volume and road geometric design
2014 (MoRT&H,2015). This is probably an characteristics are related to accidents. In many
underestimate. The actual numbers of injuries research results show that changing in
are much higher than official statics, as not all geometric design characteristics and traffic
injuries are reported to the police stations. Road volume accident rate change.
crashes are an outcome of various factors, some Traffic volume is also an important parameter.
of which are the length of road network, vehicle Traffic volume on selected segment is counted
population, human population and enforcement manual at peak hour. Peak hour of each
of road safety regulations etc. segment is decided on the basis of expert
Bhopal city is selected for development of CPM. opinion.
About 3500 traffic crashes occurred every year 2.1. Collection of Accident Data
in Bhopal city. Crash Prediction Models (CPMs) A 6 year (2011-2015 & 2016) Accident data
is developed to know the expected number of was collected in form the police department of
crashes on a road in a certain time frame. Crash the Bhopal city. The data is digitized in three
frequency depends on Segment length, the traffic categories; accident details, vehicle details, and
flow and several risk factors. No. of accidents is victim details (include Gender and age of
called the dependent (variable). The other victim).
components like traffic volume, carriage way Different type of land uses are selected for the
width are incorporated in the CPM as study like Shops, Block of flats, Industrial,
independent model variables, also called Residential, neighborhood, scattered housing,
open spaces, Market areas (CBDs) etc.

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-4, ISSUE-6, 2017


82
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)

2.2. Road characteristics data between independent and dependent variables


The road characteristics data are collected in the was nonlinear, in which situation the model
year 2016 The road characteristics data segment should be rebuilt (Feng et al., 2016). The
length, road width, traffic volume, median multiple linear regression analysis makes several
width, no. access road and pedestrian volume to key assumptions such as linear relationship,
be incorporated in the model analysis. Due to multicollinearity, normality, independence and
limitations, the road design elements like homoscedasticity between the model variables.
vertical alignment, horizontal alignment, Finally these assumptions should also be
lighting, road markings, and haptic feedbacks examined for developed model.
attributes could not be included in the model It assumes that, there is a linear relationship
analyses. Environmental factors, vehicular between the dependent variable and each of
factor, human factor are also not included in independent variables and the dependent
this analysis. variable and independent variable collectively.
A total of 76 segments were selected for 3.1. Correlation Analysis
development of accident prediction model. The Bivariate correlation analysis is done by SPSS.
sections that were considered included the Pearson correlation coefficient is selected for
following analysis. Correlation is a unit free measure of
1. The length of the access roads varied from relationship between two variables and take
100 m to 1400 m with a mean of 413 m and value in [-1, +1], where r is close to +1(-1), there
total length of segments is 32,210 meters is strong and positive (negative) relationship and
2. The Peak hour traffic volume varied from a correlation coefficient of 0 indicates that there
1810 to 4560 PCU with mean of 3144. is no linear relationship between the two
3. The number of crashes over 6 years varied variables. It measures only linear relationship.
from 10 to 39 crashes in a section with mean In table 1 seems that the correlation test value of
of 20. Total number of crashes was 1558. independent variable with dependent variable.
4. The no. of access road varied from 1 to 8 with Sign (**) indicates a Correlation is significant at
the mean of 2.75 the 0.01 level where the sign (*) Indicate a
5. The pedestrian volume varied from 13 to 345 moderate Correlation is significant at the 0.05
with the mean of 145 level.
3. MODEL DEVELOPMENT Table 1 shows parameters having positive signs
Multiple regression is an extension of simple with traffic accident. It is presented that variable
linear regression. If the independent variables has positive correlation which means that
are two or more in numbers, then the analysis is variables are perfectly related in a positive linear
known as multiple linear regression.The main sense if the correlation value is positive and
idea of multiple linear regression method is to negative linear sense when correlation value is
build correlation analysis between dependent negative.
and independent variables. The function will be Table 1: Correlation Value of Independent
the following form: Variable (X) With Dependent Variable(Y)
Correlati-
Sr. Independent
Y=B0+B1X1+B2X2+B3X1+…+BmXm Symbol on Test
No. Variable
Where, value
Y= Dependent variable Segment
1. SL 0.295**
B0= regression constant, length
B1, B2, B3...Bm = Regression coefficients of the Carriageway
2. CW 0.425**
respective m independent variables. Width
X1, X2, X3, ………..Xm = Independent variables. Pedestrian
A statistic test of the model was necessary, 3. PV 0.249*
Volume
including determination coefficient test (R2 test), Traffic
significance test of regression coefficient (t-test), 4. TV 0.440**
Volume
and significance test of regression equation No. of
(F test). If the significant test of regression 5. AP 0.232*
Access Road
equation failed, it was possible that important
factors were missed during the selection of
independent variables, or the relationship

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-4, ISSUE-6, 2017


83
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)

3.2. Multicollinearity analyses each other. It seems that the multicollinearity is


In statistics, multicollinearity or collinearity is a absent.
phenomenon in which two or more predictor In the above section it is presented that there is
variables (Independent variables) in a MLRM no coloration between selected variable. If
are highly correlated. This means, one can be collinearity present than we remove variables
linearly predicted from the others with a with multicollinearity. All variables are selected
substantial degree of accuracy. Multicollinearity for model formulation.
poses a problem only for multiple regression 3.3. Model Formulation by Multiple linear
analyses. Collinearity is between -1 to +1. regression analysis using SPSS
Perfect collinearity means, two predictors that Parameters which are used in this study for
are perfectly correlated have a correlation development of model are identified using
coefficient is 1). If there is perfect collinearity literature reviews. The parameter include as an
between predictors it becomes impossible to input for model development are:
obtain unique regression coefficients because  Segment Length i.e. Length of the road
there are an infinite number of combinations of segment were an accident occurred
coefficients that would work equally well. Put  Traffic Volume i.e. Peak hour traffic volume
simply, if we have two predictors that are (PCU/hour) on carriageway
perfectly correlated, then the values of b for each  Carriageway width i.e. Width of road on
variable are interchangeable. (Field 2005) which a vehicles are not restricted by any
One way of identifying multicollinearity is to physical barriers to move laterally.
scan a correlation matrix of all of the predictor  Access Road i.e. No. of access road at road
variables and see if any correlate very highly (by segment.
very highly mean correlations of above .80 or  Pedestrian volume i.e. Total number of
.90). SPSS produces various collinearity pedestrian on the road
diagnostics, one of which is the variance  Vehicle Accident i.e. Accident of vehicle on
inflation factor (VIF). The VIF indicates whether the selected segments.
a predictor has a strong linear relationship with
the other predictor(s). I. Model Summary: Table 3 shows the model
Table 2: Correlation Matrix of Selected summary.
Independent Variable The "R" column represents the value of R, the
Predicto multiple correlation coefficients.
TV AL AP PV CW
r The "R Square" column represents the R2 value
TV 1 .052 .111 .089 .483* (also called the coefficient of determination).
* Table 3: Model Summary of MLR
SL - Std.
.481*
.052 1 * .416* -.178 R Adjusted Error of
* Model R
Square R Square the
AP - Estimate
.481*
.111 * 1 .338* -.043 1 .686a .471 .433 4.789
*
It is the proportion of variance in the dependent
PV - - variable that can be explained by the independent
.089 .416* .338* 1 .203 variable.
* *
II Selection of Best Model
CW .483* The most sensitive approach to selecting a subset
* -.178 -.043 .203 1
of important variables in a complex linear model
is to compare all possible subsets. This procedure
In this research correlation matrix is used to simply fits all the possible regression models
detect the problem of multicollinearity. The (i.e., all possible combinations of predictors) and
values of correlation between the independent chooses the best one (or more than one) based on
variables are larger than |0.5| indicate the the criteria described below.
multicollinearity otherwise there is no 1. High R, R2 and Adjusted R2 value
multicollinearity. Table 2 shows the correlation 2. Low significance F value and Minimum
values of selected independent variables with standard error

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-4, ISSUE-6, 2017


84
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)

3. Low p value for the coefficients of Table 4: ANOVA, Multiple Linear


independent variables and y intercept Regression Analysis in SPSS
III. Formulation of Model Equation
Segment length, traffic volume and pedestrian Sum of Mean
volume have high value in comparison of other Model Squares df Square F Sig.
parameters. Regression coefficients of respective Regression 1427.68 5 285.538 12.451 .000b
variables SL, TV, PV are in three decimals. To
obtain a proper regression coefficient SL, TV, Residual 1605.31 70 22.933
PV are divided by respectively 10, 100 and 10. Total 3033.00 75
The Coefficients part of the output gives us the
values that we need in order to write the 2. Examining Normality
regression equation. The accident Prediction The normal probability plot shows the theoretical
model for selected road segments using the percentiles of a normal distribution versus the
multiple regression analysis is presented in actual percentiles of the standard residuals with
equation 5.1 the same variance and mean. If the collected data
TC = -2.179 + 1.164(CW) + 0.0114(SL) + are normally distributed, then the observed
0.525(AP) + 0.0312(PV) + 0.00122(TV) values (the dots on the chart) should fall closely
Where, along the straight line (meaning that the observed
-2.179 = regression constant or intercept values are the same as you would expect to get
1.164, 0.0114, 0.525, 0.0312 and 0.00122 = from a normally distributed data set). In broad,
regression coefficients of their respective the more closely the points are clustered about
variables. the 45-degree line, the stronger the indication
CW = Width of road on which a vehicles are not supporting the normality assumption. Any
restricted by any physical barriers to move substantial curvature in the plot is evidence that
laterally the residuals have not come from a normal
SL = Segment Length distribution
AP = No. of access road
PV = Pedestrian Volume
TV = Peak hour Traffic Volume in PCU/hour
3.4. Validation and Assumptions of MLR
Model
ANOVA, F-test, T-test and normality check in
multiple linear regression.
1. ANOVA:
Analysis of Variance (ANOVA) is a statistical
method used to test differences between two or
more means. Analysis of Variance (ANOVA)
consists of calculations that provide information
about levels of variability within a regression
Figure 5.6: Normal P-P Plot of Regression
model and form a basis for tests of significance.
Residual
The ANOVA calculations for multiple
In multiple regression analysis the accurate
regression are nearly identical to the calculations
relationship between variable can be estimate if
for simple linear regression, except that the
the relationships between variable are linear. If
degrees of freedom are adjusted to reflect the
the relationship between dependent and
number of explanatory variables included in the
independent variable is not a linear relationship,
model.
the results of the regression analysis will
Table 4 shows the output of the ANOVA
underestimate the true relationship. A preferable
analysis. Result of ANOVA test is f (5, 70) =
method of detection linearity is examination of
12.451 which is at the significance level 0.000
residual plots between dependent and
(P-value = .000), P value is less than 0.005. It
independent. The first step of assessment of
means that the regression equation is significant
assumption begins with initial descriptive plots
at 95 % confidence limit in levels of variability
of dependent Variable versus each of the
within a regression model.
explanatory variables, separately.

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-4, ISSUE-6, 2017


85
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)

4. CONCLUSION Their impact on different types of bus drivers,


This research developed a crash prediction Accident Analysis &
model by multiple linear regression analysis. Prevention, 86, pp.29-39.
Predictor variable carriage way widths, segment [2] MORT&H, (2015), Road Accident Statistics,
length, traffic volume, pedestrian volume, no. of Ministry of Road Transport and Highways,
access road are selected for the study. The risk is Government of India, New Delhi
increasing as the traffic volume, pedestrian [3] Field A. (2005), Discovering statistics using
volume, carriage-way width, segment length and SPSS, SAGE Publications Ltd ISBN 978-1-
no. of access road increases. Previous study 84787-906-6
shows that presence of median decrease the risk [4] IBM Software Group (2015), “IBM SPSS
of accident. Advanced Statistics 22” Available from:
High R2 value is needed for the model selection. http://library.uvm.edu/services/statistics/SPSS2
Low R2 value shows either some predictor 2Manuals/IBM%20SPSS%20Advanced
variables are missing or the relation between %20Statistics.pdf
dependent and independent variable is not linear [5] Ranjitkar P and Chengye P. (2013),
or not follow normal distribution. Modelling motorway accidents using negative
The accident prediction models can be used to binomial regression, Eastern Asia Society for
decrease the risk on the road and safety Transportation Studies Vol. 9.
performance. In terms of future work, it is [6] Sawalha Z. and Sayed T. (2003),
suggested to taking other predictor variables Statistical issues in traffic accident modeling,
such as horizontal and vertical alignment, Canadian Journal of Civil Engineering, 33(9),
median width, sidewalk width, environmental P1115-1124
factors etc. CPM can be developed by regression [7] Prajapati, P., Tiwari, G., Evaluating Safety of
analysis such as poisson regression, negative Urban Arterial Roads of Medium Sized Indian
binomial regression, zero inflated regression City. Proceedings of the Eastern Asia Society for
model etc. Transportation Studies, Vol.9, 2013
REFERENCES [8] WHO (2015), Global status report on road
[1] Feng S., Li Z., Ci Y. and Zhang G. (2016), safety 2015
Risk factors affecting fatal bus accident severity:

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-4, ISSUE-6, 2017


86

You might also like