Harshit Gupta1, Dr. Siddhartha Rokade2 1 PG Student, 2Assistant Professor, Department of Civil Engineering, Maulana Azad National Institute of Technology, Bhopal
Abstract predictors. SPSS software is used for regression
The purpose of the study is to develop a model analysis. for prediction of crashes in urban medium 2. DESCRIPTION OF DATA size cities. In this paper Crash prediction To develop a crash prediction model model (CPM) is developed using multiple independent and dependent variables are regression analysis. A model is a simplified needed. The selection of appropriate variables representation of a real world process. It for model development is an important step as should be representative in the sense that it these variables significantly influence the should contain the salient features of the accuracy and validity of model. The collected phenomena under study. In general, one of data was divided into two parts that is the objectives in modeling is to have a simple dependent data and independent data. model to explain a complex phenomenon. a) Dependent data - Road accident data Keywords: Crash prediction model (CPM), b) Independent data - Factors affecting the Multiple regression Analysis, Medium size crashes like segment length, carriageway city width, median width etc. These variables 1. INTRODUCTION are selected on the basis former literatures Road accidents are increasing every year. The and data of these variables are collected total number of road accidents increased from the selected segments. marginally from 4,86,476 in 2013 to 4,89,400 in Traffic volume and road geometric design 2014 (MoRT&H,2015). This is probably an characteristics are related to accidents. In many underestimate. The actual numbers of injuries research results show that changing in are much higher than official statics, as not all geometric design characteristics and traffic injuries are reported to the police stations. Road volume accident rate change. crashes are an outcome of various factors, some Traffic volume is also an important parameter. of which are the length of road network, vehicle Traffic volume on selected segment is counted population, human population and enforcement manual at peak hour. Peak hour of each of road safety regulations etc. segment is decided on the basis of expert Bhopal city is selected for development of CPM. opinion. About 3500 traffic crashes occurred every year 2.1. Collection of Accident Data in Bhopal city. Crash Prediction Models (CPMs) A 6 year (2011-2015 & 2016) Accident data is developed to know the expected number of was collected in form the police department of crashes on a road in a certain time frame. Crash the Bhopal city. The data is digitized in three frequency depends on Segment length, the traffic categories; accident details, vehicle details, and flow and several risk factors. No. of accidents is victim details (include Gender and age of called the dependent (variable). The other victim). components like traffic volume, carriage way Different type of land uses are selected for the width are incorporated in the CPM as study like Shops, Block of flats, Industrial, independent model variables, also called Residential, neighborhood, scattered housing, open spaces, Market areas (CBDs) etc.
82 INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
2.2. Road characteristics data between independent and dependent variables
The road characteristics data are collected in the was nonlinear, in which situation the model year 2016 The road characteristics data segment should be rebuilt (Feng et al., 2016). The length, road width, traffic volume, median multiple linear regression analysis makes several width, no. access road and pedestrian volume to key assumptions such as linear relationship, be incorporated in the model analysis. Due to multicollinearity, normality, independence and limitations, the road design elements like homoscedasticity between the model variables. vertical alignment, horizontal alignment, Finally these assumptions should also be lighting, road markings, and haptic feedbacks examined for developed model. attributes could not be included in the model It assumes that, there is a linear relationship analyses. Environmental factors, vehicular between the dependent variable and each of factor, human factor are also not included in independent variables and the dependent this analysis. variable and independent variable collectively. A total of 76 segments were selected for 3.1. Correlation Analysis development of accident prediction model. The Bivariate correlation analysis is done by SPSS. sections that were considered included the Pearson correlation coefficient is selected for following analysis. Correlation is a unit free measure of 1. The length of the access roads varied from relationship between two variables and take 100 m to 1400 m with a mean of 413 m and value in [-1, +1], where r is close to +1(-1), there total length of segments is 32,210 meters is strong and positive (negative) relationship and 2. The Peak hour traffic volume varied from a correlation coefficient of 0 indicates that there 1810 to 4560 PCU with mean of 3144. is no linear relationship between the two 3. The number of crashes over 6 years varied variables. It measures only linear relationship. from 10 to 39 crashes in a section with mean In table 1 seems that the correlation test value of of 20. Total number of crashes was 1558. independent variable with dependent variable. 4. The no. of access road varied from 1 to 8 with Sign (**) indicates a Correlation is significant at the mean of 2.75 the 0.01 level where the sign (*) Indicate a 5. The pedestrian volume varied from 13 to 345 moderate Correlation is significant at the 0.05 with the mean of 145 level. 3. MODEL DEVELOPMENT Table 1 shows parameters having positive signs Multiple regression is an extension of simple with traffic accident. It is presented that variable linear regression. If the independent variables has positive correlation which means that are two or more in numbers, then the analysis is variables are perfectly related in a positive linear known as multiple linear regression.The main sense if the correlation value is positive and idea of multiple linear regression method is to negative linear sense when correlation value is build correlation analysis between dependent negative. and independent variables. The function will be Table 1: Correlation Value of Independent the following form: Variable (X) With Dependent Variable(Y) Correlati- Sr. Independent Y=B0+B1X1+B2X2+B3X1+…+BmXm Symbol on Test No. Variable Where, value Y= Dependent variable Segment 1. SL 0.295** B0= regression constant, length B1, B2, B3...Bm = Regression coefficients of the Carriageway 2. CW 0.425** respective m independent variables. Width X1, X2, X3, ………..Xm = Independent variables. Pedestrian A statistic test of the model was necessary, 3. PV 0.249* Volume including determination coefficient test (R2 test), Traffic significance test of regression coefficient (t-test), 4. TV 0.440** Volume and significance test of regression equation No. of (F test). If the significant test of regression 5. AP 0.232* Access Road equation failed, it was possible that important factors were missed during the selection of independent variables, or the relationship
83 INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
3.2. Multicollinearity analyses each other. It seems that the multicollinearity is
In statistics, multicollinearity or collinearity is a absent. phenomenon in which two or more predictor In the above section it is presented that there is variables (Independent variables) in a MLRM no coloration between selected variable. If are highly correlated. This means, one can be collinearity present than we remove variables linearly predicted from the others with a with multicollinearity. All variables are selected substantial degree of accuracy. Multicollinearity for model formulation. poses a problem only for multiple regression 3.3. Model Formulation by Multiple linear analyses. Collinearity is between -1 to +1. regression analysis using SPSS Perfect collinearity means, two predictors that Parameters which are used in this study for are perfectly correlated have a correlation development of model are identified using coefficient is 1). If there is perfect collinearity literature reviews. The parameter include as an between predictors it becomes impossible to input for model development are: obtain unique regression coefficients because Segment Length i.e. Length of the road there are an infinite number of combinations of segment were an accident occurred coefficients that would work equally well. Put Traffic Volume i.e. Peak hour traffic volume simply, if we have two predictors that are (PCU/hour) on carriageway perfectly correlated, then the values of b for each Carriageway width i.e. Width of road on variable are interchangeable. (Field 2005) which a vehicles are not restricted by any One way of identifying multicollinearity is to physical barriers to move laterally. scan a correlation matrix of all of the predictor Access Road i.e. No. of access road at road variables and see if any correlate very highly (by segment. very highly mean correlations of above .80 or Pedestrian volume i.e. Total number of .90). SPSS produces various collinearity pedestrian on the road diagnostics, one of which is the variance Vehicle Accident i.e. Accident of vehicle on inflation factor (VIF). The VIF indicates whether the selected segments. a predictor has a strong linear relationship with the other predictor(s). I. Model Summary: Table 3 shows the model Table 2: Correlation Matrix of Selected summary. Independent Variable The "R" column represents the value of R, the Predicto multiple correlation coefficients. TV AL AP PV CW r The "R Square" column represents the R2 value TV 1 .052 .111 .089 .483* (also called the coefficient of determination). * Table 3: Model Summary of MLR SL - Std. .481* .052 1 * .416* -.178 R Adjusted Error of * Model R Square R Square the AP - Estimate .481* .111 * 1 .338* -.043 1 .686a .471 .433 4.789 * It is the proportion of variance in the dependent PV - - variable that can be explained by the independent .089 .416* .338* 1 .203 variable. * * II Selection of Best Model CW .483* The most sensitive approach to selecting a subset * -.178 -.043 .203 1 of important variables in a complex linear model is to compare all possible subsets. This procedure In this research correlation matrix is used to simply fits all the possible regression models detect the problem of multicollinearity. The (i.e., all possible combinations of predictors) and values of correlation between the independent chooses the best one (or more than one) based on variables are larger than |0.5| indicate the the criteria described below. multicollinearity otherwise there is no 1. High R, R2 and Adjusted R2 value multicollinearity. Table 2 shows the correlation 2. Low significance F value and Minimum values of selected independent variables with standard error
84 INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
3. Low p value for the coefficients of Table 4: ANOVA, Multiple Linear
independent variables and y intercept Regression Analysis in SPSS III. Formulation of Model Equation Segment length, traffic volume and pedestrian Sum of Mean volume have high value in comparison of other Model Squares df Square F Sig. parameters. Regression coefficients of respective Regression 1427.68 5 285.538 12.451 .000b variables SL, TV, PV are in three decimals. To obtain a proper regression coefficient SL, TV, Residual 1605.31 70 22.933 PV are divided by respectively 10, 100 and 10. Total 3033.00 75 The Coefficients part of the output gives us the values that we need in order to write the 2. Examining Normality regression equation. The accident Prediction The normal probability plot shows the theoretical model for selected road segments using the percentiles of a normal distribution versus the multiple regression analysis is presented in actual percentiles of the standard residuals with equation 5.1 the same variance and mean. If the collected data TC = -2.179 + 1.164(CW) + 0.0114(SL) + are normally distributed, then the observed 0.525(AP) + 0.0312(PV) + 0.00122(TV) values (the dots on the chart) should fall closely Where, along the straight line (meaning that the observed -2.179 = regression constant or intercept values are the same as you would expect to get 1.164, 0.0114, 0.525, 0.0312 and 0.00122 = from a normally distributed data set). In broad, regression coefficients of their respective the more closely the points are clustered about variables. the 45-degree line, the stronger the indication CW = Width of road on which a vehicles are not supporting the normality assumption. Any restricted by any physical barriers to move substantial curvature in the plot is evidence that laterally the residuals have not come from a normal SL = Segment Length distribution AP = No. of access road PV = Pedestrian Volume TV = Peak hour Traffic Volume in PCU/hour 3.4. Validation and Assumptions of MLR Model ANOVA, F-test, T-test and normality check in multiple linear regression. 1. ANOVA: Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression Figure 5.6: Normal P-P Plot of Regression model and form a basis for tests of significance. Residual The ANOVA calculations for multiple In multiple regression analysis the accurate regression are nearly identical to the calculations relationship between variable can be estimate if for simple linear regression, except that the the relationships between variable are linear. If degrees of freedom are adjusted to reflect the the relationship between dependent and number of explanatory variables included in the independent variable is not a linear relationship, model. the results of the regression analysis will Table 4 shows the output of the ANOVA underestimate the true relationship. A preferable analysis. Result of ANOVA test is f (5, 70) = method of detection linearity is examination of 12.451 which is at the significance level 0.000 residual plots between dependent and (P-value = .000), P value is less than 0.005. It independent. The first step of assessment of means that the regression equation is significant assumption begins with initial descriptive plots at 95 % confidence limit in levels of variability of dependent Variable versus each of the within a regression model. explanatory variables, separately.
85 INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
4. CONCLUSION Their impact on different types of bus drivers,
This research developed a crash prediction Accident Analysis & model by multiple linear regression analysis. Prevention, 86, pp.29-39. Predictor variable carriage way widths, segment [2] MORT&H, (2015), Road Accident Statistics, length, traffic volume, pedestrian volume, no. of Ministry of Road Transport and Highways, access road are selected for the study. The risk is Government of India, New Delhi increasing as the traffic volume, pedestrian [3] Field A. (2005), Discovering statistics using volume, carriage-way width, segment length and SPSS, SAGE Publications Ltd ISBN 978-1- no. of access road increases. Previous study 84787-906-6 shows that presence of median decrease the risk [4] IBM Software Group (2015), “IBM SPSS of accident. Advanced Statistics 22” Available from: High R2 value is needed for the model selection. http://library.uvm.edu/services/statistics/SPSS2 Low R2 value shows either some predictor 2Manuals/IBM%20SPSS%20Advanced variables are missing or the relation between %20Statistics.pdf dependent and independent variable is not linear [5] Ranjitkar P and Chengye P. (2013), or not follow normal distribution. Modelling motorway accidents using negative The accident prediction models can be used to binomial regression, Eastern Asia Society for decrease the risk on the road and safety Transportation Studies Vol. 9. performance. In terms of future work, it is [6] Sawalha Z. and Sayed T. (2003), suggested to taking other predictor variables Statistical issues in traffic accident modeling, such as horizontal and vertical alignment, Canadian Journal of Civil Engineering, 33(9), median width, sidewalk width, environmental P1115-1124 factors etc. CPM can be developed by regression [7] Prajapati, P., Tiwari, G., Evaluating Safety of analysis such as poisson regression, negative Urban Arterial Roads of Medium Sized Indian binomial regression, zero inflated regression City. Proceedings of the Eastern Asia Society for model etc. Transportation Studies, Vol.9, 2013 REFERENCES [8] WHO (2015), Global status report on road [1] Feng S., Li Z., Ci Y. and Zhang G. (2016), safety 2015 Risk factors affecting fatal bus accident severity:
Multicollinearity Occurs When The Multiple Linear Regression Analysis Includes Several Variables That Are Significantly Correlated Not Only With The Dependent Variable But Also To Each Other