You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/268807409

Analysis of highway crash data by Negative Binomial and Poisson regression


models

Conference Paper · June 2011


DOI: 10.13140/2.1.4113.0567

CITATIONS READS

2 2,058

1 author:

Darcin Akin
University of Hafr Al Batin
31 PUBLICATIONS   284 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Sustainability of Transport Networks View project

Grand River Avenue (M-43) Pedestrian Project View project

All content following this page was uploaded by Darcin Akin on 27 November 2014.

The user has requested enhancement of the downloaded file.


Analysis of highway crash data by Negative Binomial
and Poisson regression models
Darçın Akın1
1
Gebze Institute of Technology, Department of City and Regional Planning
Istanbul. Cad. No:101, PO Box 141, 41410 Gebze, Kocaeli, Turkey
1
dakin@gyte.edu.tr or darcinakin@gmail.com

Abstract. This study evaluates the influence of roadway, weather and acci-
dents conditions, and type of traffic control on accident severity (number of per-
son killed) using Negative Binomial and Poisson regression models. Information
on accident severity and roadway and weather conditions was obtained from the
Michigan Department of Transportation Accident Database. Negative Binomial
(NB) and Poisson regression models were deployed to measure the association
between accident severity and roadway, weather and accidents conditions. NB re-
gression model results presented that monthly, daily, hourly and weekday varia-
tions are not statistically significant on accident severity (number of persons
killed). However, Poisson regression results were the reverse with respect to these
variables. Type of traffic control was also found to be not statistically significant.
Number of vehicles involved, crash type (overturn, rear-end, side-swipe, head-on,
hit object, and so on), injury types (A, B, C), number of uninjured, number of oc-
cupants and weather conditions are statistically significant at 0.05 level. Light and
surface conditions were also statistically significant at 0.10 level. The findings of
the Poisson regression are very similar to NB regression but the parameter estima-
tions are little bit different from those determined by NB regression. The results
are in agreement with professional judgments with respect to the factors affecting
the accident severity on highway crashes.
Keywords: crash data, Negative Binomial regression, Poisson regression,
crash properties, road and weather factorss

1 Introduction

Accident prediction models are important tools for estimating road safety with regards
to roadway, weather and accidents conditions. There are different empirical equations
developed for accident prediction models. However, new regression techniques have
recently found application opportunities in this area. It is obvious that the model devel-
opment and subsequently the model results are strongly affected by the choice of the
regression technique used. This study evaluates the influence of roadway, weather and
accidents conditions, and type of traffic control on accident severity (number of person
killed) using regression models. Information on accident severity and roadway and
weather conditions was obtained from the Michigan Department of Transportation
Accident Database. Negative Binomial (NB) and Poisson regression models were dep-
loyed to measure the association between accident severity and roadway, weather and
accidents conditions.
1.1 Literature Review

Statistical models are used to examine the relationships between accidents and features
of accidents as well as accident sites. However, many past studies illuminating the nu-
merous problems with linear regression models (Joshua and Garber, 1990 and Miaou
and Lum, 1993) have led to the adoption of more appropriate regression models such as
Poisson regression which is used to model data that are Poisson distributed, and nega-
tive binomial (NB) model which is used to model data that have gamma distributed
Poisson means across crash sites—allowing for additional dispersion (variance) of the
crash data. Although the Poisson and NB regression models possess desirable distribu-
tional properties to describe motor vehicle accidents, these models are not without limi-
tations. One problem that often arises with crash data is the problem of ‗excess‘ zeroes,
which often leads to dispersion above that described by even the negative binomial
model. ‗Excess‘ does not mean ‗too many‘ in the absolute sense, it is a relative compar-
ison that merely suggests that the Poisson and/or negative binomial distributions predict
fewer zeroes than present in the data. As discussed in Lord et al. (2004), the observance
of a preponderance of zero crashes results from low exposure (i.e. train frequency
and/or traffic volumes), high heterogeneity in crashes, observation periods that are
relatively small, and/or under-reporting of crashes, and not necessarily a ‗dual state‘
process which underlies the ‗zero-inflated‘ model. Thus, the motivation to fit zero-
inflated probability models accounting for excess zeroes often arises from the need to
find better fitting models which from a statistical standpoint is justified; unfortunately,
however, the zero-inflated model comes also with ―excess theoretical baggage‖ that
lacks theoretical appeal (see Lord et al., 2004). Another problem not often observed
with crash data is underdispersion—where the variance of the data is less than the ex-
pected variance under an assumed probability model (e.g. the Poisson). One manifesta-
tion might be ―too few zeroes‖, but this is not a formal description. Underdispersion is a
phenomenon which has been less convenient to model directly than over-dispersion
mainly because it is less common observed. Winkelman's gamma probability count
model offers an approach for modeling underdispersed (or overdispersed) count data
(Winkelmann and Zimmermann, 1995), and therefore may offer an alternative to the
zero-inflated family of models for modeling overdispersed data as well as provide a tool
for modeling underdispersion.

2 Data and Methodology

In this section, the data used in this paper and the methodology are described. Crash
data were barrowed from the Michigan Department of Transportation
(www.michigan.goc/MDOT) for the years 2000 and 2004. The data files includes all
crashes reported by the Police in all counties, towns and townships. Data are analyzed
to determine the influence of roadway, weather and accidents conditions, and type of
traffic control on accident severity (number of person killed) using Negative Binomial
and Poisson regression models. The following section presents descriptive of the crash
data.
2.1 Descriptive of Crash Data

Some important descriptive of the crash data used in the analyses are given in the fol-
lowing tables. Table 1 shows accident occurrences in MDOT regions. The most urba-
nized region is Metro so that it has the highest frequency (or percent) of crashes (39.8
and 37.9%) in the State of Michigan in 2000 and 2004. The upper two regions (Supe-
rior and North) have the lowest urbanizations and the lowest accident occurrences (4.3
and 7.3% in 2000 and 4 and 7% in 2004). Then, the other four regions; namely, Grand,
Bay, Southwest and University) have comparable crash occurrence rates, but the Metro
Regions the rate is more than doubled compared to these four regions. The Metro re-
gions includes the city of Detroit (the motor capital of the world), and the most urba-
nized counties such as Macomb, Wayne and Oakland (see Figure 1).

Table 1. Accident occurrences in MDOT Regions


2004 2000
MDOT Regions Frequency Valid Percent Frequency Valid Percent
Superior 15066 4.0 18169 4.3
North 26370 7.0 30937 7.3
Grand 50442 13.5 55396 13.0
Bay 46370 12.4 51068 12.0
Southwest 37888 10.1 40927 9.6
University 56740 15.1 59905 14.1
Metro 142108 37.9 169576 39.8
Total 374984 100.0 425978 100.0

Fig. 1. MDOT Regions


Table 2 present the accident occurrences by area type, such as interchange, intersec-
tion, midblock and non-traffic area. It is quite expected that intersections and midblock
are more accident prone areas with respect to interchange areas since interchanges have
less conflict points than the other two.

Table 2. Accident occurrences by area type


2004 2000
Frequency Valid Percent Frequency Valid Percent
Interchange Area 39427 10.5 35529 8.3
Intersection Area 175398 46.8 212698 49.8
Midblock Area 158737 42.3 176706 41.4
Non-Traffic Area 1422 0.4 1909 0.4
Total 374984 100.0 426842 100.0

Table 3 present the accident occurrences by more detailed area type. the most inter-
esting number is related to ―straight road‖ sections. These sections constitute over 40%
of all crashes. It is also interesting to note that railroad crossing crashes are only 0.2 or
0.3% of all.

Table 3. Accident occurrences by more detailed area type


2004 2000
Frequency Valid Percent Frequency Valid Percent
Entrance or Exit Ramp 12255 3,3 13221 3,1
Other Freeway Area 34795 9,5 35960 8,6
Within Intersection 56553 15,4 73341 17,5
Driveway Within 150 19184 5,2 24551 5,8
ft of Intersection
Other Intersection 37623 10,3 42968 10,2
Related
Straight Road 161754 44,2 175574 41,8
Curved Road 12909 3,5 14269 3,4
Driveway Related 14311 3,9 17982 4,3
Parking Related 8016 2,2 10589 2,5
Railroad Crossing 952 ,3 697 ,2
Other Non-Intersection 7606 2,1 10575 2,5
Area
Unknown 160 ,0 317 ,1
Total 366118 100,0 420044 100,0
Missing 8866 6798
Total 374984 426842
Table 4 present the relationship of accident occurrences to road sections. It is very
normal that 85% of all crashes occur on the road. However, considerable number of
crashes also occur outside of the should/curb (about 8%). These crashes must be oc-
curred in urban as well as rural areas at nighttime by intoxicated drivers.

Table 4. Relationship of accident occurrences to road sections


2004 2000
Frequency Valid Percent Frequency Valid Percent
On the Road 305722 84.5 364102 85.6
In the Median 2924 0.8 3212 0.8
On the Shoulder 15786 4.4 17503 4.1
Outside of 30507 8.4 32096 7.6
Shoulder / Curb
In the Gore 800 0.2 884 0.2
Unknown 6022 1.7 7310 1.7
Total 361761 100.0 425107 100.0
Missing 13223 1735
Total 374984 426842

Table 5 present the relationship of accident occurrences to road types. The majority
of crashes occurred on county roads and city streets (about 60%). M-routes also take a
considerable amount of crashes (19%). Interstate and US routes have relatively lower
crashes by 9 and 7%.

Table 5. Relationship of accident occurrences to road types

2004 2000
Route Class Frequency Valid Percent Frequency Valid Percent
Not Located 4777 1,3 39402 9,2
Interstate Route 34820 9,3 31638 7,4
US Route 27737 7,4 27725 6,5
M Route 72152 19,2 78107 18,3
I-State Business 5731 1,5 6842 1,6
Loop or Spur
US Business 4098 1,1 4746 1,1
Route
M Business 105 ,0 120 ,0
Route
Connector 721 ,2 742 ,2
County Road or 224843 60,0 237520 55,6
City Street
Total 374984 100,0 426842 100,0
Table 6 shows the relationship of accident occurrences to speed limits. It should be
noted that this table does not present the risk of speed limits since the road with speed
limits higher than 55 mph show lower accident occurrences. City streets with speed
limits equal or less than 25 mph take about 20% of all accidents. M, US and Interstate
routes with speed limits higher than 50 mph have the highest percent (over 30%).

Table 6. Relationship of accident occurrences to speed limits


2004 2000
Speed Categories Frequency Valid Percent Frequency Valid Percent
Equal or less than 74688 19,9 79867 20,7
25 mph
26-30 mph 19608 5,2 25070 6,5
31-35 mph 47028 12,5 58910 15,2
36-40 mph 26004 6,9 30115 7,8
41-45- mph 53995 14,4 58121 15,0
46-50 mph 12909 3,4 13006 3,4
51-55 mph 107214 28,6 113883 29,4
56-60 mph 971 0,3 306 0,1
61-65 mph 6807 1,8 7421 1,9
66-70 mph 25612 6,8 17 0,0
Greater than 70 148 0,0 --- ---
mph
Missing 0 40126
Total 374984 100,0 426842 100,0

2.2 Modeling of Crash Data

Poisson and Negative Binomial (NB) regression models are used to model the influence
of roadway, weather and accidents conditions, and type of traffic control on accident
severity (number of person killed).
The negative binomial (NB) regression model is the member of the exponential family
of discrete probability distributions. The nature of the distribution is itself well unders-
tood, but its contribution to regression modeling, in particular as a generalized linear
model (GLM), has not been appreciated until recently. The mathematical properties of
the negative binomial are derived and GLM algorithms are developed for both the ca-
nonical and log form. The log forms of both may be effectively used to model types of
Poisson-overdispersed count data (Hilbe, 1993). It is not recommended that negative
binomial models be applied to small samples. What constitutes a small sample does not
seem to be clearly defined in the literature though (UCLA, 2011). Poisson regression,
also a member of the class of models known as generalized linear models (GLM), is the
standard method used to analyze count data. However, many real data situations violate
the assumptions upon which the Poisson model is based. For instance, the Poisson mod-
el assumes that the mean and variance of the response are identical. This means that
events occur within a period of observation at a constant rate; an event is equally likely
at any point within the period. When there is heterogeneity in the data, it is likely that
the Poisson model is overdispersed. Such overdispersion is indicated if the variance of
the response is greater than its mean. One may also check for model overdispersion by
submitting the data to a Poisson model and observing the Chi2-based or Deviance-based
dispersion statistic. The model is Poisson-overdispersed if the dispersion value is greater
than unity. Log negative binomial regression can rather effectively be used to model
count data in which the response variance is greater than that of the mean (Hilbe, 1993).

Model Results. In order to apply Poisson regression model, we first need to check is
there is heterogeneity in the data. Table 7 presents the descriptive statistics and it seen
that the variance of the response variable (number of persons killed in crash) is little bit
higher than the mean. Thus, the results of the Poisson regression must be approached
cautiously.

Table 7. Descriptive statistics of the response variable


Response variable:
Mini- Maxi- Std. Devia-
Number of Persons n Mean Variance
mum mum tion
Killed in Crash
2004
374984 0 4 0.003 0.062 0.004
2000 426842 0 5 0.003 0.064 0.004

Table 8 presents the case processing summary. As seen about 90% of the cases in-
cluded in regression modeling. Regarding the measure of goodness of fit, the value of
deviance/df is 0.012 and lower than 1. This means that the model is not Poisson-
overdispersed.

Table 8. Case processing summary for regression models

2004 2000
Cases n Percent N Percent
Included 349268 93.1% 382367 89.6%
Excluded 25716 6.9% 44475 10.4%
Total 374984 100.0% 426842 100.0%

Table 9 presents the parameter estimates using NB regression model. Significance of


the parameters for 2004 and 2000 are similar except for the variables of day and surface.
The variable of day is significant at 0.05 level with year 2000 data and the variable of
surface is significant at 0.10 level with year 2004 data.
Table 9. Parameter estimates using NB regression

2004 2000
Hypothesis Test Hypothesis Test
Sig.@ Sig.@
Std. Wald 0.10 Std. Wald 0.10
Parameter B Error Chi2 Sig. level? B Error Chi2 Sig. level?
(Intercept) -4.518 .2127 451.365 .000 yes -5.552 .2317 574.247 .000 yes
Month .000 .0098 .001 .976 no .000 .0100 .000 .996 no
Day .003 .0038 .460 .498 no .012 .0040 8.848 .003 yes
Hour -.008 .0051 2.614 .106 no .004 .0051 .765 .382 no
Weekday -.011 .0169 .426 .514 no .003 .0173 .025 .875 no
NumVeh .818 .0362 510.613 .000 yes .734 .0417 309.797 .000 yes
CrshType -.112 .0065 294.156 .000 yes -.079 .0063 154.388 .000 yes
NumInj -3.687 .0591 3890.27 .000 yes -3.385 .0579 3422.32 .000 yes
NumUnInj -4.312 .0494 7626.77 .000 yes -4.304 .0531 6565.80 .000 yes
NumOcc 3.673 .0412 7935.97 .000 yes 3.425 .0412 6912.91 .000 yes
Weather -.092 .0273 11.320 .001 yes -.067 .0313 4.574 .032 yes
Light .041 .0197 4.347 .037 yes .066 .0204 10.458 .001 yes
Surface .057 .0310 3.359 .067 yes .022 .0341 .433 .510 no
TControl -.018 .0317 .334 .563 no .054 .0341 2.558 .110 no

Table 10 presents the parameter estimates using Poisson regression model. Signific-
ance of the parameters for 2004 and 2000 are similar except for the variables of month,
day, weather and surface. These variables are significant at 0.05 level with year 2004
data.

Table 10. Parameter estimates using Poisson regression

2004 2000
Hypothesis Test Hypothesis Test
Sig.@ Sig.@
Std. Wald 0.010 Std. Wald 0.010
Parameter B Error Chi2 Sig. level B Error Chi2 Sig. level
(Intercept) -4.410 .1979 496.542 .000 yes -4.691 .2231 442.253 .000 yes
Month -.044 .0093 22.089 .000 yes .014 .0098 2.064 .151 no
Day .025 .0036 50.400 .000 yes -.001 .0038 .030 .863 no
Hour -.034 .0050 46.191 .000 yes -.019 .0048 15.263 .000 yes
Weekday .005 .0157 .099 .753 no -.016 .0168 .933 .334 no
NumVeh .867 .0284 934.627 .000 yes .841 .0310 737.370 .000 yes
CrshType -.102 .0059 301.378 .000 yes -.079 .0059 178.777 .000 yes
NumInj -2.731 .0388 4949.289 .000 yes -2.483 .0482 2652.439 .000 yes
NumUnInj -3.585 .0400 8038.972 .000 yes -3.465 .0456 5782.059 .000 yes
NumOcc 2.898 .0280 10728.93 .000 yes 2.513 .0243 10733.83 .000 yes
Weather -.083 .0259 10.140 .001 yes -.037 .0303 1.464 .226 no
Light .036 .0188 3.633 .057 yes .098 .0203 23.039 .000 yes
Surface .074 .0294 6.244 .012 yes -.058 .0361 2.556 .110 no
TControl -.045 .0300 2.255 .133 no -.022 .0303 .515 .473 no
3 Findings and Conclusion

NB regression model results presented that monthly, daily, hourly and weekday varia-
tions are not statistically significant on accident severity (number of persons killed).
However, the year 2000 data NB regression result that daily variations are significant at
0.05 level. Similarly, the surface conditions are statistically significant on accident
severity with the year 2004 data but no with 2000 data. Type of traffic control was also
found to be not statistically significant. Number of vehicles involved, crash type (over-
turn, rear-end, side-swipe, head-on, hit object, and so on), injury types (A, B, C), num-
ber of uninjured, number of occupants and weather conditions are statistically signifi-
cant at 0.05 level. Light and surface conditions were also statistically significant at 0.10
level. The findings of the Poisson regression are very similar to NB regression with
respect to the parameters of number of vehicles, number of injured and uninjured per-
sons, number of occupants, weather, light and surface conditions, but the parameter
estimations are little bit different from those determined by NB regression. Another
important difference between the NB and Poisson regression results is that monthly,
daily, hourly and weekday variations are statistically significant on accident severity
with year 2004 data but not with year 2000 data except the parameter of the hour of
accident occurrence. The results are in agreement with professional judgments with
respect to the factors affecting the accident severity on highway crashes.

Acknowledgement

The author acknowledges Dr. Dale Lighthizer of the Michigan Department of Transpor-
tation and Prof. Dr. Richard Lyles of Michigan State University for providing the crash
data for the sole purpose of doing academic research and writing papers.

References

1. El-Basyouny, K. and Sayed T. (2006). Comparison of two Negative Binomial Regres-


sion Techniques in Developing Accident Prediction Models. Transportation Research
Record: Journal of the Transportation Research Board, volume 1950, pp 9-16.
2. Joshua, S.C. and Garber, N.J. (1990). Estimating truck accident rate and involvements
using linear and Poisson regression models. Transport. Planning Technol. 15 (1990)
(1), pp. 41–58.
3. Hilbe, J.M. (1993). Log-Negative Binomial Regression as a Generalized Linear Model.
Technical Report COS 93/94-5-26. Department of Sociology and Graduate College
Committee on Statistics.
4. Lord, D. (2000). The prediction of accidents on digital networks: characteristics and
issues related to the application of accident prediction models. Ph.D. Dissertation, De-
partment of Civil Engineering, University of Toronto, Toronto.
5. Lord, D.; Washington, S.; and Ivan, J. (2004). Poisson, Poisson-gamma, and zero-
inflated regression models of motor vehicle crashes: balancing statistical fit and theory.
Accident Analysis and Prevention, Pergamon Press/Elsevier Science, 2004.
6. Lord, D. (2006). Modeling motor vehicle crashes using Poisson-gamma models: ex-
amining the effects of low sample mean values and small sample size on the estimation
of the fixed dispersion parameter. Accident Analysis and Prevention, 38 (2006) (4), pp.
751–766.
7. Miaou, S.P. and Lum, H. (1993). Modeling vehicle accidents and highway geometric
design relationships. Accident Analysis and Prevention, 25 (1993), pp. 689–709.
8. Oh, J.; Washington, S.P.; and Nam, D (2006). Accident prediction model for railway-
highway interfaces. Accident Analysis and Prevention, Volume 38, Issue 2, pp. 346-
356.
9. UCLA: Academic Technology Services, Statistical Consulting Group. SPSS Data
Analysis Examples: Negative Binomial Regression.
http://www.ats.ucla.edu/stat/Spss/dae/spss_neg_binom_DAE.htm (accessed in April
20011)
10. Washington, S.; Karlaftis, M.; and Mannering, F (2003). Statistical and Econometric
Methods for Transportation Data Analysis, Chapman Hall/CRC, Boca Raton, FL.
11. Winkelmann, R. and Zimmermann, K. (1995). Recent developments in count data
modeling: theory and applications. J. Econ. Surveys 9, pp. 1–24.

Biography

Dr. Darçın Akın – Born in Germany in 1966. He received his BS degree in civil engineering
from Dokuz Eylul University, Izmir, Turkey in 1987 as valedictorian. Respectively, he received
his MS and Ph.D degrees in transportation engineering from Dokuz Eylul and Michigan State
Universities (E. Lansing, MI, USA) in 1992 and 2000.
During his Ph.D. study, he was involved in traffic engineering research projects including
work zone speed study, study of pedestrian behavior at signalized and unsignalized crossings,
and simulation of campus network. He was also worked as part-time at TriCounty Regional
Planning Commission and Michigan Department of Transportation (MDOT), where he worked
on bicycle network planning and statewide long-range plan. Lastly, before he graduated, he
worked as a full-time transport planner at MDOT. He maintained highway networks for city‘s
long range plans. He completed his Ph.D. in May 2000 and returned home to accept a faculty
position at Gebze Institute of Technology (GIT) in the department of City and Regional Planning.
At GIT he has been teaching urban transport policies, urban transport systems, urban planning
and travel modeling. Outside of GIT, he taught at several well-known institutes including Boga-
zici and Istanbul Technical Universities at graduate as well as undergraduate programs. Dr. Akın
received his tenure and the title of associate professor in planning in Feb 2010. Dr. Akın has
many papers and publications at several national and international conferences and symposiums.
He has been a reviewer at several periodicals and international conferences including TRB of
USA.
Dr. Akin is the member of Turkish Road Association since 2005, Chambers of Civil Engineers
since 1987, Institute of Transportation Engineers (ITE) of USA between 1996 and 2000, Ameri-
can Society of Civil Engineers (ASCE) between 1996 and 2000, World Conference on Transport
Research Society (WCTRS) from 2004-6 to 2010-13, and Member of the Board of Trustee of
EMIT Research Platform since 2011.

View publication stats

You might also like