ACD — Capstone Project — Car
accident severity
ACD_Coursera
Sep 22 -3 min read
Business Problem: Introduction
What are the most salient factors that determine the severity of a car
accident? In this project we are going to analyze a data-set containing car
accidents and the conditions at the given time of the event (road condition,
visibility, weather). By means of training different machine learning
models, we'll be able to predict how these conditions modulate the severity
of the collision. Furthermore, we’ll construct a rating scale in order to asses
the probability of an accident (1: extremely low probability up to 5: highest
probability).
Data Understanding and Preparation
The data used for this project will be extracted from the Collisions set. In
order to prepare the data to be feed into the models, we'll drop the columns
that contain irrelevant information. This will allow us to greatly decrease
the computational costs associated in managing large amounts of
information. In particular, we’ll use the columns WEATHER (x1),
ROADCOND(x2) and LIGHTCOND (x3) to predict the SEVERITYCODE (y).
import pandas as pd
import
Gece »dé=pd.read_csv("Data-Collisions.csv")
dasta = d£.drop(columns = ['OBJECTID’, 'SEVERITYCODE.1', 'REFORTNO", *
'x!, 'y!, 'STATUS', 'ADDRTYPE',
"LOCATION", "EXCEPTRSNCODE',
‘SEVERITYDESC', ‘I!
(ONTYFE', "SD!
* PEDCYLCOUNT',
*PERSONCOUNT',
‘SPEEDING’, 4
dasta["WEATHER"] = dasta["WEAT!
dasta["ROADCOND") = dasta["ROADC
dasta["LIGHTCOND"] = dasta ("Lz
"} astype ("category")
OND") .astype ("category")
dasta["WEATHER_CAT"] = dasta["WEATHER"] .cat. codes
dasta["ROADCOND_CAT"] = dasta(["ROI
dasta[ "Lr
dasta.head(20)
Transforming the target value (SEVERITYCODE) in order to balance the
data prior to the modeling stage.
from sklearn.utils import resample
dasta more = dasta[dasta. SEVERITYCODI
dasta_less = dasta [dasta. SEVERITYCODI
dasta_more equal = resample (dasta_s
2]
Te,
replace=Palse,
n_samples=581
random_state=
daste_bal = pd.concat([dasta_more equal, dasta_less])
dasta bal. SEVERITYCODE. value_counts()
Modeling
Once we have appropiately transformed our data, it’s time to proceed to test
different models in order to dilucidate the one that has the best accuracy. In
the usual order, we are going to train: a logistic regression model, a KNN
model and a decision tree.
from sklearn import preprocessing
X = preprocessing. StandardScaler() .f1t(X).transform(X)
5]import mumpy as np
X = np.asarzay (desta bal [ [WEATHER CRi
x(025]
y = mp-asarzay (daste
y (9:5)
+ "ROADCOND_CAT', 'LIGHTCOND_CAT']])
yal [' SEVERITYCODE' ])
fas ve did in previous Labs, I have decided to use a 808(train) and 208(test)
from sklearn.model_selection import train_test_split
X train, X test, y train, y test = train Eest split (x, y,
, sandom_state=3)
flogistic Regression
from sklearn.linear model import LogisticRegression
from sklearn.metrics import confusion matrix
='Liblinear').£it(x_train,y_train)
logReg = LogisticRegression (C=:
logPred = logReg.predict (x_test)
LogPredodd= logReg. predict proba (x test)
#xM
from sklearn.neighbors import KNeighborsClaseifier
ks = 15
hood = RMeighborsClassifier(n neighbors = ke).£it (M_train,y tain)
hood
hoodPred = hood.predict (x_test)
hoodPred[0:5]
azray([2, 2, 2, 2, 11)
#Decis.
from sklearn.tree import DecisicnTreeClassifier
treedat = DecisionTreeClassifier(criterion="entropy", max_depth =
treedat
teeedat. fit (X_train,y train)
treeFred = treedat.predict (xX_test)
Evaluation
Now that we have succesfully trained 3 different models, we need to
evaluate them in order to determine which is the most accurate.from sklearn.metrics import £1 score
from sklearn.metrics import
from sklearn.metrics import
y_score(y test, logPred)
Yt, LogPredodd)
0.529386492524
0683972651397;
542896479876548
+5571833648393195,
Conclusion
8
About Help
In this project, we tackled the problem of predicting the severity of the
collision (y) in a given event via a multi factorial model which takes road
condition, visibility and weather as input variables. After selecting the data
set, we proceeded to curate it in order to facilitate the construction of 3
different models: logistic regression, KNN and a decision tree. After
carefully evaluating the models via accuracy tests, we concluded that
logistic regression is the optimal approach for this task. The results hereby
presented are subject to further refinement by means of exploration higher
order ML models.