You are on page 1of 23

Analysis of Train Accidents in the U.S.

During
2001 2012

Imran A. Khan

Summary
There are many factors that can cause severity of rail accidents. Based on my findings, season is an
important factor that causes more death. I find that more fatalities occur during summer season. The rate of
change of fatalities during summer season is estimated to be about 0.22 with 95% confident interval
between 0.04 and 0.4. So it is important for the FRA to put an extra safety when the train is running under
summer season. Type of accident and cause of accident significantly affect the cost damage at 5% level. A
train accident at RR grade crossing is more likely to cause cost damage. Putting a greater safety at RR
Grade Crossing can reduced the severity of cost damage. Also, the FRA should train well their people about
safety in order to minimize human error.
Honor pledge: On my honor, I pledge that I am the sole author of this paper and I have accurately cited all
help and references used in its completion.

Imran A. Khan

1. Problem Description
1.1. Situation
According to the Federal Railroad Administration (FRA) data from 2001 2012[3], about 2,500 train
accidents occur annually in the U.S. These incidents cause injuries ranging from the moderately severe to
death and cost damage. Many of the incidents have small damage cost and do not lead to any death. On
average, a train accident happens in the last 12 years has zero number of injury as shown in Figure 1. I
observe that more than 13% of the accidents lead to high damage cost and about 1% lead to a lot of people
killed. These are the extreme values above the upper whisker of boxplots that are considered as severe
accidents.
Figure 1. Boxplot of severity metrics
Boxplot of Total Damage

10
5
0

Total Damage (x$1,000,000)

4
0

Fatalities

15

Boxplot of Fatalities

For a given accident, many factors come into play. Table 1 shows that most of the accidents are because of
human error account 34% of train accidents, followed by rack, roadbed and structures failures account 32%.
Train derailment is the most common type of accident that leads to large cost damage (Table 2). However,
this type of accident has very minimum human damage. There are only 7% of train accidents at rail-highway
crossings but it lead to many fatalities (85%).

Table 1. Frequency table for severity metrics by type of cause


#

%
Total

Number

Fatalities

Type of cause

Damage

Total
Number

Fatalities

($)

Mechanical and
Electrical Failures (E)
Train operation Human Factors (H)
Miscellaneous
Causes (M)
Signal and
Communication (S)
Rack, Roadbed and
Structures (T)

Damage
($)

3617

561,518,909

12%

0%

17%

10655

35

756,877,988

34%

7%

23%

6327

452

625,508,542

20%

90%

19%

589

24,838,419

2%

0%

1%

9785

16

1,377,074,145

32%

3%

41%

Table 2. Frequency table for severity metrics by type of accident


#

%
Total

Number

Fatalities

Type of accident
Derailment (1)

Damage

Total
Number

Fatalities

($)

Damage
($)

20694

22

2,552,192,579

67%

4%

76%

Head on collision (2)

99

83,868,583

0%

2%

3%

Rear-end collision (3)

218

66,591,167

1%

1%

2%

Side collision (4)

1221

121,312,370

4%

1%

4%

Raking collision (5)

500

26,922,342

2%

0%

1%

Broken collision (6)

62

4,455,089

0%

0%

0%

Highway-rail cross (7)

2309

428

160,160,623

7%

85%

5%

RR Grade Crossing (8)

7,324,022

0%

0%

0%

Obstruction (9)

728

20

59,304,105

2%

4%

2%

Explosive (10)

12

18,268,351

0%

0%

1%

Fire (11)

231

37,921,122

1%

0%

1%

Other impacts (12)

3428

134,453,596

11%

1%

4%

Others (13)

1469

11

73,044,054

5%

2%

2%

Figure 2. Boxplot of severity metrics vs. season


Total Damage vs. Season

10
5
0

Fatalities

Total Damage (x$1,000,000)

15

Fatalities vs. Season

Spring

Summer

Autumn

Winter

Spring

Summer

Season

Autumn

Winter

Season

Figure 3. Boxplot of severity metrics vs. type of accident


Total Damage vs. Type of Accident

10
5
0

Fatalities

Total Damage (x$1,000,000)

15

Fatalities vs. Type of Accident

7
Type

10 11 12 13

7
Type

10 11 12 13

Figure 4. Boxplot of severity metrics vs. cause of accident


Total Damage vs. Cause of Accident

10
5
0

Fatalities

Total Damage (x$1,000,000)

15

Fatalities vs. Cause of Accident

Cause

Cause

In this study, the predominant focus is on the severity of train accidents, i.e. any incident that lead to more
fatalities and expensive cost damage, and how they can be minimized. It appears that summer season lead
to more fatalities as compared to the other seasons but the total damage is almost similar across the four
seasons (Figure 2). Different type of accident will lead to different severity of rail accident. For severe
accident, it looks like explosive is the major type of accident that causing cost damage (Figure 3). Cause of
accident is another factor that affects the cost damage (Figure 4).
Figure 5 and 6 show that there is a relationship between speed and the severity metrics. High speed of train
tends to cause more fatalities and cost damage. The plots also tell me that the more people evacuated, the
more severity of rail accidents can be reduced. Gross tonnage of a train (TONS) and number of head end
locomotive (HEADEND1) are other important factors that related to severity metrics.

Figure 5. Scatterplot matrix between fatalities (TOTKLD) and the other quantitative variables
40

80

0 20000 50000

80

TOTKLD

2000

5000

40

TRNSPD

0 20000 50000

EVACUATE

HEADEND1

2000

5000

0 2 4 6 8

TONS

0 2 4 6 8

Figure 6. Scatterplot matrix between total damage (ACCDMG) and the other quantitative variables
40

80

30000 70000
1.5e+07

80

0.0e+00

ACCDMG

2000

5000

40

TRNSPD

30000 70000

EVACUATE

HEADEND1

0.0e+00

1.5e+07

2000

5000

0 2 4 6 8

0 2 4 6 8

TONS

The biplot displayed in Figure 7 tell me that many factors related to fatalities and cost damage as many
vectors pointing in the same direction as TOTKLD as well as ACCDMG. This confirms my findings as the
scatterplot matrix displayed in Figure 5 and 6 and those vectors are potential factors in causing severity
metrics.
Figure 7. Biplot for human and cost damage
Fatalities
10

Total Damage

20

-60

-40

-20

40
20

0.04
0.02

-20
-40

-0.02
-0.04

Comp.2

-10

CARSDMG
CARSHZD
CARS HEADEND1
TRNSPD
EVACUATE
TONS
ACCDMG

-0.1

0.0

0.1

0.2

-0.04

Comp.1

-0.02

-60

Latitude
-0.2

60

TEMP

-20

-0.2

0.00

20
10

0.1
0.0
-0.1

Comp.2

Longitud

40

Longitud

Latitude

TONS TEMP
CARS TOTKLDTRNSPD
HEADEND1
EVACUATE
CARSHZD
CARSDMG

20

60

-10

0.2

-20

0.00

0.02

0.04

Comp.1

1.2. Goal
The purpose of this study is to provide recommendations, so that FRA can take their action to reduce the
severity of railroad accidents in terms of fatalities and total damage.
1.3. Metrics
I utilize multiple linear regression models to measure the severity metrics. I consider several potential
factors such type of accident, cause of accident, and season to predict the severity of rail accidents. I use
significance level of 5% for the analysis throughout this study. If the confidence level (p-value) is less than
0.05, then my (null) hypothesis is rejected in favor of the alternative. Alternatively, if p-value is greater
than 0.05, the null should not be rejected. I also use adjusted R2, AIC and BIC criteria to compare between
models.

1.4. Hypothesis
Based on my observation in Figure 2, 3, and 4, I have three hypotheses regarding the severity of rail
accidents:
Hypothesis 1:
Ho: Season does not cause to more death
H1: Season causes to more death.
Hypothesis 2:
Ho: Type of accident does not affect cost damage.
H1: Type of accident affects cost damage.
Hypothesis 3:
Ho: Cause of accident does not affect cost damage.
H1: Cause of accident affects cost damage.

2. Approach
2.1. Data
The data set is obtained from the FRA railroad accidents period 2001 2012[3]. In total, there are 42033
accidents over the 12 years with 140 relevant variables. I find 20% of data points are duplicated. I also find
one data point with extreme value in terms of evacuation in year 2002. The reported numbers is 50000
people evacuated in an accident which is very unlikely to happen. I spot this extreme value as typo because
it shows very large value as compared to the other cases (Figure B1, Appendix B). Thus, I do not include
them in the analysis. I also do not consider data points from September 11, 2001 due to the chances of that
happening again is almost zero. The cost damage is significantly higher than the other cases. This leads to
30973 data points used in this study.
Since my interest is to examine severe accidents, only extreme cases above the upper whisker boxplot of
severity metrics is taken into account, that are accidents with at least one fatality and cost damage with at
least $143,861. I use all potential predictor variables, including confounding variables, in the initial models.

In total, there are 21 predictors: 10 continuous variables and 11 categorical variables, including SEASON
created as a new variable (Table A1, Appendix A). I remove any missing cases from data since the
methodology required complete observations. In total, I use 391 cases to model fatality (TOTKLD), and
2954 to model total damage (ACCDMG).

2.2. Analysis
In the modeling of severity metrics, I perform multiple linear regression analysis using R software with a
general model.
= 0 + 1 1 + 2 2 + . + +
The stages of data analysis are as follows:
1. I convert all categorical predictor variables into dummy variables. For example, TYPE has 13 levels
and R automatically encodes these 13 levels into 12 dummy variables with derailment as the base
case. See Table A1 in the Appendix for details base case selected for each categorical variable.
2. I utilize simple linear regression analysis for each of the 21 potential predictor variables.
Continuous predictor with p-value > 0.25 is not considered in the initial model (Full Model).
3. I reduce the full models by dropping all the non-significant predictors (Reduced Model). I use
Partial F test to examine if smaller set of predictors can be retained.
4. I also perform an alternative model selection, i.e. stepwise selection procedure, to select important
predictors in the full model (Step Model).
5. I then compare the reduced model and the step model by adjusted R2 and AIC criteria to select the
best model. I cannot use cross validation for model comparison due to the regression model on a
fold in which certain levels of the factor variable are not present.
6. I introduce second order model and interaction term for the selected model.
7. I carry out graphical diagnostic plots to examine how well the regression assumptions are satisfied.
8. I transform the response variable if the regression assumptions are violated.
For fatalities model, there are 10 predictor variables in the full model (Table B1, Appendix). This model
can be reduced by dropping 7 variables, i.e. TRNSP, TONS, HEADEND1, TYPE, TYPTRK, TRKCLAS,
and CAUSE with F-statistic 0.913 and p-value 0.5743. The reported BIC show that the step model is a
better model with smaller BIC value (881.76). But the AIC and the adjusted R2 values agree that the reduced
model is a preferable model (Table B2, Appendix). Furthermore, a second order model including interaction
term is considered in the reduced model. I find that second order model does not fit better than the first
model. I use partial F-test to check for this and I get F-test 1.39 with p-value 0.24. This means the interaction

terms and the second order of EVACUATE are not important in the model and the first order model is
preferable. A further investigation with diagnostic plots shows that the fitted model is moderately violated
the regression assumptions. The residual points are generally scattered randomly throughout the range of
fitted values. The points also generally fall around the line in QQ plot (Figure B2, Appendix). Transforming
the response variable with Box-Cox method does not do any better (Figure B3, Appendix), thus the fitted
model without interaction and second order term is chosen for ease of interpretation. Table 3 summarizes
the estimated coefficient (standard error) and the corresponding p-value for the first and second order
model.
Table 3. Comparison the first and second order model for fatalities
First order model
Estimate (Std. Error)
(Intercept)

Second order model

P-value

Estimate (Std. Error)

P-value

1.07 (0.08)

<0.0001

1.07 (0.08)

<0.0001

0.001 (0)

<0.0001

0 (0.002)

0.87

TYPEQ 2

0.46 (0.09)

<0.0001

0.46 (0.09)

<0.0001

TYPEQ 3

0.1 (0.14)

0.46

0.09 (0.14)

0.51

TYPEQ 4

-0.3 (0.7)

0.67

-0.3 (0.7)

0.67

TYPEQ 6

-0.07 (0.7)

0.92

-0.07 (0.7)

0.92

TYPEQ 7

0.19 (0.35)

0.58

0.18 (0.35)

0.60

TYPEQ 8

0.02 (0.2)

0.93

0.01 (0.2)

0.95

TYPEQ 9

-0.3 (0.5)

0.55

-0.3 (0.5)

0.55

TYPEQ A

-0.3 (0.7)

0.67

-0.3 (0.7)

0.67

TYPEQ C

0.08 (0.35)

0.82

0.1 (0.35)

0.78

TYPEQ D

0.43 (0.41)

0.29

0.44 (0.41)

0.28

TYPEQ E
Season(base case:
Spring)
Summer

-0.07 (0.7)

0.92

-0.07 (0.7)

0.92

0.23 (0.1)

0.02

0.22 (0.1)

0.03

-0.07 (0.11)

0.51

-0.06 (0.11)

0.61

0.03 (0.1)

0.74

EVACUATE
Type of consist
(base case = TYPEQ 1)

Autumn
Winter

0.03 (0.1)

0.80

4.7E-7 (1.8E-6)

0.80

EVACUATE x Summer

0.003 (0.004)

0.43

EVACUATE x Autumn

0.0001 (0.003)

0.97

EVACUATE x Winter

-0.001 (0.009)

0.95

EVACUATE2

For total damage model, there are 16 predictor variables considered in the initial model (Table B1,
Appendix). There are 7 variables that are not significant in the full model. Thus, only 9 predictors are kept

in the reduced model, i.e. CARSHZD, EVACUATE, TRNSPD, TONS, TYPE, TRNDIR, REGION, TYPTRK, and
CAUSE. A partial F-test shows that the reduced model explains total damage better with F-statistic 1.74 and

p-value 0.08. The reported BIC show that the reduced model is a better model since the BIC value is smaller
(87957.48). However, the model based on stepwise selection procedure is selected as the best model since
the AIC is smaller and the adjusted R2 is larger than the reduced model (Table B3, Appendix). As shown
in Table 4, the selected stepwise model that includes second order and interaction terms (Model 2) is found
to be better with p-value < 0.0001. Model 2 is then reduced by performing stepwise selection procedure. I
find that some interaction and second order terms can be dropped from the model so Model 3 can be retained
(p-value = 0.99).
Table 4. Partial F-test

Model 1: Step model (first order


model)

Res. Df

RSS

Df

2906

1.32E+15

2883

1.24E+15

23

2886

1.24E+15

Sum of

F-test

P-value

7.80E+13

7.89

<0.0001

4.35E+10

0.03

0.99

Square

Model 2: Step model including


second order and interaction terms
(second order model)
Model 3: Model 2 after
performing stepwise selection
procedure

The diagnostic plot shows that the selected model (Model 3) is moderately violated the regression
assumptions (Figure B4, Appendix). Similar as fatalities model, transforming the response variable with
Box-Cox method does not fit any better (Figure B5, Appendix). Therefore, the fitted model with second
order terms without any transformation to the response variable is chosen for ease of interpretation. Table
5 summarizes the estimated coefficient, standard error, and p-value.

Table 5. The selected second order model for total damage


Estimate (Std. Error)

P-value

746000 (393000)

0.058

CARSDMG

8660 (6180)

0.161

CARSHZD

110000 (31500)

<0.0001

283 (102)

0.006

-24100 (4930)

<0.0001

26.9 (4.12)

<0.0001

Head on collision

1470000 (134000)

<0.0001

Rearend collision

465000 (106000)

<0.0001

251000 (73000)

0.001

111000 (140000)

0.429

-199000 (221000)

0.369

-302000 (82000)

<0.0001

6760000 (658000)

<0.0001

Obstruction

242000 (123000)

0.049

Explosive detonation

266000 (658000)

0.686

Fire / violent rupture

-56300 (113000)

0.620

104000 (73000)

0.155

-47600 (111000)

0.667

South

-70600 (65200)

0.279

East

-116000 (61000)

0.058

West

-158000 (63900)

0.013

Region 2

-71000 (107000)

0.509

Region 3

-260000 (107000)

0.015

Region 4

-133000 (103000)

0.196

Region 5

-207000 (96700)

0.033

Region 6

-196000 (98600)

0.046

Region 7

81800 (107000)

0.443

Region 8

-108000 (105000)

0.306

Intercept

EVACUATE
TRNSPD
TONS
Type of accident (base case:
derailment)

Side collision
Raking collision
Broken train collision
Hwy-rail crossing
RR Grade Crossing

Other impacts
Others
Train direction (base case:
north)

FRA designated region (base


case: Region 1)

Type of consist (base case:


TYPEQ -NA)
TYPEQ 1

-225000 (383000)

0.556

TYPEQ 2

-18500 (392000)

0.962

TYPEQ 3

562000 (417000)

0.177

TYPEQ 4

-411000 (409000)

0.315

TYPEQ 5

-426000 (440000)

0.333

TYPEQ 6

83500 (398000)

0.834

TYPEQ 7

-85300 (384000)

0.824

TYPEQ 8

-54500 (401000)

0.892

TYPEQ 9

-117000 (426000)

0.783

TYPEQ A

-39800 (418000)

0.924

TYPEQ B

560000 (610000)

0.359

TYPEQ D

-502000 (611000)

0.411

-83900 (64900)

0.196

Siding

377000 (132000)

<0.0001

Industry

-27800 (110000)

0.801

-340000 (76400)

<0.0001

-12400 (76100)

0.871

-214000 (195000)

0.271

-253000 (68400)

<0.0001

-2820 (1230)

0.022

232 (42.3)

<0.0001

0 (0)

0.032

Longitud2

12.3 (3.92)

<0.0001

TRNSPD x South

96.2 (2410)

0.968

TRNSPD East

7690 (2290)

<0.0001

TRNSPD x West

7900 (2400)

<0.0001

TRNSPD x Region 2

2290 (3780)

0.545

TRNSPD x Region 3

19500 (3770)

<0.0001

TRNSPD x Region 4

12600 (3550)

<0.0001

TRNSPD x Region 5

15700 (3370)

0.000

Type of track (base case:


Main)
Yard

Cause of accident (Base


case: E)

CARSHZD2
TRNSPD2
2

TONS

TRNSPD x Region 6

17300 (3400)

<0.0001

TRNSPD x Region 7

9200 (3690)

0.013

TRNSPD x Region 8

12100 (3690)

<0.0001

TRNSPD x Yard

-7370 (5980)

0.218

TRNSPD x Siding

-25500 (7220)

<0.0001

-12200 (10200)

0.231

TRNSPD x H

16400 (2740)

<0.0001

TRNSPD x M

4740 (2490)

0.056

TRNSPD x S

4680 (10200)

0.646

TRNSPD x T

14800 (2190)

<0.0001

TRNSPD x Industry

3. Evidence
I find that season is an important factor that leads to more fatalities. The partial F-test shows that season
cannot be eliminated from the model (F-statistic: 3.49, p-value: 0.016). The p-value for summer season is
0.03, meaning that I have a strong evidence to reject my (null) hypothesis. The resulting coefficient
indicates that the number of fatalities is higher during summer season. The rate of change of fatalities during
summer season is estimated to be about 0.22 with 95% confident interval between 0.04 and 0.4. Based on
the final model for fatalities, I observe that TYPEQ is another important factor causing more death.
For total damage, cause and type of accident are important factors to the severity of total damage. Different
cause and different type of accident will lead to different cost damage and they are statistically significant.
With 95% confidence, these effects cannot be dropped from the model with F-statistics 5.82 and p-value <
0.0001. Therefore, I can reject my hypothesis that cause and type of accident do not affect total damage. It
should be noted that the train speed and cause of accident has an interaction effect on cost damage (Figure
B6, Appendix). This means that the relationship between total damage and cause of accident depend on the
train speed. I observe that at high train speed, human error comes into play to cause more cost damage.
Furthermore, given the other factors are fixed, the expected total damage for severe accident is higher at
RR Grade Crossing, i.e. $7,506,000 and the evidence is highly significant at 5% level.

4. Recommendation
It is evidence that several number of factors can lead to severe train accidents. This includes season, type
of accident, and cause of accident. The best models to answer my hypotheses have pretty high validation
to predict the severity of rail accidents, i.e. about 25% based on the adjusted R2 (Table B2-B3, Appendix).
With 95% confidence, the effect of season to fatalities is statistically significant. The rate of change of
fatalities during summer season is estimated to be about 0.22 with 95% confident interval between 0.04 and
0.4. The effect of type of accident and cause of accident are also significant to total damage. At 5% level,
these factors cannot be eliminated from the model, so I can be sure that they are important to severity of
train accidents. This confirms my findings based on the plots shown in Figure 2, 3, and 4. The results tell
me that the FRA should put an extra safety requirement when the train is running during summer season.
Human errors are often unavoidable. This is what I obtain from modeling the cost damage. I find that human
error is one of the most important factors that causing more cost damage. The FRA should train well their
people about safety, so that human error failures can be minimized. In addition, it is important to put greater
safety for train at RR Grade Crossing.

5. References
[1] D. E. Brown and L. Barnes, Laboratory 1: Train accidents," August 2013, assignment in class SYS
4021.
[2] D. E. Brown and L. Barnes, Laboratory 1: Train accidents template," August 2013, assignment in
class SYS 4021.
[3] F. R. Administration, Federal railroad administration office of safety analysis," August 2012.
[Online]. Available: http://safetydata.fra.dot.gov/officesafety/

Appendix A
Table A. Accident Description
No

Field Name

Description

Type

TOTKLD

Fatalities - total killed for railroads

Response variable

ACCDMG

Total reportable damage on all reports in $

Response variable

CARS

# of cars carrying hazmat

Continuous variable

CARSDMG

# of hazmat cars damaged or derailed

Continuous variable

CARSHZD

# of cars that released hazmat

Continuous variable

EVACUATE

# of persons evacuated

Continuous variable

TEMP

Temperature in degrees Fahrenheit

Continuous variable

TRNSPD

Speed of train in miles per hour

Continuous variable

TONS

Gross tonnage, excluding power units

Continuous variable

10

HEADEND1

# of head end locomotives

Continuous variable

11

Latitude

Latitude in decimal degrees, explicit decimal, explicit +/- (WGS84)

Continuous variable

12

Longitud

Longitude in decimal degrees, explicit decimal, explicit +/- (WGS84)

Continuous variable

13

TYPE

type of accident:

Categorical variable

01= derailment (base case),02= head on collision,03= rearend collision,04=


side collision,05= raking collision,06= broken train collision,07= hwy-rail
crossing,08= RR Grade Crossing, 09= obstruction,10= explosiv detonation,
11= fire / violent rupture,12= other impacts,13= other (described in narrative)
14

VISIBILTY

daylight period:

Categorical variable

1=dawn (base case),2=day,3=dusk,4=dark


15

WEATHER

weather conditions:

Categorical variable

1=clear (base case), 2=cloudy,3=rain,4=fog,5=sleet,6=snow


16

TRNDIR

train direction:

Categorical variable

1=north (base case),2=south,3=east,4=west


17

REGION

FRA designated region (1 = base case)

Categorical variable

18

TYPEQ

type of consist:

Categorical variable

1=freight train (base case),2=passenger train,3=commuter train,4=work


train,5=single car,6= cut of cars,7= yard / switching,8= light loco(s),9= maint
/ inspect,car,A= spec. MoW q
19

TYPTRK

type of track:

Categorical variable

1=main (base case), 2=yard, 3=siding, 4=industry


20

TRKCLAS

FRA track class: 1-9,X (1 = base case)

Categorical variable

21

RCL

Remote control locomotive = 0,1,2, or 3

Categorical variable

0= not a remotely controlled operation (base case),1= remote control portable


transmitter,2= remote control tower operation, 3= remote control portable
transmitter (more than one remote control)
22

CAUSE

Primary cause of incident:

Categorical variable

E=Mechanical and Electrical Failures (base case), H=Human Factors,


M=Miscellaneous Causes, S=Signal and Communication, T=Rack, Roadbed
and Structures
23

SEASON

Primary cause of incident:


1=spring (Mar May) ( (base case), 2=summer (Jun Aug), 3=autumn (Sep
Nov), 4=winter (Dec Feb)

Categorical variable

Appendix B

Table B 1. P-value of the overall F-statistic in simple regression model


Fatalities

Total
Damage

CARS

0.96

0.89

CARSDMG

0.40

0.00

CARSHZD

0.86

0.00

EVACUATE

0.00

0.00

TEMP

0.27

0.74

TRNSPD

0.01

0.00

TONS

0.13

0.00

HEADEND1

0.10

0.81

Latitude

0.82

0.00

Longitud

0.71

0.00

factor(TYPE)

0.00

0.00

factor(VISIBLTY)

0.99

0.24

factor(WEATHER)

0.34

0.69

factor(TRNDIR)

0.83

0.00

factor(REGION)

0.28

0.00

factor(TYPEQ)

0.05

0.00

factor(TYPTRK)

0.00

0.00

factor(TRKCLAS)

0.17

0.00

factor(RCL)

NA

0.00

factor(CAUSE)

0.01

0.05

factor(SEASON)

0.05

0.46

*NA: cannot be estimated since only one level available under RCL variable for fatalities model

Table B 2. Model comparison for fatalities


Full Model
Response variable

Reduced Model

Stepwise Model

TOTKLD
EVACUATE, TRNSPD,
TONS, HEADEND1,

Predictor variables

TYPE, TYPEQ, TYPTRK,


TRKCLAS, CAUSE,

EVACUATE, TYPEQ,

EVACUATE, TRNSPD,

SEASON

CAUSE, SEASON

SEASON
R2

33.22%

29.6%

26.66%

adjusted R2

26.43%

26.79%

25.32%

AIC

867.419

846.043

846.044

BIC

1018.23

913.51

881.76

F-statistic: 10.51 on 15

F-statistic: 19.89 on 7 and

F-statistic: 4.892 on 36 and

and 375 DF, p-value: <

383 DF, p-value: < 2.2e-

354 DF, p-value: 8.785e-16

2.2e-16

16

Overall significance

Partial F-test: Full vs.


Reduced Model

F: 0.913, p-value: 0.5743

Table B 3. Model comparison for total damage


Full Model
Response variable

Reduced Model
ACCDMG

CARSDMG, CARSHZD,

Predictor variables

Stepwise Model

EVACUATE, TRNSPD,

CARSHZD,

TONS, Latitude, Longitud,

EVACUATE, TRNSPD,

TYPE, VISIBLTY, TRNDIR,

TONS, TYPE, TRNDIR,

REGION, TYPEQ,

REGION, TYPTRK,

TYPTRK, TRKCLAS, RCL,

CAUSE

CAUSE

CARSDMG, CARSHZD,
EVACUATE, TRNSPD,
TONS, Longitud, TYPE,
TRNDIR, REGION,
TYPEQ, TYPTRK,
CAUSE

R2

27.2%

25.14%

26.67%

adjusted R2

25.61%

24.29%

25.49%

AIC

87725.23

87747.80

87714.55

BIC

88114.63

87957.48

88008.11

F-statistic: 29.71 on 33

F-statistic: 22.49 on 47

F-statistic: 17.14 on 63 and

and 2920 DF, p-value: <

and 2906 DF, p-value: <

2890 DF, p-value: < 2.2e-16

2.2e-16

2.2e-16

Overall significance

Partial F-test: Full vs.


Reduced Model

F-test:1.74, p-value:0.08

Figure B 1. Boxplot for fatalities and number of people evacuated for each year to identify potential
outliers

40000
30000
20000
10000
0

Number of People Evacuated

50000

Number of people evacuated in each year

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Year

Normal Q-Q

1924
27762
38066

-2

2 4 6
-2

Residuals

Residuals vs Fitted
1924
38066 27762

Standardized residuals

Figure B 2. Diagnostic plot for the selected fatalities model before transformation

-3

-2

Fitted values

Residuals vs Leverage
6

1924

41250

-2

1.5

Standardized residuals

Scale-Location
1924
38066 27762

Theoretical Quantiles

0.0

Standardized residuals

Fitted values

-1

18969
1
0.5
0.5
1

Cook's distance
0.0

0.2

0.4

0.6

Leverage

0.8

Figure B 3. Diagnostic plot for fatalities model after transformation with Box-Cox method (Lambda=-2)

20

40

60

10

1924

27762
38066

40
0

20

38066 27762

-20

Residuals

Normal Q-Q
Standardized residuals

Residuals vs Fitted
1924

80

-3

-2

60

10
5

18969

4535

1.0

40

1924

0.5
1
0.5
1

Cook's distance

-5

Standardized residuals

38066 27762

20

Residuals vs Leverage

2.0

3.0

Scale-Location
1924

Theoretical Quantiles

0.0

Standardized residuals

Fitted values

-1

80

0.0

0.2

0.4

Fitted values

0.6

0.8

Leverage

0e+00

2e+06

4e+06

20

18324
41076
20237

5 10

41076
20237

Normal Q-Q

1e+07

18324

0e+00

Residuals

Residuals vs Fitted

Standardized residuals

Figure B 4. Diagnostic plot for the selected total damage model before transformation

6e+06

-3

2e+06

4e+06

Fitted values

6e+06

Residuals vs Leverage
15

18324
41076
20237

-5 0 5

4
1

41076
20237

Standardized residuals

Scale-Location
18324

0e+00

-1

Theoretical Quantiles

Standardized residuals

Fitted values

-2

1
0.5
0.5
1

Cook's distance
0.0

0.2

0.4
Leverage

0.6

0.8

Figure B 5. Diagnostic plot for the selected total damage model after transformation with Box-Cox
method (lambda=-0.5)

-1000

1000

38066
41076

10

38066
41076

-3

-2

-1

Scale-Location

Residuals vs Leverage
18324

38066

36985

2.0
1.0
0.0

10

Theoretical Quantiles

Standardized residuals

Fitted values

38066
41076

500

18324

1000 1500 2000 2500

18324

3.0

Standardized residuals

500

Normal Q-Q

18324

Standardized residuals

Residuals

3000

Residuals vs Fitted

0.5
1

Cook's distance

1000 1500 2000 2500

0.0

0.2

Fitted values

0.4

0.6

1
0.5

0.8

Leverage

CAUSE

4.0e+06

8.0e+06

M
E
H
S
T

0.0e+00

ACCDMG

1.2e+07

Figure B 6. Interaction plot train speed and cause with damage cost of accident

0 4

13

19

25

31

37

43

49

55

TRNSPD

61

67

75

90