T T S: W S ? AS A: HE Itanic Hipwreck Ho Was Most Likely To Urvive Tatistical Nalysis

THE TITANIC SHIPWRECK: WHO WAS
MOST LIKELY TO SURVIVE?

A STATISTICAL ANALYSIS
This paper examines the probability of surviving the Titanic shipwreck using limited
dependent variable regression analysis. This applied analysis will explain the impact
of sex, age, and passenger class on the likelihood of survival. The probability of
survival is examined using a Linear probability model, a Probit model, and a Logit
model.
1. INTRODUCTION
On April 14, 1912, the unthinkable happened when the “unsinkable” RMS Titanic
crashed into an iceberg and sunk into the Atlantic Ocean. The 20 lifeboats aboard the
ship, a number actually larger than that required by the British Board of Trade at the time,
were not enough to save a majority of the passengers, leaving over 1500 passengers
aboard the sinking vessel. A total of 705 passengers escaped onto lifeboats and to safety.
But not all passengers had an equal chance of getting onto a lifeboat and surviving the
disaster. It is the purpose of this paper to explain, using regression analysis, the impact of
sex, passenger class, and age on a person’s likelihood of surviving the shipwreck.
2. HYPOTHESES
The first hypothesis is that upper class women have the best chance of survival,
followed by middle class women and then lower class women. Second, men’s survival
rate will also decrease as class lowers. Finally, it is predicted that age will be negatively
related to the probability of survival.
3. THE DATA SET
The data used in this paper consists of 1046 observations of single passengers
aboard the Titanic. It is important to note that not all passengers aboard the ship are
accounted for in this analysis because some characteristics of these passengers were
missing. The information provided for each passenger includes age, sex, passenger class
(1st, 2nd, or 3rd), and whether or not the passenger survived or died the shipwreck. In this
sample, 41 percent of the passengers survived, and 59 percent of the passengers died. The
sample mean of age is 30 years old. Sex is a dummy variable for male passengers on the
ship. In this sample, 63 percent of the passengers were male, and 37 percent of the
passengers were female. A dummy variable was also created for 1st class passengers and
2nd class passengers. For instance, the dummy variable for first class passengers (pclass1)
is equal to one if a passenger was in the upper class and equal to zero if they were in any
other class (middle-class or lower-class). In this sample, 27 percent of the passengers
were upper class, and 25 percent of the passengers were middle class and 48 percent of
the passengers were lower class.
2
4. THE MODEL
To test the probability of survival a binary choice model is used. The binary
choice model may be characterized as a classical regression model subject to qualitative
observation of the dependent variable. In this case, the dependent variable will be
whether or not the passenger survived the shipwreck. Specifically, assume that
Yi = X i β − σε i
satisfies all classical assumptions except that Yi is not observed. Instead, a binary
variable, J i , is observed such that
J i = 1 if Yi > 0
J i = 0 if Yi ≤ 0
The model specifies the relationship between the latent variable Yi and the regressors. In
order to estimate anything, the relationship between the observed variable J i and the
regressors are needed. The relationship follows from the definition of J. That is,
Xiβ
J i = 1 iff Yi > 0 iff X i β − σε i > 0 iff ε i <
σ
which implies that
Xiβ Xiβ
P ( J i = 1) = P(ε i < ) = F( )
σ σ
where F (⋅) is the distribution function of the ε i . Since the space of J i is {0, 1} it follows
that
Xiβ
P ( J i = 0) = 1 − F ( )
σ
3
Since J i is discrete, the density function for J i is
Ji 1− J i
 X β   X β 
f (J i ) = F  i  1 − F  i 
  σ    σ 
Then, letting δ = σ −1 β it follows that the log-likelihood function is
n
ln L(δ ) = ∑ {J i ln F ( X i δ ) + (1 − J i ) ln[1 −F ( X i δ )]}
i =1
Clearly, the binary choice model encompassed a wide class of models indexed by the
choice of the distribution function, F ( X i δ ) . In order to examine the probability of
survival the linear probability model, the probit model, and the logit model will be used.
4.1 LINEAR PROBABILITY MODEL
The simplest binary choice model is the linear probability model where the error
term is assumed to be independently and identically distributed uniformly on the interval
[0, 1]. That is, ε i ~ iidU (0,1) . Therefore, F ( X i δ ) = X i δ for 0< X i δ < 1 . So, the linear
probability model is a linear regression model with heteroskedastic errors. Specifically,
SURVi = α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2 + µ i
Therefore, this linear probability model will be estimated using OLS, but it is important
to note that the resulting estimates will be unbiased but inefficient. This is one major
drawback of the linear probability model. Table 1 summarizes the OLS estimation.
4
Table 1: Linear Probabilty Model Estimates using OLS
Regressor Coefficient Std. Error t-stat Prob > t
Con 0.73457 0.03233 22.71940 0.00000
AGE -0.00527 0.00093 -5.65627 0.00000
SEX -0.49141 0.02555 -19.23186 0.00000
PCLASS1 0.37039 0.03250 11.39518 0.00000
PCLASS2 0.15901 0.03033 5.24332 0.00000
One can obtain the following conditional probabilities of surviving from the
model:
Probability of survival given female & 3rd class:
^ ^
P (S | F, 3) = α + β AGE Age = 0.73457 – 0.00527Age
Probability of survival given female & 2nd class:
^ ^ ^
P (S | F, 2) = ( α + β pclass 2 ) + β AGE Age = 0.89358 – 0.00527Age
Probability of survival given female & 1st class:
^ ^ ^
P (S | F, 1) = ( α + β pclass1 ) + β AGE Age = 1.10496 – 0.00527Age
Probability of survival given male & 3rd class:
^ ^ ^
P (S | M, 3) = ( α + β sex ) + β AGE Age = 0.24316 – 0.00527Age
Probability of survival given male & 2nd class:
^ ^ ^ ^
P (S | M, 2) = ( α + β sex + β pclass 2 ) + β AGE Age = 0.40217 – 0.00527Age
5
Probability of survival given male & 1st class:
^ ^ ^ ^
P (S | M, 1) = ( α + β sex + β pclass1 ) + β AGE Age = 0.61355 – 0.00527Age
Therefore, it is clear that these regressions only differ in the intercepts and not in the
^
slope coefficient for Age, β AGE . This can be seen in the following graph.
Graph 1: Probability of Survival
Probability of Survival
120% F,3 F,2
110%
F,1 M,3
100%
90%
M,2 M,1
80%
Probability of Survival (%)
70%
60%
50%
40%
30%
20%
10%
0%
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
-10%
-20%
-30%
Age (years)
By looking at the graph, another major drawback of the linear probability model is clear.
The probabilities do not lie within the unit interval. That is why there are probabilities
greater than 100% and less than 0%. This is a fundamental limitation.
6
From the graph, it is also easy to understand the interpretations of the coefficients.
^
Ceteris paribus, β AGE is the effect on the probability of survival due to a one-year
increase in age. Therefore, a one-year increase in age will decrease the probability of
^
survival by 0.5%. β sex is interpreted as the difference in the probability of survival
^
between a male and a female holding age and passenger class constant. β pclass1 is
interpreted as the difference in the probability of survival between a first class passenger
^
and a third class passenger holding sex and age constant. β pclass 2 is interpreted as the
difference in the probability of survival between a second class passenger and a third
class passenger holding sex and age constant.
In order to test the significance of each regressor, the following hypothesis test
was conducted:
H0: β i = 0 and HA: β i ≠ 0 where i =AGE, SEX, PCLASS1, PCLASS2
A two tailed test is used and if α = 0.05 or 5%, the two-tailed critical value is ±1.96.
Based on this test, we reject the null hypothesis and conclude that individually all
regression coefficients are statistically significant, that is, different from zero.
4.2 PROBIT MODEL
The probit model is the specification of the binary choice model that results when
the error term is independent and identically distributed normally with mean 0 and
variance 1. That is, ε i ~ iidN (0,1) . In this case, F ( X i δ ) = Φ ( X i δ ) where Φ (⋅) is the
standard normal distribution function. Specifically,
7
P ( SURVi = 1) = Φ (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2)
A probit model was estimated and the results are summarized in Table 2.
Table 2: Probit Model Estimation
Con 0.75194 0.11389 6.60261 0.00000
AGE -0.01943 0.00350 -5.54898 0.00000
SEX -1.48564 0.09746 -15.24314 0.00000
PCLASS1 1.30316 0.12500 10.42550 0.00000
PCLASS2 0.54299 0.12526 4.33495 0.00002
In order to test the significance of each regressor, the following hypothesis test
was conducted:
It also is important to test the null hypothesis that the regressors are jointly
significant. The null hypothesis is as follows:
^ ^ ^ ^
H0: β AGE = β sex = β pclass1 = β pclass 2 = 0
8
That is, all of the explanatory variables explain zero percent of the variation in the
dependent variable, SURV. Therefore, a likelihood ratio test can be performed as
follows:
 ~ ^
A
− 2ln L(δ ) − ln L(δ ) ~ χ K2 − 1
 
The value of the unconstrained log-likelihood function is -492.2764. The value of the
constrained log-likelihood function is -707.3102. Therefore, the value of the test statistic
is 430.0676. The critical value for χ 42 at α=0.01 or 1% level of significance is 18.47.
Therefore, the regressors are jointly significant at the 1% level.
The marginal effects were also estimated for the probit model and are summarized
in Table 3 as follows:
Table 3: Probit Model Marginal Effects
Regressor Marginal Effects Std. Error t-stat Prob > t
AGE -0.00746 0.00134 -5.57849 0.00000
SEX -0.57088 0.03878 -14.72199 0.00000
PCLASS1 0.50076 0.04913 10.19181 0.00000
PCLASS2 0.20865 0.04876 4.27881 0.00002
Unlike the linear probability model, the marginal effect of a one unit increase in an
independent variable is not the estimated coefficients in the probit model. Theoretically,
the marginal effect of X i' for the probit model is defined as follows:
9
∂Φ ( X i δ )
= φ ( X iδ ) ⋅ δ
∂X i/
Therefore, the marginal effect of one regressor depends on the entire vector of regressors.
As can be seen from Table 3 above, the marginal effect of each regressor is marginally
significant at the 1% level.
4.3 LOGIT MODEL
The logit model is the binary choice model that results when the error term is
e X iδ
independent with logistic distribution. In this case, F ( X i δ ) = Ψ ( X i δ ) = .
1 + e X iδ
Specifically,
P ( SURVi = 1) = Ψ (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2 )
A logit model was estimated and the results are summarized in Table 4.
Table 4: Logit Model Estimation
Con 1.23239 0.19999 6.16221 0.00000
AGE -0.03439 0.00622 -5.53179 0.00000
SEX -2.49783 0.17194 -14.52697 0.00000
PCLASS1 2.28961 0.22526 10.16423 0.00000
PCLASS2 1.00935 0.21781 4.63406 0.00000
10
In order to test the significance of each regressor, the following hypothesis test was
conducted:
It also is important to test the null hypothesis that the regressors are jointly
significant. The null hypothesis is as follows:
^ ^ ^ ^
H0: β AGE = β sex = β pclass1 = β pclass 2 = 0
That is, all of the explanatory variables explain zero percent of the variation in the
dependent variable, SURV. Therefore, a likelihood ratio test can be performed as
follows:
 ~ ^
A
− 2ln L(δ ) − ln L(δ ) ~ χ K2 − 1
 
The value of the unconstrained log-likelihood function is -491.2266. The value of the
constrained log-likelihood function is -707.3102. Therefore, the value of the test statistic
is 432.1672. The critical value for χ 42 at α=0.01 or 1% level of significance is 18.47.
Therefore, the regressors are jointly significant at the 1% level.
The marginal effects were also estimated for the logit model and are summarized
in Table 5 below. Again unlike the linear probability model, the marginal effect of a one
unit increase in an independent variable is not the estimated coefficients in the logit
model. Theoretically, the marginal effect of X i' for the logit model is defined as follows:
11
∂Ψ ( X i δ )
= Ψ ( X i δ ) ⋅ (1 − Ψ ( X i δ )) ⋅ δ
∂X i/
Therefore, the marginal effect of one regressor depends on the entire vector of regressors.
As can be seen from Table 5, the marginal effect of each regressor is marginally
significant at the 1% level.
Table 5: Logit Model Marginal Effects
Regressor Marginal Effects Std. Error t-stat Prob > t
AGE -0.00810 0.00146 -5.54556 0.00000
SEX -0.58799 0.04774 -12.31609 0.00000
PCLASS1 0.53897 0.05827 9.24915 0.00000
PCLASS2 0.23760 0.05384 4.41332 0.00001
5. CONCLUSION
This paper explained, using regression analysis, the impact of age, sex, and
passenger class on a person’s likelihood of surviving the Titanic shipwreck. Three binary
choice models were used to regress both quantitative and qualitative explanatory
variables on the qualitative dependent variable of survival as follows:
SURVi = f (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2)
All three models; linear probability, probit, and logit; lead to the same conclusions. Upper
class women did indeed have the highest probability of surviving, followed by middle
class women and then lower class women. Among men, upper class men had greatest
12
probability of survival, which was not much below that of lower class women.
Additionally, the coefficient of age was negative, but was a very small value.
This analysis could be extended in the following ways. Another possible
explanatory variable is whether or not a passenger got on a lifeboat or not. This seems to
be a significant determinant of survival. In addition, one could control for the cabin of the
passenger. It would also be interesting to investigate why upper class men had a lower
probability of survival than lower class women. It might also be interesting to have a
separate category for children. However, from the regression that was performed, a
person was lucky to survive and had the best chance of survival if they were a young,
upper class woman.
13

T T S: W S ? AS A: HE Itanic Hipwreck Ho Was Most Likely To Urvive Tatistical Nalysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

T T S: W S ? AS A: HE Itanic Hipwreck Ho Was Most Likely To Urvive Tatistical Nalysis

Uploaded by

Copyright:

Available Formats

THE TITANIC SHIPWRECK: WHO WAS

MOST LIKELY TO SURVIVE?

related to the probability of survival.

3. THE DATA SET

other class (middle-class or lower-class). In this sample, 27 percent of the passengers

the passengers were lower class.

choice model may be characterized as a classical regression model subject to qualitative

variable, J i , is observed such that

which implies that

Then, letting δ = σ −1 β it follows that the log-likelihood function is

choice of the distribution function, F ( X i δ ) . In order to examine the probability of

4.1 LINEAR PROBABILITY MODEL

term is assumed to be independently and identically distributed uniformly on the interval

probability model is a linear regression model with heteroskedastic errors. Specifically,

SURVi = α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2 + µ i

Regressor Coefficient Std. Error t-stat Prob > t

Con 0.73457 0.03233 22.71940 0.00000

AGE -0.00527 0.00093 -5.65627 0.00000

SEX -0.49141 0.02555 -19.23186 0.00000

PCLASS1 0.37039 0.03250 11.39518 0.00000

PCLASS2 0.15901 0.03033 5.24332 0.00000

Probability of survival given female & 3rd class:

Probability of survival given female & 2nd class:

Probability of survival given female & 1st class:

Probability of survival given male & 3rd class:

Probability of survival given male & 2nd class:

Graph 1: Probability of Survival

class passenger holding sex and age constant.

H0: β i = 0 and HA: β i ≠ 0 where i =AGE, SEX, PCLASS1, PCLASS2

4.2 PROBIT MODEL

standard normal distribution function. Specifically,

Table 2: Probit Model Estimation

Regressor Coefficient Std. Error t-stat Prob > t

Con 0.75194 0.11389 6.60261 0.00000

AGE -0.01943 0.00350 -5.54898 0.00000

SEX -1.48564 0.09746 -15.24314 0.00000

PCLASS1 1.30316 0.12500 10.42550 0.00000

PCLASS2 0.54299 0.12526 4.33495 0.00002

H0: β i = 0 and HA: β i ≠ 0 where i =AGE, SEX, PCLASS1, PCLASS2

significant. The null hypothesis is as follows:

dependent variable, SURV. Therefore, a likelihood ratio test can be performed as

is 430.0676. The critical value for χ 42 at α=0.01 or 1% level of significance is 18.47.

Therefore, the regressors are jointly significant at the 1% level.

Table 3: Probit Model Marginal Effects

Regressor Marginal Effects Std. Error t-stat Prob > t

AGE -0.00746 0.00134 -5.57849 0.00000

SEX -0.57088 0.03878 -14.72199 0.00000

PCLASS1 0.50076 0.04913 10.19181 0.00000

PCLASS2 0.20865 0.04876 4.27881 0.00002

significant at the 1% level.

4.3 LOGIT MODEL

P ( SURVi = 1) = Ψ (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2 )

Table 4: Logit Model Estimation

Regressor Coefficient Std. Error t-stat Prob > t

Con 1.23239 0.19999 6.16221 0.00000

AGE -0.03439 0.00622 -5.53179 0.00000

SEX -2.49783 0.17194 -14.52697 0.00000

PCLASS1 2.28961 0.22526 10.16423 0.00000

PCLASS2 1.00935 0.21781 4.63406 0.00000

H0: β i = 0 and HA: β i ≠ 0 where i =AGE, SEX, PCLASS1, PCLASS2

significant. The null hypothesis is as follows:

dependent variable, SURV. Therefore, a likelihood ratio test can be performed as