Professional Documents
Culture Documents
This paper examines the probability of surviving the Titanic shipwreck using limited
dependent variable regression analysis. This applied analysis will explain the impact
of sex, age, and passenger class on the likelihood of survival. The probability of
survival is examined using a Linear probability model, a Probit model, and a Logit
model.
1. INTRODUCTION
On April 14, 1912, the unthinkable happened when the “unsinkable” RMS Titanic
crashed into an iceberg and sunk into the Atlantic Ocean. The 20 lifeboats aboard the
ship, a number actually larger than that required by the British Board of Trade at the time,
were not enough to save a majority of the passengers, leaving over 1500 passengers
aboard the sinking vessel. A total of 705 passengers escaped onto lifeboats and to safety.
But not all passengers had an equal chance of getting onto a lifeboat and surviving the
disaster. It is the purpose of this paper to explain, using regression analysis, the impact of
sex, passenger class, and age on a person’s likelihood of surviving the shipwreck.
2. HYPOTHESES
The first hypothesis is that upper class women have the best chance of survival,
followed by middle class women and then lower class women. Second, men’s survival
rate will also decrease as class lowers. Finally, it is predicted that age will be negatively
The data used in this paper consists of 1046 observations of single passengers
aboard the Titanic. It is important to note that not all passengers aboard the ship are
accounted for in this analysis because some characteristics of these passengers were
missing. The information provided for each passenger includes age, sex, passenger class
(1st, 2nd, or 3rd), and whether or not the passenger survived or died the shipwreck. In this
sample, 41 percent of the passengers survived, and 59 percent of the passengers died. The
sample mean of age is 30 years old. Sex is a dummy variable for male passengers on the
ship. In this sample, 63 percent of the passengers were male, and 37 percent of the
passengers were female. A dummy variable was also created for 1st class passengers and
2nd class passengers. For instance, the dummy variable for first class passengers (pclass1)
is equal to one if a passenger was in the upper class and equal to zero if they were in any
were upper class, and 25 percent of the passengers were middle class and 48 percent of
2
4. THE MODEL
To test the probability of survival a binary choice model is used. The binary
observation of the dependent variable. In this case, the dependent variable will be
whether or not the passenger survived the shipwreck. Specifically, assume that
Yi = X i β − σε i
satisfies all classical assumptions except that Yi is not observed. Instead, a binary
J i = 1 if Yi > 0
J i = 0 if Yi ≤ 0
The model specifies the relationship between the latent variable Yi and the regressors. In
order to estimate anything, the relationship between the observed variable J i and the
regressors are needed. The relationship follows from the definition of J. That is,
Xiβ
J i = 1 iff Yi > 0 iff X i β − σε i > 0 iff ε i <
σ
Xiβ Xiβ
P ( J i = 1) = P(ε i < ) = F( )
σ σ
where F (⋅) is the distribution function of the ε i . Since the space of J i is {0, 1} it follows
that
Xiβ
P ( J i = 0) = 1 − F ( )
σ
3
Since J i is discrete, the density function for J i is
Ji 1− J i
X β X β
f (J i ) = F i 1 − F i
σ σ
n
ln L(δ ) = ∑ {J i ln F ( X i δ ) + (1 − J i ) ln[1 −F ( X i δ )]}
i =1
Clearly, the binary choice model encompassed a wide class of models indexed by the
survival the linear probability model, the probit model, and the logit model will be used.
The simplest binary choice model is the linear probability model where the error
[0, 1]. That is, ε i ~ iidU (0,1) . Therefore, F ( X i δ ) = X i δ for 0< X i δ < 1 . So, the linear
Therefore, this linear probability model will be estimated using OLS, but it is important
to note that the resulting estimates will be unbiased but inefficient. This is one major
drawback of the linear probability model. Table 1 summarizes the OLS estimation.
4
Table 1: Linear Probabilty Model Estimates using OLS
One can obtain the following conditional probabilities of surviving from the
model:
^ ^
P (S | F, 3) = α + β AGE Age = 0.73457 – 0.00527Age
^ ^ ^
P (S | F, 2) = ( α + β pclass 2 ) + β AGE Age = 0.89358 – 0.00527Age
^ ^ ^
P (S | F, 1) = ( α + β pclass1 ) + β AGE Age = 1.10496 – 0.00527Age
^ ^ ^
P (S | M, 3) = ( α + β sex ) + β AGE Age = 0.24316 – 0.00527Age
^ ^ ^ ^
P (S | M, 2) = ( α + β sex + β pclass 2 ) + β AGE Age = 0.40217 – 0.00527Age
5
Probability of survival given male & 1st class:
^ ^ ^ ^
P (S | M, 1) = ( α + β sex + β pclass1 ) + β AGE Age = 0.61355 – 0.00527Age
Therefore, it is clear that these regressions only differ in the intercepts and not in the
^
slope coefficient for Age, β AGE . This can be seen in the following graph.
Probability of Survival
120% F,3 F,2
110%
F,1 M,3
100%
90%
M,2 M,1
80%
Probability of Survival (%)
70%
60%
50%
40%
30%
20%
10%
0%
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
-10%
-20%
-30%
Age (years)
By looking at the graph, another major drawback of the linear probability model is clear.
The probabilities do not lie within the unit interval. That is why there are probabilities
greater than 100% and less than 0%. This is a fundamental limitation.
6
From the graph, it is also easy to understand the interpretations of the coefficients.
^
Ceteris paribus, β AGE is the effect on the probability of survival due to a one-year
increase in age. Therefore, a one-year increase in age will decrease the probability of
^
survival by 0.5%. β sex is interpreted as the difference in the probability of survival
^
between a male and a female holding age and passenger class constant. β pclass1 is
interpreted as the difference in the probability of survival between a first class passenger
^
and a third class passenger holding sex and age constant. β pclass 2 is interpreted as the
difference in the probability of survival between a second class passenger and a third
In order to test the significance of each regressor, the following hypothesis test
was conducted:
A two tailed test is used and if α = 0.05 or 5%, the two-tailed critical value is ±1.96.
Based on this test, we reject the null hypothesis and conclude that individually all
regression coefficients are statistically significant, that is, different from zero.
The probit model is the specification of the binary choice model that results when
the error term is independent and identically distributed normally with mean 0 and
variance 1. That is, ε i ~ iidN (0,1) . In this case, F ( X i δ ) = Φ ( X i δ ) where Φ (⋅) is the
7
P ( SURVi = 1) = Φ (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2)
A probit model was estimated and the results are summarized in Table 2.
In order to test the significance of each regressor, the following hypothesis test
was conducted:
A two tailed test is used and if α = 0.05 or 5%, the two-tailed critical value is ±1.96.
Based on this test, we reject the null hypothesis and conclude that individually all
regression coefficients are statistically significant, that is, different from zero.
It also is important to test the null hypothesis that the regressors are jointly
^ ^ ^ ^
H0: β AGE = β sex = β pclass1 = β pclass 2 = 0
8
That is, all of the explanatory variables explain zero percent of the variation in the
follows:
~ ^
A
− 2ln L(δ ) − ln L(δ ) ~ χ K2 − 1
The value of the unconstrained log-likelihood function is -492.2764. The value of the
constrained log-likelihood function is -707.3102. Therefore, the value of the test statistic
The marginal effects were also estimated for the probit model and are summarized
in Table 3 as follows:
Unlike the linear probability model, the marginal effect of a one unit increase in an
independent variable is not the estimated coefficients in the probit model. Theoretically,
the marginal effect of X i' for the probit model is defined as follows:
9
∂Φ ( X i δ )
= φ ( X iδ ) ⋅ δ
∂X i/
Therefore, the marginal effect of one regressor depends on the entire vector of regressors.
As can be seen from Table 3 above, the marginal effect of each regressor is marginally
The logit model is the binary choice model that results when the error term is
e X iδ
independent with logistic distribution. In this case, F ( X i δ ) = Ψ ( X i δ ) = .
1 + e X iδ
Specifically,
A logit model was estimated and the results are summarized in Table 4.
10
In order to test the significance of each regressor, the following hypothesis test was
conducted:
A two tailed test is used and if α = 0.05 or 5%, the two-tailed critical value is ±1.96.
Based on this test, we reject the null hypothesis and conclude that individually all
regression coefficients are statistically significant, that is, different from zero.
It also is important to test the null hypothesis that the regressors are jointly
^ ^ ^ ^
H0: β AGE = β sex = β pclass1 = β pclass 2 = 0
That is, all of the explanatory variables explain zero percent of the variation in the
follows:
~ ^
A
− 2ln L(δ ) − ln L(δ ) ~ χ K2 − 1
The value of the unconstrained log-likelihood function is -491.2266. The value of the
constrained log-likelihood function is -707.3102. Therefore, the value of the test statistic
The marginal effects were also estimated for the logit model and are summarized
in Table 5 below. Again unlike the linear probability model, the marginal effect of a one
unit increase in an independent variable is not the estimated coefficients in the logit
model. Theoretically, the marginal effect of X i' for the logit model is defined as follows:
11
∂Ψ ( X i δ )
= Ψ ( X i δ ) ⋅ (1 − Ψ ( X i δ )) ⋅ δ
∂X i/
Therefore, the marginal effect of one regressor depends on the entire vector of regressors.
As can be seen from Table 5, the marginal effect of each regressor is marginally
5. CONCLUSION
This paper explained, using regression analysis, the impact of age, sex, and
passenger class on a person’s likelihood of surviving the Titanic shipwreck. Three binary
choice models were used to regress both quantitative and qualitative explanatory
All three models; linear probability, probit, and logit; lead to the same conclusions. Upper
class women did indeed have the highest probability of surviving, followed by middle
class women and then lower class women. Among men, upper class men had greatest
12
probability of survival, which was not much below that of lower class women.
Additionally, the coefficient of age was negative, but was a very small value.
explanatory variable is whether or not a passenger got on a lifeboat or not. This seems to
be a significant determinant of survival. In addition, one could control for the cabin of the
passenger. It would also be interesting to investigate why upper class men had a lower
probability of survival than lower class women. It might also be interesting to have a
separate category for children. However, from the regression that was performed, a
person was lucky to survive and had the best chance of survival if they were a young,
13