You are on page 1of 13

THE TITANIC SHIPWRECK: WHO WAS

MOST LIKELY TO SURVIVE?


A STATISTICAL ANALYSIS

This paper examines the probability of surviving the Titanic shipwreck using limited
dependent variable regression analysis. This applied analysis will explain the impact
of sex, age, and passenger class on the likelihood of survival. The probability of
survival is examined using a Linear probability model, a Probit model, and a Logit
model.

1. INTRODUCTION

On April 14, 1912, the unthinkable happened when the “unsinkable” RMS Titanic

crashed into an iceberg and sunk into the Atlantic Ocean. The 20 lifeboats aboard the

ship, a number actually larger than that required by the British Board of Trade at the time,

were not enough to save a majority of the passengers, leaving over 1500 passengers

aboard the sinking vessel. A total of 705 passengers escaped onto lifeboats and to safety.

But not all passengers had an equal chance of getting onto a lifeboat and surviving the
disaster. It is the purpose of this paper to explain, using regression analysis, the impact of

sex, passenger class, and age on a person’s likelihood of surviving the shipwreck.

2. HYPOTHESES

The first hypothesis is that upper class women have the best chance of survival,

followed by middle class women and then lower class women. Second, men’s survival

rate will also decrease as class lowers. Finally, it is predicted that age will be negatively

related to the probability of survival.

3. THE DATA SET

The data used in this paper consists of 1046 observations of single passengers

aboard the Titanic. It is important to note that not all passengers aboard the ship are

accounted for in this analysis because some characteristics of these passengers were

missing. The information provided for each passenger includes age, sex, passenger class

(1st, 2nd, or 3rd), and whether or not the passenger survived or died the shipwreck. In this

sample, 41 percent of the passengers survived, and 59 percent of the passengers died. The

sample mean of age is 30 years old. Sex is a dummy variable for male passengers on the

ship. In this sample, 63 percent of the passengers were male, and 37 percent of the

passengers were female. A dummy variable was also created for 1st class passengers and

2nd class passengers. For instance, the dummy variable for first class passengers (pclass1)

is equal to one if a passenger was in the upper class and equal to zero if they were in any

other class (middle-class or lower-class). In this sample, 27 percent of the passengers

were upper class, and 25 percent of the passengers were middle class and 48 percent of

the passengers were lower class.

2
4. THE MODEL

To test the probability of survival a binary choice model is used. The binary

choice model may be characterized as a classical regression model subject to qualitative

observation of the dependent variable. In this case, the dependent variable will be

whether or not the passenger survived the shipwreck. Specifically, assume that

Yi = X i β − σε i

satisfies all classical assumptions except that Yi is not observed. Instead, a binary

variable, J i , is observed such that

J i = 1 if Yi > 0
J i = 0 if Yi ≤ 0

The model specifies the relationship between the latent variable Yi and the regressors. In

order to estimate anything, the relationship between the observed variable J i and the

regressors are needed. The relationship follows from the definition of J. That is,

Xiβ
J i = 1 iff Yi > 0 iff X i β − σε i > 0 iff ε i <
σ

which implies that

Xiβ Xiβ
P ( J i = 1) = P(ε i < ) = F( )
σ σ

where F (⋅) is the distribution function of the ε i . Since the space of J i is {0, 1} it follows

that

Xiβ
P ( J i = 0) = 1 − F ( )
σ

3
Since J i is discrete, the density function for J i is

Ji 1− J i
 X β   X β 
f (J i ) = F  i  1 − F  i 
  σ    σ 

Then, letting δ = σ −1 β it follows that the log-likelihood function is

n
ln L(δ ) = ∑ {J i ln F ( X i δ ) + (1 − J i ) ln[1 −F ( X i δ )]}
i =1

Clearly, the binary choice model encompassed a wide class of models indexed by the

choice of the distribution function, F ( X i δ ) . In order to examine the probability of

survival the linear probability model, the probit model, and the logit model will be used.

4.1 LINEAR PROBABILITY MODEL

The simplest binary choice model is the linear probability model where the error

term is assumed to be independently and identically distributed uniformly on the interval

[0, 1]. That is, ε i ~ iidU (0,1) . Therefore, F ( X i δ ) = X i δ for 0< X i δ < 1 . So, the linear

probability model is a linear regression model with heteroskedastic errors. Specifically,

SURVi = α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2 + µ i

Therefore, this linear probability model will be estimated using OLS, but it is important

to note that the resulting estimates will be unbiased but inefficient. This is one major

drawback of the linear probability model. Table 1 summarizes the OLS estimation.

4
Table 1: Linear Probabilty Model Estimates using OLS

Regressor Coefficient Std. Error t-stat Prob > t

Con 0.73457 0.03233 22.71940 0.00000

AGE -0.00527 0.00093 -5.65627 0.00000

SEX -0.49141 0.02555 -19.23186 0.00000

PCLASS1 0.37039 0.03250 11.39518 0.00000

PCLASS2 0.15901 0.03033 5.24332 0.00000

One can obtain the following conditional probabilities of surviving from the

model:

Probability of survival given female & 3rd class:

^ ^
P (S | F, 3) = α + β AGE Age = 0.73457 – 0.00527Age

Probability of survival given female & 2nd class:

^ ^ ^
P (S | F, 2) = ( α + β pclass 2 ) + β AGE Age = 0.89358 – 0.00527Age

Probability of survival given female & 1st class:

^ ^ ^
P (S | F, 1) = ( α + β pclass1 ) + β AGE Age = 1.10496 – 0.00527Age

Probability of survival given male & 3rd class:

^ ^ ^
P (S | M, 3) = ( α + β sex ) + β AGE Age = 0.24316 – 0.00527Age

Probability of survival given male & 2nd class:

^ ^ ^ ^
P (S | M, 2) = ( α + β sex + β pclass 2 ) + β AGE Age = 0.40217 – 0.00527Age

5
Probability of survival given male & 1st class:

^ ^ ^ ^
P (S | M, 1) = ( α + β sex + β pclass1 ) + β AGE Age = 0.61355 – 0.00527Age

Therefore, it is clear that these regressions only differ in the intercepts and not in the

^
slope coefficient for Age, β AGE . This can be seen in the following graph.

Graph 1: Probability of Survival

Probability of Survival
120% F,3 F,2

110%
F,1 M,3
100%

90%
M,2 M,1
80%
Probability of Survival (%)

70%

60%

50%

40%

30%

20%

10%

0%
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
-10%

-20%

-30%
Age (years)

By looking at the graph, another major drawback of the linear probability model is clear.

The probabilities do not lie within the unit interval. That is why there are probabilities

greater than 100% and less than 0%. This is a fundamental limitation.

6
From the graph, it is also easy to understand the interpretations of the coefficients.

^
Ceteris paribus, β AGE is the effect on the probability of survival due to a one-year

increase in age. Therefore, a one-year increase in age will decrease the probability of

^
survival by 0.5%. β sex is interpreted as the difference in the probability of survival

^
between a male and a female holding age and passenger class constant. β pclass1 is

interpreted as the difference in the probability of survival between a first class passenger

^
and a third class passenger holding sex and age constant. β pclass 2 is interpreted as the

difference in the probability of survival between a second class passenger and a third

class passenger holding sex and age constant.

In order to test the significance of each regressor, the following hypothesis test

was conducted:

H0: β i = 0 and HA: β i ≠ 0 where i =AGE, SEX, PCLASS1, PCLASS2

A two tailed test is used and if α = 0.05 or 5%, the two-tailed critical value is ±1.96.

Based on this test, we reject the null hypothesis and conclude that individually all

regression coefficients are statistically significant, that is, different from zero.

4.2 PROBIT MODEL

The probit model is the specification of the binary choice model that results when

the error term is independent and identically distributed normally with mean 0 and

variance 1. That is, ε i ~ iidN (0,1) . In this case, F ( X i δ ) = Φ ( X i δ ) where Φ (⋅) is the

standard normal distribution function. Specifically,

7
P ( SURVi = 1) = Φ (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2)

A probit model was estimated and the results are summarized in Table 2.

Table 2: Probit Model Estimation

Regressor Coefficient Std. Error t-stat Prob > t

Con 0.75194 0.11389 6.60261 0.00000

AGE -0.01943 0.00350 -5.54898 0.00000

SEX -1.48564 0.09746 -15.24314 0.00000

PCLASS1 1.30316 0.12500 10.42550 0.00000

PCLASS2 0.54299 0.12526 4.33495 0.00002

In order to test the significance of each regressor, the following hypothesis test

was conducted:

H0: β i = 0 and HA: β i ≠ 0 where i =AGE, SEX, PCLASS1, PCLASS2

A two tailed test is used and if α = 0.05 or 5%, the two-tailed critical value is ±1.96.

Based on this test, we reject the null hypothesis and conclude that individually all

regression coefficients are statistically significant, that is, different from zero.

It also is important to test the null hypothesis that the regressors are jointly

significant. The null hypothesis is as follows:

^ ^ ^ ^
H0: β AGE = β sex = β pclass1 = β pclass 2 = 0

8
That is, all of the explanatory variables explain zero percent of the variation in the

dependent variable, SURV. Therefore, a likelihood ratio test can be performed as

follows:

 ~ ^
A
− 2ln L(δ ) − ln L(δ ) ~ χ K2 − 1
 

The value of the unconstrained log-likelihood function is -492.2764. The value of the

constrained log-likelihood function is -707.3102. Therefore, the value of the test statistic

is 430.0676. The critical value for χ 42 at α=0.01 or 1% level of significance is 18.47.

Therefore, the regressors are jointly significant at the 1% level.

The marginal effects were also estimated for the probit model and are summarized

in Table 3 as follows:

Table 3: Probit Model Marginal Effects

Regressor Marginal Effects Std. Error t-stat Prob > t

AGE -0.00746 0.00134 -5.57849 0.00000

SEX -0.57088 0.03878 -14.72199 0.00000

PCLASS1 0.50076 0.04913 10.19181 0.00000

PCLASS2 0.20865 0.04876 4.27881 0.00002

Unlike the linear probability model, the marginal effect of a one unit increase in an

independent variable is not the estimated coefficients in the probit model. Theoretically,

the marginal effect of X i' for the probit model is defined as follows:

9
∂Φ ( X i δ )
= φ ( X iδ ) ⋅ δ
∂X i/

Therefore, the marginal effect of one regressor depends on the entire vector of regressors.

As can be seen from Table 3 above, the marginal effect of each regressor is marginally

significant at the 1% level.

4.3 LOGIT MODEL

The logit model is the binary choice model that results when the error term is

e X iδ
independent with logistic distribution. In this case, F ( X i δ ) = Ψ ( X i δ ) = .
1 + e X iδ

Specifically,

P ( SURVi = 1) = Ψ (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2 )

A logit model was estimated and the results are summarized in Table 4.

Table 4: Logit Model Estimation

Regressor Coefficient Std. Error t-stat Prob > t

Con 1.23239 0.19999 6.16221 0.00000

AGE -0.03439 0.00622 -5.53179 0.00000

SEX -2.49783 0.17194 -14.52697 0.00000

PCLASS1 2.28961 0.22526 10.16423 0.00000

PCLASS2 1.00935 0.21781 4.63406 0.00000

10
In order to test the significance of each regressor, the following hypothesis test was

conducted:

H0: β i = 0 and HA: β i ≠ 0 where i =AGE, SEX, PCLASS1, PCLASS2

A two tailed test is used and if α = 0.05 or 5%, the two-tailed critical value is ±1.96.

Based on this test, we reject the null hypothesis and conclude that individually all

regression coefficients are statistically significant, that is, different from zero.

It also is important to test the null hypothesis that the regressors are jointly

significant. The null hypothesis is as follows:

^ ^ ^ ^
H0: β AGE = β sex = β pclass1 = β pclass 2 = 0

That is, all of the explanatory variables explain zero percent of the variation in the

dependent variable, SURV. Therefore, a likelihood ratio test can be performed as

follows:

 ~ ^
A
− 2ln L(δ ) − ln L(δ ) ~ χ K2 − 1
 

The value of the unconstrained log-likelihood function is -491.2266. The value of the

constrained log-likelihood function is -707.3102. Therefore, the value of the test statistic

is 432.1672. The critical value for χ 42 at α=0.01 or 1% level of significance is 18.47.

Therefore, the regressors are jointly significant at the 1% level.

The marginal effects were also estimated for the logit model and are summarized

in Table 5 below. Again unlike the linear probability model, the marginal effect of a one

unit increase in an independent variable is not the estimated coefficients in the logit

model. Theoretically, the marginal effect of X i' for the logit model is defined as follows:

11
∂Ψ ( X i δ )
= Ψ ( X i δ ) ⋅ (1 − Ψ ( X i δ )) ⋅ δ
∂X i/

Therefore, the marginal effect of one regressor depends on the entire vector of regressors.

As can be seen from Table 5, the marginal effect of each regressor is marginally

significant at the 1% level.

Table 5: Logit Model Marginal Effects

Regressor Marginal Effects Std. Error t-stat Prob > t

AGE -0.00810 0.00146 -5.54556 0.00000

SEX -0.58799 0.04774 -12.31609 0.00000

PCLASS1 0.53897 0.05827 9.24915 0.00000

PCLASS2 0.23760 0.05384 4.41332 0.00001

5. CONCLUSION

This paper explained, using regression analysis, the impact of age, sex, and

passenger class on a person’s likelihood of surviving the Titanic shipwreck. Three binary

choice models were used to regress both quantitative and qualitative explanatory

variables on the qualitative dependent variable of survival as follows:

SURVi = f (α + β age AGE + β sex SEX + β pclass1 PCLASS1 + β pclass 2 PCLASS 2)

All three models; linear probability, probit, and logit; lead to the same conclusions. Upper

class women did indeed have the highest probability of surviving, followed by middle

class women and then lower class women. Among men, upper class men had greatest

12
probability of survival, which was not much below that of lower class women.

Additionally, the coefficient of age was negative, but was a very small value.

This analysis could be extended in the following ways. Another possible

explanatory variable is whether or not a passenger got on a lifeboat or not. This seems to

be a significant determinant of survival. In addition, one could control for the cabin of the

passenger. It would also be interesting to investigate why upper class men had a lower

probability of survival than lower class women. It might also be interesting to have a

separate category for children. However, from the regression that was performed, a

person was lucky to survive and had the best chance of survival if they were a young,

upper class woman.

13

You might also like