An Analysis of the Probability of Survival and Age Distribution of passengers aboard the Titanic and their dependence on various

factors

Introduction
A short statistical analysis is done on the data about the passengers to uncover hidden dependencies among different variables. This paves the way to try and find reasons for the observed patterns.

I. Is there a significant
Fig. 1 Histogram for survivors and non survivors

From the above histograms, we can comment that the age distribution seems different for the two samples. It seems that the infants (less than 5 years of age) survived more. The Box-plots of the same are presented:

Fig. 2 Box plot for non survivors and survivors

2.19). Let us check the assumption of normality of the populations of survivors and non-survivors.05). From the calculated p-value for Wilcoxon rank sum test (0. there is not enough evidence against Ho (Ho: X1 is stochastically equal to X2).0020591830 1. Fig.824687e-10 2.461442e-09 5. we start with the following assumptions: 1. we assume that the CDF’s of X 1 and X2 have same shape. X1 and X2 are Continuous.014279e-16 4. 2 Q-Q norm for survivors and non survivors The p-values for all the tests for the two samples are tabulated as follows:Test Shapiro-Wilk normality test Anderson-Darling normality test Cramer-von Mises normality test Lilliefors (Kolmogorov-Smirnov) normality test Shapiro-Francia normality test 0. Let X1 denotes age of survivors and X2 denotes the age of non-survivors. Fig 3: ECDF for non survivors and survivors .646866e-16 The p-values for all the tests for the two samples suggest that the populations of survivors and non-survivors are not normal (at assumed α-value of 0. The samples of X1 and X2 are independent and identically distributed. Further to the above assumptions.719147e-08 p value (Survived) p-value (Non Survived) 1. This allows us to apply the wilcoxon’s rank sum test. To answer the first question.The Random Variable of interest is age.

0355 Π45-60. we do a chi-square test to see if survivors and non-survivors have a homogeneous distribution across these age categories. we do a Kolmogorov-Smirnov two sample test and get a p-value of 0. Non-Survivors Π15-30. Survivors= Π5-15. 30+ to 45.05). So. Survivors= Π15-30. Non-Survivors Π30-45. Now. since the sample size of survivors is 313. Survivors= Π45-60. which supports our belief that there is a difference in age distributions of survivors and non-survivors. Survivors= Π60+. Non-Survivors . Now. Survivors= Π30-45. Based on observed data and our intuition from histograms. Non-Survivors Π60+. 45+ to 60. we categorise survivors and nonsurvivors in different age categories viz.1578 0. Non-Survivors P value 2. which suggests that there is evidently a significant difference in age distributions (at α-value of 0. On performing Z tests. We get a p-value of 5.477×10-6.8e-6 Conclusion Π0-5. Non-Survivors Π60+. and thus the adjoining conclusions:Age Category 0 to 5 5+ to 15 15+ to 30 30+ to 45 45+ to 60 60+ 0. Survivors= Π0-5. 0 to 5. null hypotheses being Π0-5. we can do a z-test on problem of proportion for each age category separately.03428. Non-Survivors Π45-60. Non-Survivors We began with two tailed tests and single tailed tests were done wherever null was refuted. Survivors >Π0-5. we get the following p values. 5+ to 15. 15+ to 30.But from the histograms and ECDF of survivors and non-survivors. Non-Survivors Π5-15. and 60+. it appears that there is a significant difference in survival probability for people in age group of 0-5 years. and that of non-survivors is 443. Survivors < Π60+. Survivors = Π45-60.

The above analysis suggests that there is a significant difference in age distribution between those who survived and those who did not. II. .366845e-10 Anderson-Darling normality test Cramer-von Mises normality test 0. (a) Is there a significant difference in Age distribution between male survivors and male non survivors? Histograms and box-plots survived more and old males died more.045051620 2.227363e-16 5.182246e-10 the populations of survivors and non-survivors are not normal (at assumed α-value of 0. Fig 4 Box plot for male non survivors and survivors Fig 5 Histogram for male non survivors and survivors Normality tests on data The p-values for all the tests for the two samples of survivors and non-survivors are tabulated as follows:Test p value (Survived) p-value (Non Survived) 6.008390771 0.05). We do a Kolmogorov Smirnov two sample test to find out that the two samples come from different distributions (p value = 0.002) implying there is a significant difference in age distributions of male survivors and dead.

On performing Z tests. Null hypotheses being as follows:Π0-5. the only difference being that here the two samples come from Male. Thus. Male_Survivors = Π45-60. Male_Survivors = Π15-30. Male_Survivors= Π0-5.47e-11 implies population of male survivors and non-survivors is not homogeneous with respect to age categories. Male_Survivors = Π5-15. . and thus the adjoining conclusions:Age Category 0 to 5 5+ to 15 15+ to 30 a significant difference in age distribution between male survivors and male non-survivors. Male_Survivors = Π60+. Male_Non-Survivors We began with two tailed tests and single tailed tests were done wherever null was refuted. one for each age category. Male_Survivors = Π30-45. Male_Non-Survivors Π60+. Male_Non-Survivors Π30-45.We use the same approach of dividing the population into age categories to find out if there is a dependence of survival probability on age category as done in part (1). Male_Non-Survivors Π15-30. Male_Non-Survivors Π5-15. we get the following p values. we go ahead with 6 separate Z tests. Male_Non-Survivors Π45-60. Chi-square p value of 1.

05).0077707718 0. whereas that of non survivors follow normal distribution (at assumed α-value of 0.01326) implying there is a significant difference in age distributions of female survivors and dead. we do a Kolmogorov Smirnov two sample test.08483381 Lilliefors (Kolmogorov-Smirnov) normality test Shapiro-Francia normality test 0.II. Fig 6 Boxplot for female non survivors and survivors Fig 7 Histograms for female survivors and non survivors Normality tests data The p-values for all the tests for the two samples of female survivors and female non survivors are tabulated below:Test p value (Survived) p-value (Non Survived) 0. This clearly suggests that the distributions are not same. The distributions do not seem to be normal.11744076 0. This also suggests that the two samples come from different distributions (p value = 0.11530637 0. as supported by the normality tests.0001661670 0.12109238 The p-values for all the tests for the two samples suggest that the samples of survivors are not normal. to reinforce on this. (b) Is there a significant difference in Age distribution between females who survived and those who did not? Histograms and box-plots for male dead and survived are compared. However.11296655 0. .

Female_Survivors = Π60+.We use the same approach of dividing the population into age categories to find out if there is a dependence of Π30-45. Female_Survivors = Π30-45. On performing Z tests. Female_Non-Survivors P value Conclusion The above analysis suggests that there is a significant difference in age distribution between female survivors and female non-survivors. . Female_Survivors = Π60+. Female_Non-Survivors We began with two tailed tests and single tailed tests were done wherever null was refuted. Female_Non-Survivors Π45-60. we get the following p values. Female_Non-Survivors Π60+. and thus the adjoining conclusions:Age Category 0 to 5 5+ to 15 15+ to 30 30+ to 45 45+ to 60 60+ 0. Female_Survivors = Π45-60.666 Π60+.

Possible reasons could have been that females and kids were given preference in going on life boats. IV. age group of 15 to 30 and above 60 years had less survival probability. Given that the boarders are males. based on consolidations of your findings in 1 and 2 above. infants and teenagers had higher survival probability. old could have thought of sacrificing their lives for the young. Remark on how Age affected the Survival Probability of a passenger on board the Titanic. Is there a significant difference in Survival Probability between the two genders? . The findings in 1 and 2 above suggest that females had higher survival probability than their counterparts.III. age group of 45 to 60 had higher survival probability. however. Given that the boarders are females.

Fisher’s exact test 2. Conclusion: On the basis of Z-test we conclude that there is a significant difference in the survival probability of the two genders. male and female Ha: Significant difference in the survival probability of the two genders viz. male and female (Two-sided) Data: The below table displays the problem’s data:Survivor Males Females Total 142 308 450 Non-Survivor 709 154 863 Total 851 462 1313 Test adopted for testing the hypothesis: Since it’s a problem of proportion and we would like to compare the survival probabilities of male and female. Z-test Fisher’s exact test is more powerful test in this case but we can also do a Z-test as the sample size is large. we can use the following tests: 1. .Ho: No difference in the survival probability of the two genders viz.

2×10-16 suggests that there is enough evidence to reject the null hypothesis (at α-value of 0. The above conclusion agrees with the common knowledge that passengers in first class had the first option to mount the lifeboats. This helped us find which passenger class had better chance of survival. We further break the data to compare different classes. VI. Is there a significant difference in Survival Probability between the two genders even after taking the effect of Passenger Class into Account? .We have the following data:Survivors Passenger Class I Passenger Class II Passenger Class III 193 119 138 Non-Survivors 129 161 573 The p-value of 2. Passengers in third class were the last to mount the lifeboats. It can be said that there is a significant difference between population distributions across passenger classes. It was observed that the survival probability is highest for Class I followed by Class II with Class III having the lowest probability for survival.05). We did single-tailed Fisher’s test by taking sets of two classes at a time.

. i. So.e. So. i. there is a significant difference in Survival Probability between the two genders for Class II. i.We make three 2×2 contingency tables corresponding to each class.2e-16. there is a significant difference in Survival Probability between the two genders for class1.. So. and do Fisher’s test as follows:Class I Male Female Survivors 59 134 Non Survivors 120 9 We did a two sided Fisher’s test which yielded a p value of less than 2.2e-16. we did a one-sided Class III Survivors Non Survivors 441 Female 80 132 We did a two sided Fisher’s test which yielded a p value of less than 2.e.e.. we did a one-sided fisher’s test We did a two sided Fisher’s test which yielded a p-value of less than 2. there is a significant difference in Survival Probability between the two genders for class2.2e-16. we did a one-sided fisher’s test with alternate hypothesis being that males’ survival probability is less than that of .