Professional Documents
Culture Documents
IN
HEALTHCARE
(LIFE EXPECTANCY – WHO DATASET)
SUBMITTED TO:
SUBMITTED BY GROUP 1
BHADRINATH T.S. [18PGDM013]
NO TITLE PAGE
1 INTRODUCTION 03
2 OBJECTIVES 03
3 DESCRIPTION OF THE VARIABLES 04
4 METHODOLOGY 05
5 PROBLEM DEFINITION 06
6 VISUALIZATION AND EXPLORATION 06
7 REGRESSION MODEL FITTING 29
8 CONCLUSION 64
9 LIMITATIONS 65
10 RECOMMENDATIONS 65
11 REFERENCES 65
2|Page
1. INTRODUCTION
Life expectancy is a statistical measure of the average time an organism is expected to live, based on
the year of its birth, its current age and other demographic factors including gender. Life expectancy at
birth reflects the overall mortality level of a population. It summarizes the mortality pattern that
prevails across all age groups in a given year – children and adolescents, adults and the elderly. Global
life expectancy at birth in 2016 was 72.0 years (74.2 years for females and 69.8 years for males),
ranging from 61.2 years in the WHO African Region to 77.5 years in the WHO European Region,
giving a ratio of 1.3 between the two regions. Women live longer than men all around the world. The
gap in life expectancy between the sexes was 4.3 years in 2000 and had remained almost the same by
2016 (4.4).
Global average life expectancy increased by 5.5 years between 2000 and 2016, the fastest increase
since the 1960s. Those gains reverse declines during the 1990s, when life expectancy fell in Africa
because of the AIDS epidemic, and in Eastern Europe following the collapse of the Soviet Union. The
2000-2016 increase was greatest in the WHO African Region, where life expectancy increased by 10.3
years to 61.2 years, driven mainly by improvements in child survival, and expanded access to
antiretrovirals for treatment of HIV.
2. OBJECTIVES
1. To apply the relevant concepts of analyzing the data taught during the coursework.
2. To identify the dependent and independent variables and identify type of the variables whether it is
categorical or continuous.
3. To visualize and explore the variables through Histogram, Bar plot, skew plot and Descriptive
measures.
4. To identify a pattern in the observations using pivot table, ‘group by’ function.
5. To identify which countries are doing better across different variables like life expectancy, schooling,
percentage expenditure,
6. To identify significant independent variables affecting the dependent variables using appropriate
modelling techniques.
7. To verify whether conditions of normality have been satisfied by the regression model.
8. To suggest appropriate methods and apply those methods if conditions of the normality have not been
satisfied.
3|Page
3. DESCRIPTION OF THE VARIABLES
4|Page
4. METHODOLOGY
R software has been used in our project report for the purpose of analysis. R is a language for statistical
computing and graphics. It is an integrated suite of software facilities for data manipulation, calculation
and graphical display. It also provides a wide variety of statistical (linear and nonlinear modelling,
classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and
is highly extensible.
Group-By function and summary functions have been used to identify patterns across the year, to identity
which countries have been performing well, to identify whether developing or developed countries are
performing well. Descriptive statistics have been used to get to know more about the variables and their
distribution in the data set.
Microsoft Excel have also been used to fit in the pivot table to identify patterns in the data set.
Regression is a technique used to model and analyze the relationships between variables and how they
contribute to produce a particular outcome. A linear regression refers to a regression model which is made
up of linear variables. We have used Simple and Multiple Regression to check the significance of
predictor variables on a dependent variable.
The stepwise regression consists of iteratively adding and removing predictors in order to find the subset
of variables in the data set resulting in the best performing model, that is a model that lowers prediction
error. Stepwise forward selection has been used in the project for selection of significant variables as it
helps us to fit our model in an effective manner.
We have also used the ANCOVA (Analysis of Covariance) Regression model since in our model,
continuous variables (Adult Mortality, Total expenditure, GDP, etc.) coexists with qualitative variables
(Country, Year, Status).
5|Page
5. PROBLEM DEFINITION
We have defined some base questions for our project. They are as follows:
2. What visual patterns/ trends are captured from the exploration of the data?
3. What are the predicting variables actually affecting the life expectancy?
4. How does Infant, Adult and Under-five mortality rates affect life expectancy?
6. Does Life Expectancy have positive or negative relationships with drinking alcohol?
9. What are the variables affecting the adult mortality? Do the same variables which affect the life
expectancy also affects the adult mortality.?
The below pivot tables, histograms, bar plots provides us with all the visual patterns/ trends from the
exploration of the data. This also gives us the Answer to Question 2 of Problem Definition.
TREND COUNTRIES
6|Page
Uzbekistan, Venezuela (Bolivarian Republic Of), Viet Nam.
Inference:
7|Page
6.1.2 STATUS AND TOTAL EXPENDITURE
8|Page
Lowest General government expenditure on health as a percentage of
total government expenditure by Cook Islands = 3.58% as shown
below:
9|Page
Lowest General government expenditure on health as a percentage of
total government expenditure by Singapore = 55.32% as shown
below:
Inference:
General government expenditure on health as a percentage of total government expenditure is more for
Developed Economies that Developing Economies.
10 | P a g e
6.1.3 STATUS AND BMI
NO STATUS BMI
The countries close to normal BMI are Japan and Singapore with values
25.6 and 25.9 respectively as shown as follows:
1 DEVELOPED
The countries with highest value of overweight BMI is Malta with value
66.18 as shown in the figure above.
The country that has lowest underweight BMI is Saint Kitts and Nevis with
a value of 5.2 as shown as follows:
2 DEVELOPING
11 | P a g e
Gambia = 20.3
Ghana = 21.725
Guinea-Bissau = 19.431
Indonesia = 19.956
Liberia = 19.987
Maldives = 19.293
Mauritania = 22.475
Nigeria = 19.750
Philippines = 19.187
Republic of Korea = 23.24375
Sao Tome and Principe = 20.85
Somalia = 18.6875
Thailand = 21.59375
The country that has highest overweight BMI is Nauru with a value of 87.3
as shown below:
Inference:
Developed Economies have Overweight BMI values which indicate the cases of Obesity. No
country has underweight BMI and there are only two economies with BMI values close to the
standard of 25.
Few Developing Economies face the problem of poverty where the BMI value is less than 18.5.
Normal BMI shows the population has a healthy lifestyle. There are also few countries with
overweight BMI value which might be due to Obesity.
12 | P a g e
6.1.4 IMMUNIZATION & LIFE EXPECTANCY
For Afghanistan, Hepatitis B immunization coverage is kept nearly constant, Polio and Diphtheria
Immunization coverage are increased throughout the years, but the Life Expectancy has not
increased accordingly.
13 | P a g e
Figure showing Life Expectancy of Bahamas
Inference:
Thus, it can be assumed that the effect of Immunization coverage of Hepatitis B, Polio & Diphtheria on
Life Expectancy is negligible.
14 | P a g e
6.1.5 MORTALITY RATES
The Mortality rate of each country is calculated with respect to the population at a particular year.
Adult Mortality
Adult Mortality (%) = X 100
Population
Infant deaths
Infant Mortality (%) = X 100
Population
Under−five Deaths
Under-five Mortality (%) = X 100
Population
The rest of death % account for adolescent death (8-15 years) and above 60 years.
For Example, Afghanistan Adult, Infant and Under-five Mortality has reduced over the years. In 2002,
there was no Adult Mortality.
15 | P a g e
6.2 USING R, HISTOGRAM AND BAR PLOTS
Skewness is a measure of the asymmetry of the probability distribution about its mean. There are 2 types
of Skewness.
Negatively skewed: The left tail is longer, the mass of the distribution is concentrated on the right of the
figure.
Positively skewed: The right tail is longer, the mass of the distribution is concentrated on the left of the
figure.
16 | P a g e
Out of all variables, Population has the highest positive skewness. It is positively skewed. So, mean is to
the maximum right of the peak as compared to other variables.
Polio has highest negative skewness. It is negatively skewed. So, mean is to the maximum left of the peak
as compared to other variables.
90 45 100.00% 35 0 100.00%
17 | P a g e
Life expectancy histogram has the maximum frequency in the age 71-75 where 796 observation have it.
Second highest peak was in 76-80 with value as 571.
18 | P a g e
Infant death has the highest frequency 0-200 range across the year and across the country with frequency
of 2004. Infant death of 0 have frequency of 848.
19 | P a g e
Adult mortality has the highest frequency upto 100 with frequency of 1068. Adult mortality have value of
zero for 0.
6.2.4 BARPLOTS
There are 512 developed countries and 2426 developing countries in the dataset.
20 | P a g e
There are 183 data points for all years except for the year 2013 which has 193 data points.
Here we are grouping quantitative variable (Life expectancy) by using the qualitative variable (Status).
We are finding the mean of life expectancies of Developed and Developing countries. The developed
countries have higher Life expectancy mean (79.19785) than the developed countries (67.11147).
The Life Expectancy data range of 70-75 has the highest frequency (close to 800) i.e., there are close to
800 data points of Life expectancy in the data set for which the values of Life expectancy lie in the range
70-75 years.
21 | P a g e
As far as skewness is concerned, this is a left (negatively) skewed histogram i.e. peak of the histogram
veers to the right.
For Developing countries, the Life Expectancy data range of 72-74 has the highest frequency (close to
350) i.e. There are close to 350 data points of Life expectancy in the data set for which the values of Life
expectancy lie in the range 72-74 years for Developing countries. This is a left (negatively) skewed
histogram i.e. peak of the histogram veers to the right.
For Developed countries, the Life Expectancy data range of 81-82 has the highest frequency (close to 80)
i.e. There are close to 80 data points of Life expectancy in the data set for which the values of Life
expectancy lie in the range 72-74 years for Developed countries.
So, we can see that as compared to Developing countries, the Developed countries have got the highest
frequency for a higher range of Life expectancy.
Thus, the frequency of average period for which a person may expect to live (Life expectancy) in the
higher range is more for Developed countries. Thus, residents stay alive for a greater number of years for
Developed countries.
22 | P a g e
This is a left (negatively) skewed histogram i.e. peak of the histogram veers to the right.
The Descriptive Statistics and Group By provided us with all the insights obtained from the variables.
This also gives us the Answer to Question 1 of Problem Definition.
Grouping by Year:
1. Average Life expectancy for most part of the dataset has been on the rise.
2. Average Adult Mortality and Average Infant Mortality has no definite pattern. For some part, it is
reducing while towards the end it is again has increased. This has very little effect on life
expectancy.
3. Alcohol consumption has been decreasing year on year.
4. General Govt Expenditure on health as a part of GDP has been increasing year on year across
countries.
5. Death due to HIV in children is decreasing.
6. Avg thinness is increasing.
7. Number of years of schooling is increasing.
23 | P a g e
Grouping by countries:
1. France has the highest average life expectancy. Lesotho has the lowest average life expectancy.
2. Tunisia has the lowest average adult mortality. Lesotho has highest adult mortality.
3. Austria, Belize, Bosnia and Herzegovina, Cabo Verde, Croatia, Cyprus, Estonia, Fiji has zero
infant deaths.
4. Bangladesh, Equatorial Guinea has the lowest alcohol consumption. Belarus has the highest
alcohol consumption.
5. Eritrea has the lowest average percentage expenditure. Australia has the highest average
percentage expenditure
6. Fiji has the highest average number of hepatitis b cases. Equatorial Guinea has the lowest average
number of hepatitis b cases.
24 | P a g e
Grouping by countries:
1. Greece has the highest average BMI. Rwanda has the lowest average BMI.
2. Brazil has the highest polio immunization coverage. Equatorial Guinea has the lowest polio
immunization coverage.
3. Ireland has the lowest average thinness among 10-19 years. India has the highest average thinness
among countries.
4. Tonga has the lowest average thinness among 5-9 years. India has the highest average thinness a
mong 5-9 years.
5. Australia has the highest schooling, Eritrea has the lowest schooling number.
25 | P a g e
Grouping by Status:
1. As expected, developed countries has higher average life expectancy, lowest average adult mortal
ity, lowest average infant mortality, highest average percentage expenditure, lower average
measles cases, higher average BMI, lower average under 5 deaths, highest average polio coverage
lower average prevalence of thinness among 5-9 and 10-19, higher average schooling.
26 | P a g e
Descriptive Statistics:
27 | P a g e
Summary Stats:
The quartile measures the spread of values above and below the mean by dividing the distribution into
four groups. A quartile divides data into three points – a lower quartile, median, and upper quartile – to
form four groups of the data set.
28 | P a g e
7. REGRESSION MODEL FITTING
Identifying the significant variables by stepwise regression (Answer to Question 3 of Problem
Definition)
OBJECTIVE:
● To find out the relationship (positive or negative) between Life expectancy and X variables.
29 | P a g e
Life Expectancy regressed on all the above significant variables (Overall/Original Multiple
regression Model)
30 | P a g e
Regression Equation:
There will be 192 dummy variables for Country, so 192 regression coefficients for Country (since total
193 countries in dataset). Similarly, since are there are total 16 years in dataset, there will be 15 dummy
variables for Year and 15 regression coefficients for Country.
Writing all the dummies will be a complex task. For Country, in the regression equation, it will be like
β1C1+ β2 C2+ β3C3+ …+ β192 C192 (Here C1, C2,…,C192 are dummy variables for the countries). So, we have
written them as B1(Country).
Β193Y193+ β194 Y194+ β195Y195+ …+ β207Y207 (Here all the Y’s are dummy variables for the Years). So, we
have written them as B2(Year).
Value of R^2 (96.37%) is high in this regression model. So, the model is a good fit.
H0: β1 = β2 = ... = βk = 0. Thus, none of the variables belong to the model and we do not have a good
model for prediction.
Ha: At least one β is not 0. Thus, at least one variable belongs to the model and we have a good model for
prediction.
Since the p value of the overall model <2.2e-16 which is < 0.05, so we reject the null hypothesis that we
do not have a good model for prediction. So, the overall model is significant predictor of Life expectancy.
Null Hypothesis: H0: β1 = 0. Thus, the variable does not belong to the model and we do not have a good
model for prediction.
Alternative Hypothesis: Ha: β1 ≠ 0. Thus, it belongs to the model and we have a good model for
prediction.
Since, the p value of all predictor variables except Income composition of resources <0.05, so we reject
the null hypothesis for all of them, so they are significant predictors of Life expectancy except Income
composition of resources.
31 | P a g e
Among the quantitative variables, the coefficient of Schooling is highest (1.5e-01) i.e. with 1 unit
increase in Schooling, the Life expectancy increases by 1.5e-01 units which is the highest. The coefficient
of HIV/AIDS is lowest (- 3.077e-01) i.e. with 1 unit increase in HIV/AIDS, the Life expectancy decreases
by 3.077e-01units.
For Year:
For only Years 2001, 2002 and 2003, the p value>0.05. So there is no significant difference between the
average Life expectancies of 2000 & 2001, 2000 & 2002 and 2000 & 2003. For the rest of the years there
is a significant difference between the average Life expectancies of 2000 and each of the rest of the years
pairwise.
All the coefficients for years are positive i.e. the average life expectancies in all the years are more than
the average life expectancy of 2000. The coefficient for 2015 is highest (6.241) i.e. the average life
expectancy of 2015 is the 6.241 units more than the average life expectancy of 2000. Similarly, the
coefficient for 2000 is lowest (2.501e-01).
For Country:
For the countries, Benin, Burkina Faso, Burundi, Cameroon, Equatorial Guinea, Guinea, Guinea-Bissau,
Liberia, Mali, Mozambique, Togo, Zambia and Zimbabwe, the p value>0.05. So there is no significant
difference between the average Life expectancies between Afghanistan and each of these pairwise. For
the rest of the countries where p value<0.05, there is a significant difference between the average Life
expectancies between Afghanistan and each of them pairwise.
The coefficient for Solomon Islands is highest (9.751) i.e. the average life expectancy of Solomon Islands
is the 9.751 units more than the average life expectancy of Afghanistan. Similarly, the coefficient for
Angola is lowest (-6.513) i.e. the average life expectancy of Afghanistan is the 6.513 units more than the
average life expectancy of Angola.
32 | P a g e
Checking the plots for Linearity of the model:
33 | P a g e
3/n=3/2938 =0.001(n=number of data points in dataset)
As lots of data points are > 0.001 in the above plot, so no outliers are present.
34 | P a g e
Checking for Multicollinearity:
Auxiliary regression model of infant. Deaths regressed on the rest of the X variables
The R^2=99.44% >96.37% (R^2 of original model). So infant deaths is a source of multicollinearity. So,
we can remove it from the final regression model.
Auxiliary regression model of under-five deaths regressed on the rest of the X variables
35 | P a g e
The R^2=99.45% >96.37% (R^2 of original model). So, under-five deaths is a source of multicollinearity.
So, we can remove it from the final regression model.
VIFs for infant deaths and under-five deaths are >10. So, they are sources of multicollinearity. So, we can
remove them from the final regression model.
36 | P a g e
Regression Equation:
R^2=96.27% slightly less than original regression model (96.37%), but the model is still a good fit. The
overall model after removing Multicollinearity is significant as p value of this model <2.2e-16 which is
< 0.05. All variables except Income Composition of resources and Schooling are significant predictor
variables of Life expectancy (In the original regression model, all variables except Income Composition
of resources were significant predictor variables of Life expectancy).
Among the quantitative variables, the coefficient of Schooling is highest (1.726e-01) i.e. with 1 unit
increase in Schooling, the Life expectancy increases by 1.726e-01 units which is the highest. The
coefficient of HIV/AIDS is lowest (- 3.15e-01) i.e. with 1 unit increase in HIV/AIDS, the Life
expectancy decreases by 3.15e-01) units. This is similar to the overall multiple regression model where
also the coefficient of Schooling is highest and the coefficient of HIV/AIDS is lowest.
Regression using the log transformation of the model after removing Multicollinearity
Regression Equation:
37 | P a g e
The value of R^2=96.06%. It has reduced from the before transformation model. Still the model is a good
fit.
All the variables except Income composition of resources are significant predictor variables of Life
Expectancy. Except Income composition of resources, for all other predictor variables, the p value<0.05,
so they are significant. This is same as that of the original regression model.
Since there is no significant improvement of R^2 and no improvement in normality after taking log
transformation, so log transformation is ruled out.
38 | P a g e
39 | P a g e
Regression Equation:
The intercept value of 58.1937 is the average Life expectancy of benchmark category Country
Afghanistan.
For the countries, Benin, Congo, Ethiopia, Gambia, Haiti, Kenya, Liberia, Niger, Rwanda, South
Africa and Togo, the p value>0.05. So, there is no significant difference between the average Life
expectancies between Afghanistan and each of these pairwise. For the rest of the countries where p
value<0.05, there is a significant difference between the average Life expectancies between Afghanistan
and each of them pairwise.
This result is different as compared to the result in the overall regression model, only 3 countries are
common to the individual regression on country and overall regression model which are not significant
are Benin, Liberia and Togo.
The coefficient for Japan is highest (24.3437) i.e. the average life expectancy of Japan is the 24.3437
units more than the average life expectancy of Afghanistan. Similarly, the coefficient for Sierra Leone is
lowest (-12.0812) i.e. the average life expectancy of Afghanistan is the 12.0812 units more than the
average life expectancy of Sierra Leone.
The result is different as compared to the result in the overall regression model. In the overall regression
model, the coefficient for Solomon Islands is highest and the coefficient for Angola is lowest.
First important insight seen is most of the African countries have life expectancies lower than that
of benchmark country (Afghanistan) i.e. negative coefficients. So, it can be concluded that most of
the African countries are worse off as far as Life expectancies are concerned as compared to other
continents of the world.
Second important insight seen is most of the European countries have life expectancies higher
(more than 15 units) than that of benchmark country (Afghanistan) i.e. positive (>15) coefficients.
So, it can be concluded that most of the European countries are better off as far as Life
expectancies are concerned as compared to other continents of the world.
40 | P a g e
7.1.2 LIFE EXPECTANCY VS YEAR
Regression Equation:
The intercept value of 66.7503 is the average Life expectancy in the benchmark category Year 2000.
This result is different as compared to the result in the overall regression model, only 3 years are common
to the individual regression on country and overall regression model which are not significant are
2001,2002 and 2003.
41 | P a g e
All the coefficients for years are positive i.e. the average life expectancies in all the years are more than
the average life expectancy of 2000. The coefficient for 2015 is highest (4.8667) i.e. the average life
expectancy of 2015 is the 4.8667 units more than average life expectancy of 2000. Similarly, the
coefficient for 2000 is lowest (0.3787).
For the overall regression model also, the coefficient for 2015 is highest and for 2000 is lowest (Same as
Individual regression on Year).
Another important insight we can get is that the life expectancies increase from the year 2001 to
2015. So, it can be concluded that life expectancies are increasing by time (year on year).
Regression Equation:
Average value of Life expectancy when there is no Alcohol consumption is 64.763 units (intercept).
R^2 is 16.39% (low). P value<0.05. So Alcohol is significant predictor variable of Life Expectancy
similar to the overall multiple regression model.
With one unit increase in drinking alcohol, the Life expectancy increases by 0.95 units. Hence Life
expectancy has a positive relationship with drinking alcohol. But in the overall multiple regression model,
Alcohol has a negative coefficient i.e. negative relationship with Life Expectancy.
42 | P a g e
7.1.4.2 Correlation of Drinking Alcohol with Life Expectancy
The correlation is positive i.e. 0.404 (slightly more than 0.4). The correlation is moderate as it falls
between 0.4 and 0.7.
Regression Equation:
R^2 is 2.4% (very low). P value<0.05. So Measles is significant predictor variable of Life Expectancy
similar to the overall multiple regression model.
43 | P a g e
7.1.6 LIFE EXPECTANCY VS TOTAL EXPENDITURE
Regression Equation:
R^2 is 4.7% (very low). P value<0.05. So Total expenditure is significant predictor variable of Life
Expectancy similar to the overall multiple regression model.
Regression Equation:
P value<0.05. So HIV/AIDS is significant predictor variable of Life Expectancy similar to the overall
multiple regression model. With one unit increase in HIV/AIDS, the Life expectancy decreases by 1.04
units (The regression coefficient of HIV/AIDS is -1.04). Hence Life expectancy has a negative
relationship with HIV/ AIDS.
44 | P a g e
In the overall multiple regression model as well, HIV/AIDS has a negative regression coefficient i.e.
negative relationship with Life Expectancy.
Regression Equation:
R^2 is 22.77% (low). P value<0.05. So, Thinness 1-19 years is significant predictor variable of Life
Expectancy similar to the overall multiple regression model.
Regression Equation:
R^2 is 52.53% (moderate). P value<0.05. So Income composition of resources which was insignificant in
the overall multiple regression model has now become significant predictor variable of Life Expectancy
in this model.
45 | P a g e
7.1.11 LIFE EXPECTANCY VS SCHOOLING (Answer to Question 5 of Problem Definition)
Regression Equation:
With one unit increase in Schooling, the Life expectancy increases by 2.103 units. Hence Life expectancy
has a positive relationship with Schooling.
In the overall multiple regression model as well, Schooling has a positive coefficient i.e. positive
relationship with Life Expectancy.
There are immunization coverages for 3 diseases as per the description of the dataset: Hepatitis B, Polio
and Diphtheria.
46 | P a g e
Regression Equation:
Average value of Life expectancy when there are no Hepatitis B, Polio and Diphtheria immunization
coverages simultaneously is 54.794 units (intercept).
P value of Polio and Diphtheria are <0.05. So Polio and Diphtheria immunization coverages are
significant predictor variables of Life Expectancy.
But p value of Hepatitis B >0.05 i.e. Hepatitis B immunization coverage is not a significant predictor of
Life Expectancy.
With one unit increase in Polio immunization coverage, the Life expectancy increases by 0.08 units, with
one unit increase in Diphtheria immunization coverage, the Life expectancy increases by 0.09 units
(highest). The corresponding figure for Hepatitis B is 0.003 units (lowest).
So, we can conclude that Diphtheria immunization coverage increase has the greatest positive impact on
Life expectancy among these 3 immunization coverages. Life expectancy has a positive relationship with
all of them- Hepatitis B, Polio and Diphtheria immunization coverages.
47 | P a g e
If we check their individual effect on Life Expectancy:
Hepatitis B:
Regression Equation:
Polio:
Regression Equation:
48 | P a g e
P value<0.05. So, Polio immunization coverage is significant predictor variables of Life Expectancy
when considered as the only predictor variable in regression of Life expectancy, similar to regression
model where all Hepatitis B, Polio and Diphtheria immunization coverages were considered together for
regression.
Diphtheria:
Regression Equation:
Regression Equation:
49 | P a g e
P value >0.05. So, Population is not a significant predictor variable of Life Expectancy. With one unit
increase in Population, the Life expectancy decreases by (3.468e-09) units. Hence Population has a
negative relationship with drinking alcohol. So, we can conclude, higher the Population of a country,
lower is its Life expectancy.
Regression Equation:
P value <0.05. So, GDP is a significant predictor variable of Life Expectancy. With one unit increase in
GDP, the Life expectancy increases by (3.117e-04) units. Hence Life expectancy has a positive
relationship with GDP. So, we can conclude, higher the GDP of a country, higher is its Life expectancy.
50 | P a g e
Regression Equation:
Life Expectancy= 77.883 – 0.049 (Adult Mortality) + 0.185 (Infant deaths) - 0.145 (under-five deaths)
P value of Infant, Adult and under-five mortality rates all are <0.05.
So, Infant, Adult and under-five mortality rates all are significant predictor variables of Life Expectancy.
With one unit increase in Adult mortality rate, the Life expectancy decreases by 0.049 units, with one unit
increase in infant mortality rate, the Life expectancy increases by 0.185 units increases by 0.09 units
(highest), with one unit increase in under-five mortality rate, the Life expectancy decreases by 0.145
units(lowest).
So, we can conclude that infant mortality rate increase has the greatest positive impact on Life expectancy
among these 3 mortality rates.
Adult Mortality:
Regression Equation:
P value<0.05. So Adult mortality is significant predictor variable of Life Expectancy when considered as
the only predictor variable in regression of Life expectancy, similar to the regression model where all
Infant, Adult and under-five mortality rates were considered together for regression.
51 | P a g e
Infant deaths:
Regression Equation:
P value<0.05. So, Infant deaths is significant predictor variable of Life Expectancy when considered as
the only predictor variable in regression of Life expectancy, similar to the regression model where all
Infant, Adult and under-five mortality rates were considered together for regression.
Under-five deaths:
Regression Equation:
P value<0.05. So, under-five deaths is significant predictor variable of Life Expectancy when considered
as the only predictor variable in regression of Life expectancy, similar to the regression model where all
Infant, Adult and under-five mortality rates were considered together for regression.
52 | P a g e
INFERENCE FROM THIS MODEL
Thus, we can conclude that taking Life Expectancy as Y and regressing it on the rest of the
significant X variables (obtained after stepwise regression) that significantly affect Life
Expectancy after removing multicollinearity resulted in a good-fit regression model as it has very
high R^2 value. We are also able to answer the last 6 questions of the problem definition from
this model.
OBJECTIVE: To check whether the issue of Adult Mortality can be addressed by changes in the other X
variables that significantly influence Life Expectancy
Residuals:
Regression Equation:
Model R square being 0.60, it can be concluded that the model is not a very good fit. Reason might be
non-linear relationship between Y and any of the X ‘s. To check that, following plots were obtained.
53 | P a g e
54 | P a g e
From the above plots, it is clear that more than one condition of linearity is violated.
GVIF Df GVIF^(1/(2*Df))
Country 5.645015e+06 170 1.046786
Income.composition.of.resources 6.271597e+00 1 2.504316
Schooling 1.587135e+01 1 3.983886
Year 2.459652e+00 15 1.030455
HIV.AIDS 5.407505e+00 1 2.325404
Diphtheria 2.033893e+00 1 1.426146
Measles 2.150863e+00 1 1.466582
Alcohol 1.108622e+01 1 3.329598
under.five.deaths 1.499765e+03 1 38.726797
infant.deaths 1.604659e+03 1 40.058195
Total.expenditure 2.348768e+00 1 1.532569
thinness..1.19.years 7.208859e+00 1 2.684932
Thus we can conclude that infant deaths and under five deaths are sources of multicollinearity (VIF>10).
So by fitting the same model excluding these two variables gives us the following output.
Residuals:
55 | P a g e
Regression Equation:
Here in case of country, Afghanistan has been taken as the benchmark category and in case of year, 2000
has been taken as the benchmark category.
The coefficients in the above model can be interpreted as the change in Adult Mortality as a result of unit
change in the corresponding X variables.
Even after removing the multicollinear variables, R square did not improve. Now log transformation of
the above model was tried.
Residuals:
Regression Equation:
56 | P a g e
After log transformation, the R square further reduced to 0.34. So this multiple regression model
is definitely not a good fit.
Residuals:
Regression Equation:
Here again Afghanistan has been taken as the benchmark category. The intercept is the differential of the
Adult Mortality of any country w.r.t that of Afghanistan.
Tunisia has the highest coefficient (-250 ) and Benin has the lowest (0.3).
Angola, Belarus, Benin, Bhutan, Burkina Faso, Burundi, Cameroon, Central African Republic, Comoros,
Congo, Djibouti, Equatorial Guinea, Eritrea, Ethiopia, Gambia, Guinea, Kazakhstan, Kenya, Liberia,
Madagascar, Mongolia, Namibia, Niger, Nigeria, Papua New Guinea, Philippines, Russian Federation,
Rwanda, South Africa, Togo, Turkmenistan, Uganda, Ukraine, Yemen, Zambia
But in the individual model (adult mortality versus country), insignificant countries:
Belarus, Benin, Bhutan, Burkina Faso, Burundi, Cameroon, Comoros, Congo, Djibouti, Equatorial
Guinea, Eritrea, Ethiopia, Gambia, Guinea, Kazakhstan, Liberia, Madagascar, Mongolia, Namibia, Niger,
Nigeria, Papua New Guinea, Philippines, Russian Federation, Rwanda, Togo, Turkmenistan, Uganda,
Chad, Gabon, Mozambique, Somalia, Sudan, United Republic of Tanzania.
R square value being 0.59, it can be concluded that the model is not a good fit.
57 | P a g e
7.2.3 ADULT MORTALITY VS INCOME COMPOSITION OF RESOURCES
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330.372 6.522 50.66 <2e-16 ***
Income.composition.of.resources -266.696 9.853 -27.07 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 109.3 on 2766 degrees of freedom
(170 observations deleted due to missingness)
Here we can see that Income composition of resources has now become significant, which was otherwise
insignificant in the multiple regression model.
Residuals:
Coefficients:
58 | P a g e
Regression Equation: Adult Mortality = 363.4750 -16.7033 (Schooling)
Schooling, which was insignificant in the multiple regression model has now become significant in this
model.
The R square of the model is very low (0.2) showing that the model is not a good fit.
Residuals:
Regression Equation:
Here 2000 has been taken as the benchmark year. The intercept can be interpreted as the differential Adult
Mortality of any year with respect to 2000.
The coefficient of 2014 is the highest (-32.78) and that of 2004 is the lowest (4.78).
In the previous multiple regression model, only the year 2012 was significant, whereas in this individual
model, the years 2012, 2013, 2014, and 2015 are significant.
R square is extremely low (0.0083), suggesting that this model is a complete misfit.
Residuals:
59 | P a g e
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 142.4217 2.0693 68.82 <2e-16 ***
HIV.AIDS 12.8023 0.3849 33.26 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 105.9 on 2926 degrees of freedom
HIV AIDS was significant both in the multiple as well as individual regression model.
R square of this individual model is very low (0.27) suggesting it is not a good fit.
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 282.56485 7.99022 35.36 <2e-16 ***
Diphtheria -1.43915 0.09327 -15.43 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Diphtheria was previously insignificant in the multiple model, but now has become significant in the
individual model.
R square value of this individual model is extremely low (0.075), suggesting that this model is a
complete misfit.
60 | P a g e
7.2.8 ADULT MORTALITY VS MEASLES
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.640e+02 2.347e+00 69.866 <2e-16 ***
Measles 3.374e-04 2.000e-04 1.687 0.0917 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Measles was significant in the multiple regression model, but is insignificant in the individual model.
The R square value of this individual model is negligible (close to zero), suggesting this model is a
complete misfit.
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 192.4121 3.5622 54.02 <2e-16 ***
Alcohol -6.0573 0.5802 -10.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 122.9 on 2733 degrees of freedom
61 | P a g e
Regression Equation: Adult Mortality = 192.4121 -6.0573 (Alcohol)
Alcohol was insignificant in the multiple regression model, but has become significant in the individual
model.
R square value of this individual model is extremely low (0.038), suggesting that this model is a complete
misfit.
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 198.5707 6.2086 31.98 < 2e-16 ***
Total.expenditure -5.8237 0.9657 -6.03 1.86e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 124.6 on 2700 degrees of freedom
Total expenditure was insignificant in the multiple regression model, but has become significant in the
individual model.
R square value of this individual model is extremely low (0.013), suggesting that this model is a complete
misfit.
62 | P a g e
7.2.11 ADULT MORTALITY VS THINNESS-1-19 YEARS
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.2023 3.2579 37.51 <2e-16 ***
thinness..1.19.years 8.4884 0.4964 17.10 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 118.1 on 2894 degrees of freedom
R square value of this individual model is extremely low (0.092), suggesting that this model is a complete
misfit.
Thus we can conclude that taking Adult Mortality as Y and regressing it on the rest of the X variables that
significantly affect Life Expectancy was a wrong decision as none of the variables could completely
explain Adult Mortality when taken individually.
However, when taken together, the overall model could somewhat explain Adult Mortality.
63 | P a g e
8. CONCLUSION
Life Expectancy data set has been studied with the help of R and Microsoft Excel. Dependent and
Independent variables has been identified. ‘Life Expectancy’ depends on 13 significant independent
variables which forms the basis to improve the ‘Life Expectancy’ of a country. Similarly, adult mortality
depends on 10 dependent variables which explains it.
1. Developed Economies have Overweight BMI values which indicate the cases of Obesity. For most of
the Developing Economies, normal BMI shows the population has a healthy lifestyle.
2. Population has the highest positive skewness and Polio has highest negative skewness.
3. The developed countries have higher Life expectancy mean (79.19785) than the developed countries
(67.11147).
4. The Life Expectancy data range of 70-75 has the highest frequency (close to 800).
5. Average Adult Mortality and Average Infant Mortality has no definite pattern.
6. General Govt Expenditure on health as a part of GDP has been increasing year on year across
countries.
7. As expected, developed countries has higher average life expectancy, lowest average adult mortality,
lowest average infant mortality, highest average percentage expenditure, lower average measles cases,
higher average BMI, lower average under 5 deaths, highest average polio coverage lower average
prevalence of thinness among 5-9 and 10-19, higher average schooling.
8. Adult mortality, infant deaths, alcohol, percentage expenditure, measles, under 5 deaths, HIV AIDS,
GDP, thinness 10-19, thinness 5-9 years, is having huge deviation from mean.
9. Most of the African countries are worse off as far as Life expectancies are concerned as compared to
other continents of the world.
10. Most of the European countries are better off as far as Life expectancies are concerned as compared to
other continents of the world.
64 | P a g e
9. LIMITATIONS
There are about 1000 missing observations which might lead to wrong model fitting.
10.RECOMMENDATIONS
To improve the life expectancy, Government of respective country should concentrate on controlling
adult mortality, infant death, under five deaths, awareness about HIV/AIDS and care during pregnancy of
HIV AIDS patients. Government should increase total expenditure spent on health. Also, Government
should increase immunization coverage of Diphtheria. Government should control alcohol availability to
its citizen to improve the life expectancy.
11. REFERENCES
1. https://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends_text/en/
2. https://towardsdatascience.com/5-types-of-regression-and-their-properties-c5e1fa12d55e
3. http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-
regression-essentials-in-r/
4. https://rdrr.io/cran/goeveg/man/cv.html
65 | P a g e