Group 1 Project Report DA

REPORT ON DATA ANALYTICS
IN
HEALTHCARE
(LIFE EXPECTANCY – WHO DATASET)
SUBMITTED TO:
DR. POOJA SENGUPTA
SUBMITTED BY GROUP 1
BHADRINATH T.S. [18PGDM013]
SAYANI MANDAL [18PGDM042]
SOHAM SARKAR [18PGDM103]
AISWARYA NAIR [18PGDM118]
PREETY PAUL CHOUDHURY [18PGDM141]

CONTENTS
NO TITLE PAGE
1 INTRODUCTION 03
2 OBJECTIVES 03
3 DESCRIPTION OF THE VARIABLES 04
4 METHODOLOGY 05
5 PROBLEM DEFINITION 06
6 VISUALIZATION AND EXPLORATION 06
7 REGRESSION MODEL FITTING 29
8 CONCLUSION 64
9 LIMITATIONS 65
10 RECOMMENDATIONS 65
11 REFERENCES 65
2|Page
1. INTRODUCTION
Life expectancy is a statistical measure of the average time an organism is expected to live, based on
the year of its birth, its current age and other demographic factors including gender. Life expectancy at
birth reflects the overall mortality level of a population. It summarizes the mortality pattern that
prevails across all age groups in a given year – children and adolescents, adults and the elderly. Global
life expectancy at birth in 2016 was 72.0 years (74.2 years for females and 69.8 years for males),
ranging from 61.2 years in the WHO African Region to 77.5 years in the WHO European Region,
giving a ratio of 1.3 between the two regions. Women live longer than men all around the world. The
gap in life expectancy between the sexes was 4.3 years in 2000 and had remained almost the same by
2016 (4.4).
Global average life expectancy increased by 5.5 years between 2000 and 2016, the fastest increase
since the 1960s. Those gains reverse declines during the 1990s, when life expectancy fell in Africa
because of the AIDS epidemic, and in Eastern Europe following the collapse of the Soviet Union. The
2000-2016 increase was greatest in the WHO African Region, where life expectancy increased by 10.3
years to 61.2 years, driven mainly by improvements in child survival, and expanded access to
antiretrovirals for treatment of HIV.
Source: WHO Global Health Observatory Data
2. OBJECTIVES
The objective of the project is as follows:
1. To apply the relevant concepts of analyzing the data taught during the coursework.
2. To identify the dependent and independent variables and identify type of the variables whether it is
categorical or continuous.
3. To visualize and explore the variables through Histogram, Bar plot, skew plot and Descriptive
measures.
4. To identify a pattern in the observations using pivot table, ‘group by’ function.
5. To identify which countries are doing better across different variables like life expectancy, schooling,
percentage expenditure,
6. To identify significant independent variables affecting the dependent variables using appropriate
modelling techniques.
7. To verify whether conditions of normality have been satisfied by the regression model.
8. To suggest appropriate methods and apply those methods if conditions of the normality have not been
satisfied.
3|Page
3. DESCRIPTION OF THE VARIABLES
 Country: Country, factor variable with 193 levels.

 Year: Year, int data type, we have converted this into factor variable (16 levels).
 Status: Developed or Developing status, factor variable with 2 levels.
 Life expectancy: Life Expectancy in age, num data type.
 Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60
years per 1000 population), int data type.
 Infant deaths: Number of Infant Deaths per 1000 population, int data type.
 Alcohol: Alcohol, recorded per capita (15+) consumption (in liters of pure alcohol), num data
type.
 Percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per
capita (%), num data type.
 Hepatitis B: Hepatitis B (Hep B) immunization coverage among 1-year-olds (%), int data type.
 Measles: Measles - number of reported cases per 1000 population, int data type.
 BMI: Average Body Mass Index of entire population, num data type.
 Under-five deaths: Number of under-five deaths per 1000 population, int data type.
 Polio: Polio (Pol3) immunization coverage among 1-year-olds (%), int data type.
 Total expenditure: General government expenditure on health as a percentage of total government
expenditure (%), num data type.
 Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year
old’s (%), int data type.
 HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years), num data type.
 GDP: Gross Domestic Product per capita (in USD), num data type.
 Population: Population of the country, num data type.
 thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (%), num data
type.
 thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%), num data type.
 Income composition of resources: Human Development Index in terms of income composition
of resources (index ranging from 0 to 1), num data type.
 Schooling: Number of years of Schooling(years), num data type.
4|Page
4. METHODOLOGY
R software has been used in our project report for the purpose of analysis. R is a language for statistical
computing and graphics. It is an integrated suite of software facilities for data manipulation, calculation
and graphical display. It also provides a wide variety of statistical (linear and nonlinear modelling,
classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and
is highly extensible.
Group-By function and summary functions have been used to identify patterns across the year, to identity
which countries have been performing well, to identify whether developing or developed countries are
performing well. Descriptive statistics have been used to get to know more about the variables and their
distribution in the data set.
Microsoft Excel have also been used to fit in the pivot table to identify patterns in the data set.
Regression is a technique used to model and analyze the relationships between variables and how they
contribute to produce a particular outcome. A linear regression refers to a regression model which is made
up of linear variables. We have used Simple and Multiple Regression to check the significance of
predictor variables on a dependent variable.
The stepwise regression consists of iteratively adding and removing predictors in order to find the subset
of variables in the data set resulting in the best performing model, that is a model that lowers prediction
error. Stepwise forward selection has been used in the project for selection of significant variables as it
helps us to fit our model in an effective manner.
There are three strategies of stepwise regression:

1. Forward selection, which starts with no predictors in the model, iteratively adds the most
contributive predictors, and stops when the improvement is no longer statistically
significant.
2. Backward selection (or backward elimination), which starts with all predictors in the
model (full model), iteratively removes the least contributive predictors, and stops when you
have a model where all predictors are statistically significant.
3. Stepwise selection (or sequential replacement), which is a combination of forward and
backward selections. It starts with no predictors, then sequentially adds the most
contributive predictors (like forward selection). After adding each new variable, it removes
any variable which no longer provides an improvement in the model fit (like backward
selection)
We have also used the ANCOVA (Analysis of Covariance) Regression model since in our model,
continuous variables (Adult Mortality, Total expenditure, GDP, etc.) coexists with qualitative variables
(Country, Year, Status).
5|Page
5. PROBLEM DEFINITION
We have defined some base questions for our project. They are as follows:
1. What insights can be obtained from the variables?
2. What visual patterns/ trends are captured from the exploration of the data?
3. What are the predicting variables actually affecting the life expectancy?
4. How does Infant, Adult and Under-five mortality rates affect life expectancy?
5. What is the impact of schooling on the lifespan of humans?
6. Does Life Expectancy have positive or negative relationships with drinking alcohol?
7. Do populated countries tend to have lower life expectancy?
8. What is the impact of Immunization coverage on life Expectancy?
9. What are the variables affecting the adult mortality? Do the same variables which affect the life
expectancy also affects the adult mortality.?
6. VISUALISATION AND EXPLORATION
The below pivot tables, histograms, bar plots provides us with all the visual patterns/ trends from the
exploration of the data. This also gives us the Answer to Question 2 of Problem Definition.
6.1 USING PIVOT TABLES
6.1.1 LIFE EXPECTANCY AND COUNTRY
TREND COUNTRIES
Albania, Algeria, Antigua & Barbuda, Argentina, Armenia,

Bahamas, Bahrain, Barbados, Belize, Benin, Bhutan, Bosnia
& Herzgovina, Brunei Darussalem, Bulgaria, Cameroon,
China, Columbia, Costa Rica, Croatia, Cuba, Czechia, Dr
Congo, Equador, Equatorial Guinea, Fiji, Georgia, Guinea,
Guinea-Bissau, Guyana, Honduras, Hungary, India, Jamaica,
SLIGHT INCREASE
1 Japan, Jordan, Kazhakistan, Kiribati, Kuwait, Lebanon,
OR CONSTANT
Malaysia, Mauritius, Mexico, Mongolia, Montenegro,
Nepal, Oman, Pakistan, Panama, Peru, Philippines, Poland,
Qatar, Saint Lucia, Sao Tome And Principe, Saudi Arabia,
Serbia, Seychelles, Slovakia, Solomon Islands, Tajikistan,
Thailand, Macedonia, Togo, Tonga, Tunisia, Turkmenistan,
United Arab Emirates, United States Of America,
6|Page
Uzbekistan, Venezuela (Bolivarian Republic Of), Viet Nam.
Angola, Australia,Austria, Azerbaijan, Bangladesh,Belarus,

Botswana, Brazil, Burkina Faso, Burundi, Cã´Te D'ivoire,
Cabo Verde,,Cambodia, Canada, Chad, Comoros, Congo,
Djibouti, Dominican Republic, Egypt, El Salvador, Eritrea,
Estonia, Ethiopia, Finland, France, Gabon, Ghana, Grenada,
Guatemala, Haiti, Iceland, Indonesia, Iran, Iraq, Ireland,
Israel, Italy, Kyrgyzstan, Lao, Latvia, Lesotho, Liberia,
Libya, Lithuania, Luxembourg, Madagascar, Maldives,
Mali, Malta, Mauritania, Micronesia, Morocco,
Mozambique, Myanmar, Namibia, Netherlands, New
2 VARIATION
Zealand, Nicaragua, Niger, Nigeria, Norway, Papua New
Guinea, Paraguay, Republic Of Korea, Republic Of
Moldova, Romania, Rwanda, Saint Vincent And The
Grenadines, Samoa, Senegal, Sierra Leone, Singapore,
Somalia, South Africa, South Sudan, Spain, Sri Lanka,
Sudan, Suriname, Swaziland, Sweden, Switzerland, Syrian
Arab Republic, Timor-Leste, Trinidad And Tobago, Turkey,
Uganda, Ukraine, United Kingdom Of Great Britain And
Northern Ireland, United Republic Of Tanzania, Vanuatu,
Yemen, Zambia.
Afghanisthan, Bolivia, Chile, Cyprus, Dr Korea, Denmark,
3 INCREASING
Malawi, Russian Federation, Slovenia, Zimbabwe.
Belgium, Central African Republic,Gambia, Germany,
4 DECREASING
Greece, Portugal.
Cook Islands, Dominica, Marshall Islands, Monaco, Nauru,
5 NO DATA
Niue, Palau, Saint Kitts And Nevis, San Marino, Tuvalu.
Inference:
 Constant Life Expectancy = This trend might be due to age-specific mortality.

 Variation in Life Expectancy = This trend might be due to contributions from age- and disease-
specific mortality
 Increasing Life Expectancy = The trend might be due to improvements in public health, nutrition
and medicine and inclination towards fitness by more of physical activity by the population of a
country.
 Decreasing Life Expectancy = This trend might be due to mortality due to smoking, drinking and
obesity.
7|Page
6.1.2 STATUS AND TOTAL EXPENDITURE
NO. STATUS COUNTRIES

1 DEVELOPING Afghanistan, Albania, Algeria, Angola, Antigua and Barbuda,
Argentina, Armenia, Azerbaijan, Bahamas, Bahrain, Bangladesh,
Barbados, Belarus, Belize, Benin, Bhutan, Bolivia (Plurinational State
of), Bosnia and Herzegovina, Botswana, Brazil, Brunei Darussalam,
Burkina Faso, Burundi, CÃ´te d'Ivoire, Cabo Verde, Cambodia,
Cameroon, Canada,Central African Republic, Chad, Chile, China,
Colombia, Comoros, Congo, Cook Islands, Costa Rica, Cuba,
Democratic People's Republic of Korea, Democratic Republic of the
Congo, Djibouti, Dominica, Dominican Republic, Ecuador, Egypt, El
Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland,
France, Gabon, Gambia, Georgia, Ghana, Greece, Grenada,
Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Honduras, India,
Indonesia, Iran (Islamic Republic of), Iraq, Israel, Jamaica, Jordan,
Kazakhstan, Kenya, Kiribati, Kuwait, Kyrgyzstan, Lao People's
Democratic Republic, Lebanon, Lesotho, Liberia, Libya, Madagascar,
Malawi, Malaysia, Maldives, Mali, Marshall Islands, Mauritania,
Mauritius, Mexico, Micronesia (Federated States of), Monaco,
Mongolia, Montenegro, Morocco, Mozambique, Myanmar, Namibia,
Nauru, Nepal, Nicaragua, Niger, Nigeria, Niue, Oman, Pakistan,
Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines,
Qatar, Republic of Korea, Republic of Moldova, Russian Federation,
Rwanda, Saint Kitts and Nevis, Saint Lucia, Saint Vincent and the
Grenadines, Samoa, San Marino, Sao Tome and Principe, Saudi
Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Solomon Islands,
Somalia, South Africa, South Sudan, Sri Lanka, Sudan, Suriname,
Swaziland, Syrian Arab Republic, Tajikistan, Thailand, The former
Yugoslav republic of Macedonia, Timor-Leste, Togo, Tonga, Trinidad
and Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda,
Ukraine, United Arab Emirates, United Republic of Tanzania,
Uruguay, Uzbekistan, Vanuatu, Venezuela (Bolivarian Republic of),
Viet Nam, Yemen, Zambia, Zimbabwe.
Highest General government expenditure on health as a percentage of

total government expenditure by Micronesia = 165.84% as shown
below:
8|Page
Lowest General government expenditure on health as a percentage of
total government expenditure by Cook Islands = 3.58% as shown
below:
2 DEVELOPED Australia, Austria, Belgium, Bulgaria, Croatia, Cyprus, Czechia,

Denmark, Germany, Hungary, Iceland, Ireland, Italy, Japan, Latvia,
Lithuania, Luxembourg, Malta, Netherlands, New Zealand, Norway,
Poland, Portugal, Romania, Singapore, Slovakia, Slovenia, Spain,
Sweden, Switzerland, United Kingdom of Great Britain and Northern
Ireland, United States of America
Highest General government expenditure on health as a percentage of

total government expenditure by United States of America = 237.95%
as shown below:
9|Page
Lowest General government expenditure on health as a percentage of
total government expenditure by Singapore = 55.32% as shown
below:
Inference:
General government expenditure on health as a percentage of total government expenditure is more for
Developed Economies that Developing Economies.
10 | P a g e
6.1.3 STATUS AND BMI
NO STATUS BMI
The countries close to normal BMI are Japan and Singapore with values
25.6 and 25.9 respectively as shown as follows:
1 DEVELOPED
The countries with highest value of overweight BMI is Malta with value
66.18 as shown in the figure above.
The country that has lowest underweight BMI is Saint Kitts and Nevis with
a value of 5.2 as shown as follows:
2 DEVELOPING
The countries that have normal BMI are :

 Benin = 19.612
 CÃ´te d'Ivoire = 21.325
 Cabo Verde = 24.375
 Cameroon = 23.618
 China = 21.806
 Congo = 20.925
11 | P a g e
 Gambia = 20.3
 Ghana = 21.725
 Guinea-Bissau = 19.431
 Indonesia = 19.956
 Liberia = 19.987
 Maldives = 19.293
 Mauritania = 22.475
 Nigeria = 19.750
 Philippines = 19.187
 Republic of Korea = 23.24375
 Sao Tome and Principe = 20.85
 Somalia = 18.6875
 Thailand = 21.59375
The country that has highest overweight BMI is Nauru with a value of 87.3
as shown below:
Inference:
 Developed Economies have Overweight BMI values which indicate the cases of Obesity. No
country has underweight BMI and there are only two economies with BMI values close to the
standard of 25.
 Few Developing Economies face the problem of poverty where the BMI value is less than 18.5.
Normal BMI shows the population has a healthy lifestyle. There are also few countries with
overweight BMI value which might be due to Obesity.
12 | P a g e
6.1.4 IMMUNIZATION & LIFE EXPECTANCY
 For Afghanistan, Hepatitis B immunization coverage is kept nearly constant, Polio and Diphtheria
Immunization coverage are increased throughout the years, but the Life Expectancy has not
increased accordingly.
Figure showing Life Expectancy of Afghanistan
Chart showing Percentage of Immunization Coverage of Hepatitis B, Polio & Diphtheria

respectively of Afghanistan
 Taking another example of a country, Bahamas, No Hepatitis immunization had occurred on

2000 but other vaccines coverage was present. Still the Life expectancy shows a constant value
throughout the years without much effect of immunization.
13 | P a g e
Figure showing Life Expectancy of Bahamas
Chart showing Percentage of Immunization Coverage of Hepatitis B, Polio & Diphtheria

respectively of Bahamas
Inference:
Thus, it can be assumed that the effect of Immunization coverage of Hepatitis B, Polio & Diphtheria on
Life Expectancy is negligible.
14 | P a g e
6.1.5 MORTALITY RATES
The Mortality rate of each country is calculated with respect to the population at a particular year.
Adult Mortality
Adult Mortality (%) = X 100
Population
Infant deaths
Infant Mortality (%) = X 100
Population
Under−five Deaths
Under-five Mortality (%) = X 100
Population
The rest of death % account for adolescent death (8-15 years) and above 60 years.
For Example, Afghanistan Adult, Infant and Under-five Mortality has reduced over the years. In 2002,
there was no Adult Mortality.
Chart showing Percentage of Adult, Infant and Under-five Mortality
15 | P a g e
6.2 USING R, HISTOGRAM AND BAR PLOTS
Skewness is a measure of the asymmetry of the probability distribution about its mean. There are 2 types
of Skewness.
Negatively skewed: The left tail is longer, the mass of the distribution is concentrated on the right of the
figure.
Positively skewed: The right tail is longer, the mass of the distribution is concentrated on the left of the
figure.
(Source: Skewness Wikipedia)
16 | P a g e
Out of all variables, Population has the highest positive skewness. It is positively skewed. So, mean is to
the maximum right of the peak as compared to other variables.
Polio has highest negative skewness. It is negatively skewed. So, mean is to the maximum left of the peak
as compared to other variables.
6.2.1 LIFE EXPECTANCY HISTOGRAM
Bin Frequency Cumulative Bin Frequency Cumulative

Range % Range %
35 0 0.00% 75 796 27.19%
40 2 0.07% 80 571 46.69%
45 17 0.65% 70 444 61.85%
50 108 4.34% 60 279 71.38%
55 188 10.76% 65 270 80.60%
60 279 20.29% 85 208 87.70%
65 270 29.51% 55 188 94.13%
70 444 44.67% 50 108 97.81%
75 796 71.86% 90 45 99.35%
80 571 91.36% 45 17 99.93%
85 208 98.46% 40 2 100.00%
90 45 100.00% 35 0 100.00%
More 0 100.00% More 0 100.00%
17 | P a g e
Life expectancy histogram has the maximum frequency in the age 71-75 where 796 observation have it.
Second highest peak was in 76-80 with value as 571.
6.2.2 INFANT DEATHS

Range % Range %
2 2
0 848 28.86% 200 2004 68.21%
200 2004 97.07% 0 848 97.07%
400 51 98.81% 400 51 98.81%
600 19 99.46% 600 19 99.46%
800 0 99.46% 1800 4 99.59%
1000 3 99.56% 1000 3 99.69%
1200 3 99.66% 1200 3 99.80%
1400 3 99.76% 1400 3 99.90%
1600 3 99.86% 1600 3 100.00%
1800 4 100.00% 800 0 100.00%
More 0 100.00% More 0 100.00%
18 | P a g e
Infant death has the highest frequency 0-200 range across the year and across the country with frequency
of 2004. Infant death of 0 have frequency of 848.
6.2.3 ADULT MORTALITY

Range % Range %
3 3
0 0 0.00% 100 1068 36.48%
100 1068 36.48% 200 980 69.95%
200 980 69.95% 300 520 87.70%
300 520 87.70% 400 214 95.01%
400 214 95.01% 500 92 98.16%
500 92 98.16% 600 32 99.25%
600 32 99.25% 700 19 99.90%
700 19 99.90% 800 3 100.00%
800 3 100.00% 0 0 100.00%
More 0 100.00% More 0 100.00%
19 | P a g e
Adult mortality has the highest frequency upto 100 with frequency of 1068. Adult mortality have value of
zero for 0.
6.2.4 BARPLOTS
There are 512 developed countries and 2426 developing countries in the dataset.
20 | P a g e
There are 183 data points for all years except for the year 2013 which has 193 data points.
Here we are grouping quantitative variable (Life expectancy) by using the qualitative variable (Status).
We are finding the mean of life expectancies of Developed and Developing countries. The developed
countries have higher Life expectancy mean (79.19785) than the developed countries (67.11147).
The Life Expectancy data range of 70-75 has the highest frequency (close to 800) i.e., there are close to
800 data points of Life expectancy in the data set for which the values of Life expectancy lie in the range
70-75 years.
21 | P a g e
As far as skewness is concerned, this is a left (negatively) skewed histogram i.e. peak of the histogram
veers to the right.
For Developing countries, the Life Expectancy data range of 72-74 has the highest frequency (close to
350) i.e. There are close to 350 data points of Life expectancy in the data set for which the values of Life
expectancy lie in the range 72-74 years for Developing countries. This is a left (negatively) skewed
histogram i.e. peak of the histogram veers to the right.
For Developed countries, the Life Expectancy data range of 81-82 has the highest frequency (close to 80)
i.e. There are close to 80 data points of Life expectancy in the data set for which the values of Life
expectancy lie in the range 72-74 years for Developed countries.
So, we can see that as compared to Developing countries, the Developed countries have got the highest
frequency for a higher range of Life expectancy.
Thus, the frequency of average period for which a person may expect to live (Life expectancy) in the
higher range is more for Developed countries. Thus, residents stay alive for a greater number of years for
Developed countries.
22 | P a g e
This is a left (negatively) skewed histogram i.e. peak of the histogram veers to the right.
The Descriptive Statistics and Group By provided us with all the insights obtained from the variables.
This also gives us the Answer to Question 1 of Problem Definition.
Grouping by Year:
Pic 1: grouping by year
From the output, we can observe the following.
1. Average Life expectancy for most part of the dataset has been on the rise.
2. Average Adult Mortality and Average Infant Mortality has no definite pattern. For some part, it is
reducing while towards the end it is again has increased. This has very little effect on life
expectancy.
3. Alcohol consumption has been decreasing year on year.
4. General Govt Expenditure on health as a part of GDP has been increasing year on year across
countries.
5. Death due to HIV in children is decreasing.
6. Avg thinness is increasing.
7. Number of years of schooling is increasing.
23 | P a g e
Grouping by countries:
Pic 2 : Grouping by Countries.
1. France has the highest average life expectancy. Lesotho has the lowest average life expectancy.
2. Tunisia has the lowest average adult mortality. Lesotho has highest adult mortality.
3. Austria, Belize, Bosnia and Herzegovina, Cabo Verde, Croatia, Cyprus, Estonia, Fiji has zero
infant deaths.
4. Bangladesh, Equatorial Guinea has the lowest alcohol consumption. Belarus has the highest
alcohol consumption.
5. Eritrea has the lowest average percentage expenditure. Australia has the highest average
percentage expenditure
6. Fiji has the highest average number of hepatitis b cases. Equatorial Guinea has the lowest average
number of hepatitis b cases.
24 | P a g e
Grouping by countries:
Pic 3: Grouping by countries
1. Greece has the highest average BMI. Rwanda has the lowest average BMI.
2. Brazil has the highest polio immunization coverage. Equatorial Guinea has the lowest polio
immunization coverage.
3. Ireland has the lowest average thinness among 10-19 years. India has the highest average thinness
among countries.
4. Tonga has the lowest average thinness among 5-9 years. India has the highest average thinness a
mong 5-9 years.
5. Australia has the highest schooling, Eritrea has the lowest schooling number.
25 | P a g e
Grouping by Status:
Pic 4: Grouping by status.
1. As expected, developed countries has higher average life expectancy, lowest average adult mortal
ity, lowest average infant mortality, highest average percentage expenditure, lower average
measles cases, higher average BMI, lower average under 5 deaths, highest average polio coverage
lower average prevalence of thinness among 5-9 and 10-19, higher average schooling.
26 | P a g e
Descriptive Statistics:
Pic 5: Descriptive Stats.

1. Adult mortality, infant deaths, alcohol, percentage expenditure, measles, under 5 deaths, HIV
AIDS, GDP, thinness 10-19, thinness 5-9 years, is having huge deviation from mean.
2. The magnitude of the standard error gives an index of the precision of the estimate of the
parameter. The SEM describes how precise the mean of the sample is versus the true mean of the
population. From the standard error of the mean, percentage expenditure, measles, GDP,
population has higher standard error.
3. Median: The middle most value in a data series is called the median. The median () function is
used in R to calculate this value. Here, Population has the highest median, infant death has the
lowest median.
4. Standard deviation: the measurement of variation of a set of numbers. Here, population has the
highest standard deviation, “income composition of resources” has the lowest standard deviation.
5. Coefficient of variation: It is a statistical measure of the dispersion of data points in a data series
around the mean. It is defined as the ratio of the standard deviation to the mean and is expressed
as a percentage Adult mortality has the lowest coefficient of variation, population has the
highest coefficient of variation. According to Dormann 2013 CV-values below 0.05 (5%) indicate
very high precision of the data, values above 0.2 (20%) low precision. Since all the variables has
got coefficient of variation value less than 20%, so they have high precision.
27 | P a g e
Summary Stats:
Pic 6: Summary Stats
The quartile measures the spread of values above and below the mean by dividing the distribution into
four groups. A quartile divides data into three points – a lower quartile, median, and upper quartile – to
form four groups of the data set.
28 | P a g e
7. REGRESSION MODEL FITTING
Identifying the significant variables by stepwise regression (Answer to Question 3 of Problem
Definition)
7.1 LIFE EXPECTANCY AS A DEPENDENT VARIABLE
OBJECTIVE:
● To identify the significant predictor variables (X variables) of Life expectancy in original

multiple regression model and individual regression models
● To find out the relationship (positive or negative) between Life expectancy and X variables.
29 | P a g e
Life Expectancy regressed on all the above significant variables (Overall/Original Multiple
regression Model)
30 | P a g e
Regression Equation:
Life expectancy = 5.51e+01+ B1(Country) + B2(Year) - 1.716e-03 (Adult Mortality) + 8.626e-02

(Infant deaths) -1.024e-01(Alcohol) -1.287e-05(Measles) - 6.434e-02 (Under five deaths) - 4.975e-
02(Total expenditure) + 6.836e-03(Diphtheria) - 3.077e-01(HIV-AIDS) + 4.481e - 02(Thinness 1-19
years) - 1.659e-01(Income Composition of Resources) +1.5e - 01(Schooling)
There will be 192 dummy variables for Country, so 192 regression coefficients for Country (since total
193 countries in dataset). Similarly, since are there are total 16 years in dataset, there will be 15 dummy
variables for Year and 15 regression coefficients for Country.
Writing all the dummies will be a complex task. For Country, in the regression equation, it will be like
β1C1+ β2 C2+ β3C3+ …+ β192 C192 (Here C1, C2,…,C192 are dummy variables for the countries). So, we have
written them as B1(Country).
For Year, in the regression equation, it will be like
Β193Y193+ β194 Y194+ β195Y195+ …+ β207Y207 (Here all the Y’s are dummy variables for the Years). So, we
have written them as B2(Year).
Value of R^2 (96.37%) is high in this regression model. So, the model is a good fit.
Null Hypothesis for overall Regression Model:
H0: β1 = β2 = ... = βk = 0. Thus, none of the variables belong to the model and we do not have a good
model for prediction.
Alternative Hypothesis for overall Regression Model:
Ha: At least one β is not 0. Thus, at least one variable belongs to the model and we have a good model for
prediction.
Since the p value of the overall model <2.2e-16 which is < 0.05, so we reject the null hypothesis that we
do not have a good model for prediction. So, the overall model is significant predictor of Life expectancy.
For individual variables:
Null Hypothesis: H0: β1 = 0. Thus, the variable does not belong to the model and we do not have a good
model for prediction.
Alternative Hypothesis: Ha: β1 ≠ 0. Thus, it belongs to the model and we have a good model for
prediction.
Since, the p value of all predictor variables except Income composition of resources <0.05, so we reject
the null hypothesis for all of them, so they are significant predictors of Life expectancy except Income
composition of resources.
31 | P a g e
Among the quantitative variables, the coefficient of Schooling is highest (1.5e-01) i.e. with 1 unit
increase in Schooling, the Life expectancy increases by 1.5e-01 units which is the highest. The coefficient
of HIV/AIDS is lowest (- 3.077e-01) i.e. with 1 unit increase in HIV/AIDS, the Life expectancy decreases
by 3.077e-01units.
For Year:
Benchmark category: Year 2000
For only Years 2001, 2002 and 2003, the p value>0.05. So there is no significant difference between the
average Life expectancies of 2000 & 2001, 2000 & 2002 and 2000 & 2003. For the rest of the years there
is a significant difference between the average Life expectancies of 2000 and each of the rest of the years
pairwise.
All the coefficients for years are positive i.e. the average life expectancies in all the years are more than
the average life expectancy of 2000. The coefficient for 2015 is highest (6.241) i.e. the average life
expectancy of 2015 is the 6.241 units more than the average life expectancy of 2000. Similarly, the
coefficient for 2000 is lowest (2.501e-01).
For Country:
Benchmark category: Afghanistan
For the countries, Benin, Burkina Faso, Burundi, Cameroon, Equatorial Guinea, Guinea, Guinea-Bissau,
Liberia, Mali, Mozambique, Togo, Zambia and Zimbabwe, the p value>0.05. So there is no significant
difference between the average Life expectancies between Afghanistan and each of these pairwise. For
the rest of the countries where p value<0.05, there is a significant difference between the average Life
expectancies between Afghanistan and each of them pairwise.
The coefficient for Solomon Islands is highest (9.751) i.e. the average life expectancy of Solomon Islands
is the 9.751 units more than the average life expectancy of Afghanistan. Similarly, the coefficient for
Angola is lowest (-6.513) i.e. the average life expectancy of Afghanistan is the 6.513 units more than the
average life expectancy of Angola.
32 | P a g e
Checking the plots for Linearity of the model:
Plot is slightly funnel shaped. So it has potential of Heteroscedasticity.
Plot is non-normal as it is not a 45-degree straight line.
33 | P a g e
3/n=3/2938 =0.001(n=number of data points in dataset)
As lots of data points are > 0.001 in the above plot, so no outliers are present.
Since model is not satisfying 2 conditions of linearity, so model is non-linear.
34 | P a g e
Checking for Multicollinearity:
Auxiliary regression model of infant. Deaths regressed on the rest of the X variables
The R^2=99.44% >96.37% (R^2 of original model). So infant deaths is a source of multicollinearity. So,
we can remove it from the final regression model.
Auxiliary regression model of under-five deaths regressed on the rest of the X variables
35 | P a g e
The R^2=99.45% >96.37% (R^2 of original model). So, under-five deaths is a source of multicollinearity.
So, we can remove it from the final regression model.
Checking for VIFs
VIFs for infant deaths and under-five deaths are >10. So, they are sources of multicollinearity. So, we can
remove them from the final regression model.
Regression model by removing Multicollinearity:
36 | P a g e
Life expectancy = 5.49e+01+ B1(Country) + B2(Year) - 1.758e-03 (Adult Mortality) -1.046e-01

(Alcohol) -1.873e-05(Measles) -5.363e-02(Total expenditure) + 8.435e-03 (Diphtheria)- 3.15e-01(HIV-
AIDS) + 2.906e-02(Thinness 1-19 years) -1.239e-01 (Income Composition of Resources) +1.726e-01
(Schooling)
R^2=96.27% slightly less than original regression model (96.37%), but the model is still a good fit. The
overall model after removing Multicollinearity is significant as p value of this model <2.2e-16 which is
< 0.05. All variables except Income Composition of resources and Schooling are significant predictor
variables of Life expectancy (In the original regression model, all variables except Income Composition
of resources were significant predictor variables of Life expectancy).
Among the quantitative variables, the coefficient of Schooling is highest (1.726e-01) i.e. with 1 unit
increase in Schooling, the Life expectancy increases by 1.726e-01 units which is the highest. The
coefficient of HIV/AIDS is lowest (- 3.15e-01) i.e. with 1 unit increase in HIV/AIDS, the Life
expectancy decreases by 3.15e-01) units. This is similar to the overall multiple regression model where
also the coefficient of Schooling is highest and the coefficient of HIV/AIDS is lowest.
Regression using the log transformation of the model after removing Multicollinearity
Life expectancy= 4.004+ B1(Country) + B2(Year) - 3.108e-05 (Adult Mortality) -1.572e-03(Alcohol) -

3.931e-07(Measles) -9.353e-04(Total expenditure) + 1.519e-04 (Diphtheria) – 6.436e-03(HIV/AIDS) +
7.392e-04(Thinness 1-19 years) + 1.001e-03(Income Composition of Resources) +3.34e-
03(Schooling)
37 | P a g e
The value of R^2=96.06%. It has reduced from the before transformation model. Still the model is a good
fit.
All the variables except Income composition of resources are significant predictor variables of Life
Expectancy. Except Income composition of resources, for all other predictor variables, the p value<0.05,
so they are significant. This is same as that of the original regression model.
Since there is no significant improvement of R^2 and no improvement in normality after taking log
transformation, so log transformation is ruled out.
Checking Regression of Life expectancy on Individual predictor variables:
7.1.1 LIFE EXPECTANCY VS COUNTRY
38 | P a g e
39 | P a g e
Life Expectancy= 58.1937 + B1(Country)
The R^2 value is 92.56%. Benchmark category: Afghanistan
The intercept value of 58.1937 is the average Life expectancy of benchmark category Country
Afghanistan.
For the countries, Benin, Congo, Ethiopia, Gambia, Haiti, Kenya, Liberia, Niger, Rwanda, South
Africa and Togo, the p value>0.05. So, there is no significant difference between the average Life
expectancies between Afghanistan and each of these pairwise. For the rest of the countries where p
value<0.05, there is a significant difference between the average Life expectancies between Afghanistan
and each of them pairwise.
This result is different as compared to the result in the overall regression model, only 3 countries are
common to the individual regression on country and overall regression model which are not significant
are Benin, Liberia and Togo.
The coefficient for Japan is highest (24.3437) i.e. the average life expectancy of Japan is the 24.3437
units more than the average life expectancy of Afghanistan. Similarly, the coefficient for Sierra Leone is
lowest (-12.0812) i.e. the average life expectancy of Afghanistan is the 12.0812 units more than the
average life expectancy of Sierra Leone.
The result is different as compared to the result in the overall regression model. In the overall regression
model, the coefficient for Solomon Islands is highest and the coefficient for Angola is lowest.
First important insight seen is most of the African countries have life expectancies lower than that
of benchmark country (Afghanistan) i.e. negative coefficients. So, it can be concluded that most of
the African countries are worse off as far as Life expectancies are concerned as compared to other
continents of the world.
Second important insight seen is most of the European countries have life expectancies higher
(more than 15 units) than that of benchmark country (Afghanistan) i.e. positive (>15) coefficients.
So, it can be concluded that most of the European countries are better off as far as Life
expectancies are concerned as compared to other continents of the world.
40 | P a g e
7.1.2 LIFE EXPECTANCY VS YEAR
Life Expectancy= 66.7503 + B2(Year)
R^2 value is 2.9%. Benchmark category: Year 2000
The intercept value of 66.7503 is the average Life expectancy in the benchmark category Year 2000.
For only Years 2001,2002,2003,2004,2005,2006 the p value>0.05. So there is no significant

difference between the average Life expectancies of 2000 & 2001, 2000 & 2002,2000 & 2003,2000 &
2004, 2000 & 2005, 2000 & 2006. For the rest of the years there is a significant difference between the
average Life expectancies of 2000 and each of the rest of the years pairwise.
This result is different as compared to the result in the overall regression model, only 3 years are common
to the individual regression on country and overall regression model which are not significant are
2001,2002 and 2003.
41 | P a g e
All the coefficients for years are positive i.e. the average life expectancies in all the years are more than
the average life expectancy of 2000. The coefficient for 2015 is highest (4.8667) i.e. the average life
expectancy of 2015 is the 4.8667 units more than average life expectancy of 2000. Similarly, the
coefficient for 2000 is lowest (0.3787).
For the overall regression model also, the coefficient for 2015 is highest and for 2000 is lowest (Same as
Individual regression on Year).
Another important insight we can get is that the life expectancies increase from the year 2001 to
2015. So, it can be concluded that life expectancies are increasing by time (year on year).
7.1.4 LIFE EXPECTANCY VS DRINKING ALCOHOL (Impact of drinking Alcohol on life

Expectancy) This is answer to Question 6 of Problem Definition
7.1.4.1 Life Expectancy regressed on Alcohol:
Life Expectancy= 64.76334 + 0.95464 (Alcohol)
Average value of Life expectancy when there is no Alcohol consumption is 64.763 units (intercept).
R^2 is 16.39% (low). P value<0.05. So Alcohol is significant predictor variable of Life Expectancy
similar to the overall multiple regression model.
With one unit increase in drinking alcohol, the Life expectancy increases by 0.95 units. Hence Life
expectancy has a positive relationship with drinking alcohol. But in the overall multiple regression model,
Alcohol has a negative coefficient i.e. negative relationship with Life Expectancy.
42 | P a g e
7.1.4.2 Correlation of Drinking Alcohol with Life Expectancy
The correlation is positive i.e. 0.404 (slightly more than 0.4). The correlation is moderate as it falls
between 0.4 and 0.7.
7.1.5 LIFE EXPECTANCY VS MEASLES
Life Expectancy= 6.954e+01 – 1.307e-04 (Measles)
R^2 is 2.4% (very low). P value<0.05. So Measles is significant predictor variable of Life Expectancy
similar to the overall multiple regression model.
43 | P a g e
7.1.6 LIFE EXPECTANCY VS TOTAL EXPENDITURE
Life Expectancy= 64.26213 + 0.83693 (Total expenditure)
R^2 is 4.7% (very low). P value<0.05. So Total expenditure is significant predictor variable of Life
Expectancy similar to the overall multiple regression model.
7.1.8 LIFE EXPECTANCY VS CHRONIC DISEASE HIV/AIDS:
Life Expectancy= 71.04654 – 1.04228 (HIV/AIDS)
P value<0.05. So HIV/AIDS is significant predictor variable of Life Expectancy similar to the overall
multiple regression model. With one unit increase in HIV/AIDS, the Life expectancy decreases by 1.04
units (The regression coefficient of HIV/AIDS is -1.04). Hence Life expectancy has a negative
relationship with HIV/ AIDS.
44 | P a g e
In the overall multiple regression model as well, HIV/AIDS has a negative regression coefficient i.e.
negative relationship with Life Expectancy.
7.1.9 LIFE EXPECTANCY VS THINNESS 1-19 YEARS
Life Expectancy= 74.31828 – 1.02413 (Thinness 1-19 years)
R^2 is 22.77% (low). P value<0.05. So, Thinness 1-19 years is significant predictor variable of Life
7.1.10 LIFE EXPECTANCY VS INCOME COMPOSITION OF RESOURCES
Life Expectancy= 49.1735 + 32.1572 (Income composition of resources)
R^2 is 52.53% (moderate). P value<0.05. So Income composition of resources which was insignificant in
the overall multiple regression model has now become significant predictor variable of Life Expectancy
in this model.
45 | P a g e
7.1.11 LIFE EXPECTANCY VS SCHOOLING (Answer to Question 5 of Problem Definition)
Life Expectancy= 44.10889 + 2.10345 (Schooling)
R^2 is 56.55% (moderate). P value<0.05. So Schooling is significant predictor variable of Life

With one unit increase in Schooling, the Life expectancy increases by 2.103 units. Hence Life expectancy
has a positive relationship with Schooling.
In the overall multiple regression model as well, Schooling has a positive coefficient i.e. positive
relationship with Life Expectancy.
Solutions to other Problem Definition Questions:

7.1.14 LIFE EXPECTANCY VS IMMUNIZATION COVERAGE (Answer to Question 8 of
Problem Definition)
There are immunization coverages for 3 diseases as per the description of the dataset: Hepatitis B, Polio
and Diphtheria.
46 | P a g e
Life Expectancy= 54.794 + 0.003 (Hepatitis B) + 0.084 (Polio) + 0.09 (Diphtheria)
Average value of Life expectancy when there are no Hepatitis B, Polio and Diphtheria immunization
coverages simultaneously is 54.794 units (intercept).
P value of Polio and Diphtheria are <0.05. So Polio and Diphtheria immunization coverages are
significant predictor variables of Life Expectancy.
But p value of Hepatitis B >0.05 i.e. Hepatitis B immunization coverage is not a significant predictor of
Life Expectancy.
With one unit increase in Polio immunization coverage, the Life expectancy increases by 0.08 units, with
one unit increase in Diphtheria immunization coverage, the Life expectancy increases by 0.09 units
(highest). The corresponding figure for Hepatitis B is 0.003 units (lowest).
So, we can conclude that Diphtheria immunization coverage increase has the greatest positive impact on
Life expectancy among these 3 immunization coverages. Life expectancy has a positive relationship with
all of them- Hepatitis B, Polio and Diphtheria immunization coverages.
47 | P a g e
If we check their individual effect on Life Expectancy:
Hepatitis B:
Life Expectancy= 62.947 + 0.086 (Hepatitis B)
P value<0.05. So Hepatitis B immunization coverage is significant predictor variables of Life

Expectancy, when considered as the only predictor variable in regression of Life expectancy. But in the
regression model where all Hepatitis B, Polio and Diphtheria immunization coverages were considered
together for regression, Hepatitis B immunization coverage was an insignificant predictor of Life
Expectancy.
Polio:
Life Expectancy= 53.704 + 0.188 (Polio)
48 | P a g e
P value<0.05. So, Polio immunization coverage is significant predictor variables of Life Expectancy
when considered as the only predictor variable in regression of Life expectancy, similar to regression
model where all Hepatitis B, Polio and Diphtheria immunization coverages were considered together for
regression.
Diphtheria:
Life Expectancy= 53.477 + 0.192 (Diphtheria)
P value<0.05. So, Diphtheria immunization coverage is a significant predictor variable of Life

Expectancy when considered as the only predictor variable in regression, similar to regression model
where all Hepatitis B, Polio and Diphtheria immunization coverages considered together for regression.
7.1.15 LIFE EXPECTANCY VS POPULATION (Answer to Question 7 of Problem Definition)
Life Expectancy= 6.873e+01 – 3.468e-09 (Population)
49 | P a g e
P value >0.05. So, Population is not a significant predictor variable of Life Expectancy. With one unit
increase in Population, the Life expectancy decreases by (3.468e-09) units. Hence Population has a
negative relationship with drinking alcohol. So, we can conclude, higher the Population of a country,
lower is its Life expectancy.
7.1.16 LIFE EXPECTANCY VS GDP
Life Expectancy= 6.704e+01 + 3.117e-04 (GDP)
P value <0.05. So, GDP is a significant predictor variable of Life Expectancy. With one unit increase in
GDP, the Life expectancy increases by (3.117e-04) units. Hence Life expectancy has a positive
relationship with GDP. So, we can conclude, higher the GDP of a country, higher is its Life expectancy.
IMPACT OF INFANT, ADULT AND UNDER-FIVE MORTALITY RATES ON LIFE

EXPECTANCY (Answer to Question 4 of Problem Definition)
50 | P a g e
Life Expectancy= 77.883 – 0.049 (Adult Mortality) + 0.185 (Infant deaths) - 0.145 (under-five deaths)
P value of Infant, Adult and under-five mortality rates all are <0.05.
So, Infant, Adult and under-five mortality rates all are significant predictor variables of Life Expectancy.
With one unit increase in Adult mortality rate, the Life expectancy decreases by 0.049 units, with one unit
increase in infant mortality rate, the Life expectancy increases by 0.185 units increases by 0.09 units
(highest), with one unit increase in under-five mortality rate, the Life expectancy decreases by 0.145
units(lowest).
So, we can conclude that infant mortality rate increase has the greatest positive impact on Life expectancy
among these 3 mortality rates.
If we check their individual effect on Life Expectancy:
Adult Mortality:
Life Expectancy= 78.018 – 0.053 (Adult Mortality)
P value<0.05. So Adult mortality is significant predictor variable of Life Expectancy when considered as
the only predictor variable in regression of Life expectancy, similar to the regression model where all
Infant, Adult and under-five mortality rates were considered together for regression.
51 | P a g e
Infant deaths:
Life Expectancy= 69.706 - 0.015 (Infant deaths)
P value<0.05. So, Infant deaths is significant predictor variable of Life Expectancy when considered as
the only predictor variable in regression of Life expectancy, similar to the regression model where all
Under-five deaths:
Life Expectancy= 69.781 - 0.013 (under-five deaths)
P value<0.05. So, under-five deaths is significant predictor variable of Life Expectancy when considered
as the only predictor variable in regression of Life expectancy, similar to the regression model where all
52 | P a g e
INFERENCE FROM THIS MODEL
Thus, we can conclude that taking Life Expectancy as Y and regressing it on the rest of the
significant X variables (obtained after stepwise regression) that significantly affect Life
Expectancy after removing multicollinearity resulted in a good-fit regression model as it has very
high R^2 value. We are also able to answer the last 6 questions of the problem definition from
this model.
7.2 ADULT MORTALITY AS DEPENDENT VARIABLE (Answer to Question 9 of

Problem Definition)
OBJECTIVE: To check whether the issue of Adult Mortality can be addressed by changes in the other X
variables that significantly influence Life Expectancy
7.2.1 ADULT MORTALITY VS ALL OTHER SIGNIFICANT VARIABLES EXCEPT LIFE

EXPECTANCY
Residuals:
Min 1Q Median 3Q Max
-475.35 -4.97 12.33 32.36 510.49
Adult Mortality = 2.828e+02 + B1(Country) -1.131(Income Composition of Resources) -

0.999(Schooling) + B2(Year) + 6.911(HIV-AIDS) + 1.019(Diphtheria) -4.139(Measles) + 1.039(Alcohol)
-0.408(Under five deaths) + 0.580(Infant deaths) -0.413(Total expenditure) + 0.739(Thinness 1-19 years)
B1= coefficient of the categorical variable country.
B2= coefficient of the categorical variable year.
Residual standard error: 81.43 on 2360 degrees of freedom
(382 observations deleted due to missingness)
Multiple R-squared: 0.6031, Adjusted R-squared: 0.5704
F-statistic: 18.39 on 195 and 2360 DF, p-value: < 2.2e-16
Model R square being 0.60, it can be concluded that the model is not a very good fit. Reason might be
non-linear relationship between Y and any of the X ‘s. To check that, following plots were obtained.
53 | P a g e
54 | P a g e
From the above plots, it is clear that more than one condition of linearity is violated.
Checking for Multicollinearity (through VIF):
GVIF Df GVIF^(1/(2*Df))
Country 5.645015e+06 170 1.046786
Income.composition.of.resources 6.271597e+00 1 2.504316
Schooling 1.587135e+01 1 3.983886
Year 2.459652e+00 15 1.030455
HIV.AIDS 5.407505e+00 1 2.325404
Diphtheria 2.033893e+00 1 1.426146
Measles 2.150863e+00 1 1.466582
Alcohol 1.108622e+01 1 3.329598
under.five.deaths 1.499765e+03 1 38.726797
infant.deaths 1.604659e+03 1 40.058195
Total.expenditure 2.348768e+00 1 1.532569
thinness..1.19.years 7.208859e+00 1 2.684932
Thus we can conclude that infant deaths and under five deaths are sources of multicollinearity (VIF>10).
So by fitting the same model excluding these two variables gives us the following output.
Residuals:
-475.71 -5.02 12.30 32.33 507.53
55 | P a g e
Adult Mortality = 2.919e+02 + B1(Country) -2.071e+01 (Income Composition of Resources) -2.149e+00

(Schooling) + B2(Year) + 4.794e+00(HIV-AIDS) + 9.910e-02(Diphtheria) -8.853e-04(Measles) +
1.236e+00(Alcohol) -4.315e-01(Total expenditure) + 6.808e-01(Thinness 1-19 years)
Here in case of country, Afghanistan has been taken as the benchmark category and in case of year, 2000
has been taken as the benchmark category.
The coefficients in the above model can be interpreted as the change in Adult Mortality as a result of unit
change in the corresponding X variables.

Even after removing the multicollinear variables, R square did not improve. Now log transformation of
the above model was tried.
Residuals:
-4.4799 0.0048 0.2438 0.4515 2.0174
Ln(Adult Mortality) = 5.681e+00 + B1(Country) -2.609e-01 (Income Composition of Resources) -

2.684e-02 (Schooling) + B2(Year) + 4.580e-03(HIV-AIDS) + 6.661e-04(Diphtheria) -7.977e-
06(Measles) + 9.702e-03(Alcohol) -6.173e-03(Total expenditure) + 2.272e-03(Thinness 1-19 years)
56 | P a g e
After log transformation, the R square further reduced to 0.34. So this multiple regression model
is definitely not a good fit.
7.2.2 ADULT MORTALITY VS COUNTRY
Residuals:
-498.06 -6.37 11.63 33.00 513.69
Adult Mortality = 269.0625 + B1(Country)
Here again Afghanistan has been taken as the benchmark category. The intercept is the differential of the
Adult Mortality of any country w.r.t that of Afghanistan.
Tunisia has the highest coefficient (-250 ) and Benin has the lowest (0.3).
Insignificant Countries in the overall Model:
Angola, Belarus, Benin, Bhutan, Burkina Faso, Burundi, Cameroon, Central African Republic, Comoros,
Congo, Djibouti, Equatorial Guinea, Eritrea, Ethiopia, Gambia, Guinea, Kazakhstan, Kenya, Liberia,
Madagascar, Mongolia, Namibia, Niger, Nigeria, Papua New Guinea, Philippines, Russian Federation,
Rwanda, South Africa, Togo, Turkmenistan, Uganda, Ukraine, Yemen, Zambia
Rest all countries were significant.
But in the individual model (adult mortality versus country), insignificant countries:
Belarus, Benin, Bhutan, Burkina Faso, Burundi, Cameroon, Comoros, Congo, Djibouti, Equatorial
Guinea, Eritrea, Ethiopia, Gambia, Guinea, Kazakhstan, Liberia, Madagascar, Mongolia, Namibia, Niger,
Nigeria, Papua New Guinea, Philippines, Russian Federation, Rwanda, Togo, Turkmenistan, Uganda,
Chad, Gabon, Mozambique, Somalia, Sudan, United Republic of Tanzania.
Rest all were significant.
R square value being 0.59, it can be concluded that the model is not a good fit.
57 | P a g e
7.2.3 ADULT MORTALITY VS INCOME COMPOSITION OF RESOURCES
Residuals:
-327.37 -39.65 -5.28 46.21 517.44
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330.372 6.522 50.66 <2e-16 ***
Income.composition.of.resources -266.696 9.853 -27.07 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Regression Equation: Adult Mortality = 330.372 -266.696 (Income composition of Resources)
Here we can see that Income composition of resources has now become significant, which was otherwise
insignificant in the multiple regression model.
R square value is very low, the model is not a good fit .
7.2.4 ADULT MORTALITY VS SCHOOLING
Residuals:
-325.47 -49.96 -3.55 42.96 534.29
Coefficients:

(Intercept) 363.4750 7.7515 46.89 <2e-16 ***
Schooling -16.7033 0.6222 -26.84 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
58 | P a g e
Regression Equation: Adult Mortality = 363.4750 -16.7033 (Schooling)
Schooling, which was insignificant in the multiple regression model has now become significant in this
model.
The R square of the model is very low (0.2) showing that the model is not a good fit.
7.2.5 ADULT MORTALITY VERSUS YEAR
Residuals:
-179.48 -90.77 -20.22 63.57 549.37
Adult Mortality = 181.475 + B1(Year)
Here 2000 has been taken as the benchmark year. The intercept can be interpreted as the differential Adult
Mortality of any year with respect to 2000.
The coefficient of 2014 is the highest (-32.78) and that of 2004 is the lowest (4.78).
F-statistic: 1.634 on 15 and 2912 DF, p-value: 0.05769
In the previous multiple regression model, only the year 2012 was significant, whereas in this individual
model, the years 2012, 2013, 2014, and 2015 are significant.
R square is extremely low (0.0083), suggesting that this model is a complete misfit.
7.2.6 ADULT MORTALITY VS HIV AIDS
Residuals:
-784.2 -70.7 -1.7 74.3 515.2
59 | P a g e
Coefficients:
(Intercept) 142.4217 2.0693 68.82 <2e-16 ***
HIV.AIDS 12.8023 0.3849 33.26 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 1106 on 1 and 2926 DF, p-value: < 2.2e-16
Regression Equation: Adult Mortality = 142.4217 + 12.8023(HIV.AIDS)
HIV AIDS was significant both in the multiple as well as individual regression model.
R square of this individual model is very low (0.27) suggesting it is not a good fit.
7.2.7 ADULT MORTALITY VS DIPHTHERIA
Residuals:
-267.61 -77.53 -12.31 61.49 556.03
Coefficients:
(Intercept) 282.56485 7.99022 35.36 <2e-16 ***
Diphtheria -1.43915 0.09327 -15.43 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Regression Equation: Adult Mortality = 282.56485 -1.43915 (Diphtheria)
Diphtheria was previously insignificant in the multiple model, but now has become significant in the
individual model.
R square value of this individual model is extremely low (0.075), suggesting that this model is a
complete misfit.
60 | P a g e
7.2.8 ADULT MORTALITY VS MEASLES
Residuals:
-197.40 -90.98 -20.84 63.61 559.01
Coefficients:
(Intercept) 1.640e+02 2.347e+00 69.866 <2e-16 ***
Measles 3.374e-04 2.000e-04 1.687 0.0917 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 2.847 on 1 and 2926 DF, p-value: 0.09167
Regression Equation: Adult Mortality = 1.640e+02 + 3.374e-04 (Measles)
Measles was significant in the multiple regression model, but is insignificant in the individual model.
The R square value of this individual model is negligible (close to zero), suggesting this model is a
complete misfit.
7.2.9 ADULT MORTALITY VS ALCOHOL
Residuals:
-190.93 -81.15 -21.55 58.21 557.00
Coefficients:
(Intercept) 192.4121 3.5622 54.02 <2e-16 ***
Alcohol -6.0573 0.5802 -10.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 109 on 1 and 2733 DF, p-value: < 2.2e-16
61 | P a g e
Regression Equation: Adult Mortality = 192.4121 -6.0573 (Alcohol)
Alcohol was insignificant in the multiple regression model, but has become significant in the individual
model.
R square value of this individual model is extremely low (0.038), suggesting that this model is a complete
misfit.
7.2.10 ADULT MORTALITY VS TOTAL EXPENDITURE
Residuals:
-189.52 -84.38 -22.08 62.23 565.95
Coefficients:
(Intercept) 198.5707 6.2086 31.98 < 2e-16 ***
Total.expenditure -5.8237 0.9657 -6.03 1.86e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 36.37 on 1 and 2700 DF, p-value: 1.859e-09
Regression Equation: Adult Mortality = 198.5707 -5.8237 (Total Expenditure)
Total expenditure was insignificant in the multiple regression model, but has become significant in the
individual model.
misfit.
62 | P a g e
7.2.11 ADULT MORTALITY VS THINNESS-1-19 YEARS
Residuals:
-349.39 -71.00 -13.56 60.42 554.67
Coefficients:
(Intercept) 122.2023 3.2579 37.51 <2e-16 ***
thinness..1.19.years 8.4884 0.4964 17.10 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Regression Equation: Adult Mortality = 122.2023 + 8.4884 (Thinness-1-19 years)
misfit.
INFERENCE FROM THIS MODEL
Thus we can conclude that taking Adult Mortality as Y and regressing it on the rest of the X variables that
significantly affect Life Expectancy was a wrong decision as none of the variables could completely
explain Adult Mortality when taken individually.
However, when taken together, the overall model could somewhat explain Adult Mortality.
63 | P a g e
8. CONCLUSION
Life Expectancy data set has been studied with the help of R and Microsoft Excel. Dependent and
Independent variables has been identified. ‘Life Expectancy’ depends on 13 significant independent
variables which forms the basis to improve the ‘Life Expectancy’ of a country. Similarly, adult mortality
depends on 10 dependent variables which explains it.
The following insights are obtained from the dataset:
1. Developed Economies have Overweight BMI values which indicate the cases of Obesity. For most of
the Developing Economies, normal BMI shows the population has a healthy lifestyle.
2. Population has the highest positive skewness and Polio has highest negative skewness.
3. The developed countries have higher Life expectancy mean (79.19785) than the developed countries
(67.11147).
4. The Life Expectancy data range of 70-75 has the highest frequency (close to 800).
5. Average Adult Mortality and Average Infant Mortality has no definite pattern.
6. General Govt Expenditure on health as a part of GDP has been increasing year on year across
countries.
7. As expected, developed countries has higher average life expectancy, lowest average adult mortality,
lowest average infant mortality, highest average percentage expenditure, lower average measles cases,
higher average BMI, lower average under 5 deaths, highest average polio coverage lower average
prevalence of thinness among 5-9 and 10-19, higher average schooling.
8. Adult mortality, infant deaths, alcohol, percentage expenditure, measles, under 5 deaths, HIV AIDS,
GDP, thinness 10-19, thinness 5-9 years, is having huge deviation from mean.
9. Most of the African countries are worse off as far as Life expectancies are concerned as compared to
other continents of the world.
10. Most of the European countries are better off as far as Life expectancies are concerned as compared to
other continents of the world.
11. Average Life expectancies are increasing by time (year on year).
12. Adult Mortality Regression Model is not a good fit model.
64 | P a g e
9. LIMITATIONS
There are about 1000 missing observations which might lead to wrong model fitting.
10.RECOMMENDATIONS
To improve the life expectancy, Government of respective country should concentrate on controlling
adult mortality, infant death, under five deaths, awareness about HIV/AIDS and care during pregnancy of
HIV AIDS patients. Government should increase total expenditure spent on health. Also, Government
should increase immunization coverage of Diphtheria. Government should control alcohol availability to
its citizen to improve the life expectancy.
11. REFERENCES
1. https://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends_text/en/
2. https://towardsdatascience.com/5-types-of-regression-and-their-properties-c5e1fa12d55e
3. http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-
regression-essentials-in-r/
4. https://rdrr.io/cran/goeveg/man/cv.html
65 | P a g e

Group 1 Project Report DA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group 1 Project Report DA

Uploaded by

Copyright:

Available Formats

REPORT ON DATA ANALYTICS

DR. POOJA SENGUPTA

SAYANI MANDAL [18PGDM042]

SOHAM SARKAR [18PGDM103]

AISWARYA NAIR [18PGDM118]

PREETY PAUL CHOUDHURY [18PGDM141]

Source: WHO Global Health Observatory Data

The objective of the project is as follows:

 Country: Country, factor variable with 193 levels.

There are three strategies of stepwise regression:

1. What insights can be obtained from the variables?

5. What is the impact of schooling on the lifespan of humans?

7. Do populated countries tend to have lower life expectancy?

8. What is the impact of Immunization coverage on life Expectancy?

6. VISUALISATION AND EXPLORATION

6.1 USING PIVOT TABLES

6.1.1 LIFE EXPECTANCY AND COUNTRY

Albania, Algeria, Antigua & Barbuda, Argentina, Armenia,

Angola, Australia,Austria, Azerbaijan, Bangladesh,Belarus,

 Constant Life Expectancy = This trend might be due to age-specific mortality.

NO. STATUS COUNTRIES

Highest General government expenditure on health as a percentage of

2 DEVELOPED Australia, Austria, Belgium, Bulgaria, Croatia, Cyprus, Czechia,

Highest General government expenditure on health as a percentage of

The countries that have normal BMI are :

Figure showing Life Expectancy of Afghanistan

Chart showing Percentage of Immunization Coverage of Hepatitis B, Polio & Diphtheria

 Taking another example of a country, Bahamas, No Hepatitis immunization had occurred on

Chart showing Percentage of Immunization Coverage of Hepatitis B, Polio & Diphtheria

Chart showing Percentage of Adult, Infant and Under-five Mortality

(Source: Skewness Wikipedia)

6.2.1 LIFE EXPECTANCY HISTOGRAM

Bin Frequency Cumulative Bin Frequency Cumulative

35 0 0.00% 75 796 27.19%

40 2 0.07% 80 571 46.69%

45 17 0.65% 70 444 61.85%

50 108 4.34% 60 279 71.38%

55 188 10.76% 65 270 80.60%

60 279 20.29% 85 208 87.70%

65 270 29.51% 55 188 94.13%

70 444 44.67% 50 108 97.81%

75 796 71.86% 90 45 99.35%

80 571 91.36% 45 17 99.93%

85 208 98.46% 40 2 100.00%

More 0 100.00% More 0 100.00%

6.2.2 INFANT DEATHS

Bin Frequency Cumulative Bin Frequency Cumulative

200 2004 97.07% 0 848 97.07%

400 51 98.81% 400 51 98.81%

600 19 99.46% 600 19 99.46%

800 0 99.46% 1800 4 99.59%

1000 3 99.56% 1000 3 99.69%

1200 3 99.66% 1200 3 99.80%

1400 3 99.76% 1400 3 99.90%

1600 3 99.86% 1600 3 100.00%

1800 4 100.00% 800 0 100.00%

More 0 100.00% More 0 100.00%

6.2.3 ADULT MORTALITY

Bin Frequency Cumulative Bin Frequency Cumulative

100 1068 36.48% 200 980 69.95%

200 980 69.95% 300 520 87.70%

300 520 87.70% 400 214 95.01%

400 214 95.01% 500 92 98.16%