You are on page 1of 4

Fouzia Salahuddin Ahmed

Prof. Thusita Kumara


SDP 202B L3
Assignment 6
20/4/24

1. Create a do file and save it as AS#06_YOURNAME and enter the preliminary operations and
import the dataset to your working directory.

Do file is attached

2. How many variables are there in the data set and for how many countries?

Number of variables = 13
Number of countries = 165
Using the data editor/browse function

3. How many of them are string variables? How many are numeric (continuous/discrete)
variables? How many are categorical(nominal/ordinal)?

Using command codebook i found what type of variables there were

String values : 2 countryme sanitaccessfactor


Numeric values : 11 infmort lifeexpect sanitaccess adolfert eduexpend adultlit primedufem
birthrate loggdppercapita sanitaccessnum gdppercapita

4. Create a histogram for the variable gdppercapita. What can you say about its
distribution? Is it how you expected it to be?

The density is very high when the gdp per capita is low or skewed to the left and there is much
lower density on the higher bins of gdp per capita. This distribution is as expected, since there is
a huge global population but as much global wealth/gdp. The left skewness indicates that many
countries or regions in the dataset are likely to be classified as developing economies. These
economies often face challenges such as limited access to resources, infrastructure, and
opportunities for economic growth.

5. Create a 90% confidence interval for gdppercapita and interpret it.

With 90% confidence, we can say that the true mean value of gdppercapita in the population is
likely to be within the range of 11222.47 to 16539.1.
This means that if you were to repeat the sampling process and calculate the mean of
gdppercapita multiple times, about 90% of the time, the true population mean would fall within
this interval. The standard error is relatively small compared to the mean value, indicating a
relatively precise estimate of the mean. Since the mean lies within the confidence intervals this
gives more proof to the sample mean as a prediction of the population mean.

6. Regress gdppercapita on infant mortality and interpret the slope estimate. Interpret R-
squared. Does the model clear the F-test?
For the regression the dependent variable is gdppercapita and the independent variable is
infant mortality.

The slope coefficient for the independent variable "infmort" (infant mortality) is -424.8095. The
negative sign of the coefficient suggests an inverse relationship between infant mortality and
GDP per capita. In other words, as infant mortality increases, the model predicts a decrease in
GDP per capita.

R-squared value is 0.2741, it indicates that approximately 27.41% of the variation in the
dependent variable (gdppercapita) can be explained by the variation in the independent variable
included in the regression model (in this case, infant mortality). This suggests that while there is
a relationship between infant mortality and GDP per capita, other factors not included in the
model are also influencing the variation in gdppercapita.

In this case, the F-test result indicates that the overall regression model is statistically
significant. The p-value of 0.0000 suggests that the probability of obtaining such a large
F-statistic by chance alone is extremely low. Therefore, we can reject the null hypothesis. This
means that there is evidence to support the claim that there is some relationship between infant
mortality and GDP per capita.

7. Create a box plot for variable infmort. What is the median of infmort? Are there any
outliers?

The median is 20
There is 1 outlier

8. Test whether the mean infant mortality is same for countries with high and low sanitation
access. Interpret the result in context of the data.

The two-sample t-test results comparing infant mortality rates between countries with high and
low sanitation access reveal a substantial difference. Countries with high sanitation access
(Group 0) have an average infant mortality rate of 58.46, while countries with low sanitation
access (Group 1) have an average rate of 15.37. The t-value of 15.94 indicates a highly
significant difference between the two groups. These findings reject the null hypothesis of no
difference and suggest that countries with high sanitation access experience significantly higher
infant mortality rates compared to those with low sanitation access.

9. Is there a linear relationship between sanitation access and infant mortality? Explain.
By running a linear regression test the regression analysis reveals a significant and negative
linear relationship between sanitation access and infant mortality. A higher level of sanitation
access is associated with lower infant mortality rates. The coefficient estimate of -0.7087415
indicates that for each unit increase in sanitation access, there is a corresponding decrease of
approximately 0.71 in the infant mortality rate. The results are statistically significant (p < 0.001),
indicating that the relationship is unlikely to be due to chance. The R-squared value of 0.7243
suggests that about 72.43% of the variation in infant mortality can be explained by the linear
relationship with sanitation access. Overall, these findings highlight the importance of sanitation
access in reducing infant mortality rates.

10. Regress infmort on lifeexpect, adolfert, adultlit, primedufem and gdppercapita and
discuss the STATA output.

Running this model in stata gives us many significant data points.


The overall model is statistically significant, as indicated by the F-statistic of 45.07 (p < 0.001).
This suggests that the set of independent variables collectively has a significant impact on
explaining infant mortality.

The R-squared value of 0.8965 indicates that approximately 89.65% of the variance in infant
mortality can be explained by the independent variables in the model.

"lifeexpect" (life expectancy) has a negative coefficient of -1.289621, suggesting that for each
additional year of life expectancy, there is a decrease of approximately 1.29 in the infant
mortality rate. It is statistically significant (p = 0.007).

"adolfert" (adolescent fertility rate) has a coefficient of 0.027525, indicating a positive


relationship with infant mortality, but it is not statistically significant (p = 0.565).

"adultlit" (adult literacy rate) has a negative coefficient of -0.5835216, implying that higher adult
literacy rates are associated with lower infant mortality rates. It is statistically significant (p <
0.001).

"primedufem" (primary education female-male ratio) has a coefficient of -0.0794956, suggesting


a negative relationship with infant mortality, but it is not statistically significant (p = 0.615).

"gdppercapita" (GDP per capita) has a coefficient of -0.0001073, indicating a negative


association with infant mortality, but it is not statistically significant (p = 0.648).

The intercept term, represented by "_cons", has a coefficient of 173.1568. This represents the
estimated infant mortality when all other independent variables in the model are held constant.

The adjusted R-squared value of 0.8767 accounts for the number of independent variables and
provides a more conservative estimate of the model's explanatory power.
In summary, the regression analysis suggests that life expectancy and adult literacy rate are
significant factors associated with infant mortality, while adolescent fertility rate, primary
education female-male ratio, and GDP per capita do not appear to have statistically significant
relationships with infant mortality in this model.

You might also like