You are on page 1of 7

In the name of the Father, and the Son, and the Holy Spirit

Department of Economics

Degree of B.Sc., Second Year

LI Econometrics 08 30401

Formative Coursework

Page 1 TURN OVER


Use Galton’s height data from the Harvard Dataverse (download using this link) to answer the
following questions. You will need to use STATA to answer the questions. Provide a detailed
answer to each question, including a screenshot of the results and your comments. As an
appendix include a copy of your .do file.

1. Describe the variables in the dataset. How many observations are in this dataset? Which
variables are strings and which variables are numerical?

Answer
STATA code: describe
Sample size: 898
String: family and gender
Numerical: father, mother, height, kids, male, female

2. Provide a table of descriptive statistics for the variables ‘height’, ‘father’ and ‘mother’.
What is the mean and standard deviation of the children heights? What is the height of
the shortest mother and the tallest father in the dataset?

Answer
STATA code: sum height father mother
Mean height: 66.76 St.dev: 3.58
Shortest mother: 58 Tallest father: 78.5

3. Which families have children with height greater than 75cm?

Answer
STATA code: list family if height > 75
Families no. 7, 35, 40 and 72.

4. Tabulate the variable ‘gender’. How many male children are in the dataset? What is the
percentage of the female children?

Page 2 TURN OVER


Answer
STATA code: tab gender
Number of males: 465 Percentage of females: 48.22%

5. Graph the histogram of the height of children using 15 bins and percentage as the y-axis
with your choice of line and bar colours. Add a normal curve to your graph. Are the
heights normally distributed?

Answer
STATA code: hist height, bin(15) normal percent lcolor(navy) fcolor(ltblue)
The heights seem to be symmetric but have a flat peak. They approximate normal.

6. Graph a scatter plot for the children heights against the heights of their fathers? What
kind of relation can you detect from the graph?

Answer
STATA code: scatter height father
There is moderate positive linear relation between the heights of children and the height
of their fathers.

7. Construct a correlation matrix between the variables: ‘height’, ‘father’, ‘mother’, ‘kids’ and
‘male’. Can you spot any potential multicollinearity problems amongst the variables?

Answer
STATA code: pwcorr height father mother kids male
Most correlation coefficients are moderate to low. No apparent multicollinearity prob-
lems.

8. Estimate the following multiple regression model:

heighti = β0 + β1 f atheri + β2 motheri + β3 kidsi + β4 malei + εi

Page 3 TURN OVER


(a) Is the model statistically significant overall?

Answer
STATA code: reg height father mother kids male
Model is overall significant at 5% level. F-statistic is quite large at F ∗ = 398 with
almost zero p-value.

(b) How much of the variation in the children heights can be explained by the model?

Answer
Adjusted R2 = 0.6391. So, about 64% of the variations in the heights can be
explained by the model.

(c) What is the range that includes the true parameter coefficient of the children gender
at 1% significance level?

Answer
STATA code: reg height father mother kids male, level(99)
The 99% confidence interval for β4 is (4.84, 5.58)

(d) Are all variables statistically significant at 5% level of significance? Which variables
are not (if any)?

Answer
All variables are statistically significant at 5% level (i.e. have p-values < 0.05),
except ‘kids’.

9. Perform the following hypothesis tests at 5% level of significance. State the null and
alternative hypothesis for each test and comment on your results.
(a) The estimated coefficients of the heights of fathers and mothers are quite close. Test
whether they are equal in the population?

Answer
STATA code: test father = mother
H0 : β1 − β2 = 0 vs. H1 : β1 − β2 6= 0
Since the p-value > 0.05, we cannot reject the null hypothesis, and conclude that
the coefficients on these two variables are equal in the population at 5% level of
significance.

(b) Test whether the sum of β1 and β2 is equal to one.

Page 4 TURN OVER


Answer
STATA code: test father + mother = 1
H0 : β1 + β2 = 1 vs. H1 : β1 + β2 6= 1
Since the p-value < 0.05, we reject the null hypothesis, and conclude that the sum
of coefficients is not equal to unity at 5% level of significance.

(c) Test the joint significance of ‘father’ and ‘mother’, and the joint significance of ‘kids’
and ‘male’.

Answer
STATA code: test father mother
H0 : β1 = β2 = 0 vs. H1 : at least one β 6= 0
Since the p-value < 0.05, we reject the null hypothesis, and conclude that the
heights of fathers and mothers are jointly significant at the 5% level.

STATA code: test kids male


H0 : β3 = β4 = 0 vs. H1 : at least one β 6= 0
Since p-value < 0.05, we reject the null hypothesis, and conclude that the number
of children and the gender of the kids are jointly significant at the 5% level.

10. Conduct a Farrer-Glauber multicollinearity test? What is your conclusion about the exis-
tence, location and pattern of multicollinearity in this model?

Answer
STATA code: fgtest height father mother kids male
H0 : X’s are orthogonal vs. H1 : X’s are not orthogonal

Since the p-value of the χ2 test is < 0.05, we reject the null hypothesis, and conclude
that multicollinearity exists.

Since the p-value of the F test of the regression of ‘father’ on the other X’s < 0.05, we
reject the null hypothesis, and conclude that variable ‘father’ is causing (location of)
multicollinearity.

Since the t-statistic of the partial correlation coefficient between ‘father’ and ‘kids’ is
greater than the critical value, we conclude that the variable ‘father’ is multicollinear
with the variable ‘kids’.

Page 5 TURN OVER


11. Generate a variable called ‘midparent’, which is the average of the mother’s and father’s
heights. Re-estimate the children height using only ‘midparent’ and ‘male’ as your explana-
tory variables. Check the significance of the model and its individual parameters. Does the
new model suffer from multicollinearity?

Answer
STATA code: generate midparent = (father+mother)/2
STATA code: label var midparent ”Average of father and mother height”
STATA code: reg height midparent male
STATA code: fgtest height midparent male

The new model is overall significant with similar adjusted R2 = 0.6374. All variables
are individually significant at 5% level.

H0 : X’s are orthogonal vs. H1 : X’s are not orthogonal


Since the p-value of the χ2 test is > 0.05, we cannot reject the null hypothesis, and
conclude that multicollinearity does not exist in the new model.

12. Predict the fitted values of children heights, as well as the residual values from the new
model you estimated in the previous question.
(a) Use the residual values to graph the QQ plot? What do you conclude? Formally
test for error normality using the Jarque-Bera test. If the errors are found to be
non-normal, suggest possible solutions.

Answer
STATA code: predict height hat, xb
STATA code: predict resid, residuals
STATA code: qnorm resid
STATA code: sktest resid

The QQ plot shows that the residuals closely follow the normal distribution, except
for the existence of three possible outliers, which may skew the distribution.

H0 : εi ∼ N (0, σ 2 ) vs. H1 : εi  N (0, σ 2 )


Since the p-value < 0.05, we reject the null hypothesis, and conclude that the
errors are not normally distributed.

Page 6 TURN OVER


Since the problem is likely caused by the existence of outliers, we should remove
these from the dataset, and re-estimate the model without them.

(b) Use the Breusch-Pagan test to detect the existence of heteroskedasticity in the model?
If the problem exists, use White’s robust variance estimator to re-estimate the model.

Answer
STATA code: estat hettest
STATA code: reg height midparent male, vce(robust)

Since the p-value of the Breusch-Pagan test < 0.05, we conclude that the model
suffers from heteroskedasticity (error variance is not constant).

Page 7 END OF PAPER

You might also like