You are on page 1of 5

Economics 326

Winter 2021
Midterm 1 Exam Answers

Question I

1. Error term in a regression The error term in a regression is the component that
captures the variation in the dependent variable arising from forces other than
variation in the right hand side variables. It can contain omitted variables that
affect the dependent variable, errors arising from an incorrect specification of the
functional form for the right hand side variable, and/or measurement error in the
dependent and/or independent variables. It the difference between the dependent
variable and the population regression function (e.g., β0 + β1 xi ).

2. R2 of a regression: The R2 provides information on how well the regression model


is fitting the data. It is a number between 0 and 1, and is equal to ESS
T SS
, where ESS
2
P
is the explained sum of squares, and is equal to i (Ŷi − Ȳ ) , and T SS is the total
sum of squares, and is equal to i (Yi − Ȳ )2 .
P

3. Homoskedasticity: If we are thinking of a regression of Y on X, we say there is


homoskedasticity if the variance of the error term, u, does not vary with the values
of X. Mathematically, V ar(ui |Xi ) = V ar(ui ) = a, where a is some constant.

Question II
To get full marks, the answer should clearly state the null and the alternative hy-
potheses and specify the significance level for the test. It should then set up the test
statistic,stating the correct distribution for it. In the first case, we can use a standard
normal because under the null we know the variance of the sample proportion (and
we will assume that the sample sizes are big enough for the CLT to hold). The an-
swer should state the conclusion in the right language (either ”do not reject the null
hypothesis at the α*100% level” or the opposite. If it says anything about ”proving” or
”disproving” the null, a mark is taken off.

For part i), the answer is


.1
z= =2 (1)
sqrt 0.5∗0.5
100

ii) The correct distribution is the standard Normal. This is because we form the test
statistic under the null hypothesis - i.e., as if the null hypothesis holds. If it did then
we know the variance of the proportion ( p∗(1−p)
n
. But if we know the variance then the z
statistic is distributed as N(0,1).
For part iii), under the null, the proportions are the same for both populations. This
means we can pool the two samples to get a better estimate of the proportion. But even
under the null, we don’t know the value of the proportion and so the statistic has a t
instead of z distribution.

1
As in part i), a complete answer includes stating the null and alternative hypotheses in
terms of the population proportions:
H0 : po = pno
Ha : po 6= pno
Then choosing a level of significance, e.g., α = 0.05

Next, we obtain the sample proportions we require (the proportions for each separate
sample, p̂o , p̂no and for the two pooled, p̂p ).

Then form the test statistic:


p̂o − p̂no
t= q (2)
2 · p̂p ·(1−p̂
100
p)

0.1
=√ = 1.49
2 · .0023
State the distribution of the test statistic (Student’s t with 198 degrees of freedom -
though this is just an approximation. If you worked with it as a standard normal, that
also got full marks.). Next obtain the critical value for the 5% level of significance. which
is approximately 1.961 from the table. Since the t-statistic value is less than 1.961, we
do not reject the null hypothesis at the 5% level of significance.

The answer for part iv) is very similar to the answer for part iii). The null hypothesis
is that the difference between the two proportions equals 0.05. In forming the test
statistic, this means that we subtract 0.05 from the numerator used in the test statistic
in part iii). The denominator remains the same.
Question III
1. The four assumptions required for unbiased estimates, using OLS, are: 1) The
model is linear in parameters and you have specified the correct model, 2) The data
is a random sample, 3) None of the independent variables are linear combinations
of each other, and each has a non-zero variance, 4) Zero conditional mean: E[u|X] =
0.

2. We can write:
Yj − Yk
b=
Xj − X k
(β0 + β1 Xj + uj ) − (β0 + β1 Xk + uk )
= (3)
Xj − Xk
uj − uk
= β1 +
Xj − Xk
Then, treating X as fixed,
E[uj − uk ]
E[b] = β1 +
Xj − Xk (4)
= β1

2
3. Again, treat X as fixed. We assume V ar(uj ) = V ar(uk ) = σ 2 . Then,
V ar(b) = E[b − β1 ]2
uj − uk 2
= E[ ]
Xj − Xk
1 (5)
= · V ar(uj − uk )
(Xj − Xk )2
2σ 2
=
(Xj − Xk )2
where in the last line we use the fact that uj , uk are uncorrelated.
Question IV grades: i) 10 marks
ii) 5 marks
iii) 5 marks
iv) 5 marks
v) 5 marks
The code is as follows:

*read in the data assuming that the data file is in the current directory
infix age 18-20 price 37-44 inc 109-116 sex 151 using ”sfs.txt”

* work only with people age 20 to 79


* note that the age variable is continuous within this range
keep if age >= 20 & age <= 79

* note from reading the manual that the housing price has missing values that are
coded as 99999996
* drop the observations with no housing price data
* alternatively, could include the conditional statement ‘if price ¡ 99999996’ in all stata
commands
keep if price < 99999996
* get means, variances, minimum and maximum values for each variable
sum age price inc sex, detail
* because sex is a binary variable, could tabulate it instead tab sex
* plot price against each variable
* I also put in a best fit line to help in seeing the relationship
twoway (scatter price age) (lfit price age)
twoway (scatter price inc) (lfit price inc)
* check correlations among the exogenous variables (or plot them against each other
corr age inc

* ttest of difference by sex


ttest price, by(sex)

3
* to run the regression, first generate log values of price and inc
gen lprice = ln(price)
gen linc = ln(inc)
* run the regression
reg lprice linc age
* get the estimate of α1 in two stages
* first stage: regress linc on age and capture the residuals reg linc age
predict resid, r
* second stage: regress lprice on residuals from first stage regression
reg lprice resid

To get full marks, all the commands have to be correct (i.e., if implemented in Stata,
they would work).
Places where marks could be lost: not specifying the location of the text file; not narrow-
ing the age range of the sample; not catching that price has missing values, implying
that observations need to be dropped (this could be done with drop or keep commands
or as ‘if ’ components of other commands); not summarizing all variables and checking
out the relationships among the right hand side variables; not generating log versions
of the price and income variables.
Question V
marks:
i) 5 marks
ii) 3 marks
iii) 4 marks
iv) 3 marks

i) They should calculate the elasticity using the exact formula: exp(0.3) - 1 = 0.35.
Thus, a 1% increase in market income is associated with a 0.35% increase in the price
of the house purchased.
ii) The age coefficient implies that each extra year of age is associated with a 0.1%
increase in the price of the house
iii) The intercept takes a value of -1. This means that if age = 0 and linc = 0 (i.e., inc
equals $1) then the log price equals -1. Whether the friend is right to be concerned is a
little nuanced. They are wrong about this being a negative price because the dependent
variable is in logs. The implied price for the family with a head age 0 and $1 in income
is exp(-1) = 0.37, or 37 cents. Even if its not negative, this is still an unrealistic value.
But you should not be concerned because the family type for which this is true (one
with a newborn as the head of the family and $1 in market income does not exist and,
in fact, is well away from the sample. This coefficient value is just part of getting the
regression line at the right level and does not have economic meaning in itself. iv) The
R2 tells us the proportion of the variation in the house price explained by the right hand
side variables in the regression. In this case, 11.9% of the variation in house prices is

4
explained by income and age.

You might also like