You are on page 1of 8

DATA ANALYSIS FOR ECONOMICS: ADDITIONAL PROBLEM SET UNITS I AND II.

Jorge Pena
October 2021, IE University

1. X and Y are two independent random variables with E(X)=1, V(X)=1, E(Y)=2, V(Y)=1.
a) Calculate the expectation and the variance of W=2X+Y.
b) Calculate the expectation and the variance of W=3X+4Y.
c) Calculate the expectation and the variance of W=2X-4Y.

2. Suppose x1 and x2 are two independent observations of the random variable X using a random
sample of the population. We know that E(X)=1 and V(X)=1. Consider the following 3 estimators of
E(X):
1 1
𝑊 = 𝑥! + 𝑥"
2 2
1 3
𝑍 = 𝑥! + 𝑥"
4 4
1 2
𝑌 = 𝑥! + 𝑥"
3 3

a) Show that the 3 estimators are unbiased.


b) Which one of the estimators is more efficient? Which estimator would you choose?

3. The following data are heights (inches) and weights (pounds) of swimmers.

Height Weight
68 132
64 108
62 102
65 115
66 128

a) Draw the scatter plot of these data using height as the independent variable.
b) What does the scatter diagram in part (a) indicate about the relationship between the two
variables?
c) Obtain the estimated regression equation by calculating the estimators of b0 and b1. Interpret
the results.
d) If a swimmer's height is 63 inches, what will her estimated weight be?

4. Consumer Reports publishes tests and evaluations on HDTVs. For each model, a general
evaluation based mainly on the quality of the image was made. A higher rating indicates better
performance. General evaluation and pricing of 45-inch plasma televisions are given in the following
data.

Evaluation
Brand Price (x) score (y)
Dell 2800 62
Hisense 2800 53
Hitachi 2700 44
JVC 3500 50
LG 3300 54
Maxent 2000 39
Panasonic 4000 66

1
Phillips 3000 55
Proview 2500 34
Samsung 3000 39

a) Use these data to obtain an estimated regression equation that can be used to estimate the
overall evaluation score of a 42-inch television given the price. Interpret the results.
b) Calculate the R2 of the regression. Did the estimated regression equation provide a good fit?
c) Estimate the overall evaluation score for a TV priced at $3,200.

5. Consider the following table with data on GPA (x) and monthly wage (y) for a random sample of 6
students.

GPA (x) Wage (y)


2.6 3300
3.4 3600
3.6 4000
3.2 3500
3.5 3900
2.9 3600

a) Estimate a linear regression model that explains the average wage (y) as a function of the
GPA (x). Interpret the results.
b) What other factors can be contained in u? Comment on the properties of the OLS estimators
based on the SLR assumptions.
c) Calculate SST, SSE and SSR.
d) Calculate the R2 of the regression and comment on the goodness of fit.
e) Calculate the standard deviation of the OLS estimator of b1.
f) Repeat point a) but assuming a log-log model. Interpret the results.

a) The population equation is

𝑤𝑎𝑔𝑒 = 𝛽# + 𝛽! 𝐺𝑃𝐴 + 𝑢

We compute the OLS estimators:

𝐶𝑜𝑣(𝑥, 𝑦) 86
𝛽,! = = = 581.1
𝑉(𝑥) 0.148
𝛽,# = 𝑦= − 𝛽,! 𝑥̅ = 3650 − 581.1 ∙ 3.2 = 1790.5

The estimated equation is:

𝑤𝑎𝑔𝑒
E = 1790.5 + 581.1𝐺𝑃𝐴

If the GPA=0, then the estimated average wage is 1790.5. If the GPA increases by 1 point,
then the estimated average wage increases by 581.1 Euros in the sample.
b) The intelligence, ability, experience and other skills of the students will be in the error term
(u). It is likely that assumption 4 does not hold in this application as probably experience,
ability and intelligence are factors correlated with the wage and also with the GPA. Hence, the
OLS estimator computed does not provide any causal interpretation as it is capturing the
effect not only of the GPA but the joint effect of all these factors. The positive relation
observed is just a correlation, not a causal relation.
c) We can use the information of the following table:

2
residual residual
Obs. GPA (x) Wage (y) (𝑦! − 𝑦$)" fitted (𝑦F$ ) (𝑦&! − 𝑦$)"
(𝑢
F$) F $" )
sq. (𝑢
1 2.6 3300 122500 3301.4 121549.8 -1.4 1.8
2 3.4 3600 2500 3766.2 13511.7 -166.2 27635.7
3 3.6 4000 122500 3882.5 54037.7 117.5 13815.7
4 3.2 3500 22500 3650.0 0.0 -150.0 22506.0
5 3.5 3900 62500 3824.4 30397.9 75.6 5722.9
6 2.9 3600 2500 3475.7 30384.0 124.3 15453.0
Sum 19.2 21900 335000 21900.1 249881.1 -0.12 85135.1

Therefore
%

𝑆𝑆𝑇 = J(𝑦𝑖 − 𝑦=)2 = 335000


$&!
%

𝑆𝑆𝐸 = J(𝑦F 𝑖 − 𝑦=)2 = 249881.1


$&!
%

𝑆𝑆𝑅 = J 𝑢F$" = 85135.1


$&!

d) The R2 is given by:


𝑆𝑆𝐸 249881
𝑅" = = = 0.75
𝑆𝑆𝑇 335000

The regression explains 75% of the variance of the wage. It is a high R2, however this not
imply causality. The remaining 25% is explained by other factors that are likely to be
correlated with the GPA.
e) The estimator of the standard error of the OLS estimator of b1 is defined as

𝜎F "
𝑠.N
𝑒(𝛽,! ) = O
𝑆𝑆𝑇'

Where

∑%$&! 𝑢F$" 85135.1


𝜎F " = = = 17027.02
𝑛−1 5

and
%

𝑆𝑆𝑇' = J(𝑥𝑖 − 𝑥= )2
$&!
= (2.6 − 3.2)2 + (3.4 − 3.2)2 + (3.6 − 3.2)2 + (3.2 − 3.2)2 + (3.5 − 3.2)2
+ (2.9 − 3.2)2 = 0.7

Therefore

17027.02
𝑠.N
𝑒(𝛽,! ) = O = √24324.3 = 155.96
0.7

3
The estimator of the standard error of the OLS estimator of b1 is 155.96, relatively high as
compared to the value of 𝛽,! , 581.1. This shows that the OLS estimator is inefficient. The
inefficiency can be explained by the high variance of the residual, i.e for all the factors that
can explain the wage other than the GPA, and also by the low variability of GPA. In order to
increase the efficiency of the estimator we can increase the sample size to have more
heterogenous students in the sample, and therefore increasing SSTx. Likewise, we can add
more variables to the regression, reducing the variance of the residual (𝜎F " ).
f) The new log-log model is

ln (𝑤𝑎𝑔𝑒) = 𝛽# + 𝛽! ln (𝐺𝑃𝐴) + 𝑢

The new dataset is

Obs. GPA (x) Wage (y) log(GPA), ln(x) ln(wage), ln(y)


1 2.6 3300 0.96 8.10
2 3.4 3600 1.22 8.19
3 3.6 4000 1.28 8.29
4 3.2 3500 1.16 8.16
5 3.5 3900 1.25 8.27
6 2.9 3600 1.06 8.19
Sum 19.2 21900 6.94 49.20

Which can be used to obtain the OLS estimators:

𝐶𝑜𝑣(ln (𝑥), ln (𝑦)) 0.0076


𝛽,! = = = 0.49
𝑉(ln (𝑥)) 0.0156
𝛽,# = ln (𝑦) ======= = 8.2 − 0.49 ∙ 1.16 = 7.63
======== − 𝛽,! ln(𝑥)

The estimated OLS regression line is

N
ln (𝑤𝑎𝑔𝑒) = 7.63 + 0.49ln (𝐺𝑃𝐴) + 𝑢

N
The intercept does not have an useful interpretation, it is the value of ln (𝑤𝑎𝑔𝑒) when
ln(GPA)=0 or when GPA=1. As for the slope, if GPA increases by 1%, the estimated average
wage increases by 0.49%. Recall that this interpretation does not imply causality, the
estimator in the log-log model has the same problems that the estimator in the initial model in
levels.

4
SOLUTIONS

1. We have to use the following properties of random variables assume W=a∙X+b∙Y, where a and b
are any real numbers:

𝐸(𝑊) = 𝑎 ∙ 𝐸(𝑋) + 𝑏 ∙ 𝐸(𝑌)


𝑉(𝑊) = 𝑎" ∙ 𝑉(𝑋) + 𝑏" ∙ 𝑉(𝑌) + 2 ∙ 𝑎 ∙ 𝑏 ∙ 𝐶𝑜𝑣(𝑋, 𝑌)

If we have W=a∙X-b∙Y

𝐸(𝑊) = 𝑎 ∙ 𝐸(𝑋) − 𝑏 ∙ 𝐸(𝑌)


𝑉(𝑊) = 𝑎" ∙ 𝑉(𝑋) + 𝑏" ∙ 𝑉(𝑌) − 2 ∙ 𝑎 ∙ 𝑏 ∙ 𝐶𝑜𝑣(𝑋, 𝑌)

Note that in this exercise X and Y are independent random variables, therefore Cov(X,Y)=0.

a) We apply the properties to the specific case of the example (a=2, b=1) to get

𝐸(𝑊) = 2 ∙ 𝐸(𝑋) + 𝐸(𝑌) = 2 ∙ 1 + 2 = 4


𝑉(𝑊) = 2" ∙ 𝑉(𝑋) + 1" ∙ 𝑉(𝑌) + 2 ∙ 2 ∙ 1 ∙ 0 = 2" ∙ 1 + 1" ∙ 1 = 5

b) We apply the properties to the specific case of the example (a=3, b=4) to get

𝐸(𝑊) = 3 ∙ 𝐸(𝑋) + 4𝐸(𝑌) = 3 ∙ 1 + 4 ∙ 2 = 11


𝑉(𝑊) = 3" ∙ 𝑉(𝑋) + 4" ∙ 𝑉(𝑌) + 2 ∙ 3 ∙ 4 ∙ 0 = 3" ∙ 1 + 4" ∙ 1 = 25

c) We apply the properties to the specific case of the example (a=2, b=-4) to get

𝐸(𝑊) = 2 ∙ 𝐸(𝑋) − 4𝐸(𝑌) = 2 ∙ 1 − 4 ∙ 2 = −6


𝑉(𝑊) = 2" ∙ 𝑉(𝑋) + 4" ∙ 𝑉(𝑌) − 2 ∙ 2 ∙ 4 ∙ 0 = 2" ∙ 1 + 4" ∙ 1 = 20

2. Note that x1 and x2 are also random variables with E(x1)=E(x2)=E(X)=1 and V(x1)=V(x2)=V(X)=1. In
this exercise we use the properties of random variables used in problem 1
a) We show the unbiasedness of W, remember that an estimator is said to be unbiased if the
expectation of the estimator equals the parameter we want to estimate, in this case the
parameter of interest is E(X) which is 1.

1 1 1 1
𝐸(𝑊) = 𝐸(𝑥! ) + 𝐸(𝑥" ) = 1 + 1 = 1
2 2 2 2
1 3 1 3
) )
𝐸(𝑍) = 𝐸(𝑥! + 𝐸(𝑥" = 1 + 1 = 1
4 4 4 4
1 2 1 2
𝐸(𝑌) = 𝐸(𝑥! ) + 𝐸(𝑥" ) = 1 + 1 = 1
3 3 3 3

Hence, the 3 estimators are unbiased, ie. E(W)=E(X)=1, E(Z)=E(X)=1 and E(V)=E(X)=1.
b) To check the efficiency of the estimators we need to calculate their variance. Note that since
x1 and x2 are independent random variables their covariance is 0. Therefore,

1 " 1 " 1 1
𝑉(𝑊) = [ \ 𝑉(𝑥! ) + [ \ 𝑉(𝑥" ) = 1 + 1 = 0.5
2 2 4 4
1 " 3 " 1 9
𝑉(𝑍) = [ \ 𝑉(𝑥! ) + [ \ 𝑉(𝑥" ) = 1 + 1 = 0.562
4 4 16 16
" "
1 2 1 4
𝑉(𝑌) = [ \ 𝐸(𝑥! ) + [ \ 𝐸(𝑥" ) = 1 + 1 = 0.556
3 3 9 9

We have that V(X)<V(Y)<V(Z). Hence, attending to the efficiency and given that the 3
estimators are unbiased we prefer W, then Y and the last estimator is Z.

a) The scatter-plot is shown below:

5
b) There is a positive relation between both variables, the higher the height the higher the
weight.
c) Given the information of the following table:

Observation Height (x) Weight (y) (𝑥! − 𝑥̅ )" (𝑥! − 𝑥̅ )(𝑦! − 𝑦')
1 68 132 9 45
2 64 108 1 9
3 62 102 9 45
4 65 115 0 0
5 66 128 1 11
Sum 325 585 20 110

The averages can be computed as:

∑%$&! 𝑥$ 325 ∑%$&! 𝑦$ 585


𝑥̅ = = = 65, 𝑦= = = = 117
𝑛 5 𝑛 5

The variance of height (x) is:

∑%$&!(𝑥$ − 𝑥̅ )" 20
𝑉(𝑥) = = =5
𝑛−1 4

The covariance between height (x) and weight (y) is

∑%$&!(𝑥$ − 𝑥̅ )(𝑦$ − 𝑦=) 110


𝐶𝑜𝑣(𝑥, 𝑦) = = = 27.5
𝑛−1 4

We can compute the OLS estimators:

𝐶𝑜𝑣(𝑥, 𝑦) 27.5
𝛽,! = = = 5.5
𝑉(𝑥) 5
𝛽,# = 𝑦= − 𝛽,! 𝑥̅ = 117 − 5.5 ∙ 65 = −240.5

The estimated regression is

N = −240.5 + 5.5ℎ𝑒𝑖𝑔ℎ𝑡
𝑤𝑒𝚤𝑔ℎ𝑡

The constant gives the average value of weight when height=0, in this case it does not have
any meaningful interpretation as height is never 0. The value of the slope (𝛽,! ) gives the
average change in the dependent variable when the explanatory variable increases by 1, so if
height increases by 1 pound, then weight increases by 5.5 inches on average. Note that, this

6
interpretation applies to the sample but not to the population unless the SLR assumptions
hold.
d) We just have to replace height=63 in the estimated regression equation:

N = −240.5 + 5.5ℎ𝑒𝑖𝑔ℎ𝑡 = −240.5 + 5.5 ∙ 63 = 106.5


𝑤𝑒𝚤𝑔ℎ𝑡

The estimated average weight of swimmers with height equal to 63 inches is 106.5 pounds.

4.

a) The population equation is

𝑠𝑐𝑜𝑟𝑒 = 𝛽# + 𝛽! 𝑝𝑟𝑖𝑐𝑒 + 𝑢

We can compute the OLS estimators:

𝐶𝑜𝑣(𝑥, 𝑦) 3871.1
𝛽2% = = = 0.013
𝑉(𝑥) 304888.9
𝛽,# = 𝑦= − 𝛽,! 𝑥̅ = 49.6 − 0.013 ∙ 49.6 = 12.02

The estimated OLS regression is:

𝑠𝑐𝑜𝑟𝑒
E = 12.02 + 0.013𝑝𝑟𝑖𝑐𝑒

If the price is 0 the estimated average score is 12.02 (no meaningful interpretation). If the
price increases by 1€, the average score increases by 0.013.

b) The R2 is defined as

𝑆𝑆𝐸 𝑆𝑆𝑅
𝑅" = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇

where
%

𝑆𝑆𝑇 = J(𝑦𝑖 − 𝑦= )2
$&!
%

𝑆𝑆𝐸 = J(𝑦F 𝑖 − 𝑦= )2
$&!
%

𝑆𝑆𝑅 = J 𝑢F$"
$&!

Using the information of the following table

Evaluation residual sq.


Brand Price (x) (𝑦! − 𝑦$)" fitted (𝑦F$ ) residual (𝑢
F$ )
score (y) F $" )
(𝑢
Dell 2800 62 153.76 48.42 13.58 184.4
Hisense 2800 53 11.56 48.42 4.58 21.0
Hitachi 2700 44 31.36 47.12 -3.12 9.7
JVC 3500 50 0.16 57.52 -7.52 56.6
LG 3300 54 19.36 54.92 -0.92 0.8
Maxent 2000 39 112.36 38.02 0.98 1.0

7
Panasonic 4000 66 268.96 64.02 1.98 3.9
Phillips 3000 55 29.16 51.02 3.98 15.8
Proview 2500 34 243.36 44.52 -10.52 110.7
Samsung 3000 39 112.36 51.02 -12.02 144.5
Sum 982.4 548.4

We have

548.4
𝑅" = 1 − = 0.44
982.4

The regression explains 44% of the variance of the score, there is a 66% explained by other
factors. Although the R2 is reasonable, the regression does not provide a very good fit.

c) We can use

𝑠𝑐𝑜𝑟𝑒
E = 12.02 + 0.013𝑝𝑟𝑖𝑐𝑒

To get

𝑠𝑐𝑜𝑟𝑒
E = 12.02 + 0.013 ∙ 3200 = 53.62

You might also like