You are on page 1of 8

BES Tutorial Sample Solutions, S1/13

WEEK 12 TUTORIAL EXERCISES (To be discussed in the week starting


May 27)

1. Recall the Anzac Garage data (ANZACG.XLS) used in Weeks 3, 8 and 10.
In Week 3 we considered the simple linear regression model given by:


where price = used car price in dollars and age = age of the car in years.
The EXCEL results obtained using Ordinary Least Squares are
presented below:

Regression Statistics

R2
0.077

Standard Error 42069

Observations
117




CoefficientsStandard Error t Stat pvalue
Intercept
47469
6748
7.035 0.000
Age
2658
856
3.106 0.002

(a) Interpret the tStat and the pvalues in the EXCEL output.
What do you need to assume?
The tstat & pvalues in the EXCEL output are derived from twotail tests with
null hypotheses that the associated population parameter equals to 0. Hence,
larger tstats and lower pvalues mean we are more confident that the
associated population parameter is nonzero. Here, pvalues for both intercept
and Age coefficients are below 1% &, hence we can be confident that both
population parameters are statistically significant (nonzero).

We need to assume the disturbances are normal or because the sample size is
large invoke the CLT.

(b) Calculate a 95% confidence interval for the coefficient on age.

Standard normal critical value is 1.96 hence 95% confidence interval is:

1

2658 1.96856 = 2658 1678 = (4336, 980)



(c) Interpret the R2 value.

The regression model including age explains 7.7% of the variation in used car
prices.

(d) Test whether the estimated coefficient of Age is significantly less
than zero at the 5% level of significance.

Unlike in (a) this is a onetailed test:

H0: 1=0; H1: 1< 0
Decision rule: Reject H0 if b1/se(b1) < 1.645
Test statistic: b1/se(b1)=3.106 < 1.645 and hence reject H0

(e) Estimate a 95% confidence interval for the mean price for a
secondhand passenger car that is 10 years old and interpret the
result? Note: the sample mean of age is 6.44 years.

A 10 year old car is expected to be valued at $47469 102658=20889.
Boundaries of confidence interval for this prediction can be found by:

1
,


where s = 42069, se(b1)=856 and hence
42069
856


Hence:
20889

1.98

42069

1
117

10

6.44
2415

2415

20889

9783

We are 95% confident that the price of a 10 year old car will fall between
$11,106 and $30,672. While the impact of age on price is precisely
estimated, the CI is quite wide because of the large amount of
unexplained variation that is indicated by the very low R2 value reported.
(Note: use of normal critical values here would be acceptable given the
large sample size and would make little practical difference as the
critical value would be 1.96 rather than 1.98)

Anzac Garages pricing scheme based on the age of the car is not
working out very well. When its secondhand cars are compared with
cars of the same age from other dealers, prices often diverge. One of
their consultants noted that the value of a secondhand car should
depend on both the Odometer reading as well as the Age of the vehicle.
This consultant wanted to estimate the following two simple linear
regression models separately:




where Odometer = distance the car has travelled since leaving factory
in kilometers. A senior consultant advised use of a multiple linear
regression model instead:



(f) Discuss why the simple linear regression methods may not be
preferable to the multiple regression method, in general, and in
the context of this problem. The resultant OLS estimates for the
multiple regression model given below:

The predictive performance of the model will improve as relevant variables are
added to a simple regression model.

Also the assumption that the disturbance is uncorrelated with the explanatory
variables is critical for the unbiased estimation of coefficients of included
variables. In the simple price on age regression it will be violated if variables
affecting price and correlated with age have been omitted from the model.
This is likely to be the case here with distance the car has traveled.

3

We see the R2 has improved (approximately doubled) with the addition of


odometer and the coefficient on age is now much smaller in magnitude and is
now statistically insignificant.

SUMMARY OUTPUT




Regression Statistics

R Square
0.150

Standard Error 40568

Observations 117





CoefficientsStandard Errort Stat Pvalue
Intercept
53867
6825
7.893 0.000
Odometer (km) 0.270
0.087
3.110 0.002
Age
360
1108
0.325 0.746


2. Computing Exercise #4
Refer to the computing program and Discussion Question 4.3 on
multiple regression.

After estimating three import equations, the first two being simple
linear regression, the third being a multiple regression containing GNR
and relative prices as explanatory variables you were asked the
following discussion question:

Are the coefficients 1 and 2 statistically different from zero at the 5%
level? Of the three regression equations you estimated, which one
provides a better explanation of the level of imports?

The pvalues for 1 and 2 are both <0.0005 and hence at all conventional
significance levels one would reject the null hypotheses that these coefficients
are individually equal to zero.

We could interpret better in a number of ways. In terms of fit the third
regression is best in terms of adjusted R2: 0.9713 compared to 0.9457 and
0.3167 in the two simple regression models. (Notice the multiple regression
model will always dominate the two simple regression models in terms of R2
but may not in terms of adjusted R2 .)
4


In addition though, you could argue that the multiple regression model is
better because it guards against the omitted variable bias that is likely in the
two simple linear regression models.

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9867
R Square
0.9736
Adjusted R
0.9713
Standard E 3140.3680
Observatio
26

Intercept
GNE
Price

Coefficients Standard Error


16101.329
10822.442
0.249
0.011
-38978.894
8255.354

t Stat
P-value
1.488
0.150
23.406
0.000
-4.722
0.000



3. SIA: Sydney housing prices.
Recall the housing price data for Sydney suburbs used in Question 6 in
Week 3. Your statistically nave friend has been doing some analysis of
Sydney housing prices using these data and has asked you for help. In
addition to the price data there are a number of characteristics
associated with the suburb that have been collected and are likely to
explain some of the large variation in housing prices across suburbs
that are observed in the data. Your friend was very interested in the
impact on housing prices of being located under the flight path. The
regression of housing price on the flightpath variable (Model 1)
provided a result that he did not expect. On your advice he ran a
second regression (Model 2) that included several extra explanatory
variables. Results for Model 1 and Model 2 are presented in the table,
together with a full description of variables used in the analysis.

Housing price is the mean of the median price of houses sold in each
suburb for two quarters (September and December 2002) measured
in thousands of dollars;
5

Distance to CBD is distance measured in kilometers of the suburb from


Sydneys CBD;
Distance to Airport is distance measured in kilometers of the suburb
from Sydney Airport;
Distance to beach is distance of the suburb measured in kilometers
from the nearest beach;
Flightpath is a dummy variable that equals 1 if the suburb is under the
flight path and equal to 0 otherwise.

(a) How would you interpret the regression estimates for the
parameters in Model 1 and explain why your friend found the
result to be unexpected?
Because the estimate of 1 is positive this means houses under the flightpath
on average sell for more ($216,200 more) than houses not under the
flightpath. This is surprising because you would except aircraft noise
associated with being under the flighpath would be unattractive and hence
lead to lower not higher prices.

(b) Explain why the results in Model 1 are unreliable as a basis for
determining the impact on housing prices of being located under
the flight path. Which of the assumptions associated with simple
linear regression has clearly been violated in Model 1?

You would like to make the statement about the impact of being under the
flightpath holding other factors constant. This is not possible with Model 1
as it is a simple linear regression and hence there is potential for omitted
(confounding) variables that lead to biased estimates of the impact of being
situated under the flightpath.

For example, proximity to the beach is likely to impact on housing prices and
be correlated with being under the flightpath. In Model 1, the variable
Distance to beach is in the disturbance term and hence leads to a violation of
assumption that E(u|X) = 0.

(c) Write a brief description of the results for Flightpath in Model 2 in
terms of the parameter estimate, its interpretation and its
statistical significance.

6

The estimated parameter indicated a $51,500 premium (much smaller than


for Model 1) for suburbs under the flightpath relative to those not holding
other factors constant.

For statistical significance:
H0: i = 0 versus H1: i 0 where i is the ith regression coefficient
Because we have a large sample size we can invoke the CLT and use standard
normal critical values when evaluating the test statistics given by bi/se(bi)

If we choose = 0.05 then the decision rule will be to reject if | bi/se(bi)| > 1.96

The test statistic for flightpath (51.5/50.2 = 1.03) indicates that this
parameter is not statistically different from zero.

(d) Interpret the overall fit of Model 2.

Model 2 produces an R2 of 0.372 37.2% of the variation in Sydney housing
prices is explained by the explanatory variables in the regression.

(e) Use Model 2 to predict the average housing price for the suburb of
Randwick which is 5.21 kms from the CBD, 1.78 kms from the
beach, 6.62 kms from the airport and is not deemed to be under
the flight path.

Prediction = 853.5 + 0 21.55.21 + 216.62 13.91.78
= 855.763


The predicted average house price for Randwick is $855,763



Multiple regression results for Sydney housing prices*

Dependent variable:
Housing price
Model 1
Model 2
569.9
853.5
Intercept
(20.6)
(35.5)
216.2
51.5
Flightpath
(56.0)
(50.2)
Distance to
21.5

CBD
(3.4)
Distance to
21.0

Airport
(2.9)
Distance to
13.9

beach
(2.3)
Observations
503
503
R squared
0.029
0.372
* Numbers in brackets below coefficient estimates are standard errors.
Explanatory
variables

You might also like