You are on page 1of 19

# University of Hong Kong

## Introductory Econometrics (ECON0701), Fall 2010

17 September 2010

## Multiple Regression Analysis: Estimation

• Remember from last time, the four basic assumptions of the multiple regression
model.
• The model in the population must be linear in parameters:

y   0  1  x1   2  x2  ...   k  xk  u

• The data set

## • must be a random sample from the general population.

• The error term must have zero conditional mean:

E u | x1 , x2 ,..., xk   0

## Multiple Regression Analysis: Estimation

• There cannot be any perfect collinearity among the independent variables.
• When these are all true, the estimates of 0, 1, …, k are unbiased.
• We also discussed the implications of omitting an important variable from the multiple
regression analysis.
• An “important variable” is one that (a) correlated with the other x variables in the
model and (b) related to the y variable, after the other x variables are taken into
account.
Multiple Regression Analysis: Estimation
• We will now explore what happens when we mis-specify the regression model.
• The misspecification is leaving out a variable that should be there.
• The result will be that the estimator for the coefficient on the remaining variable will
be biased.

## Multiple Regression Analysis: Estimation

• For example, suppose that the true relationship is

y   0  1  x1   2  x2  u

• But instead of including x2 as a variable in our regression model, as we should, we
instead estimate

y  0  1  x1  u

## Multiple Regression Analysis: Estimation

• What happens to the estimator of 1?
n n

## x i1  x1  yi x i1  x1   0  1  xi1   2  xi 2  ui 

1  i 1
n
 i 1
n

x i1  x1  x i1  x1 
2 2

i 1 i 1

n n n n
 0   xi1  x1   1  xi1  xi1  x1    2  xi 2  xi1  x1    ui  xi1  x1 
1  i 1 i 1
n
i 1 i 1

x  x1 
2
i1
i 1
Multiple Regression Analysis: Estimation
• Taking the expectation, this simplifies to

ˆ  x1 , x2 
Cov
E  1   1   2
ˆ  x1 
Var

• This means that if you omit x2, and x2 is correlated with x1, your estimate of 1 will be
biased.

## Multiple Regression Analysis: Estimation

Bias in estimate of Corr(x1,x2) > 0 Corr(x1,x2) < 0
1 from omitting x2
2 > 0 positive bias negative bias

## Multiple Regression Analysis: Estimation

• For example, it is probably the case that one’s wages depend both on one’s education
and one’s ability:

## wages   0  1 education  1 ability  u

• But often we do not have data on someone’s “ability.”

## Multiple Regression Analysis: Estimation

• Instead, we might estimate the model

wages   0  1 education  u
• using the data we have available.
• Probably, since ability and education are positively correlated, and ability as a positive
effect on wages, the estimate of 1 will be positively biased.
Multiple Regression Analysis: Estimation
• As an example, we will look at the relationship between a child’s educational
attainment, his or her IQ, and his or her mother’s educational attainment.

education   0  1  Meducation   2  IQ  u
• Question: what happens to the estimate of b1 if the second variable (IQ) is omitted
from the equation?

## Multiple Regression Analysis: Estimation

• Fact 1: the correlation between IQ and educational attainment is probably positive.
Therefore 2 is probably positive.
• Fact 2: The correlation between mother’s education and child’s IQ is also probably
positive.
• Therefore, omitting child’s IQ means that we expect the estimate of 1 to be positively
biased.

. d

## Contains data from wage2.dta

obs: 935
vars: 17 14 Apr 1999 13:41
size: 24,310 (97.2% of memory free)
---------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------

IQ int %9.0g IQ score

educ byte %9.0g years of education

meduc byte %9.0g mother's education

---------------------------------------------------------------------------
Multiple Regression Analysis: Estimation
. summarize IQ, detail

IQ score
-------------------------------------------------------------
Percentiles Smallest
1% 64 50
5% 74 54
10% 82 55 Obs 935
25% 92 59 Sum of Wgt. 935

## 50% 102 Mean 101.2824

Largest Std. Dev. 15.05264
75% 112 134
90% 120 134 Variance 226.5819
95% 125 137 Skewness -.3404246
99% 132 145 Kurtosis 2.977035

## Multiple Regression Analysis: Estimation

IQ Range Description
36-50 Moderately Retarded
51-70 Mildly Retarded
70-90 Slow Learner
90-110 Average
110-120 Superior
120-140 Very Superior
140-180 Gifted

## Multiple Regression Analysis: Estimation

• Now let’s look at the estimated multiple regression model:

## Source | SS df MS Number of obs = 857

-------------+------------------------------ F( 2, 854) = 191.07
Model | 1277.79546 2 638.897729 Prob > F = 0.0000
Residual | 2855.60011 854 3.34379404 R-squared = 0.3091
-------------+------------------------------ Adj R-squared = 0.3075
Total | 4133.39557 856 4.82873314 Root MSE = 1.8286

------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meduc | .1669298 .0232489 7.18 0.000 .121298 .2125615
IQ | .0651911 .0044139 14.77 0.000 .0565278 .0738544
_cons | 5.155378 .4398148 11.72 0.000 4.292133 6.018622
------------------------------------------------------------------------------
Multiple Regression Analysis: Estimation
• Note that b2 (the coefficient on IQ) is positive, as we hypothesized would be true for
Fact 1.
• Next step is to see if mother’s education is positively correlated with children’s IQ (for
Fact 2).

. correlate IQ meduc
(obs=857)

| IQ meduc
-------------+------------------
IQ | 1.0000
meduc | 0.3318 1.0000

## Multiple Regression Analysis: Estimation

. regress educ meduc

## Source | SS df MS Number of obs = 857

-------------+------------------------------ F( 1, 855) = 130.78
Model | 548.378173 1 548.378173 Prob > F = 0.0000
Residual | 3585.01739 855 4.1930028 R-squared = 0.1327
-------------+------------------------------ Adj R-squared = 0.1317
Total | 4133.39557 856 4.82873314 Root MSE = 2.0477

------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meduc | .2808636 .0245594 11.44 0.000 .2326598 .3290674
_cons | 10.57491 .271523 38.95 0.000 10.04198 11.10783
------------------------------------------------------------------------------

• Note that without IQ, the estimate of the coefficent on mother’s education is upwardly
biased – 0.28 instead of 0.16.

## Multiple Regression Analysis: Estimation

• Now we will add a fifth assumption to the four assumptions we have discussed so far
(linear in parameters, random sampling, zero conditional mean, no multicollinearity).
• When assuming conditional homoskedasticity, the variance of the estimators of 0, 1,
…, k can be simply calculated.
Multiple Regression Analysis: Estimation
• Remember the linear regression model:
y   0  1  x1   2  x2  ...   k  xk  u

## • The conditional homoskedasticity assumption states that

Var  u | x1 , x2 ,..., xk    2

## Multiple Regression Analysis: Estimation

• When these five assumptions (the four basic assumptions plus conditional
homoskedasticity) are true, the variance of the estimator of any parameter j is

2
 
Var ˆ j 
 n 2
  ij  x  x j    1 Rj
2
 
 i 1 

## Multiple Regression Analysis: Estimation

• As in the simple regression case, 2 cannot be measured directly. It must be
estimated:

n
1
ˆ 
2

n  k  1 i 1
uˆi2

## • The standard error of j is defined as

ˆ 2
 
Se ˆ j 
 n 2
   xij  x j    1  R j
 
2

 i 1 
Multiple Regression Analysis: Estimation
• The Rj2 term refers to the R squared value of a regression of xj on the other x variables.
• You can see why multicollinearity is a problem; when xj is perfectly correlated with
the other x variables, this R squared is one and dividing by (1- Rj2) means dividing by
zero.

## Multiple Regression Analysis: Estimation

• The second point is that adding variables to the model will always increase the
variance of the estimator.
• This is because this causes the Rj2 to increase, increasing the variance.

## Multiple Regression Analysis: Estimation

• As an example we will explore the determinants of extra-marital affairs.
• The research questions are:
– Are older people more likely to have extra-marital affairs?
– Are people who have been married longer more likely to have extra-marital affairs?

. d

## Contains data from D:\Econometrics\Statafiles\affairs.dta

obs: 601
vars: 19 22 May 2002 11:49
size: 18,030 (97.8% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------

age float %9.0g in years
yrsmarr float %9.0g years married

naffairs byte %9.0g number of affairs within last
year

-------------------------------------------------------------------------------
Sorted by: id
Multiple Regression Analysis: Estimation
. summarize age, detail

in years
-------------------------------------------------------------
Percentiles Smallest
1% 22 17.5
5% 22 17.5
10% 22 17.5 Obs 601
25% 27 17.5 Sum of Wgt. 601

## 50% 32 Mean 32.48752

Largest Std. Dev. 9.288762
75% 37 57
90% 47 57 Variance 86.28109
95% 52 57 Skewness .8869999
99% 57 57 Kurtosis 3.220077

## Multiple Regression Analysis: Estimation

. summarize yrsmar, detail

years married
-------------------------------------------------------------
Percentiles Smallest
1% .125 .125
5% .75 .125
10% 1.5 .125 Obs 601
25% 4 .125 Sum of Wgt. 601

## 50% 7 Mean 8.177696

Largest Std. Dev. 5.571303
75% 15 15
90% 15 15 Variance 31.03942
95% 15 15 Skewness .0779935
99% 15 15 Kurtosis 1.432516

## Multiple Regression Analysis: Estimation

. tabulate naffairs

number of |
affairs |
within last |
year | Freq. Percent Cum.
------------+-----------------------------------
0 | 451 75.04 75.04
1 | 34 5.66 80.70
2 | 17 2.83 83.53
3 | 19 3.16 86.69
7 | 42 6.99 93.68
12 | 38 6.32 100.00
------------+-----------------------------------
Total | 601 100.00
Multiple Regression Analysis: Estimation
• It is unfortunately the case, however, that age and years of marriage are highly
correlated:

## . correlate age yrsmar

(obs=601)

| age yrsmarr
-------------+------------------
age | 1.0000
yrsmarr | 0.7775 1.0000

## Multiple Regression Analysis: Estimation

• Regressing the number of affairs on age, it appears that older people are slightly more
likely to have affairs:

## Source | SS df MS Number of obs = 601

-------------+------------------------------ F( 1, 599) = 5.48
Model | 59.219586 1 59.219586 Prob > F = 0.0195
Residual | 6469.86194 599 10.8011051 R-squared = 0.0091
-------------+------------------------------ Adj R-squared = 0.0074
Total | 6529.08153 600 10.8818026 Root MSE = 3.2865

------------------------------------------------------------------------------
naffairs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .033822 .0144444 2.34 0.020 .0054541 .0621899
_cons | .357114 .4880375 0.73 0.465 -.6013585 1.315586
------------------------------------------------------------------------------

## Multiple Regression Analysis: Estimation

• People who have been married longer are more likely to have affairs, as well:

## Source | SS df MS Number of obs = 601

-------------+------------------------------ F( 1, 599) = 21.67
Model | 227.929033 1 227.929033 Prob > F = 0.0000
Residual | 6301.1525 599 10.5194533 R-squared = 0.0349
-------------+------------------------------ Adj R-squared = 0.0333
Total | 6529.08153 600 10.8818026 Root MSE = 3.2434

------------------------------------------------------------------------------
naffairs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
yrsmarr | .1106286 .0237664 4.65 0.000 .0639529 .1573043
_cons | .5512198 .2351106 2.34 0.019 .0894785 1.012961
------------------------------------------------------------------------------
Multiple Regression Analysis: Estimation
• However, when both are put together the coefficient on age is inconclusive – its sign
changes. This suggests that there is a high variance in the measurement of this
coefficient.
. regress naffairs age yrsmar

## Source | SS df MS Number of obs = 601

-------------+------------------------------ F( 2, 598) = 12.86
Model | 269.275536 2 134.637768 Prob > F = 0.0000
Residual | 6259.80599 598 10.467903 R-squared = 0.0412
-------------+------------------------------ Adj R-squared = 0.0380
Total | 6529.08153 600 10.8818026 Root MSE = 3.2354

------------------------------------------------------------------------------
naffairs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0449423 .0226134 -1.99 0.047 -.0893536 -.000531
yrsmarr | .1688902 .0377022 4.48 0.000 .0948454 .242935
_cons | 1.534838 .5476808 2.80 0.005 .4592266 2.61045
------------------------------------------------------------------------------

Break
Multiple Regression Analysis: Estimation
• Last time, we examined some consequences of misspecifying a multiple regression
model.
• In particular, we examined the implications of omitting an “important variable” from
the regression.
• x2 is an “important variable” if (a) it is correlated with x1 and (b) 2 is something other
than zero.
• In this case, the sign of the omitted variable bias is determined by the signs of (a) and
(b).
• So if the correlation is negative, and 2 is positive, the estimate 1 will be negatively
biased.

## Multiple Regression Analysis: Estimation

• We also examined the sampling variance of the OLS estimator in the multiple
regression case.
• In general, an estimate of the parameter k will be more precise if:
– xk has a high variance
– The number of data points (n) is high
– xk is less correlated with the other x variables (i.e., R2k is low).
Multiple Regression Analysis: Estimation
• When the five assumptions are true (linear form, random sampling, zero conditional
mean, no multicollinearity, conditional homoskedasticity) the ordinary least squares
estimator is BLUE.
• BLUE = Best Linear Unbiased Estimator
• “Best” means that the OLS estimator has the minimum possible variance of any linear
estimator.

## Multiple Regression Analysis: Estimation

• This result is known as the Gauss-Markov Theorem.
• If conditional homoskedasticity is violated, the parameter estimates are still unbiased,
but the OLS estimator isn’t the most efficient any more.
• The practice of altering the estimator to be more efficient when heteroskedasticity is
present is known as weighted least squares.

## Multiple Regression Analysis: Estimation

• As a further example of multiple regression analysis we will look at characteristics of
law schools and the starting salaries of their graduates.
• Graduates from more highly ranked law schools often earn more than those from
lower ranked law schools.
• Is it because of the law school, or because better law schools admit better applicants in
the first place?

. d

## Contains data from D:\Econometrics\Statafiles\LAWSCH85.DTA

obs: 156
vars: 21 12 Mar 1999 15:06
size: 7,800 (99.0% of memory free)
---------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------
rank int %9.0g law school ranking
salary float %9.0g median starting salary
cost int %9.0g law school cost
LSAT int %9.0g median LSAT score
GPA float %9.0g median college GPA
libvol int %9.0g no. volumes in lib., 1000s
---------------------------------------------------------------------------
Multiple Regression Analysis: Estimation
• The dependent variable will be the logarithm of median starting salary for new
graduates.
• The independent variables measuring the quality of incoming students will be LSAT
scores and college GPAs.
• The independent variables measuring the quality of the school will be the logarithm of
the number of books in the school’s library, the logarithm of the cost of attending the
school and the media ranking of the law school.

## Multiple Regression Analysis: Estimation

• To start, we will try to find the relationship between graduates’ starting salaries and
school quality variables without any measure of the quality of entering students.

## Multiple Regression Analysis: Estimation

. generate lsalary=ln(salary)

. generate lcost=ln(cost)

. generate llibvol=ln(libvol)

## Source | SS df MS Number of obs = 141

-------------+------------------------------ F( 3, 137) = 211.04
Model | 8.8224849 3 2.9408283 Prob > F = 0.0000
Residual | 1.90907945 137 .013934886 R-squared = 0.8221
-------------+------------------------------ Adj R-squared = 0.8182
Total | 10.7315643 140 .076654031 Root MSE = .11805

------------------------------------------------------------------------------
lsalary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
llibvol | .1291507 .0325187 3.97 0.000 .0648471 .1934543
lcost | .0265127 .0295489 0.90 0.371 -.0319182 .0849437
rank | -.0041712 .0002976 -14.02 0.000 -.0047596 -.0035829
_cons | 9.880132 .3433113 28.78 0.000 9.201258 10.55901
------------------------------------------------------------------------------

## Multiple Regression Analysis: Estimation

• Now let’s try a more complete model, including measures of student quality.

## ln( salary )   0  1 ln(libvol )   2 ln(cost )   3rank   4GPA   5 LSAT  u

• We expect the coefficient on 3 to be negative, but the others should all be positive.
Multiple Regression Analysis: Estimation
. regress lsalary llibvol lcost rank GPA LSAT

## Source | SS df MS Number of obs = 136

-------------+------------------------------ F( 5, 130) = 138.23
Model | 8.73362207 5 1.74672441 Prob > F = 0.0000
Residual | 1.64272974 130 .012636383 R-squared = 0.8417
-------------+------------------------------ Adj R-squared = 0.8356
Total | 10.3763518 135 .076861865 Root MSE = .11241

------------------------------------------------------------------------------
lsalary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
llibvol | .0949932 .0332543 2.86 0.005 .0292035 .160783
lcost | .0375538 .0321061 1.17 0.244 -.0259642 .1010718
rank | -.0033246 .0003485 -9.54 0.000 -.004014 -.0026352
GPA | .2475239 .090037 2.75 0.007 .0693964 .4256514
LSAT | .0046965 .0040105 1.17 0.244 -.0032378 .0126308
_cons | 8.343226 .5325192 15.67 0.000 7.2897 9.396752
------------------------------------------------------------------------------

## Multiple Regression Analysis: Inference

• So far we have discussed only the mean and variance of the parameter estimates in the
multiple regression model:

y   0  1  x1   2  x2  ...   k  xk  u
• We have not said anything about the distribution of the OLS estimators other than that.

## Multiple Regression Analysis: Inference

• As a review, under the four basic assumptions of the linear regression model, the OLS
estimates of the parameters are unbiased:

E  ˆ j    j

## Multiple Regression Analysis: Inference

• When conditional homoskedasticity is assumed:
– The variance of the OLS estimators can be written as

2
 
Var ˆ j 
 n 2
   xij  x j  
  1 Rj
2

 i 1 

– Additionally, the OLS estimators are BLUE – that is, among all estimators that are
both linear and unbiased, they have the minimum sampling variance possible.
Multiple Regression Analysis: Inference
• These five assumptions are collectively known as the Gauss-Markov assumptions.
• With a sixth assumption, the entire sampling distribution of the estimator can be
characterized.
• The assumption, known as the normality assumption states that u is independent of the
explanatory variables and:

u  Normal 0,  2 

## Multiple Regression Analysis: Inference

• The form of the multiple regression model under these six assumptions can be
imagined succinctly as

y | x1 , x2 ,..., xk  Normal  0  1 x1   2 x2  ...   k xk ,  2 
• where the true values y are normal random variables with means equal to fitted values
constructed with the true parameters and variances all equal to 2.

## Multiple Regression Analysis: Inference

• These six assumptions together are known as the classical linear model assumptions.
• Collectively, they imply that the sample distributions of the OLS estimators are
normally distributed:

  
ˆ j  Normal  j ,Var ˆ j

## Multiple Regression Analysis: Inference

• As an example, consider if the error term u is independent of the x variables and takes
on the values -2, -1, 0, 1, and 2 with equal probability.
• This error term satisfies the Gauss-Markov assumptions.
• However, it violates the CLM assumptions.
Multiple Regression Analysis: Inference
• When the estimator is normally distributed, it can also be standardized:

ˆ j  Normal  j ,Var ˆ j  
ˆ j   j
 Normal  0,1
 
Se ˆ j

## Multiple Regression Analysis: Inference

• When an estimate of the standard error must be used, as is almost always the case, the
standardization is

ˆ j   j
 tn  k 1
 
Seˆ ˆ j

## Multiple Regression Analysis: Inference

• This standardization is used in hypothesis testing.
• Example of a hypothesis test:
• Two candidates, A and B, are running in an election. The official results say that
candidate B won the election with 54% of the vote.
• Candidate A thinks the election is rigged and hires a polling agency to ask 100 people
how they voted. The polling agency does so and 53 of the people it polls say they
voted for candidate A.

## Multiple Regression Analysis: Inference

• There are two alternatives: the election results are accurate, or they aren’t.
• Suppose that  represents the proportion of people that voted for candidate A (46%).
• An example of a null hypothesis is the hypothesis that the election results are
accurate:

H 0 :   0.46
Multiple Regression Analysis: Inference
• An example of an alternative hypothesis is that it is not:

H A :   0.46
• Candidate B brings his poll results to the local magistrate, who devises a statistical test
to find out if candidate B has evidence beyond a reasonable doubt that the election was
rigged.

## Multiple Regression Analysis: Inference

• The standard for “beyond a reasonable doubt” most commonly used is the 5%
significance test.
• In other words, the test can at most have a 5% chance of rejecting the null hypothesis
(the results are accurate) if the null hypothesis is true.

## Multiple Regression Analysis: Inference

• Suppose that the hypothesis is true ( . The next question is, what is the
sampling distribution of candidate A’s estimate if the hypothesis is TRUE?
• One way to find this out is through simulations.

.07889
Fraction

0
32 36 40 44 48 52 56 60
xb
Multiple Regression Analysis: Inference
• A test statistic is a function of the random sample of data.
• The outcome of this function is used to create a rejection rule for the null hypothesis.
• Usually the rejection rule is to reject the null hypothesis if the outcome of the function
exceeds some critical value.

## Multiple Regression Analysis: Inference

• Let’s look at the simulated data and see if we can construct a rejection rule for the null
hypothesis that 

## . summarize xb, detail

xb
-------------------------------------------------------------
Percentiles Smallest
1% 34 24
5% 38 25
10% 40 26 Obs 100000
25% 43 26 Sum of Wgt. 100000

## 50% 46 Mean 45.9947

Largest Std. Dev. 5.001752
75% 49 66
90% 52 67 Variance 25.01752
95% 54 68 Skewness .0185113
99% 58 68 Kurtosis 2.989094

## Multiple Regression Analysis: Inference

• A two-sided test might be to reject the null hypothesis if the poll results are greater or
less than certain values.
• For example, a 10% significance test would be to reject the null hypothesis if less than
38 people polled voted for A, or more than 54 people did.
• There is a 10% chance of a Type I error – rejecting the null hypothesis even if it is
true.
Multiple Regression Analysis: Inference
• Reducing the probability of Type I errors, however, increases the probability of
making a Type II error – failing to reject the null hypothesis if it is false.
• For example, suppose that 52% of the population actually voted for A. If so, it is still
fairly unlikely that one poll will show 58 people voting for A. If the poll
shows that 53 people voted for A, and the magistrate fails to reject the hypothesis that
a Type II error has been made.

## Multiple Regression Analysis: Inference

• To summarize, a null hypothesis is rejected if the test statistic falls beyond the critical
values.
• If not we fail to reject the hypothesis.
• The percent chance of a Type I error (rejecting the hypothesis if it is true) is the
significance level of the test.
• Failing to reject a hypothesis is NOT “accepting” the hypothesis.

## Multiple Regression Analysis: Inference

• For example, suppose that the cannon in front of the Vice-Chancellor’s house has been
stolen.
• The police know it was stolen at 2 in the morning.
• At the time, many people were taking part in a dance party nearby. A friend of yours
was there as well.

## Multiple Regression Analysis: Inference

• Since your friend was there, if that is all the evidence you have, you fail to reject the
null hypothesis that your friend stole the cannon.
• However you have no evidence that your friend DID steal the cannon, either.
• Therefore you cannot “accept” the null hypothesis that your friend stole the cannon.