You are on page 1of 37

Linear Regression

with One Regressor


A state implements tough new penalties on drunk drivers: What is the effect on
highway fatalities? Aschool district cuts the size of its elementary school classes:
What is the effect on its students' standardized test scores? You successfully

complete one more year ofcollege classes: What is the effect on your future
earnings?

All three of these questions are about the unknown effect of changing one
variable, X (X being penalties fordrunk driving, class size, or years of schooling), on
another variable, /(/being highway deaths, student test scores, or earnings).
This chapter introduces the linear regression model relating one variable, X, to
another, /. This model postulates a linear relationship between X and Y) the slope of
the line relating Xand Vis the effect of a one-unit change inXon Y. Just as the mean
of Vis an unknown characteristic of the population distribution of Y, the slopeof the
line relating Xand Vis an unknown characteristic ofthe population joint distribution of
Xand Theeconometric problem isto estimate this slopethat is, to estimate the
effect on /of a unit changeinX-using a sample of data on thesetwo variables.

This chapter describes methods for estimating this slope using a random sample
of data on X and Y. Forinstance, using data on class sizes and test scores from
different school districts, we show how to estimate the expected effect on test scores

of reducing class sizes by, say, onestudent perclass. The slope and the intercept of
the line relating Xand Yean be estimated bya method called ordinary least squares
(OLS).

4.1

The Linear Regression Model


The superintendent ofanelementary school district must decide whether to hire
additional teachers and she wants your advice. If she hires the teachers, she will
reducethe number of studentsper teacher (the student-teacher ratio) by two. She
faces a trade-off. Parents want smaller classes so that their children can receive

more individualized attention. But hiring more teachers means spending more

money, which is not to the liking of those paying the bill! Soshe asks you: If she
cuts class sizes,what will the effect be on student performance?
149

150

CHAPTER 4 Linear Regression with One Regressor

In many school districts, student performance is measured by standardized


tests, and the job status or pay of some administrators can dependin part on how
well their students do on these tests. We therefore sharpen the superintendent's
question:If she reduces the average classsize by two students,what will the effect
be on standardized test scores in her district?

A precise answer to this question requires a quantitative statement about


changes. If thesuperintendent changes the class size bya certam amount, what would
she expect the change in standardized test scores to be? We can write this as a math

ematical relationship using the Greek letter beta, ^oassSize^ where the subscript
ClassSize distinguishes the effectofchanging the class size from other effects. Thus,

ciassSiu

_ change in TestScore _ ATestScore


change in ClassSize AClassSize'

. .

where the Greek letter A (delta) stands for "change in." That is, ^ckssSize is the
change in the test score that results from changing the class size divided by the
change in the class size.

If you were lucky enough to know PoassSize^ you would be able to tell the
superintendent thatdecreasing class size by onestudent would change districtwide
test scores by PchssSize- You could also answer the superintendent's actual ques
tion, which concerned changing class size by two students per class. To do so,
rearrange Equation (4.1) so that

ATestScore = pciaxsSizc XAClassSize.

(4.2)

Suppose that ^ciassSize ^ "0.6. Then a reduction in class size of two students per
class would yield a predicted change intest scores of (-0.6) X(-2) = 1.2; thatis,
you would predict that test scores would rise by 1.2 points as a result of the
reduction in class sizes by two students per class.

Equation (4.1) is the definition of the slope of a straight line relating test
scores and class size.This straight line can be written

TestScore = j8o + ^ciassShe X ClassSize,

(4.3)

where is the intercept ofthis straight line and, as before, PciassSize is the slope.
According to Equation (4.3), ifyou knew jSo and ^ciassSizc notonly would you be
able todetermine the change intest scores ata district associated with achange in
class size, but you also would be able to predict the average test score itself for a
given class size.

4.1

The Linear Regression Model

151

When you propose Equation (4.3) to the superintendent, she tells you that
something is wrong with this formulation. She points out that classsize is just one
of manyfacetsof elementaryeducation and that twodistricts with the same class
sizes will have different test scores for many reasons. One district might have bet
ter teachers or it might use better textbooks.Twodistricts with comparable class
sizes, teachers, and textbooks still might have very different student populations;
perhaps one district has more immigrants (and thus fewer native English speak
ers) or wealthier families. Finally, she points out that even if two districts are the
same in all these ways they might have different test scores for essentially random
reasons having to do with the performance of the individual students on the day
of the test.Sheis right,of course; for all these reasons. Equation (4.3) will not hold
exactly for all districts. Instead, it should be viewed as a statement about a rela
tionship that holds on average across the population of districts.
A version of this hnear relationship that holds for each district must incorpo
rate these other factors influencing test scores, including each district's unique
characteristics (for example, quaUty of their teachers, backgroundof their students,
how lucky the students were on test day). One approach would be to list the most

important factors and to introduce themexplicitly into Equation (4.3) (anideawe


return to in Chapter 6). For now, however, we simply lump all these "other fac
tors" together and write the relationship for a given district as
TestScore = jSo + ^ciassSize X ClassSize + other factors.

(4.4)

Thus the test score for the district is written in terms of one component, jSq +

^ClassSize XClassSize, that represents the average effect of class size on scores in
thepopulation ofschool districts anda second component that represents allother
factors.

Although this discussion has focused on test scores and class size,the idea
expressed in Equation(4.4) ismuchmoregeneral, so it is useful to introduce more
general notation. Suppose you have a sample of n districts. Let Yj be the average
test score in the i district, let Xibe the average class size in the i district, and let

Uj denote the other factors influencing the test score in the i district. Then
Equation (4.4) can be written more generally as
+

(4.5)

for each district (that is, / = 1,..., n), where j8o is the intercept of this line and (3i

istheslope. [The general notation jSj is used fortheslope inEquation (4.5) instead
of13ClassSize because this equation iswritten interms ofa general variable X^]

152

CHAPTER 4 Linear Regression with One Regressor

Equation (4.5) is the linearregression model with a single regressor, in which


Yisthe dependent variable andX is the independent variable or the regressor.
The first part of Equation (4.5), /Sq + PiXj, is the populationregression line or
the population regression function. This is the relationship that holds between Y
and X on average over the population. Thus, if you knew the value of X, accord

ing to this population regression line you would predict that the value of the
dependent variable, 7, is jSo + p^X.

Theintercept /3o and the slope /31 are the coefficients ofthe population regres
sion Hne, also known as the parameters of the population regression line. Theslope
jS] isthe change in Yassociated with a unitchange in XThe intercept isthe value
of the population regression line when X = 0;it is the pointat which the popula
tion regression line intersects the Y axis. In some econometric applications, the
intercept has a meaningful economic interpretation. In other applications, the
intercept has no real-world meaning; forexample, when X isthe class size, strictly
speaking the intercept is the predicted value of test scores when there are no stu

dents in the class! When the real-world meaning of the intercept is nonsensical, it
is best to think of it mathematically as the coefficient that determines the level of
the regression line.

The term ii/ in Equation (4.5) is the error term.The error term incorporates
all of the factors responsible for the difference between the district's average
test scoreand the value predicted bythe population regression line. Thiserror term
contains allthe other factors besides X that determine the value of the dependent
variable, Y, for a specific observation, /. In the class size example, theseother fac
tors include all the unique features of the district that affect the performance
ofitsstudents on thetest, including teacher quality, student economic background,
luck, and even any mistakes in grading the test.

Thelinearregression model and its terminology are summarized in Key Con


cept 4.1.

Figure 4.1 summarizes the linear regression model with a single regressor for
seven hypothetical observations on testscores (F) andclass size (JV').The popu
lation regression line is the straight line /3o + jSiXThe population regression line
slopes down {(3] < 0), which means that districts with lower student-teacher

ratios (smaller classes) tend to have higher testscores. The intercept /3o has a math
ematical meaning as thevalue of the Yaxis intersected bythe population regres
sion line, but, asmentioned earlier, it has no real-world meaning in this example.
Because of the other factors that determine test performance, the hypotheti
calobservations in Figure 4.1 do not fall exactly on the population regression line.
Forexample, the value of Yfor district #1, Fj, isabove the population regression
line. This means that test scores in district #1 were better than predicted by the

4.1

The Linear Regression Model

153

Terminology for the Linear Regression


Model with a Single Regressor
The linear regression model is
Yi= pQ+ l3iXi + Ui,
where

the subscript i runs over observations, / =

Yi is the dependent variable, the regressand, or simplythe left-hand variable;


is the independent variable, the regressor, or simplythe right-hand variable;
jSo +

populationregression lineor \h&population regression function;

jSo is the intercept of the population regression line;


is the slope of the population regression line; and
Ui is the error term.

Scatterplot of Test Score vs. Student-Teacher Ratio


(Hypothetical Data)
The scatterplot shows
hypothetical observations

Test score (Y)


700

for seven school districts.

The population regres

sion line is jSq + )8,X The

680

vertical distance from the

/"^ point to the population

660

regression line is
i/2-

Y', -(/8o + j8,X,), which is


the population error term

U; for the

'{X2,Y2)

640

observation.
620

600
10

15

20

25

30

Student-teacher ratio (X)

154

CHAPTER 4 Linear Regression with One Regressor

population regression line, so the error termfor that district, Ui, ispositive. In con
trast, Yj is below the population regression line, so test scores for that district were
worse than predicted, and ui < 0.

Now return to your problem as advisor to the superintendent: What is the


expected effect on test scores of reducing the student-teacher ratio by two stu

dents per teacher? The answer is easy: The expected change is (-2) X^ciassSizeBut what is the value of ficiaasSizJ^-

4.2 Estimating the Coefficients


of the Linear Regression Model
In a practical situationsuch as the applicationto classsizeand test scores, the inter
cept ^0 and slope f3i of the population regression line are unknown. Therefore, we

must use data to estimate the unknown slope and intercept of the population
regression line.

This estimation problem is similar to others you have faced in statistics. For
example, suppose youwantto compare the meanearnings of men and women who
recently graduated from college. Although the population mean earnings are
unknown, wecan estimate the population means using a random sample of male
and female college graduates. Then the natural estimator of the unknown popu
lation mean earnings forwomen, forexample, isthe average earnings ofthe female
college graduates in the sample.

The same idea extends to the linear regression model. We do not know the

population value of ^cuissSize^ the slope of the unknown population regression line
relating X (class size) and Y(testscores). Butjustasit was possible to learn about
the population mean using a sample of data drawn from that population, so is it
possible to learn about the population slope ^ciassSize using a sample ofdata.
The data we analyze here consist of test scores and class sizes in 1999 in 420

California school districts that serve kindergarten through eighth grade. Thetest
score isthe districtwide average ofreading andmathscores for fifth graders. Class
size can be measured in various ways. The measure usedhere is one of the broadest,
which isthe numberofstudents in the district divided bythe numberofteachers
that is, the districtwide student-teacher ratio. These data are described in more
detail in Appendix 4.1.
Table 4.1 summarizes the distributions of test scores and class sizes for this sam

ple.The averagestudent-teacher ratio is 19.6 students per teacher,and the stan


dard deviation is 1.9studentsper teacher. The 10'*^ percentile of the distribution of

4.2

Estimating the Coefficients of the Linear Regression Model

155

Summary of the Distributibn ofStudent-Teacher Ratios


and Fifth-Grade Test Scores for 420 K-8 Districts in California in 1999
Percentile

10%

Standard

Average
Student-teacher ratio
Test score

25%

40%

50%

60%

75%

90%

(median)

Deviation

19.6

1.9

17.3

18.6

19.3

19.7

20.1

20.9

21.9

654.2

19.1

630.4

640.0

649.1

654.5

659.4

666.7

679.1
J

the student-teacher ratio is 17.3(that is,only 10% of districts have student-teacher

ratios below 17.3),while the district at the

percentile has a student-teacher

ratio of 21.9.

A scatterplot of these420 observations on test scores and the student-teacher


ratio is shownin Figure4.2. The samplecorrelation is -0.23, indicating a weak neg
ative relationship between the two variables. Although larger classes in this sam

pletend to have lower testscores, thereare otherdeterminants of testscores that


keep the observations from falling perfectly alonga straight line.
Despite this low correlation, if one could somehow draw a straight line
through these data, then the slope of this line would be an estimate of PciassSize
Scatterplot ofTest Score vs. Student-Teacher Ratio (California School District Data)
Data from 420

California school dis

Test score
720

tricts. There Is a weak

negative relationship

700

between the studentteacher ratio and test

'

680

'

scores: The sample


correlation is -0.23.

660

640
.

..

620

600
10

15

20

25
30
Student-teacher ratio

156

CHAPTER 4 Linear Regression with One Regressor

based on these data. One way to draw the hnewould be to take out a pencil and
a ruler andto "eyeball" the best line you could. While this method iseasy, it isvery
unscientific, and different people willcreate different estimated lines.

How, then,should youchoose among the many possible lines? Byfar the most
common way is to choose the line that produces the "least squares" fit to these
datathat is,to use the ordinary least squares (OLS) estimator.

The Ordinary Least Squares Estimator


The OLS estimator chooses the regression coefficients so that the estimated
regression line is as close as possible to the observed data, where closeness is mea

sured by the sum of the squared mistakes made in predicting Ygiven X.


As discussed in Section 3.1, the sample average, Y, is the leastsquares estima

torofthepopulation mean, (y); thatis, 7 minimizes the total squared estimation

mistakes

- m)^ among all possible estimators m[see Expression (3.2)].

The OLS estimator extends this idea to the linear regression model. Let bo
and bi be some estimators of /Sq and /3i.The regression line based on these esti
matorsis ^>0 + ^1 A', so the value of Yj predicted using thislineis bo + ^jA^.Thus the
mistake made in predicting the observation is >; - (^^o + b^Xi) = Yi-bQ- b^Xi.
The sum of these squared prediction mistakes over all n observations is

'2iyi-bo-b,Xif.

(4.6)

/=!

Thesum of the squared mistakes for the linear regression model in Expres
sion (4.6) is the extension of the sum of thesquared mistakes for the problem of
estimating the mean in Expression (3.2). In fact, if there is no regressor, then bi
does notenter Expression (4.6) and the two problems areidentical except for the
different notation [m inExpression (3.2), b^) inExpression (4.6)]. Just as thereisa
unique estimator, V, that minimizes theExpression (3.2), soisthere a unique pair
of estimators of /3o and /3i that minimize Expression (4.6).
The estimators of the intercept and slope that minimize the sum of squared
mistakes in Expression (4.6) are called the ordinary leastsquares (OLS) estima
tors of j8o and P]-

OLS hasits own special notation and terminology. The OLS estimator of Po
isdenoted and theOLS estimator ofj8i is denoted /S^.The OLS regression line,
also called thesample regression line orsample regression function, isthe straight
line constructed using the OLS estimators: j8o + p^X. The predicted value of Yj

4.2

Estimating the Coefficients of the Linear Regression Model

(IKEYlCiQNGERITr)

The OLS Estimatorr Predicted Values, and Residuals

4.2

The OLS estimators of the slope j8i and the mtercept/3o are

k=

-7T

157

(4-7)

/=!

Po = Y-piX.
The OLS predicted values

(4.8)

and residuals /are

Yi = h + kXi,i = l,--.,n
Ui = }^-Y;,i = l,...,n.

(4.9)
(4.10)

The estimated intercept (jSo), slope (jSi), and residual (m/) are computedfrom a
sample of n observations of Xfand i = 1,..., n. These are estimates of the
unknowntrue populationintercept ()3o), slope (jSi),and error term (wj).

given,basedon the OLS regression line, is. Theresidual for the i observation is

the difference between Yj and itspredicted value:, = 1/ - if.


The OLS estimators,

and jSi,are sample counterparts of the population

coefficients, pQ and /3i. Similarly, the OLS regression line ^q + ^iXis the sample
counterpart of the populationregression line /3o + PiX, and the OLS residuals
are sample counterparts of the population errors ui.
You could compute the OLS estimators j8o and

by trying different values


of bo and bi repeatedlyuntil you find those that minimize the total squared mis
takesin Expression (4.6); theyare the leastsquares estimates. This method would
be quite tedious, however. Fortunately, there are formulas, derived by minimiz
ing Expression (4.6) using calculus, that streamline the calculation of the OLS
estimators.

The OLS formulas and terminology are collected in Key Concept 4.2.These
formulas are implemented in virtually all statistical and spreadsheet programs.
These formulas are derived in Appendix 4.2.

158

CHAPTER 4 Linear Regression with One Regressor

OLS Estimates of the Relationship Between


Test Scores and the Student-Teacher Ratio
When OLS is used to estimate a line relating the student-teacher ratio to test

scores using the 420 observations in Figure 4.2, the estimated slope is -2.28 and
the estimated intercept is 698.9. Accordingly, the OLS regression line for these 420
observations is

TestScore = 698.9 - 2.28 X STR,

(4.11)

where TestScore is the average test score in the district and STR is the
student-teacher ratio.The
over TestScore in Equation (4.11) indicates that it

is the predicted value based on the OLSregression line. Figure 4.3 plots thisOLS
regression linesuperimposed over the scatterplotof the data previously shown in
Figure 4.2.

Theslope of -2.28 means that an increase in the student-teacher ratiobyone


student per class is,on average, associated with a decline in districtwide test scores

by2.28 points on the test.A decrease in the student-teacher ratio bytwostudents


per class is, on average, associated with an increase in test scores of 4.56 points

.= ~2 X(-2.28)].The negative slope indicates that more students per teacher


(larger classes) is associated with poorer performance on the test.

The Estimated Regression Line for the California Data


The estimated regres
sion line shows a nega
tive relationship
between test scores and

Test score
720

700

the student-teacher

ratio. Ifclasssizes fall by

TestScore
- 2.28 x STR
' OUCU/t? =
= 698.9
'

680

one student, the esti

mated regression pre

660

dicts that test scores will

increase by 2.28 points.

640

620

600
10

15

20

25

30

Studentteacher ratio

4.2

Estimating the Coefficients of the Linear Regression Model

159

It is nowpossible to predict the districtwide test scoregiven a value of the student-teacherratio. Forexample, for a district with 20students per teacher, the predicted
test score is 698.9 - 2.28 X 20 = 653.3. Of course, this prediction will not be exactly

right because of the other factors that determine a district's performance. But the
regression line doesgive a prediction (the OLSprediction) of what test scores would
be for that district, based on their student-teacher ratio, absent those other factors.
Is this estimate of the slope large or small? To answer this, we return to the

superintendent's problem. Recall that she is contemplating hiringenough teach


ers to reduce the student-teacher ratio by 2. Suppose her district is at the median
of the California districts. From Table 4.1, the median student-teacher ratio is 19.7
and the median test score is 654.5. A reduction of two students per class, from 19.7

to 17.7, wouldmove her student-teacher ratio fromthe 50"^ percentileto verynear


the 10*'^ percentile. This is a big change, and she would need to hire many new
teachers. How would it affect test scores?

According to Equation (4.11),cutting the student-teacher ratio by 2 is pre


dicted to increase test scores by approximately 4.6 points;if her district's test scores
are at the median, 654.5, they are predicted to increase to 659.1. Is this improve

ment large or small? According to Table 4.1, this improvement would move her
district from the medianto just short of the 60*'' percentile. Thusa decrease in class
size that would place her districtclose to the 10% with the smallest classes would
move her test scores from the 50"' to the 60"' percentile.According to these esti
mates, at least,cuttingthe student-teacher ratio bya large amount (two students
per teacher) would help and might be worth doing depending on her budgetary
situation, but it would not be a panacea.

What if the superintendent were contemplating a far more radical change,


such as reducing the student-teacher ratio from 20 students per teacher to 5?
Unfortunately, the estimates in Equation (4.11) would not be very useful to her.
This regression was estimated using the data in Figure 4.2, and,as the figure shows,
the smallest student-teacher ratio in these data is 14. These data contain no infor

mation on how districts with extremely small classes perform, so these data alone
are not a reliable basis for predicting the effect of a radical move to such an
extremely low student-teacher ratio.

Why Use the OLS Estimator?


/V

Thereare bothpractical and theoretical reasons to usethe OLS estimators /Sq and
jSj. BecauseOLSis the dominantmethodused in practice, it has becomethe com
mon language for regression analysis throughout economics, finance (see "The
'Beta' of a Stock" box), and the socialsciences more generally. Presenting results

160

CHAPTER 4 Linear Regression with One Regressor


Beta- 'of a Stock

fundamental idea of modern finance is that an

kinvestor needs a financial incentive to take a

risk.Saiddifferently, the expectedreturn^ on a risky


investment, R, must exceed the return on a safe, or

risk-free, investment, i?^.Thus the expected excess

a stock with a j8 > 1 is riskierthan the market port


folio and thus conmiands a higher expectedexcess
return.

The "beta" of a stock has become a workhorse of

the investment industry, and you can obtain esti

return, J? - Rf, on a risky investment, like owning

mated betas for hundreds of stocks on investment

stock in a company,should be positive.

firmWebsites. Thosebetas typically are estimatedby

At first it might seem like the risk of a stock

OLS regression of the actual excess return on the

should be measured by its variance.Much of that risk,


however,can be reduced by holdingother stocksin a

stock against the actual excess return on a broad

"portfolio"in other words, by diversifying your


financial holdings. This means that the right way to

The table below gives estimated betas for seven

measure the risk of a stock is not by its variance but


rather by its covariance with the market.

The capital asset pricingmodel (CAPM) formal


izesthis idea.According to the CAPM,the expected
excess return on an asset is proportional to the

expected excess return on a portfolio of all available


assets (the "market portfolio"). That is,the CAPM
says that

R-Rf = ^{R,-R^),

(4.12)

where R^ isthe expected return on the marketport


folio and j8 wthecoefficient in thepopulation regres

sion of i? -

market index.

U.S. stocks. Low-risk producers of consumerstaples


like Kellogg have stocks with low betas; riskier stocks
have high betas.
Company

Estimated p

Wal-Mart (discount retailer)


Kellogg (breakfast cereal)
Waste Management (waste disposal)
Verizon(telecommunications)
Microsoft (software)
Best Buy (electronic equipment retailer)
Bank of America (bank)

0.3
0.5

0.6
0.6
1.0

1.3
2.4

Source: SmartMoney.com.

onRm - Rf. In practice, therisk-free

return is often taken to be the rate of interest on

short-term U.S. government debt.Accordingto the

'The return onaninvestment isthe change inits price plus

market portfolioand thereforehas a lowerexpected

anypayout(dividend) fromthe investment as a percentage


ofitsinitial price. Forexample, a stock bought onJanuary 1
for $100, which then paida $2.50 dividend during the year

excess return than the market portfolio. In contrast.

R= [($105 - $100) + $2.50]/$100 = 7.5%.

CAPM, a stock with a /3 < 1 has less risk than the

and sold on December 31 for $105, would have a return of

using OLS (oritsvariants discussed laterinthis book) means thatyou are"speak


ing the same language" as other economists and statisticians. The OLS formulas

are built into virtually all spreadsheet and statistical software packages, making
OLS easy to use.

4.3

Measures of Fit

161

The OLS estimators also have desirable theoretical properties. They are anal

ogous to the desirable properties, studied in Section 3.1, of Yas an estimator of


the population mean. Under the assumptions introduced in Section 4.4, the OLS
estimator is unbiased and consistent. The OLS estimator is also efficient among a
certain class of unbiased estimators; however, this efficiency result holds under

some additional special conditions, andfurther discussion of thisresultisdeferred


until Section 5.5.

4.3 Measures of Fit


Having estimated a linearregression, youmight wonder howwell that regression
line describes the data. Does the regressor account for much or for little of the
variation in the dependentvariable? Are the observations tightly clustered around
the regression line, or are they spread out?
The
and the standard error of the regression measure how well the OLS

regression line fits thedata. The

ranges between 0and1andmeasures thefrac

tion of the variance of Yi that is explained by ^.The standard error of the regres
sion measures how far Yf typically is from its predicted value.
The

The regression isthefraction ofthesample variance of Xexplained by(orpre


dicted by) .X;.The definitions ofthepredicted value and theresidual (see Key Con
cept4.2) allow us to write the dependent variable Yi as the sum of the predicted
value, %plus the residual uf.
Yi = Yi + Ui.

(4.13)
yv

In this notation, the R^ is the ratio of the sample variance of Yi to the sample vari
ance of Yi.

Mathematically, the R^ can be written as the ratio of the explained sum of


squares tothetotal sum ofsquares.The explained sum ofsquares (ESS) isthesum

of squared deviations of the predicted values of Yi, %from their average, and the
total sumof squares (TSS) is the sumofsquareddeviations of 1/from its average:

ESS=^{Yi-Yf

(4.14)

1=1

r55=2;(if-y)2.
1=1

(4.15)

162

CHAPTER 4 Linear Regression with One Regressor

Equation (4.14) uses thefact that thesample average OLS predicted value equals
Y (proven in Appendix 4.3).

The

istheratio ofthe explained sum ofsquares to thetotal sum ofsquares:


=

Alternatively, the

Yi not explained by

(4.16)

can be written in terms of the fraction of the variance of

The sum of squared residuals, or SSR, is the sum of the

squared OLS residuals:

SSR=^ul

(4.17)

/=i

It is shown in Appendix 4.3 that TSS = ESS + SSR. Thus the R^ also can be

expressed as 1 minus the ratio of the sum of squared residuals to the total sum of
squares:

r>2

SSR

^=1-W

(4-18)

Finally, the R^ of the regression of Yon the single regressor Xis the square of the
correlation coefficient between Y and X.

The R^ ranges between 0and 1. If jS] =0,then Xj explains none of the varia
tion of Yj and the predicted value of Yj based on the regression is just the sample
average of l-. In this case, the explained sum of squares is zero and the sum of
squared residuals equals the totalsum of squares; thus the R^ is zero. In contrast,

ifXj explains all ofthe variation of Yj, then Yi= Yj for all i and every residual iszero
(that is, Uj = 0),so that ESS = TSS and

= 1.In general, the R^ doesnot take on

the extreme values of 0 or 1 but falls somewhere in between. An R^ near 1 indi

cates that the regressoris good at predicting Yj, while an R^ near 0 indicates that
the regressor is not very goodat predicting Yj.

The Standard Error of the Regression


The standard error of the regression (SER) is an estimator of the standard devia

tion of the regression error j/,.Theunits of m, and Yj are the same,so the SER is a
measure of the spread of the observations around the regression line, measured

in the units of the dependent variable. For example, ifthe units ofthe dependent

4.3

Measures of Fit

163

variableare dollars, then the SER measures the magnitude of a typicaldeviationfrom

the regression linethat is, the magnitude of a typical regression errorin dollars.
Because the regression errors ..., are unobserved, the SER is computed
using theirsample counterparts, the OLS residuals ui,..., w^.The formula for the
SER is

SER =

, wheresi =

(4-19)

where the formula for si uses the fact (proven in Appendix 4.3) that the sample
average of the OLS residuals is zero.
The formula for the SER in Equation (4.19) is similar to the formula for the

sample standard deviation of Ygiven in Equation(3.7) in Section 3.2, except that


1- - y in Equation(3.7) is replaced byw,- and the divisor in Equation (3.7) is - 1,
whereas here it is n - 2.The reason for using the divisor n-2 here (instead of n)
is the same as the reason for using the divisor n - 1 in Equation (3.7): It corrects
for a slight downward bias introduced because two regression coefficients were
estimated. This iscalled a "degreesof freedom" correction because twocoefficients
were estimated (jSq and /3i), two "degrees of freedom" of the data were lost,so
the divisorin this factor is - 2.(The mathematics behind this is discussed in Sec

tion 5.6.) When n is large, the difference between dividing by n, by - 1,or by


n - 2 is negligible.

Application to the Test Score Data


Equation (4.11) reports the regression Une, estimated using the California test
score data,relating the standardized test score{TestScore) to the student-teacher
ratio (5r/?).The R^ of this regression is 0.051, or 5.1%, and the SERis 18.6.
The R^ of 0.051 means that the regressor STR explains 5.1% of the variance

of the dependent variable TestScore. Figure 4.3 superimposes this regression line
on the scatterplot of the TestScore and STR data. As the scatterplot shows, the
student-teacher ratio explains some of the variationin test scores, but muchvari
ation remains unaccounted for.

The SER of 18.6 means that standard deviation of the regression residuals is

18.6, where the units are points on the standardized test. Because the standard
deviation isa measure ofspread, the SERof18.6 means that thereisa large spread
of the scatterplot in Figure 4.3 around the regression line as measured in points
onthe test.This large spreadmeans that predictions oftest scores madeusing only
the student-teacher ratio for that district will often be wrong by a large amount.

164

CHAPTER 4 Linear Regression with One Regressor

Whatshould we make of this low


and large SERl The fact that the of
this regression is low (and the SER is large) does not, by itself, imply that this
regression is either "good" or "bad." What the low R^ does tell us is that other
important factors influence test scores. These factors could include differences in

the student body across districts, differences in school quality unrelated to the
student-teacher ratio, or luck on the test. Thelow R^ and high SER do not tell us
what these factors are, but they do indicate that the student-teacher ratio alone
explains only a small part of the variation in test scores in these data.

4.4 The Least Squares Assumptions


This section presents a set of three assumptions on the linear regression model
and the sampling scheme underwhich OLS provides an appropriate estimator of
the unknown regression coefficients, /3o and /3i. Initially, these assumptions might
appear abstract.They do, however, have natural interpretations,and understand
ing these assumptions is essential for understanding when OLS willand will
notgive useful estimates of the regression coefficients.

Assumption #1: The Conditional Distribution

of Uj Given Xj Has a Mean of Zero


The first of the three least squares assumptionsis that the conditional distribution
of Uj given Xjhas a mean of zero.Thisassumption is a formal mathematicalstate
ment about the "other factors" contained in a,- and asserts that these other factors
are unrelated to Xj in the sense that,given a valueof Xf, the mean of the distribu
tion of these other factors is zero.

This assumption is illustrated in Figure 4.4. The population regression is the


relationship that holds on average between class size and test scores inthe popu
lation, and the error term m, represents the other factors that lead test scores at a

given district todiffer from theprediction based onthepopulation regression line.


As shown in Figure 4.4, at a given value of class size, say 20 students per class,
sometimes these other factors lead tobetter performance than predicted {tij > 0)
and sometimes to worse performance (, < 0), but on average over the popula
tion the prediction is right. In other words, given Xj = 20, the mean of the distri

bution of Uj is zero. In Figure 4.4, this is shown as the distribution of , being


centered onthe population regression line at Xj = 20 and, more generally, at other
values jc ofXi aswell. Said differently, the distribution of Ui, conditional on Xj = jc,
has a mean ofzero; stated mathematically, E{ui\Xj = x) = 0,or,insomewhat sim
plernotation, E{ui\Xi) = 0.

4.4

The LeastSquares Assumptions

165

iFIGURE^S The Conditional Probability Distributions

and the Population Regression Line


Test score

720 r
700

Distribution of Y when X = 15

Distribution of Ywhen X = 20
680

Distribution of Ywhen X = 25

660

E(Y\X=15)
EiY\X = 20)

640

E{)'|X = 25)

Po

620

600
10

15

20

25
30
Student-teacher ratio

The figure shows the conditional probability oftestscores for districts with class sizes of 15, 20,
and 25 students. The mean of the conditional distribution of test scores, given the student-

teacher ratio, E{Y\X), is the population regression line /3o + j8]X. At a given value ofX, Y\s
distributed around the regression line and theerror, u = Y (/So + jSiX), has a conditional
mean of zero for all values of X.

As shown in Figure 4.4, the assumption that E{ui\Xj) = 0 is equivalent to


assuming thatthepopulation regression line istheconditional mean of Yi given Xi
(a mathematical proofof thisis left asExercise 4.6).
The conditional mean of u in a randomized controlled experiment.

In a ran

domized controlled experiment, subjects are randomly assigned to the treatment

group (A'=1) or to the control group {X = 0).The random assignment typically


is done using a computer program that uses no information about the subject,
ensuring thatX isdistributed independently of all personal characteristics of the
subject. Random assignment makes X and u independent, which in turn implies
that the conditional mean of u given X is zero.

In observational data, X is not randomly assignedin an experiment. Instead,

the best that canbe hoped for is that X is as if randomly assigned, in the precise
sense that E{u!\Xi) = 0. Whether this assumption holds ina given empirical appli
cation with observational data requires careful thought and judgment, and we
return to this issue repeatedly.

166

CHAPTER 4 Linear Regression with One Regressor

Correlation and conditional mean. Recall from Section 2.3 that if the condi
tional mean of one random variable given another is zero, then the two random

variables have zero covariance and thus areuncorrelated [Equation (2.27)]. Thus
theconditional mean assumption E{Ui\Xi) = 0 implies thatXi andm/ areuncorre
lated, or corr(^, m,) = 0.Because correlation is a measure of linear association,
this implication does not go the other way; even if Xiand are uncorrelated, the
conditional mean of m, given Xi might be nonzero. However, if Xi and m, are cor
related, thenit must be thecase thatE{ui\Xi) isnonzero. It istherefore often con
venient to discuss theconditional mean assumption in terms ofpossible correlation

between Xi and

IfXi and m, arecorrelated, then theconditional mean assump

tion is violated.

Assumption #2; {X,. V;),/ =


Are
Independently and Identically Distributed
The second least squares assumption isthat{X}, 1^), i = 1,..., n,are independently
and identically distributed (i.i.d.) across observations. As discussed in Section 2.5

(Key Concept 2.5), this assumption is a statement about how thesample isdrawn.
Ifthe observations are drawn by simple random sampling from asingle large pop
ulation, then {Xi, If),/ = 1,..., n, are i.i.d. For example, let X be the age of a
worker and Ybe his or her earnings, and imagine drawing a person at random
from the population of workers. That randomly drawn person will have a certain
age and earnings (thatis,^ and Ywill takeonsome values). If a sample ofn work
ers is drawn from this population, then {Xi, 1^), / = 1,..., n,necessarily have the
same distribution. Ifthey are drawn at random they are also distributed indepen
dently from one observation to the next; that is, they are i.i.d.
The i.i.d. assumption is a reasonable one for many data collection schemes.

For example, survey data from a randomly chosen subset ofthe population typi
cally can be treated as i.i.d.

Not all sampling schemes produce i.i.d. observations on{Xi, >;),however. One
example is when the values ofXare not drawn from a random sample ofthe pop
ulation but rather are set by a researcher as part ofan experiment. For example,
suppose a horticulturalist wants tostudy the effects ofdifferent organic weeding
methods {X) on tomato production (7) and accordingly grows different plots of
tomatoes using different organic weeding techniques. If she picks the techniques

(the level ofX) tobe used on the plot and applies the same technique tothe
plot inall repetitions ofthe experiment, then the value ofZ;does not change from
one sample tothenext. Thus ^ isnonrandom (although theoutcome 1; israndom),
so the sampling scheme is not i.i.d. The results presented in this chapter developed

4.4

The LeastSquares Assumptions

167

fori.i.d. regressors arealso trueifthe regressors arenonrandom.The case ofa nonrandom regressor is, however, quite special. Forexample, modem experimental pro
tocols would have the horticulturalist assign the level of X to the different plots

using a computerized random number generator, thereby circumventing any pos


sible bias by the horticulturalist (she might use her favorite weeding method for
the tomatoes in the sunniest plot). Whenthismodemexperimental protocolisused,
the level of Xis random and {Xi^ Yj) are i.i.d.

Another example of non-i.i.d. sampling iswhenobservations refer to the same


unit of observation over time.For example, we mighthave data on inventorylev

els (7) at a firm and the interest rate at which the firm canborrow (^), where
these data are collectedover time from a specific firm; for example, they mightbe
recorded four times a year (quarterly) for 30 years. This is an example of time

series data,and a keyfeature of time series data is that observations falling close
to each other in time are not independent but rather tend to be correlated with
eachother;ifinterestrates are lownow, they are likely to be lownext quarter.This

pattern ofcorrelation violates the"independence" partofthe i.i.d. assumption.


Timeseriesdata introducea set of complications that are best handledafter devel
opingthe basictools of regression analysis.

Assumption #3: Large Outliers Are Unlikely


The third least squares assumption isthatlarge outliersthat is, observations with
values ofXi, Yi, orboth that are far outside theusual range ofthedata-are unlikely.
Large outliers can make OLS regression results misleading. This potential sensitiv
ity ofOLS to extreme outliers isillustrated inFigure 4.5 using hypothetical data.
In this book, the assumption that large outliers are unlikely is made mathe
matically precise by assuming thatX and Yhave nonzero finite fourth moments:
0 < E{Xj) < oo and 0 < E{Yf) < oo. Another way to state this assumption is
that X and Y have finite kurtosis.

The assumption of finite kurtosis is used in themathematics thatjustify the

large-sample approximations to the distributions ofthe OLS test statistics. We


encountered this assumption in Chapter 3 when discussing the consistency of the

sample variance. Specifically, Equation (3.9) states that the sample variance s? is
a consistent estimator ofthe population variance o-y {sy
* o-y). If >i,..., are
i.i.d. and the fourth moment of Yi is finite, then the law of large numbers in Key

Concept 2.6 applies to the average, ^2?=i(l/-iay)^, a key step in the proof in
Appendix 3.3showingthat Syis consistent.

Onesource oflarge outliers isdataentry errors, such asa typographical error

or incorrectly using different units for different observations. Imagine collecting

168

CHAPTER 4 Linear Regression with One Regressor

The Sensitivity of OLS to Large Outliers


This hypothetical data set has one out
lier. The OLS regression line estimated

2000

with the outlier shows a strong positive


relationship between X and Y, but the
OLS regression line estinnatedwithout
the outliershowsno relationship.

1700

1400

1100

800

OLS regression line


includmg outlier

500

200

0
30

OLS regression line


excluding oudier
40

50

60

70
X

data onthe height ofstudents in meters, butinadvertently recording one student's


height incentimeters instead. One way tofind outliers is to plot your data. Ifyou
decide that an outlier is due to a data entry error, then you can either correct the
error or,if that is impossible, drop the observation from your data set.
Data entry errors aside, the assumption offinite kurtosis is a plausible one in
many applications with economic data. Class size is capped by the physical capac

ity ofaclassroom; the best you can do on astandardized test is toget all the ques
tions right and the worst you can do is toget all the questions wrong. Because class
size and test scores have a finite range, they necessarily have finite kurtosis. More
generally,commonly used distributions such as the normal distribution have four
moments. Still, as a mathematical matter, some distributions have infinite fourth

moments, and this assumption rules out those distributions. If the assumption of
finite fourth moments holds, then itis unlikely that statistical inferences using OLS
will be dominated by a few observations.

Use of the Least Squares Assumptions


The three least squares assumptions for the linear regression model are summa
rized in Key Concept 4.3. The least squares assumptions play twin roles, and we
return to them repeatedly throughout this textbook.

4.5

Sampling Distribution of the OLS Estimators

169

The Least Squares Assumptions


= jSo + jSi^-+ Mi,/ = 1 , n , where

1. The error term W/ has conditional mean zero given E{ui\Xi) =0;
2. {Xi^ Yi), i = 1,... ,n,are independent and identically distributed (i.i.d.) draws
from their joint distribution; and

3. Large outliers areunlikely: Xi and1-have nonzero finite fourth moments.


Their first role is mathematical: If these assumptions hold, then, as is shown

in the next section, in large samples the OLS estimators have sampling distribu
tions that are normal. In turn,this large-sample normal distribution letsusdevelop
methods for hypothesis testing and constructing confidence intervals using the
OLS estimators.

Their second role is to organize the circumstances that pose difficulties for

OLS regression. As we will see, the first least squares assumption is the most
important to consider in practice. Onereason why thefirst least squares assump
tion might not hold in practice is discussed in Chapter 6,and additional reasons
are discussed in Section 9.2.

It is also important to consider whether the second assumption holds in an


application. Although itplausibly holds in many cross-sectional data sets, the inde
pendence assumption is inappropriate for time series data. Therefore, the regres
sion methods developed under assumption 2 require modification for some
applications with time series data.

The third assumption serves as a reminder that OLS, just like the sample
mean, canbe sensitive to large outliers. If your dataset contains large outliers, you
should examine those outliers carefully to make sure those observations are cor
rectly recorded and belong in the data set.

4.5 Sampling Distribution of the OLS


Estimators
Because the OLS estimators jSo and

are computed from a randomly drawn sam

ple, the estimators themselves are random variables with aprobability distribution
the sampling distributionthat describes the values they could take over different
possible random samples. This section presents these sampling distributions.

170

CHAPTER 4 Linear Regression with One Regressor

In small samples, these distributions are complicated, but in large samples, they
are approximately normal because of the central limit theorem.

The Sampling Distribution of the OLS Estimators


Review of the samplingdistribution of V. Recall the discussion inSections 2.5
and 2.6 about the sampling distribution ofthesample average, F,an estimator of

the unknown population mean ofK, fxy. Because Yis calculated using arandomly
drawn sample,7 is a random variable that takes on different valuesfrom one sam
ple to the next; the probability of these different values is summarized in its sam

pling distribution. Although the sampling distribution of Ycan be complicated


when the sample size is small, it is possible to make certain statements about it

that hold for all n. Inparticular, the mean ofthe sampling distribution is fjLyy that
is, E{Y) = so Yis an unbiased estimator of/xy. If is large, then more can be
said about the sampling distribution. Inparticular, the central limit theorem (Sec
tion2.6) states that this distribution isapproximately normal.

Ttie sampling distribution of and

These ideas carry over to the OLS esti

mators Pq and of the unknown intercept /3o and slope /3i of the population
regression line. Because the OLS estimators are calculated using a random sam

ple, ^0 and

are random variables that take on different values from one sam

ple to the next; the probability of these different values is summarized in their
sampling distributions.

Although the sampling distribution ofjSq and jSi can be complicated when the
samplesizeissmall, it is possible to make certainstatementsabout it that holdfor

all n. In particular, the mean ofthe sampling distributions ofj8o and jSi are (Sq and
/3i. In other words, under the least squares assumptions in Key Concept 4.3,

E{Po) =/3o and E{^^) =j8i;

(4.20)

that is, ^0 and are unbiased estimators of jSq and /3i.The proof that jSj is unbiased
isgiven inAppendix 4.3, andtheproof that jSq isunbiased isleft asExercise 4.7.

Ifthe sample is sufficiently large, by the central limit theorem the sampling
distribution of

and /3i is well approximated by the bivariate normal distribu

tion (Section 2.4.). This implies that the marginal distributions of

and jSj are

normal in large samples.

This argument invokes the central limit theorem. Technically, the central limit
theorem concerns the distribution ofaverages (like V). Ifyou examine the numer

ator in Equation (4.7) for ^[, you will see that it, too, is atype of averagenot asim
ple average, like 7, but an average of the product, {Yi -Y){Xi - ^). As discussed

4.5

Sampling Distribution of the OLS Estinfiators

171

Large-Sample Distributions of Po and jSi


/V

Iftheleast squares assumptions inKey Concept 4.3 hold, then inlarge samples /3o

and j8i have ajointly normal sampling distribution. The large-sample normal dis

tribution of ^1 is Nipy o-|), where the variance of this distribution, a*! ,is
2 _ 1var[(^ - /XA:)M,

(4.21)

[var(Ji^)]2

The large-sample normal distribution of jSq is

-I),where
f^x

lEiXl).

X,,

(4.22)

further inAppendix 4.3, the central limit theorem applies tothis average so that, like
thesimpler average Y, it is normally distributed inlarge samples.
The normal approximation to thedistribution oftheOLS estimators inlarge
samples is summarized inKey Concept 4.4. (Appendix 4.3 summarizes the deriva
tion ofthese formulas.) A relevant question inpractice ishow large n must be for
these approximations to be reliable. InSection 2.6, we suggested that n = 100 is
sufficiently large for the sampling distribution ofYtobe well approximated by a
normal distribution, and sometimes smaller n suffices. Thiscriterion carries over

tothe more complicated averages appearing inregression analysis. Invirtually all


modern econometric applications, n > 100, sowe will treat thenormal approxi
mations to the distributions of the OLS estimatorsas reliable unlessthere are good
reasons to think otherwise.

The results inKey Concept 4.4 imply thattheOLS estimators areconsistent


that is, when the sample size is large, j8o and j8i will be close tothe true population
coefficients and jSj with high probability. This isbecause the variances
and
cr^ of the estimators decrease to zero as n increases {nappears in the denomina
tor of the formulas for the variances), so the distribution of the OLS estimators

will betightly concentrated around their means, /3o and /3i, when n islarge.

Another implication ofthe distributions in Key Concept 4.4 is that, in general,

the larger is the variance of X-,, the smaller is the variance

of ^i. Mathemati

cally, this implication arises because the variance of /3i in Equation (4.21) is
inversely proportional to the square of the variance of Xf. the larger is var(J,),the

larger is the denominator in Equation (4.21) so the smaller is o"! .To get abetter sense

172

CHAPTER 4 Linear Regression with One Regressor

iiSEI Tlie Variance of

and the Variance of X

The colored dots represent


a set of X/s with a small

206

variance. The black dots

represent a set of X,'s with


a large variance. The

204

regression line can be esti

mated more accurately

202

with the black dots than

with the colored dots.


200

I ,

<

0,

198

o o

e ^ o
^

<0

196

194
97

98

99

100

101

102

103
X

ofwhy this is so, look at Figure 4.6, which presents ascatterplot of150 artificial data

points on XandJ'.The data points indicated by the colored dots are the 75 observa
tions closest to X. Suppose you were asked to draw a line as accurately as possible
through either the colored orthe black dots-which would you choose? Itwould be

easier to draw aprecise line through the black dots, which have alarger variance than
the colored dots. Similarly, the larger the variance of X, the more precise is
The distributions in Key Concept 4.4 also imply that the smaller is the vari
ance of the error ///, the smaller isthe variance of /3|.This can be seenmathemat

ically in Equation (4.21) because w,- enters the numerator, but not denominator, of
r^|. Ifall Ui were smaller by afactor ofone-half but the ^s did not change, then

would be smaller by afactor of one-half and cr| would be smaller by afactor of


one-fourth (Exercise 4.13). Stated less mathematically, if the errors are smaller

(holding the A"s fixed), then the data will have atighter scatter around the popu
lation regression line so its slope will be estimated more precisely.
The normal approximation to the sampling distribution of and jSi is apow
erful tool. With this approximation in hand, we are able to develop methods for
making inferences about the true population values ofthe regression coefficients
using only a sample of data.

Summary

173

4.6 Conclusion
This chapter has focused onthe use ofordinary least squares toestimate the inter
cept and slope ofa population regression line using a sample ofn observations on
a dependent variable, Y, and a single regressor, X There are many ways todraw a
straight line through a scatterplot, but doing so using OLS has several virtues. If
the least squares assumptions hold, then theOLS estimators ofthe slope and inter
cept are unbiased, are consistent, and have a sampling distribution with avariance
that isinversely proportional to thesample size n.Moreover, ifn islarge, then the
sampling distribution of the OLS estimator is normal.
These important properties of thesampling distribution of the OLS estima
tor hold under the three least squares assumptions.

The first assumption isthat the error term in the linear regression model has
a conditional mean ofzero, given the regressor XThis assumption implies thatthe
OLS estimator is unbiased.

The second assumption is that {Xf, Yi) are i.i.d., as is thecase ifthedata are col
lected by simple random sampling. This assumption yields the formula, presented in
Key Concept 4.4, for the variance ofthe sampling distribution ofthe OLS estimator.
The third assumption isthatlarge outliers areunlikely. Stated more formally,
Xand Fhave finite fourth moments (finite kurtosis).The reason for this assump
tion is that OLS can be unreliable if there are large outliers.Taken together, the

three least squares assumptions imply that the OLS estimator is normally distrib
uted in largesamples as described in Key Concept 4.4.

The results inthis chapter describe the sampling distribution ofthe OLS esti
mator. By themselves, however, these results are not sufficient totest a hypothe
sis about the value of j8i or to construct a confidence interval for Doing so

requires an estimator of the standard deviation of the sampling distribution-that


is, thestandard erroroftheOLS estimator. This step-moving from thesampling
distribution ofj8i toitsstandard error, hypothesis tests, and confidence intervals
is taken in the next chapter.

Summary

L The population regression line, /3o + jSi^, is the mean ofF as a function of
the value of X. The slope, j8], isthe expected change in Yassociated with a
one-unit change inX. The intercept, /3o, determines the level (or height) of
the regression line. Key Concept 4.1 summarizes the terminology ofthe pop
ulation linear regression model.

174

CHAPTER 4 Linear Regression with One Regressor


2.

3.

The population regression line can be estimated using sample observations


{Yi,Xi),i = l,...,nby ordinary least squares (OLS). The OLS estimators of
the regression intercept and slope are denoted /Sq and Pi.
The and standard error of the regression (SER) are measures of how
close the values of Yi are to the estimated regression Hne.The R^ is between
0 and 1,with a larger value indicating that the 1^'s are closer to the line. The
standard error of the regression is an estimator of the standard deviation of
the regression error.

There are three key assumptions for the linear regression model: (1) The
regression errors, Uj, have a mean ofzero conditional on the regressors Xf,
(2) the sample observations are i.i.d. random draws from the population; and

4.

(3) large outliers areunhkely. Ifthese assumptions hold, theOLS estimators

^0 and Pi are (1) unbiased, (2) consistent, and (3) normally distributed when
the sample is large.

Key Terms
linear regression model witha single
dependent variable (152)
independent variable (152)

OLS regression line (156)


sample regression line (156)
sample regressionfunction (156)
predicted value (156)

regressor (152)

residual (157)

population regression line (152)


populationregression function (152)
population intercept (152)
population slope (152)
population coefficients (152)

regression R^ (161)
explained sumof squares (ESS) (161)
total sum of squares (TSS) (161)
sum of squared residuals (SSR) (162)
standarderror of the regression (SER)

regressor (152)

parameters (152)
error term (152)

(162)

least squares assumptions (164)

ordinary least squares (OLS)


estimators (156)

Review the Concepts


4,1

Explain the difference between jSj and /3i; between the residualand the

regression error ur, and between the OLS predicted value Yj and E(Yi\Xi).
4.2

For each least squares assumption, provide an example in which the


assumption isvalid, then provide anexample inwhich the assumption fails.

Exercises

175

4.3 Sketch a hypothetical scatterplot of datafor an estimated regression with


F?- = 0.9. Sketch a hypothetical scatterplot of data for a regression with
i?2 = 0.5.

Exercises

4.1 Suppose that a researcher, using data on class size (C5) and average test
scoresfrom 100 third-grade classes, estimatesthe OLS regression
TestScore = 520.4 - 5.82 X C5, R- = 0.08, SER = 11.5.

a. A classroom has 22 students. What is the regression's prediction for


that classroom's average test score?

b. Last year a classroom had19 students, and this year it has 23 students.
What is the regression's prediction for the change in the classroom
average test score?

c. The sample average class size across the 100 classrooms is21.4. What
isthe sample average of the testscores across the 100 classrooms?
{Hint: Review theformulas for the OLS estimators.)
d. What isthe sample standard deviation oftest scores across the 100

classrooms? {Hint: Review the formulas for the R^ and SER.)


4.2 Suppose that arandom sample of200 twenty-year-old men is selected from
a population and that these men's height and weight are recorded. Aregres
sion of weight on height yields

Weight = -99.41 + 3.94 XHeight, R^ = 0.81, SER = 10.2,


where Weight ismeasured inpounds and Height is measured ininches.
a. What is theregression's weight prediction for someone who is 70 in.
tall? 65 in. tall? 74 in. tall?

b. A man has a late growth spurt and grows 1.5 m. over the course ofa year.
What isthe regression's prediction for the increase in this man's weight?

c. Suppose that instead ofmeasuring weight and height inpounds and


inches these variables are measured in centimeters and kilograms.

What are the regression estimates from this new centimeter-kilogram

regression? (Give all results, estimated coefficients, R^, and SER.)

176

CHAPTER 4 Linear Regression with One Regressor

4.3 Aregression ofaverage weekly earnings {AWE, measured in dollars) on


age (measured in years) using a random sample ofcollege-educated fulltime workers aged 25-65 yields the following:
AWE = 696.7 -h 9.6 XAge,

= 0.023,SER = 624.1.

a. Explain what the coefficient values 696.7 and 9.6 mean.

b. The standarderror of the regression (SER) is624.1. What are the units

ofmeasurement for the SER? (Dollars? Years? Oris SER unit-free?)


c. Theregression R^ is 0.023. What are the units of measurement for the

R^? (Dollars? Years? Or is R^ unit-free?)


d. What isthe regression's predicted earnings for a 25-year-old worker?
A 45-year-oId worker?

e. Will the regression give reliable predictions for a 99-year-old worker?


Why or why not?

f. Given what you know about the distribution ofearnings, do you think
it isplausible thatthedistribution oferrors intheregression isnor
mal? {Hint: Do you think that the distribution issymmetric or
skewed? What is the smallest value of earnings, andisit consistent
with a normal distribution?)

g. The average age in this sample is 41.6 years. What isthe average value
oiAWE inthesample? {Hint: Review Key Concept 4.2.)
4.4 Read the box "The 'Beta' of a Stock" in Section 4.2.

a. Suppose thatthe value of j8 isgreater than 1 for a particular stock.


Show that the variance of(/? - Rf) for this stock is greater than the
variance of {R, - 7?,).

b. Suppose that thevalue of /3 is less than 1fora particular stock. Isit


possible that variance of{R - R^) for this stock isgreater than the
variance of(/?, /?,)? {Hint: Don't forget the regression error.)
c. In a given year, therate ofreturn on3-month Treasury bills is3.5%
and the rate ofreturn ona large diversified portfolio ofstocks (the
S&P 500) is7.3%. For each company listed in the table in the box, use
theestimated value of to estimate thestock's expected rateofreturn.
4.5 A professor decides to run an experiment to measure the effect of time
pressure on final exam scores. He gives each of the 400 students in his

course the same final exam, but some students have 90 minutes tocomplete
the exam while others have 120 minutes. Each student is randomly assigned

Exercises

177

one of the examination times based on the flip of a coin. Let 1/denote the

number of points scored on the exam by the

student (0

^ 100),

letXi denote theamount oftime thatthestudent hasto complete theexam


= 90 or 120), andconsider the regression model Yi = Pq + piXi+ Ui.

a. Explain what theterm ui represents. Why will different students have


different values of up.

b. Explain why E{Ui |Xj) = 0 for this regression model.


c. Are the otherassumptions in Key Concept 4.3 satisfied? Explain,

d. The estimated regression is If= 49 + 0.24 Xi.


i. Compute the estimated regression's prediction for the average
score of students given 90minutes to completethe exam. Repeat
for 120 minutes and 150 minutes.

ii. Compute theestimated gain inscore for a student who isgiven an


additional 10 minutes on the exam.

4.6 Show that the first least squares assumption, E{ui\Xi) = 0,implies that
E{Yi\Xi) = l3o + liiX,

4.7 Show that jSq isan unbiased estimator of jSq. (Hint: Use the fact that

is

unbiased, which is shown in Appendix 4.3.)

4.8 Suppose that all ofthe regression assumptions inKey Concept 4.3 are sat

isfied except that the first assumption is replaced with (/1 Xi) =2. Which
parts ofKey Concept 4.4 continue to hold? Which change? Why? (Is /3i
normally distributed inlarge samples with mean and variance given inKey
Concept4.4? What about )8o?)

4.9 a. A linear regression yields j8i = 0.Show that = 0.


b. A linear regression yields = 0.Does this imply that

= 0?

4.10 Suppose that 1/=/3o +where (^, W/) are i.i.d., and Xi is a
Bernoulli random variable with Pr(A' = 1) = 0.20. When X=l, Ui isN(0,4);
when

= 0, Mj is N(0,1).

a. Show that theregression assumptions in Key Concept 4.3 aresatisfied.


b. Derive anexpression for the large-sample variance of j8i. [Hint: Eval
uate the terms in Equation (4.21).]
4.11 Consider the regression model

= jSo +

+ /

a. Suppose you know that jSq = 0.Derive a formula for the least squares
estimator of jSi.

178

CHAPTER 4 Linear Regression with One Regressor

b. Suppose you know that

4. Derive a formula for the least squares

estimator of/3i.

4.12 a. Show that the regression

in the regression of F on A' is the squared

value of the sample correlation between X and Y. That is, show that
b. Show that the R} from the regression of 7 on X is the same as the
from the regression of X on Y.

c. Show that jSi = rxYisylsx), where ixyisthe sample correlation between


X and Y, and iy and sx are the sample standarddeviations ofX and Y.

4.13 Suppose that Yi = (Bq + f^iXj + KUj, where kis anon-zero constant and (Yi, Xj)
satisfy the three least squares assumptions. Show that the large sample
variance of

is given by

[WmcTliis equation is the

variance given in equation (4.21) multiplied by k^.]

4.14 Show that the sample regression line passes through the point (X,Y).
Empirical Exercises
E4.1 On the text Web site http://www.pearsonglobaIeditions.coii]/stock, you will
find a data file CPS08 that contains an extended version of the data set used

in Table 3.1 for 2008. It contains data for full-time, full-year workers, age
25-34, with a high school diploma or B.A./B.S. as their highest degree. A
detailed description is given in CPS08_Descripfion, also available on the
Web site. (These are the same data as in CPS92_08 but are Imiited to the

year 2008.) In this exercise, you will investigate the relationship between a
worker's age and earnings. (Generally, older workers have more job expe
rience, leading to higher productivity and earnings.)

a. Run a regression ofaverage hourly earnings (AHE) on age (Age).


What is the estimated intercept? What is the estimated slope? Use the
estimated regression to answer this question: How much do earnings
increaseas workers age by 1 year?

b. Bob is a 26-year-old worker. Predict Bob's earnings using the esti


mated regression.Alexis is a 30-year-old worker. Predict Alexis's
earnings using the estimated regression.

c. Does age account for a large fraction ofthe variance inearnings


across individuals? Explain.

Empirical Exercises

179

E4.2 On the text Web site httpi//www.pearsonglobaleditions.coiii/stock, you


will find a data file TeachingRatings that contains data on course evalua
tions, course characteristics, and professor characteristics for 463 courses

at the University of Texas at Austin.^ A detailed description is given in


TeachmgRatings_Description, also available on theWeb site. One of the
characteristics is an index of the professor's"beauty" as rated by a panel of

six judges. In this exercise, you will investigate how course evaluations are
related to the professor's beauty.

a. Construct a scatterplot of average course evaluations {Course_Eval)


on the professor's beauty {Beauty). Does there appear to be a rela
tionship between the variables?

b. Run a regression of average course evaluations {Course_Eval) onthe


professor's beauty (Brawry). What is the estimated intercept? What is
the estimated slope? Explain why the estimated intercept is equal to

thesample mean of Course_EvaL {Hint: What is thesample mean of


Beautyl)

c. Professor Watson has an average value of Beauty, while Professor


Stock's value of Beauty is one standard deviation above the average.
Predict Professor Stock's and Professor Watson's course evaluations.

d. Commenton the size of the regression's slope. Is the estimatedeffect

of Beauty on Course^Eval large or small? Explain what you mean by


"large" and "small."

e. Does Beauty explain a large fraction of the variance in evaluations


across courses? Explain.

E4.3 On the text Web site httpi//www.pearsonglobaleditions.coiii/stock, you


will find a datafile CollegeDistance that contains datafrom a random sam

ple ofhigh school seniors interviewed in L980 and re-interviewed in 1986.


In this exercise, you will use these data to investigate the relationship
between the number ofcompleted years ofeducation foryoung adults and
the distance from each student's high school to the nearest four-year col

lege. (Proximity to college lowers the cost ofeducation, so that students


who live closer to a four-year college should, on average, complete

' These data were provided by Professor Daniel Hamermesh of the University ofTexas atAustin and
were used in his paper with Amy Paricer, "Beauty in the Classroom: Instructors' Pulchritude and Puta
tive Pedagogical Productivity," Economics ofEducation Review, August 2005,24(4); 369-376.

180

CHAPTER 4 Linear Regression with One Regressor

more years ofhigher education.) Adetailed description is given in College


Distance.Description, also available on the Web site.^

a. Run a regression of years of completed education {ED) on distance


to the nearest college {Dist), where Dist is measured in tens of miles.

(For example, Dist = 2means that thedistance is20 miles.) What is


theestimated intercept? What istheestimated slope? Use theesti
mated regression to answer this question: How does the average
value ofyears ofcompleted schooling change when colleges are built
close to where students go to highschool?

b. Bob's high school was 20 miles from thenearest college. Predict Bob's
years ofcompleted education using the estimated regression. How
would the prediction change if Bob lived 10 miles from the nearest
college?

c. Does distance to college explain a large fraction of the variance in


educational attainment across individuals? Explain.
d. What isthevalue of thestandard errorof the regression? What are

the units for thestandard error (meters, grams, years, dollars, cents, or
something else)?

E4.4 On the text Web site http://www.pearsonglobaleditions.coiii/$tock, you


will find a data file Growth that contains data onaverage growth rates from

1960 through 1995 for 65 countries along with variables that are potentially
related togrowth. A detailed description is given in Growth_Description,
also available on the Web site. Inthis exercise, you will investigate the rela
tionship between growth and trade.3

a. Construct a scatterplot ofaverage annual growth rate {Growth) on


the average trade share (TradeShare). Does there appear to bea rela
tionship between the variables?

b. Onecountry, Malta, has a trade share much larger than theother coun
tries. FindMaltaon the scatterplot. Does Malta look Uke an outlier?
c. Using all observations, run a regression of Growthon TradeShare.

What is the estimated slope? What is the estimated intercept? Use the
^These data were provided by Professor Cecilia Rouse of Princeton University and were used in her
paper "Democratization or Diversion? The Effect ofCommunity Colleges onEducational Attain
ment," Journal ofBusiness and Economic Statistics,ApxW 1995,12(2): 217-224.

3These data were provided by Professor Ross Levine of Brown University and were used in his paper
withThorsten Beck and Norman Loayza, "Finance and the Sources oi Gtomh,''Journal ofFinancial
Economics, 2000,58:261-300.

Derivation of the OLS Estimators

181

regression to predict the growth rate for a country with a trade share
of 0.5 and with a trade share equal to 1.0.

d. Estimate the same regression excludingthe data from Malta. Answer


the same questions in c..

e. Where is Malta?Whyis the Malta trade share so large? ShouldMalta


be included or excluded from the analysis?

APPENDIX

4.1

The California Test Score Data Set


The California Standardized Testing and Reporting data set contains data on test perfor
mance, school characteristics, and student demographic backgrounds. Thedata used here
are from all 420 K-6 and K-8 districts in California with data available for 1999.Testscores

are the average ofthe reading and math scores ontheStanford 9Achievement Test, astan
dardized test administered to fifth-grade students. School characteristics (averaged across

the district) include enrollment, number ofteachers (measured as "full-time equivalents"),


number ofcomputers perclassroom, and expenditures perstudent. The student-teacher
ratio used here is the number of students in the districtdivided by the number of full-time

equivalent teachers. Demographic variables for the students also are averaged across the
district. The demographic variables include the percentage ofstudents who are in the pub
lic assistance program CalWorks (formerly AFDC), the percentage ofstudents who qual
ify for areduced price lunch, and the percentage ofstudents who are English learners (that
is, students for whom English is a second language). All ofthese data were obtained from
the California Department of Education (www.cde.ca.gov).

APPENDIX

4.2 Derivation of the OLS Estimators


This appendix uses calculus to derive the formulas for the OLS estimators given in Key

Concept 4.2. To minimize the sum of squared prediction mistakes S"=|(>; - - b\XiY
[Equation (4.6)], first take the partial derivatives with respect to6o and b^.

O0oi=l

2 (>; - ^0 -

=-2i;{1!
- io - h^X,) and
/=1

(4.23)

= -2 s ( - ^0 - b.XdX,

(4.24)

182

CHAPTER 4 Linear Regression with One Regressor


The OLS estimators,

and )8,, are the values of

and

that minimize

^0 ~ bxXi)^ or, equivalently, the values of bo and bi for which the derivatives in
Equations (4.23) and (4.24) equal zero. Accordingly, setting these derivatives equal tozero,
collecting terms, and dividing by n shows that the OLS estimators, jSq and j8i, must satisfy
the two equations

y-iSo-^1^=0and

(4.25)

^'2Xfyi-pX-kl^Xj =0.

(4.26)

Solving this pair of equations for

and

^1=7^

yields

i=l

(4.27)
;=1

A, =y-^,X

(4.28)

Equations (4.27) and (4.28) are the formulas for jSq and j8] given in Key Concept 4.2; the
formula =Sxy/sx is obtained by dividing the numerator and denominator in Equa
tion (4.27) by - 1.

APPENDIX

4.3 Sampling Distribution of the OLS Estimator


In this appendix, we show that the OLS estimator j8] is unbiased and, in large samples, has
thenormal samphng distribution given inKey Concept 4.4.

Representation of (3 ^ in Terms of the Regressors and Errors


We start by providing an expression for j8) interms ofthe regressors and errors. Because

^ ~ ^0 + ^

1/ y = /3](Xj X) + Uj u, so the numerator ofthe formula for jS iin

Equation (4.27) is

^iXi-X){Yi-Y)='2iXi-X)[^,{Xi -JV)+ (,--77)]


(=1

;=i

(4.29)
= - ii).
1=1

=l

Sampling Distribution of the OLS Estimator

183

Now
- ^){"/ - )=
- X)Ui - X)u = ^Ui^i - 'T)Ui, where
the final equality follows from the definition of X, which implies that ^"=i{Xi -X)u =

[S^LiA-f - nX]u =0. Substituting ^Ui^i- ^(/ - ) =^Ui^i- X)Ui into thejinal
expression in Equation (4.29) yields ^"=i{Xi-X){Yi-Y) =^i'2i=i{Xi-X)^ +
2"=i(A^- - X)Ui. Substituting this expression inturn into the formula for

inEquation

(4.27) yields

/3i= |8| +

/=!

(4.30)

i=l

Proof That p] Is Unbiased


The expectation of is obtained by taking the expectation of both sides of Equation (4.30).
Thus,

(^i) = /3i +

/=!

(4.31)
= /3, +

/=1

= /3i,
1=1

where thesecond equality inEquation (4.31) follows by using thelaw ofiterated expecta

tions (Section 2.3). By the second least squares assumption, M/ is distributed independently
of X for all observations other than i, so (m/| A'l,..., X) = E{ui\Xi). By the first least

squares assumption, however, E{ui\Xi) =0. Itfollows that the conditional expectation in

large brackets in the second line of Equation (4.31) is zero, so that E{^\- ^\\Xx,... X)=Q.
Equivalently, (j8i|A'j, ...,X) =

that is,

is conditionally unbiased, given Xx,..., X.

By the law of iterated expectations, (j8i - P\) =E[E{^^ - p\\Xx,^)] =0, so that
(j8|) =

that is,jSj is unbiased.

Large-Sample Normal Distribution of the OLS Estimator


The large-sample normal approximation to the limiting distribution of px (Key Concept
4.4) is obtained by considering the behavior ofthe final term in Equation (4.30).
Firstconsider the numeratorof thisterm. Because X isconsistent, if the sample size is

large, ^ is nearly equal to /i;i'.Thus, to aclose approximation, the term in the numerator of

184

CHAPTER 4 Linear Regression with One Regressor

Equation (4.30) is the sample average v, where v,- = {Xi - fix)ui. By the first least squares
assumption, v,- has a mean ofzero. By the second least squares assumption, v,- is i.i.d.The

variance of V/ is al =var[(A'/

which, by the third least squares assumption, is

nonzero and finite. Therefore, vsatisfies all the requirements of the central limit theorem

(Key Concept 2.7).Thus v/a^ is, in large samples, distributed A^(0,1), where o-| =
Thus
the distribution of? iswell approximated by the N{0, al/n) distribution.
Next consider the expression in the denominator in Equation (4.30); this is the sample vari
ance ofX(except dividing by n rather than n 1, which is inconsequential ifn is large). As
discussed in Section 3.2 [Equation (3.8)], thesample variance isa consistent estimator ofthe
population variance, soin large samples itis arbitrarily close tothe population variance ofX

Combining these two results, we have that, in large samples, jSj - ^\ = v/var(A;),

so that the sampling distribution of jS, is, in large samples,

where cr| =

var(v)/[var(A^)]^ =var[(A',- /i^')"/]/{rt[var(A^)]^}, which is the expression in Equation


(4.21).

Some Additional Algebraic Facts About OLS


TheOLS residuals and predicted values satisfy;
1 "

2"/ =0,

(4.32)

J=I

si >; =?.

(4.33)

7=1

/I

2 itiXi =0and

=0, and

(4.34)

/=!

TSS = SSR + ESS.

(4.35)

Equations (4.32) through (4.35) say that the sample average ofthe OLS residuals is zero;

the sample average of the OLS predicted values equals Y\ the sample covariance 52;^between the OLS residuals and the regressors is zero; and the total sum of squares is the
sum of squared residuals and the explained sum ofsquares [the ESS, TSS, and SSR are
defined in Equations (4.14), (4.15), and (4.17)].
To verify Equation (4.32), note thatthe definition of jSq lets uswrite theOLS residu

als as Ui=Yi-ftQ-^^Xi=(Yi-Y)- ^8, (Z, - X)- thus

t^'i=t(^-Y)-p,{X,-X).

(=1

/=i

/=i

But the definitions of Yand ^ imply that 2:;'=i()^-P) =0 and 2"=i(;t;--X) =0, so
^','=\Uj = 0.

Sampling Distribution of the OLS Estimator

Toverify Equation (4.33), note that )/=>/ +

so S"=iT/ =

185

+ ^"=iU\ = 2/'=i

where the second equality is a consequence of Equation (4.32).

Toverify Equation(4.34),note that 2"=ij = 0 implies E"=iM,vY} - 2"=iM,(A',- - A"), so

1=1

(4.36)

/=!

where the final equality in Equation (4.36) is obtained using the formula for
(4.27). This result, combined with the preceding results, implies that

in Equation

= 0.

Equation (4.35) follows from the previous results and some algebra:

Tss= '2{yi-y?= 2{i- >i+ y- yf


/=1

= 2(1;- )'+
/=1

/=!

i=l

(=1

= SSR + ESS +

hw-y)

(4.37)

= SSR + ESS,
i=l

where the final equality follows from S/LiM/V? = S'Lii^jSo + ^lA^) =/So2/'=|W,-+
^\^'i=\UiXj = 0 by theprevious results.