You are on page 1of 20

Fundamentals of statistics and empirics

Linear Regression with one regressor

SW, Chapter 4

MEPS – Preparatory and Orientation Weeks 31


The simple linear regression model:
Outline

1. Basic Idea
2. Ordinary Least Squares Estimation
3. Measures of Fit
4. Least Squares Assumptions
5. Sampling Distribution
6. Hypotheses Testing
7. Confidence Interval
8. Binary Regressor
9. Heteroskedasticity and Homoskedasticity
10. Concerns regarding OLS Estimator

MEPS – Preparatory and Orientation Weeks 1-32


The simple linear regression model:
1. Basic Idea

• We want to find the regression line that fits our scatter plot best.

MEPS – Preparatory and Orientation Weeks 1-33


The simple linear regression model:
1. Basic Idea

• The slope of the regression line is the expected effect on Y of a unit


change in X.

• In our example, class size and test score.

• With the regression model, we determine

– whether there is a (statistically significant) relation between X and Y,

– how strong is this relation?

– Is this a causal effect between X and Y?

MEPS – Preparatory and Orientation Weeks 1-34


The simple linear regression model:
1. Basic Idea

• Estimation:
– How should we draw a line through the data to estimate the slope?
• Answer: ordinary least squares (OLS).

• Hypothesis testing:
– How to test if the slope is zero, i.e. there is no effect of X on Y?

• Confidence intervals:
– How to construct a confidence interval for the slope?

MEPS – Preparatory and Orientation Weeks 1-35


The simple linear regression model:
1. Basic Idea

• The regression line: Testscore = b0 + b1STR

b1 = slope of regression line


= ∆Testscore
∆STR
= change in test score for a unit change in STR

• We would like to know the value of b1.

• Since we don’t know b1, so must estimate it using data.

MEPS – Preparatory and Orientation Weeks 1-36


The simple linear regression model:
1. Basic Idea

Yi = b0 + b1 X i + ui i = 1,..., n
• We have n observations, (Xi, Yi), i = 1,.., n.

• X is the independent variable or regressor, also explanatory variable.


• Y is the dependent variable or regressand, also explained variable.
• b0 = intercept
• b1 = slope
• ui = the regression error

• The regression error consists of omitted factors. In general, these


omitted factors are other factors that influence Y, other than the variable
X. The regression error also includes error in the measurement of Y.

MEPS – Preparatory and Orientation Weeks 1-37


The simple linear regression model:
2. The Ordinary Least Squares Estimation

• How can we estimate b0 and b1 from data?

• We will focus on the least squares (“ordinary least squares” or “OLS”)


estimator of the unknown parameters β0 and β1. The OLS estimator
solves,

• The OLS estimator minimizes the average squared difference between the
actual values of Yi and the prediction (“predicted value”) based on the
estimated line.

• The result is the OLS estimators of b0 and b1.

MEPS – Preparatory and Orientation Weeks 1-38


The simple linear regression model:
2. The Ordinary Least Squares Estimation

MEPS – Preparatory and Orientation Weeks 1-39


The simple linear regression model:
2. The Ordinary Least Squares Estimation

MEPS – Preparatory and Orientation Weeks 1-40


The simple linear regression model:
2. The Ordinary Least Squares Estimation

• Application to the California Test Score – Class Size data

• Estimated slope = β̂1= – 2.28


• Estimated intercept = β̂ 0= 698.9

• Estimated regression line: Test Score= 698.9 – 2.28 STR

MEPS – Preparatory and Orientation Weeks 1-41


The simple linear regression model:
2. The Ordinary Least Squares Estimation

Interpretation of the estimated slope and intercept

• Test Score = 698.9 – 2.28 STR

• In increase of student-teacher ratio by 1 reduces the test scores on


average by 2.28.

That is, ∆Test score = –2.28


∆STR
• The intercept (taken literally) means that, according to this estimated
line, districts with zero students per teacher would have a (predicted)
test score of 698.9. But this interpretation of the intercept makes no
sense – it extrapolates the line outside the range of the data – here, the
intercept is not economically meaningful.

MEPS – Preparatory and Orientation Weeks 1-42


The simple linear regression model:
2. The Ordinary Least Squares Estimation

Predicted values & residuals:

• One of the districts in the data set is Antelope, CA, for which STR = 19.33 and
Test Score = 657.8.
• predicted value: YˆAntelope = 698.9 – 2.28 19.33 = 654.8
• residual: uˆ Antelope = 657.8 – 654.8 = 3.0
MEPS – Preparatory and Orientation Weeks 1-43
The simple linear regression model:
2. The Ordinary Least Squares Estimation

• STATA output
regress testscr str, robust

Regression with robust standard errors Number of obs = 420


F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
--------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-----------------------------------------------------------------------------

MEPS – Preparatory and Orientation Weeks 1-44


The simple linear regression model:
3. Measures of fit

Two regression statistics provide complementary measures of how well the


regression line “fits” or explains the data:

• regression R2
measures the fraction of the variance of Y that is explained by X; it is
unitless and ranges between zero (no fit) and one (perfect fit)

• standard error of the regression (SER):


measures the magnitude of a typical regression residual in the units of Y.

MEPS – Preparatory and Orientation Weeks 1-45


The simple linear regression model:
3. Measures of fit

Yi = Yˆi + uˆi
• Variance decomposition
n n n

∑ i
(Y −
i =1
Y ) 2
= ∑ i
(Yˆ − Y
i =1
) 2
+ ∑i
ˆ
u 2

i =1

TSS = ESS + SSR


• TSS = total sum of squares

• ESS = explained sum of squares

• SSR = sum of squared residuals

• (Here, we use Yˆ = Y and uˆ = 0 )

MEPS – Preparatory and Orientation Weeks 1-46


The simple linear regression model:
3. Measures of fit

Regression R² = fraction of variation of Y that is explained by X


n

ESS ∑ i
(Yˆ − Y ) 2
SSR
R2 = = i =1
n
= 1−
TSS TSS
∑ i
(Y
i =1
− Y ) 2

• 0 ≤ R2 ≤ 1

• R2 = 0 means ESS = 0 (β̂1 =0), i.e. no variation is explained

• R2 = 1 means ESS = TSS and SSR=0, i.e. all data points on regression line

• For regression with a single X, R2 = the square of the correlation


coefficient between X and Y
R 2 = [corr ( X , Y )]²
MEPS – Preparatory and Orientation Weeks 1-47
The simple linear regression model:
3. Measures of fit

Standard error of regression SER

• The SER measures the spread of the distribution of u.

n n
1 1
SER = su2ˆ = ∑
n − 2 i=2
(uˆi − ˆ
u ) 2
= ∑
n − 2 i=2
ˆ
ui
2

• The SER is (almost) the sample standard deviation of the OLS residuals.
• The SER has the units of u, which are the units of Y
• It measures the average “size” of the OLS residual (the average “mistake”
made by the OLS regression line)
• Why n-2? Degrees of freedom correction by number of estimated estimators.
(In large samples, irrelevant,whether division by n, n-1 or n-2.)

MEPS – Preparatory and Orientation Weeks 1-48


The simple linear regression model:
3. Measures of fit

Root mean squared errors (RMSE)

• The root mean squared error (RMSE) is closely related to the SER:

1 n 2
RMSE =

n i =1
uˆi

• This measures the same thing as the SER – the minor difference is division
by 1/n instead of 1/(n–2).

MEPS – Preparatory and Orientation Weeks 1-49


The simple linear regression model:
3. Measures of fit

TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6

• STR explains only a small fraction of the variation in test scores (5.1%).
• SER is relatively large: strong deviation from regression line
• That`s why, predicted values are highly incorrect.
• There must be other factors affecting test scores, e.g. ?

MEPS – Preparatory and Orientation Weeks 1-50

You might also like