Topic 2

lOMoARcPSD|20927229
TOPIC 2
Econometrics I (Universitat Pompeu Fabra)
Studocu is not sponsored or endorsed by any college or university

Downloaded by Rana Ahemd (rana.ahemd@ozu.edu.tr)
lOMoARcPSD|20927229
ECONOMETRICS I
TOPIC 2: INTRODUCTION TO LINEAR REGRESSION
LINEAR REGRESSION WITH ONE REGRESSOR
POPULATION REGRESSION LINE =
 SIMPLIEST MODEL:  GRAPHICALLY:

Fixed quantities
𝒀𝒊 ⁡ = ⁡ 𝜷𝟎 + ⁡ 𝜷𝟏 𝑿𝒊 ⁡ + ⁡ 𝑼𝒊 ⁡, where 𝒊⁡ = ⁡𝟏, … , 𝒏
Non-random Disturbance term

component
 VARIABLES:
 X is the independent (or explanatory) variable. Also

called regressor!
 Y is the dependent variable X1, X2, X3, and X4 are four hypothetical values of the explanatory
variable.
 β0 is the intercept parameter (when X = 0).
Sometimes it has a meaningful interpretation and in If the relationship between Y and X were exact, the corresponding
other just act as the level (height) of the regression values of Y would be represented by the points Q1 – Q4 on the line.
line. The disturbance term causes the actual values of Y to be different.
 β1 is the causal effect of X on Y. Parameter that gives In the diagram, the disturbance term has been assumed to be
the change in Y for a unit change in X, holding other positive in the first and fourth observations and negative in the
factors constant. In the linear model the change in Y other two, with the result that, if one plots the actual values of Y
is the same for all changes in X, no matter what the against the values of X, one obtains the points P1 – P4.
initial level of X.
It must be emphasized that in practice the P points are all one can
 Ui are unobserved factors that influence Y, other see. The actual values of β1 and β2, and hence the location of the
than the variable X. Ui, is sometimes also called the Q points, are unknown, as are the values of the disturbance term
“error term” or “residual”. in the observations.
In our example:
How can we find the value of these betas? Using the OLS estimators.
OLS ESTIMATORS = the population regression line can be estimated using sample observations by ordinary least squares (OLS). The
OLS estimators of the regression intercept and slope are denoted 𝛽̂0 and 𝛽̂1 .
Laura Aparicio 14
lOMoARcPSD|20927229
ECONOMETRICS I
Suppose that you are given the four observations on X and Y represented in previous figure and you are asked to obtain estimates of
the values of β0 and β1. As a rough approximation, you could do this by plotting the four P points and drawing a line to fit them as best
you can.
This has been done in the following figure:
 The intersection of the line with the Y-axis provides an estimate of

̂𝟎
the intercept β0, which will be denoted 𝜷
 And the slope provides an estimate of the slope coefficient β2,

̂𝟏.
which will be denoted 𝜷
̂𝟎 + 𝜷
̂𝒊 = 𝜷
The fitted line will be written as: 𝒀 ̂ 𝟏 𝑿𝒊
Drawing a regression line by eye is all very well, but it leaves a lot to subjective judgment. The question arises, is there a way of
calculating good estimates of β0 and β1 algebraically?
[1] Define what is known as a residual for each observation: the difference between the actual value of Y in any observation and
the fitted value given by the regression line.
̂𝒊
𝒖𝒊 = 𝒀𝒊 − 𝒀
[2] We substitute the fitted line:
̂𝟎 − 𝜷
𝒖𝒊 = 𝒀𝒊 − 𝜷 ̂ 𝟏 𝑿𝒊
[3] Hence the residual in each observation depends on our choice of b1 and b2. Obviously, we wish to fit the regression line, that
is, choose b1 and b2, in such a way as to make the residuals as SMALL AS POSSIBLE. We need to devise a criterion of fit that
takes account of the size of all the residuals simultaneously. There are a number of possible criteria, some of which work
better than others. One way of overcoming the problem is to minimize RSS, the sum of the squares of the residuals.
𝑹𝑺𝑺 = 𝒖𝟏 𝟐 + 𝒖𝟐 𝟐 +𝒖𝟑 𝟐 +𝒖𝟒 𝟐
The smaller one can make RSS, the better is the fit, according to this criterion. If one could reduce RSS to 0, one would have
a perfect fit, for this would imply that all the residuals are equal to 0. The line would go through all the points, but of course
in general the disturbance term makes this impossible.
WHY USE OLS, RATHER THAN SOME OTHER ESTIMATOR? The OLS estimator has some desirable properties: under certain
assumptions, it is unbiased (that is, E(𝛽̂1 ) = β1), and it has a tighter sampling distribution than some other candidate estimators of β1.
Importantly, this is what everyone uses.
Laura Aparicio 15
lOMoARcPSD|20927229
ECONOMETRICS I
Let’s know considerer the GENERAL CASE where there are n observations on two variables X and Y and, supposing Y to depend on X,
we will fit the equation:
̂𝟎 + 𝜷
̂𝒊 = 𝜷
𝒀 ̂ 𝟏 𝑿𝒊
̂ 𝒊 , will be (𝑏1 ⁡ + ⁡ 𝑏2 𝑋𝑖 ), and the residual 𝑢𝑖 will be (𝑌𝑖 ⁡–⁡𝑏1 − 𝑏2 𝑋𝑖 ). We

The fitted value of the dependent variable in observation i, 𝒀
wish to choose b1 and b2 so as to minimize the residual sum of the squares, RSS, given by:
𝒏
𝑹𝑺𝑺 = 𝒖𝟏 𝟐 + ⋯ +𝒖𝒏 𝟐 = ∑ 𝒖𝒊 𝟐
𝒊=𝟏
We will find that RSS is minimized when:
𝟏 𝒏 ̅ )(𝒀𝒊 − 𝒀
̅ ) ∑𝒏 (𝑿 − 𝑿
𝒄𝒐𝒗⁡(𝑿, 𝒀) 𝒏 ∑𝒊=𝟏(𝑿𝒊 − 𝑿 𝒊=𝟏 𝒊
̅ )(𝒀𝒊 − 𝒀
̅)
̂𝟏 =
𝜷 = =
𝑽𝒂𝒓⁡(𝑿) 𝟏 𝒏 ̅ )𝟐 ∑ 𝒏
(𝑿 − ̅
𝑿 )𝟐
∑ (𝑿 − 𝑿 𝒊=𝟏 𝒊
𝒏 𝒊=𝟏 𝒊
̂𝟎 = 𝒀
𝜷 ̅ − 𝜷𝟏 𝑿
̅⁡
MAIN CONCEPTS
OBJECTIVE: The task of regression analysis is to obtain estimates of β1 and β2, and hence an estimate of the location of the line, given
the P points. Moreover, if you were concerned only with measuring the effect of X on Y, it would be much more convenient if the
disturbance term did not exist. But in fact, part of each change in Y is due to a change in u, and this makes life more difficult (u is
sometimes described as noise).
Laura Aparicio 16
lOMoARcPSD|20927229
ECONOMETRICS I
INTERPRETATION
There are two stages in the interpretation of a regression equation:  to turn the equation into words so that it can be understood by
a no econometrician and  to decide whether this literal interpretation should be taken at face value or whether the relationship
should be investigated further.
EXAMPLE:
SLOPE: It indicates that, as S increases by one unit (of S), EARNINGS increases by 1.07 units (of EARNINGS). Since S is measured
in years, and EARNINGS is measured in dollars per hour, the coefficient of S implies that hourly earnings increase by $1.07 for
every extra year of schooling.
CONSTANT TERM: Strictly speaking, it

indicates the predicted level of EARNINGS
when S is 0. Sometimes the constant will
have a clear meaning, but sometimes not. If
the sample values of the explanatory
variable are a long way from 0,
extrapolating the regression line back to 0
may be dangerous. Even if the regression
line gives a good fit for the sample of
observations, there is no guarantee that it
will continue to do so when extrapolated to
the left or to the right.
In this case a literal interpretation of the constant would lead to the nonsensical conclusion that an individual with no
schooling would have hourly earnings of –$1.39. In this data set, no individual had less than six years of schooling and only
three failed to complete elementary school, so it is not surprising that extrapolation to 0 leads to trouble.
MEASURES OF FIT = A natural question is how well the regression line “fits” or explains the data. There are two regression statistics
that provide complementary measures of the quality of fit:
 Regression R2: measures the fraction of the variance of Y that is explained by X. It’s unit-less and ranges between 0 (no fit)
and 1 (perfect fit).
 Standard error of the regression (SER): is an estimator of the standard deviation of the regression error.
REGRESSION (R2)
First, we need to recall some nice properties (grey box).
Laura Aparicio 17
lOMoARcPSD|20927229
ECONOMETRICS I
̂𝑖 and ei, after running a regression:

[1] We have seen that we can split the value of Yi in each observation into two components, 𝑌
𝒀𝒊 = 𝒀̂𝒊 + 𝒖𝒊
[2] We can use this to decompose the variance of Y:
𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 ) + 𝟐𝑪𝒐𝒗(𝒀̂𝒊 , 𝒖𝒊 )
[3] Now it so happens the Cov(Yˆ,e) must be equal to 0 (see the box). Hence we obtain:
𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 )
̂ ), the part "explained" by the regression line,

This means that we can decompose the variance of Y into two parts, 𝑽𝒂𝒓(𝒀
and 𝑽𝒂𝒓(𝒆), the "unexplained" part.
̂ )/𝑽𝒂𝒓(𝒀) is the proportion of the variance explained by the regression line. This proportion is known
[4] In view of this, 𝑽𝒂𝒓(𝒀
as the coefficient of determination or, more usually, R2:
̂)
𝑽𝒂𝒓⁡(𝒀
𝑹𝟐 =
𝑽𝒂𝒓(𝒀)
 The maximum value of R2 is 1. This occurs when the regression line fits the observations exactly, so that 𝑌̂ ⁡ = ⁡ 𝑌𝑖 in all
observations and all the residuals are 0. Then 𝑉𝑎𝑟(𝑌̂⁡) = ⁡𝑉𝑎𝑟(𝑌)⁡𝑎𝑛𝑑⁡𝑉𝑎𝑟(𝑒)⁡𝑖𝑠⁡0, and one has a perfect fit.
 If there is no apparent relationship between the values of Y and X in the sample, R 2 will be close to 0.
Often it is convenient to decompose the variance as”sums of squares”:
Laura Aparicio 18
lOMoARcPSD|20927229
ECONOMETRICS I
STANDARD ERROR OF THE REGRESSION (SER)
The standard error of the regression is (almost) the sample standard deviation of the OLS residuals:
CHARACTERISTICS
It has the units of 𝑢̂, which are the units of Y.
It measures the spread of the distribution of 𝑢̂.
 It measures the average “size” of the OLS residual (the average

“mistake” made by the OLS regression line)
The root mean squared error (RMSE) is closely related to the SER:
IMPORTANT! A low R2 and large SER do NOT imply that our regression is either “good” or “bad”. What they tell us is that other
important factors influence Y. Moreover, they do NOT tell us what these factors are, but they do indicate that X alone explains only
a small part of the variation in Y in these data.
Laura Aparicio 19
lOMoARcPSD|20927229
ECONOMETRICS I
APPLICATION TO TEST-SCORES AND CLASS-SIZE
Interpretation of the estimated slope and intercept
SLOPE: Districts with one more student per teacher on

average have test scores that are 2.28 points lower.
INTERCEPT: The intercept (taken literally) means that, according to this estimated line, districts with zero students per teacher would
have a (predicted) test score of 698.9. This interpretation of the intercept makes no sense (it extrapolates the line outside the range of
the data) in this application, the intercept is not itself economically meaningful.
SPECIAL CASE OF A DUMMY VARIABLE
Laura Aparicio 20
lOMoARcPSD|20927229
ECONOMETRICS I
So far we have seen how to estimate the slope of the population regression function using the estimator. But under what conditions
̂1 in a causal way? And how should we interpret if this condition fails? Moreover, the OLS regression line is an
can we interpret 𝛽
estimate, computed using our sample of data; a different sample would have given a different value of.
How can we:
 ̂1 ?
quantify the sampling uncertainty associated with 𝛽
 ̂1 to test hypotheses such as 𝛽1 = 0?
use 𝛽
 ̂1 ?
construct a confidence interval for 𝛽
KEY ASSUMPTIONS OF THE MODEL = OLS provides an appropriate estimator of the unknown regression coefficients, β0 and β1, under
these three assumptions:
ASSUMPTION #1: THE CONDITIONAL DISTRIBUTION OF Ui GIVEN Xi HAS A MEAN OF ZERO
It means that the “other factors” contained in Ui are unrelated to Xi in the sense that, given a value of Xi, the mean of the distribution
of these other factors is zero. This assumption is illustrated in Figure 4.4:
At a given value of class size, say 20 students per class.

Sometimes these other factors lead to better performance
than predicted (Ui>0) and sometimes to worse
performance (Ui<0), but on average over the population
the prediction is right.
In other words, given Xi = 20, and, more generally, at other

values x of Xi as well, the mean of the distribution of Ui is
zero. This is shown at the distribution of Ui being centred
on the population regression line.
Laura Aparicio 21
lOMoARcPSD|20927229
ECONOMETRICS I
As shown in Figure 4.4, the assumption that 𝐸(𝑢𝑖 |𝑋𝑖 ) = 0 is equivalent to assuming that the population regression line is the
conditional mean of Yi given Xi.
Moreover, it could be understood as two conditions in one:
 𝐸(𝑢𝑖 |𝑥 = 1) = 𝐸(𝑢𝑖 |𝑥 = 2) = ⋯  Changes in X (size class) should never have an impact on Ui.
 𝐸(𝑢𝑖 |𝑋𝑖 = 𝑥) = 0  On average our regression model predicts the truth.
EXPERIMENTAL DATA OBSERVATIONAL DATA

In a randomized controlled In observational data, X is not
experiment, subjects are randomly randomly assigned in an experiment.
assigned to the treatment group (X = Instead, the best that can be hoped CORRELATION AND CONDITIONAL MEAN:
1) or to the control group (X = 0). for is that X is as if randomly assigned, If the conditional mean of one random
in the precise sense that 𝐸(𝑢𝑖 |𝑋𝑖 ) = variable given another is zero, then the
two random variables have zero
The random assignment typically is 0.
covariance and are uncorrelated.
done using a computer program that
uses no information about the Whether this assumption holds in a  Conditional mean = 0 
subject, ensuring that X is distributed given empirical application with correlation = 0
independently of all personal observational data requires careful

 Correlation = 0  Conditional
characteristics of the subject. thought and judgement. It’s not very
mean can take any value.
realistic!
Random assignment makes X and U
 Correlation different to 0 
independent, which in turn implies In other words, we can find cases with
conditional mean is nonzero.
that the conditional mean of U given observational data where this
X is zero. assumption doesn’t hold. If X and U are correlated, then the
conditional mean assumption is violated.
And is they are uncorrelated we can’t be
EXTREMELY IMPORTANT!! sure.
ASSUMPTION #2: (Xi, Yi), I = 1, …, n ARE INDEPENDENTLY AND IDENTICALLY DISTRIBUTED
This assumption is a statement about how the sample is drawn. This arises automatically if the entity (individual, district) is sampled
by simple random sampling: the entity is selected then, for that entity, X and Y are observed (recorded).
EXAMPLES: NON-I.I.D SAMPLING
[1] The main place we will encounter non-i.i.d. sampling is when data are recorded over time (“time series data”). This will
introduce some extra complications.
a. Example: data on inventory levels (Y) at a firm and the interest rate at which the firm can borrow (X), where these
data are collected over time from a specific firm (four times per year during 30 years). A key feature of time series
data is that observations falling close to each other in time are not independent but rather tend to be correlated
Laura Aparicio 22
lOMoARcPSD|20927229
ECONOMETRICS I
with each other; if interest rates are low now, they are likely to be low next quarter. This pattern of correlation
violates the “independence” part of the i.i.d. assumption.
[2] Another instance of non-i.i.d. sampling is when observations belonging to a group or cluster have unobservable variables in
common.
ASSUMPTION #3: LARGE OUTLIERS ARE UNLIKELY
Large outliers (that is, observations with values of Xi, Yi or both that are far
outside the usual range of the data) are unlikely. Large outliers can make OLS
regression results misleading. This potential sensitivity of OLS to extreme outliers
is illustrated in the following figure:
Mathematically: we assume X and Y have nonzero finite fourth moments:
0 < 𝐸(𝑋𝑖4 ) < ∞ and 0 < 𝐸(𝑌𝑖4 ) < ∞
where
Which means that our variables can take finite values (example: class size is capped by the physical capacity of a classroom; the best
you can do on a standardized test is to get all the question right and the worst you can do is to get all the questions wrong).
In conclusion, if the assumption of finite fourth moments holds, then it is unlikely that statistical inferences using OLS will be dominated
by a few observations.
The least squares assumptions play twin roles:
[1] FIRST ROLE: If these assumptions hold, then, as is shown in the next section, in large samples the OLS estimators have
sampling distribution that are normal which allows us to develop methods for hypothesis testing and to construct
confidence intervals.
[2] SECOND ROLE: It allows us to organize the circumstances that pose difficulties for OLS regression:
a. Assumption #1: It’s the most important to consider in practice because in several cases may not hold.
b. Assumption #2: Although it holds in many datasets, the independence assumption is inappropriate for time
series data. Therefore, in this cases we will need to modify the methods used.
Laura Aparicio 23
lOMoARcPSD|20927229
ECONOMETRICS I
c. Assumption #3: If your dataset contains large outliers, you should examine those outliers carefully to make
sure those observations are correctly recorded and belong in the data set (there can be data entry errors like
height in meters or centimetres).
̂0 and 𝛽
Remember that 𝛽 ̂1 are the OLS estimators of the unknown intercept 𝛽0 and slope 𝛽1 of the population regression line. Because
̂0 and 𝛽
the OLS estimators are calculated using a random sample, 𝛽 ̂1 are random variables that take on different values from one
sample to the next; the probability of these different values is summarized in their sampling distributions. Under the three Least Square
Assumptions and when the sample is LARGE:
̂1 has mean 𝛽1 (“𝛽

 UNBIASED: The exact (finite sample) sampling distribution of 𝛽 ̂1 is an unbiased estimator of 𝛽1 ”), and
̂1 ) is inversely proportional to n.
𝑉𝑎𝑟(𝛽
̂1 is complicated and depends on the distribution (X, U).

 Other than its mean and variance, the exact distribution of 𝛽
o ̂1 : It’s easier to draw a precise line when we have a large variance.

The larger is Var(Xi), the smaller is the variance of 𝛽
o ̂1 : If the errors are small the data will be tighter around the line.
The smaller Var(Ui), the smaller is the variance of 𝛽
𝑝
̂1 → 𝛽1 , in other words, these estimators are consistent (when the sample is large, our estimators will be
 CONSISTENT: 𝛽
near the true population coefficients) (LLN)
̂1 −𝐸(𝛽
𝛽 ̂1 )
 NORMALLY DISTRIBUTED: is approximately distributed as N(0,1) if the sample is sufficiently large even if the
̂1 )
√𝑉𝑎𝑟(𝛽
original distribution wasn’t normal (CLT)
Laura Aparicio 24
lOMoARcPSD|20927229
ECONOMETRICS I
̂𝟏 and 𝑽𝒂𝒓(𝜷
PROPERTY: Unbiasedness of 𝜷 ̂𝟏 ) is inversely proportional to n
Cov (X,U)
BEFORE: the estimator depends on X and Y. If the #1 assumption holds, the cov(X, U) = 0, so
the estimator predicts the truth
NOW: the estimator depends on X and U.
Now we calculate the expectation of the expression we’ve obtained:
Finally, regarding the variance:
The amount of doubts that you have about you’ve predicted.
Laura Aparicio 25
lOMoARcPSD|20927229
ECONOMETRICS I
𝒑
̂𝟏 : 𝜷
PROPERTY: Consistency of 𝜷 ̂𝟏 → 𝜷𝟏
These values are fixed parameters so the covariance

between them and a variable is equal to 0. More
important, the expected value of a fixed parameter is
exactly the parameter.
𝒑
̅ → 𝝁𝒙
𝑿
𝒑
𝒔𝟐𝒙 → 𝝈𝒙
𝒑
𝒔𝟐𝒙𝒚 → 𝝈𝒙𝒚
CLT and LLN allow us to combine parameters (fixed values) of a population with sample estimators.
̂𝟏 with large n
PROPERTY: Approximation to a normal distribution of 𝜷
Additional notes:
Laura Aparicio 26
lOMoARcPSD|20927229
ECONOMETRICS I
CONCLUSION
Until now we have focused on the use of ordinary least squares to estimate the intercept and slope of a population regression line
using a sample of n observations on a dependent variable, Y, and a single regressor, X.
There are many ways to draw a straight line through a scatterplot, but doing so using OLS has several virtues. If the least squares
assumptions hold, then the OLS estimators of the slope and the intercept are unbiased, consistent and have sampling distribution
with a variance that is inversely proportional to the sample size n. Moreover, if n is large, then the sampling distribution of the OLS
estimator is normal.
The results we’ve obtained describe the sampling distribution of the OLS estimator. By themselves, however, these results are not
̂1 or to construct a confidence interval for 𝛽
sufficient to test a hypothesis about the value of 𝛽 ̂1 . Doing so requires an estimator of
the standard deviation of the sampling distribution (that is, the standard error of the OLS estimator) which is what we will do in the
next sections.
SOME ADDITIONAL ALGEBRAIC FACTS ABOUT OLS
(4.32)
𝟏
∑𝒏𝒊=𝟏 𝒖
̂ 𝒊 = 𝟎 The SAMPLE AVERAGE of the OLS residuals is zero
𝒏
[1] Estimated linear regression model

[2] Isolate 𝑼̂𝒊
[3] Substitute 𝜷 ̂ 𝒐 thanks to 𝜷
̂𝒐 = 𝒀
̅−𝜷 ̂ 𝟏𝑿
̅𝒊
[4] Rearrange the expression
[5] Summation
[6] We know that the summation of a mean is just n · mean (the mean is always the same so it acts as a constant)
[7] Remove common factor n
𝒏 𝒏
𝟏
[8] It’s easily observable that: ∑𝒏𝒊=𝟏 𝒀𝒊 − 𝒀̅ = ∑𝒊=𝟏 𝒀𝒊 − 𝒀 ̅ = ⁡ ∑𝒊=𝟏 𝒀𝒊 and the same happens with X. Finally, we
̅ = 𝟎 because 𝒀
𝒏 𝒏 𝒏
pass the n which is multiplying the summation of unobserved factors dividing the right expression
̂𝒐 + 𝜷
𝒀𝒊 = 𝜷 ̂ 𝟏 𝑿𝒊 + 𝑼
̂𝒊
̂𝒐 + 𝜷
̂ 𝒊 = 𝒀𝒊 − [𝜷
𝑼 ̂ 𝟏 𝑿𝒊 ]
̂ 𝒊 = 𝒀𝒊 − [𝒀
𝑼 ̂ 𝟏𝑿
̅−𝜷 ̂ 𝟏 𝑿𝒊 ]
̅𝒊 + 𝜷
̂ 𝒊 = 𝒀𝒊 − [𝒀
𝑼 ̅ + (𝑿𝒊 − 𝑿 ̂ 𝟏 ] = (𝒀𝒊 − 𝒀
̅ 𝒊 )𝜷 ̂𝟏
̅ 𝒊 )𝜷
̅ ) − (𝑿𝒊 − 𝑿
𝒏 𝒏 𝒏 𝒏 𝒏 𝒏 𝒏
̂ 𝒊 = ∑(𝒀𝒊 − 𝒀
∑𝑼 ̂ 𝟏 ∑(𝑿𝒊 − 𝑿
̅) − 𝜷 ̂ 𝟏 (∑ 𝑿𝒊 − ∑ 𝑿
̅−𝜷
̅ 𝒊 )⁡ = ∑ 𝒀𝒊 − ∑ 𝒀 ̅ 𝒊)
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=! 𝒊=𝟏 𝒊=!
𝒏 𝒏 𝒏
̂ 𝒊 = ∑ 𝒀𝒊 − 𝒏𝒀
∑𝑼 ̂ 𝟏 (∑ 𝑿𝒊 − 𝒏𝑿
̅−𝜷 ̅ 𝒊)
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
Laura Aparicio 27
lOMoARcPSD|20927229
ECONOMETRICS I
𝒏 𝒏 𝒏 𝒏 𝒏
𝒏 𝒏 𝒏 𝟏 𝟏
̂ 𝒊 = ∑ 𝒀𝒊 − 𝒏𝒀
∑𝑼 ̂ 𝟏 ( ∑ 𝑿𝒊 − 𝒏𝑿
̅−𝜷 ̅ 𝒊 ) = 𝒏 ( ∑ 𝒀𝒊 − 𝒀 ̂ 𝟏 ( ∑ 𝑿𝒊 − 𝑿
̅ ) − 𝒏𝜷 ̅ 𝒊)
𝒏 𝒏 𝒏 𝒏 𝒏
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
𝒏
𝟏
̂𝟏 · 𝟎 = 𝟎
̂𝒊 = 𝟎 − 𝜷
∑𝑼
𝒏
𝒊=𝟏
(4.33)
𝟏
̂𝒊 = 𝒀
∑𝒏𝒊=𝟏 𝒀 ̅ 𝒊 The SAMPLE AVERAGE of the OLS predicted values equals 𝒀
̅
𝒏
[1] Estimated linear regression model

[2] ̂𝒐 + 𝜷
We know that the estimated regression line is 𝒀̂𝒊 = 𝜷 ̂ 𝟏 𝑿𝒊 and substitute it in the previous expression
[3] Summation
[4] ̂𝒊 = 𝟎
We already know from (4.32) that ∑𝒏𝒊=𝟏 𝑼
̅ = 𝟏 ∑𝒏𝒊=𝟏 𝒀𝒊
[5] Finally, we use the formula of the mean: 𝒀
𝒏
̂𝒐 + 𝜷
𝒀𝒊 = 𝜷 ̂ 𝟏 𝑿𝒊 + 𝑼
̂𝒊
𝒀𝒊 = 𝒀̂𝒊 + 𝑼
̂𝒊
𝒏 𝒏 𝒏
∑ 𝒀𝒊 = ∑ 𝒀̂𝒊 + ∑ 𝑼
̂𝒊
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
𝒏 𝒏
∑ 𝒀𝒊 = ∑ 𝒀̂𝒊
𝒊=𝟏 𝒊=𝟏
𝒏
𝟏 𝟏
̅=
𝒀 ∑ 𝒀𝒊 =
𝒏 𝒏
𝒊=𝟏
(4.34)
∑𝒏𝒊=𝟏 𝒖
̂ 𝒊 𝑿𝒊 = 𝟎 The SAMPLE COVARIANCE between the OLS residuals and the regressors is zero
[1] We know that ∑𝒏𝒊=𝟏 𝒖̂ 𝒊 𝑿𝒊 = ∑𝒏𝒊=𝟏 𝒖 ̅)

̂ 𝒊 (𝑿𝒊 − 𝑿
a. WHY? ∑𝑖=1 𝑢̂𝑖 (𝑋𝑖 − 𝑋̅) = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 ) − 𝑋̅ ∑𝑛𝑖=1 𝑢̂𝑖 = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 ) − 𝑋̅ · 0 = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 )
𝑛
̂ 𝒊 = (𝒀𝒊 − 𝒀
[2] Substitute 𝑼 ̂𝟏
̅ 𝒊 )𝜷
̅ ) − (𝑿𝒊 − 𝑿
[3] Develop
𝒄𝒐𝒗(𝑿,𝒀) ∑𝒏 ̅ ̅
𝒊=!(𝑿𝒊 −𝑿)(𝒀𝒊 −𝒀)
̂𝟏 =
[4] Substitute 𝜷 =
𝑽𝒂𝒓(𝑿) ∑𝒏𝒊=𝟏(𝑿−𝑿̅ )𝟐
[5] Finally, develop.
𝒏 𝒏
̂ 𝒊 𝑿𝒊 = ∑ 𝒖
∑𝒖 ̅) = 𝟎
̂ 𝒊 (𝑿𝒊 − 𝑿
𝒊=𝟏 𝒊=𝟏
𝒏
̅ ) − (𝑿𝒊 − 𝑿
∑[(𝒀𝒊 − 𝒀 ̂ 𝟏 ] (𝑿𝒊 − 𝑿
̅ 𝒊 )𝜷 ̅) = 𝟎
𝒊=𝟏
Laura Aparicio 28
lOMoARcPSD|20927229
ECONOMETRICS I
𝒏 𝒏
̅ ) (𝑿𝒊 − 𝑿
∑(𝒀𝒊 − 𝒀 ̂ 𝟏 ∑(𝑿𝒊 − 𝑿
̅) − 𝜷 ̅ )𝟐 = 𝟎
𝒊=𝟏 𝒊=𝟏
𝒏 𝒏
̅ )(𝒀𝒊 − 𝒀
∑𝒏𝒊=!(𝑿𝒊 − 𝑿 ̅)
∑(𝒀𝒊 − 𝒀 ̅) −
̅ ) (𝑿𝒊 − 𝑿 ̅ )𝟐 = 𝟎
∑(𝑿𝒊 − 𝑿
𝒏
∑𝒊=𝟏(𝑿 − 𝑿̅) 𝟐
𝒊=𝟏 𝒊=𝟏
𝒏 𝒏
∑(𝒀𝒊 − 𝒀 ̅ )(𝒀𝒊 − 𝒀
̅ ) − ∑(𝑿𝒊 − 𝑿
̅ ) (𝑿𝒊 − 𝑿 ̅) = 𝟎
𝒊=𝟏 𝒊=!
(4.35)
𝑻𝑺𝑺 = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 + 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺
[1] We know that 𝑻𝑺𝑺 = ∑𝒏𝒊=𝟏(𝒀𝒊 − 𝒀 ̅ )𝟐

̂
[2] Include 𝒀𝒊
[3] Substitute:
a. 𝑨 = 𝒀𝒊 − 𝒀 ̂𝒊
̂
b. 𝑩 = 𝒀𝒊 − 𝒀 ̅
[4] Develop
[5] Finally, we know that:
a. ∑𝒏𝒊=𝟏 𝑨𝟐 = 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺 It’s the sum of the square of the difference between true observations and predicted
observations.
b. ∑𝒏𝒊=𝟏 𝑩𝟐 = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 It’s the sum of the square of the difference between the predicted observations and
the mean.
c. ∑𝒏𝒊=𝟏 𝑨𝑩 = ∑𝒏𝒊=𝟏(𝒀𝒊 − 𝒀 ̂ )(𝒀
̂−𝒀 ̅ ) = ∑𝒏𝒊=𝟏(𝑼
̂ 𝒊 · (𝒀
̂−𝒀
̅ )) ⁡ = ∑𝒏𝒊=𝟏 𝑼
̂𝒊 𝒀
̂−𝒀
̅ ∑𝒏𝒊=𝟏 𝑼
̂ 𝒊 = ∑𝒏𝒊=𝟏 𝑼
̂𝒊 𝒀
̂−𝟎=
𝒏 ̂ ̂ ̂ ̂ 𝒏 ̂ ̂ 𝒏 ̂
∑𝒊=𝟏 𝑼𝒊 (𝜷𝒐 + 𝜷𝟏 𝑿𝒊 ) = 𝜷𝒐 ∑𝒊=𝟏 𝑼𝒊 + 𝜷𝟏 ∑𝒊=𝟏 𝑼𝒊 𝑿𝒊 = 𝟎 + 𝟎 = 𝟎
𝒏
̅ )𝟐
𝑻𝑺𝑺 = ∑(𝒀𝒊 − 𝒀
𝒊=𝟏
𝒏
̂𝒊 + 𝒀
𝑻𝑺𝑺 = ∑(𝒀𝒊 − 𝒀 ̂𝒊 − 𝒀
̅ )𝟐
𝒊=𝟏
𝑻𝑺𝑺 = ∑(𝑨 + 𝑩)𝟐

𝒊=𝟏
𝒏 𝒏 𝒏 𝒏
𝑻𝑺𝑺 = ∑(𝑨𝟐 + 𝑩𝟐 + 𝟐𝑨𝑩) = ∑ 𝑨𝟐 + ∑ 𝑩𝟐 + 𝟐 ∑ 𝑨𝑩

𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
𝑻𝑺𝑺 = 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺 + 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 + 𝟎
Laura Aparicio 29
lOMoARcPSD|20927229
ECONOMETRICS I
REGRESSION WITH A SINGLE REGRESSOR: HYPOTHESIS TESTS AND CONFIDENCE

INTERVALS
Hypothesis testing for regression coefficients is analogous to hypothesis testing for the population mean: Use the t-statistic to
calculate the p-values and either accept or reject the null hypothesis. Like a confidence interval for the population mean, a 95%
confidence interval for a regression coefficient is computed as the estimator ±1.96 standard errors.
HYPOTHESIS TESTING
First of all, we need to state precisely the null and ̂𝟏 , 𝑺𝑬(𝜷

STEP 1: Compute the standard error of 𝜷 ̂𝟏 ). Although the
alternative hypothesis before starting the test: formula is complicated, in applications the standard error is
computed by regression software.
Recall:
STEP 2: Compute the t-statistic:
̂1 at least as different from 𝛽1,0 assuming that

STEP 3: Compute the p-value. In other words, the probability of observing a value of 𝛽
the null hypothesis is correct.
EXAMPLE: TEST SCORES AND STR, CALIFORNIA DATA
Using STATA we obtain the following

table and we observe that there are
three equivalent ways to reject or
not the null hypothesis:
Looking the t-statistic: if it’s bigger

than 1.96 we reject.
 Looking the p-value: if it’s lower

than 0.05 we reject.
Laura Aparicio 30
lOMoARcPSD|20927229
ECONOMETRICS I
Looking to the confidence interval: if the 0 (the value of the null hypothesis) is not included then we reject.
CONFIDENCE INTERVALS: In 95% of all samples that might be

drawn, the confidence interval will contain the true value of
the population parameter.
At the same time, it can be define as the set of values that can’t
be rejected using a two-sided hypothesis test with a 5%
significance level.
RECALL!! When X is binary, the regression model can be used to estimate and test hypotheses about the difference between the
population means of the “X=0” and the “X=1” group.
Our only assumption about the distribution of Ui conditional on Xi is that is has a mean of zero (the first least squares assumption). If,
furthermore, the variance of this conditional distribution does NOT depend on Xi, then the errors are said to be homoskedastic.
We’re going to discuss:
 Its theoretical implications: What are heteroskedasticity and homoskedasticity?

 The simplified formulas for the standard errors of the OLS estimators: Mathematical implications
 The risks you run if you use these simplified formulas in practice: What does this mean in practice?
Laura Aparicio 31
lOMoARcPSD|20927229
ECONOMETRICS I
RECALL SOME PROPERTIES (Y = wages, X = years of school):
 𝑽𝒂𝒓(𝒀)  You calculate the variance of all the population’s wages

 𝑽𝒂𝒓⁡(𝒀|𝑿 = 𝟏𝟐)  You just calculate the variance of the group who satisfies the condition X [we’re fixing X, but Y keeps
changing]
 𝑽𝒂𝒓(𝑿)  You calculate the variance of all the population’s years of school.
 𝑽𝒂𝒓⁡(𝑿|𝑿 = 𝟏𝟐) = 𝟎  We’re fixing X, so the variance of X is equal to 0.
Therefore, 𝑽𝒂𝒓⁡(𝒀|𝑿) = 𝑽𝒂𝒓(𝜷𝟎 + 𝜷𝟏 𝑿 + 𝑼|𝑿) = 𝑽𝒂𝒓(𝜷𝟎 |𝑿) + 𝑽𝒂𝒓(𝜷𝟏 𝑿|𝑿) + 𝑽𝒂𝒓(𝑼|𝑿) = 𝟎 + 𝟎 + 𝑽𝒂𝒓(𝑼|𝑿) = 𝑽𝒂𝒓(𝑼|𝑿)
The variance of fixed parameters is always 0.
[1] WHAT ARE HETEROSKEDASTICITY AND HOMOSKEDASTICITY?
HOMOSKEDASTICITY HETEROSKEDASTICITY
If 𝑉𝑎𝑟(𝑈|𝑋 = 𝑥) is constant, that is the variance of the If 𝑉𝑎𝑟(𝑈|𝑋 = 𝑥) is NOT constant, that is the variance of the
conditional distribution of U given X does NOT depend on conditional distribution of U given X depends on X, then U is said to be
X, then U is said to be homoskedastic. heteroskedastic.
All distributions are equally wide.

Conditional distribution of Ui spreads out as x increases.
Laura Aparicio 32
lOMoARcPSD|20927229
ECONOMETRICS I
 EXAMPLE 1[NOT SURE, MAYBE HOMOSKEDASTICITY]: California test scores
 EXAMPLE 2 [HETEROSKEDASTICITY]: Average hourly earnings vs years of education
First, on average, the longer you study, the higher the

wage you’ll have.
Analysis of the variance of the conditional distribution

of U given X:
If you go less than ten years to school, your wage will
be around 0-20.
If you go more than 10 years, your wage will be

between 0-60.
Therefore, conditional distribution of Ui spreads out as

x increases.
Laura Aparicio 33
lOMoARcPSD|20927229
ECONOMETRICS I
[2] MATHEMATICAL IMPLICATIONS
Heteroskedasticity and homoskedasticity concern only to the variance, in other words, the Standard Error (SE) and all the values
calculated using SE (t-statistic, confidence intervals…).
HETEROSKEDASTICITY
HETEROSKEDASTICITY
 X=1  Var = 15
 X=2  Var = 20
HOMOSKEDASTICITY
HOMOSKEDASTICITY
 X=1  Var = 15
Know due to U and X are independent we can express 𝑉𝑎𝑟(𝑣) = 𝑉𝑎𝑟(𝑋) · 𝑉𝑎𝑟(𝑈)  X=2  Var = 25
𝑉𝑎𝑟(𝑋) · 𝑉𝑎𝑟(𝑈) 𝑉𝑎𝑟(𝑈)

̂1 ) =
𝑉𝑎𝑟(𝛽 =
𝑛 · 𝑉𝑎𝑟(𝑋)2 𝑛 · 𝑉𝑎𝑟(𝑋)
[3] WHAT DOES THIS MEAN IN PRACTICE?
 If the errors are homoskedastic and you

use the heteroskedastic formula for
standard errors (the one we derived), you
are OK.
 If the errors are heteroskedastic and you
use the homoskedasticity-only formula
for standard errors, the standard errors
are WRONG.
Laura Aparicio 34
lOMoARcPSD|20927229
ECONOMETRICS I
The two formulas coincide (when n is large) in the special case of homoskedasticity.
The bottom line: you should ALWAYS use the heteroskedasticity-based formulas – these are conventionally called the
heteroskedasticity-robust standard errors.
MAIN IDEA! In general, the error ui is heteroskedastic (that is, the variance of ui at a given value of Xi, 𝑣𝑎𝑟(𝑢𝑖 |𝑋𝑖 = 𝑥), depends on x).
A special case is when the error is homoskedastic (that is 𝑣𝑎𝑟(𝑢𝑖 |𝑋𝑖 = 𝑥) is constant). Homoskedasticity-only standard errors do NOT
produce valid statistical inferences when the errors are heteroskedastic, but heteroskedasticity-robust standard errors do.
EXTRA! If the three least squares assumption hold AND if the regression errors are homoskedastic, then, the OLS estimator is BLUE.
Moreover, if the three least squares holds, if the regression errors are homoskedastic AND if the regression errors are normally
distributed, then the OLS t-statistic computed using homoskedasticity-only standard errors has a Student t distribution when the null
hypothesis is true. The difference between the Student t distribution and normal distribution is negligible if the sample size is moderate
or large.
CONCLUSION
Returning to the California test score data set, there is a negative relationship between the student-teacher ratio and test scores,
but is this relationship necessarily the causal one? Districts with lower STR have, on average, higher test score. But does this mean
that reducing the STR will, in fact, increase scores?
There is, in fact, reason to worry that it might not. Hiring more teachers, after all, costs money, so wealthier school districts can
better afford small classes. Moreover, students at wealthier schools also have other advantages over their poorer neighbours,
including better facilities, newer books, and better-paid teachers.
What’s more, California has a large immigrant community; these immigrants tend to be poorer than the overall population, and, in
many cases, their children are not native English speakers. It thus might be that our negative estimated relationship between test
scores and the STR is a consequence of large classes being found in conjunction with many other factors that are, in fact, the
real cause of the lower test scores.
These other factors or “omitted variables”, could mean that the OLS analysis done so far has little value. Indeed, it could be
misleading: changing the STR alone would not change these other factors that determine a child’s performance at school. To
address this problem, we need a method that will allow us to isolate the effect on test scores of changing the STR, holding these
other factors constant. The method is MULTIPLE REGRESSION ANALYSIS.
Laura Aparicio 35

Topic 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic 2

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|20927229

Econometrics I (Universitat Pompeu Fabra)

Studocu is not sponsored or endorsed by any college or university

TOPIC 2: INTRODUCTION TO LINEAR REGRESSION

LINEAR REGRESSION WITH ONE REGRESSOR

POPULATION REGRESSION LINE =

 SIMPLIEST MODEL:  GRAPHICALLY:

Non-random Disturbance term

 X is the independent (or explanatory) variable. Also

line. The disturbance term causes the actual values of Y to be different.

This has been done in the following figure:

 The intersection of the line with the Y-axis provides an estimate of

 And the slope provides an estimate of the slope coefficient β2,

[2] We substitute the fitted line:

𝑹𝑺𝑺 = 𝒖𝟏 𝟐 + 𝒖𝟐 𝟐 +𝒖𝟑 𝟐 +𝒖𝟒 𝟐

̂ 𝒊 , will be (𝑏1 ⁡ + ⁡ 𝑏2 𝑋𝑖 ), and the residual 𝑢𝑖 will be (𝑌𝑖 ⁡–⁡𝑏1 − 𝑏2 𝑋𝑖 ). We

We will find that RSS is minimized when:

CONSTANT TERM: Strictly speaking, it

First, we need to recall some nice properties (grey box).

̂𝑖 and ei, after running a regression:

[2] We can use this to decompose the variance of Y:

𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 ) + 𝟐𝑪𝒐𝒗(𝒀̂𝒊 , 𝒖𝒊 )

𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 )

̂ ), the part "explained" by the regression line,

Often it is convenient to decompose the variance as”sums of squares”:

STANDARD ERROR OF THE REGRESSION (SER)

It has the units of 𝑢̂, which are the units of Y.

It measures the spread of the distribution of 𝑢̂.

 It measures the average “size” of the OLS residual (the average

APPLICATION TO TEST-SCORES AND CLASS-SIZE

Interpretation of the estimated slope and intercept

SLOPE: Districts with one more student per teacher on

SPECIAL CASE OF A DUMMY VARIABLE

How can we:

ASSUMPTION #1: THE CONDITIONAL DISTRIBUTION OF Ui GIVEN Xi HAS A MEAN OF ZERO

At a given value of class size, say 20 students per class.

In other words, given Xi = 20, and, more generally, at other

Moreover, it could be understood as two conditions in one:

 𝐸(𝑢𝑖 |𝑋𝑖 = 𝑥) = 0  On average our regression model predicts the truth.

EXPERIMENTAL DATA OBSERVATIONAL DATA

subject, ensuring that X is distributed given empirical application with correlation = 0

independently of all personal observational data requires careful

ASSUMPTION #2: (Xi, Yi), I = 1, …, n ARE INDEPENDENTLY AND IDENTICALLY DISTRIBUTED

EXAMPLES: NON-I.I.D SAMPLING

ASSUMPTION #3: LARGE OUTLIERS ARE UNLIKELY

Mathematically: we assume X and Y have nonzero finite fourth moments:

0 < 𝐸(𝑋𝑖4 ) < ∞ and 0 < 𝐸(𝑌𝑖4 ) < ∞

The least squares assumptions play twin roles:

̂1 has mean 𝛽1 (“𝛽

̂1 is complicated and depends on the distribution (X, U).

o ̂1 : It’s easier to draw a precise line when we have a large variance.

original distribution wasn’t normal (CLT)

Now we calculate the expectation of the expression we’ve obtained:

Finally, regarding the variance:

The amount of doubts that you have about you’ve predicted.

These values are fixed parameters so the covariance

SOME ADDITIONAL ALGEBRAIC FACTS ABOUT OLS

[1] Estimated linear regression model

[1] Estimated linear regression model

[1] We know that ∑𝒏𝒊=𝟏 𝒖̂ 𝒊 𝑿𝒊 = ∑𝒏𝒊=𝟏 𝒖 ̅)

[5] Finally, develop.

𝑻𝑺𝑺 = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 + 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺

[1] We know that 𝑻𝑺𝑺 = ∑𝒏𝒊=𝟏(𝒀𝒊 − 𝒀 ̅ )𝟐