Professional Documents
Culture Documents
Chapter 5 (cont.)
Regression and correlation
Independence - Dependence
When two characteristics are studied simultaneously in a sample, one
can consider that one influences the other in some way. For
example the height and weight or hours of study and the grade on an
exam.
The main purpose of regression analysis is to discover how they relate.
Statistic
Statistic independence Functional dependence
dependence
- +
Degree of association between the variables
Scatter plots: A graphical tool to see if a relation exists.
Given two variables X and Y taken over the same element of the
population, the scatter plot is simply a two dimensional graph where
on one axis (the abscissa) a variable is located, and on the other axis
(the ordinate) is the other variable. If variables are correlated, the chart
would show some level of correlation (trend) between the two
variables. If there is no correlation, the graph presented as a figure
without shape, a cloud of scattered points on the graph.
Positive
association.
If X
increases Y
increases
Scatter plots / Regression line
The relationship between two variables can be represented by the line of best
fit to the data. This line is called the regression line, which can be positive or
negative.
Scatter plots / Regression line
For the calculation of the regression line the method of least squares between
two variables is applied. This line minimize the sum of the squared residuals,
i.e., the squared differences between the values calculated by the equation of
the line and the actual values of the series, are the minimum.
y = a + bx
Regression line Slope
yn
yn 1 yˆi
y3
u3 ui
yi
y1 yi
y2
Intercept
x1 x2 x3 xi xn 1 xn
yi a bxi ui ui yi yˆi
Error
We call “u” disturbance or error, being the difference between the observed
value of the endogenous variable (y) and the estimated value that we obtain
through the regression line. yˆ i .
y i a bxi
The methodology for obtaining the line will be minimizing the sum of the
squared disturbances. Why are they squared?
n n
u ( yi yˆi ) 2
2
i i
u 2
i 1
i i
( y
i 1
ˆ
y ) 2
n 2 n n
2
min ui ( yi yˆi ) yi aq bpxi
2
q, p i 1 i 1 i 1
In simple linear regression model the function chosen to approximate the
relationship between the variables is a line, i.e. y = a + bx, where a, b are
parameters. This line is called the regression line of Y on X.
We will estimate this line by using the method of least squares. Given a value of X,
we have two values of Y, the observed yi, and the theoretical, yi * = a + bXi. We
should minimize the errors:
n n
yi a bxi yi a bxi 2 2 MINIMIZE
i 1 i 1
The value that we
Errors due to aproximate
with a linear function
aproximate for “y”, with
the linear regression is
na y b x
i
i
i
i a y bx
y*
x i yi y bx xi b xi2
i i i
y
xi yi n
i
x bx nx b x
i
2
i
yi a bxi 0
i i i
2 yi a b xi
a 2
i i
2
i i i i x y y n x b x i n x
i i
b
2 yi a bxi xi 0
xi yi a xi b 2
xi
b 2 b 2
xy
i i i i xy x
x
Regression line
• If the model is: y a bx u
The explanatory variable is “x”, and the dependente variable is “y”, that is, the economic
model indicates that a change in “x” is expected to change “y”, and not the reverse.
• We calculate the slope: xy s xy
… and the constant: a y bx b 2 2
s
x x
• We use the slope to answer the question: If x would
increase by one unit (of x), with how many units (of y)
would y change?
• We use the constant (or intercept) to answer the
question: If x would be zero, what would the predicted
value of y be?
[In some cases the constant is not relevant to interpret, for example, if x cannot take the
value 0 or if there are no values of x close to zero.]
Regression line
• If the model is: x a'b' y u
Here the explanatory variable is “y”, and the dependent variable is “x”. We will
never use this notation, except to answer the following question:
RV u2
EV R2 Y2 u2
u2 Y2 u2 R2 RV EV
R 1 2
2
1
Y Y2
u2
TV TV
y2 R2 u2 TV EV RV
2
R
The coefficient of determination indicates the proportion of the
variance of Y that is explained by X.
2
xy xy xy
R bb' 2 2
2 rxy2
x y x y
1 r 1
0 r2 1
0 R2 1
r 1 1 r 0 r0 0 r 1 r 1
Given a value of the variable "X" that has not been observed, calculate the
corresponding value of "Y"
yˆ 0 a bx0
Using the variance y covariance
for samlpes in chapter 4-5.
• In chapter 3 we have seen the difference in the variance of a population and
a sample.
• In chapter 5 we learned how to calculate the measures using formulas for
the population.
• In addition we have introduced formulas to calculate the variance of the
population using the moments.
• In an additional document (in Campus Extens) you can find how to calculate
the variance of a sample, using the moments.
• Also we show that calculating the correlation coefficient and the slope of a
regression line, gives the same result if we use the formulas for a population
or a sample. (Note: we can not, for example, calculate the covariance for a
population, and then use the variance of a sample x, if we calculate the
slope of a regression line).
[The sample and population formulas give the same result of the slope and the constant, but…]
nk
R2
• cannot reduce if we 2 add variables in our model, but
Radj
Rthis
2
adj
can happen for .
• penalize if we add variables. If an additional variable
contribute little (or nothing)
2
Radj to explain variation in the
dependent variable, can become smaller. Note also
that it may be negative, for example, if the correlation
coefficient is zero.