You are on page 1of 20

Analysis of Economic Data (20606)

Chapter 5 (cont.)
Regression and correlation
Independence - Dependence
When two characteristics are studied simultaneously in a sample, one
can consider that one influences the other in some way. For
example the height and weight or hours of study and the grade on an
exam.
The main purpose of regression analysis is to discover how they relate.

Two variables can be considered:

• Independent variables  No relation exist (one of them cannot be used to explain


changes in the other)
• Functional dependence  Y=f(x)
• Statistic dependence

Statistic
Statistic independence Functional dependence
dependence

- +
Degree of association between the variables
Scatter plots: A graphical tool to see if a relation exists.
Given two variables X and Y taken over the same element of the
population, the scatter plot is simply a two dimensional graph where
on one axis (the abscissa) a variable is located, and on the other axis
(the ordinate) is the other variable. If variables are correlated, the chart
would show some level of correlation (trend) between the two
variables. If there is no correlation, the graph presented as a figure
without shape, a cloud of scattered points on the graph.

Positive
association.
If X
increases Y
increases
Scatter plots / Regression line
The relationship between two variables can be represented by the line of best
fit to the data. This line is called the regression line, which can be positive or
negative.
Scatter plots / Regression line
For the calculation of the regression line the method of least squares between
two variables is applied. This line minimize the sum of the squared residuals,
i.e., the squared differences between the values calculated by the equation of
the line and the actual values of the series, are the minimum.

y = a + bx
Regression line Slope

yn
yn 1 yˆi
y3
u3 ui
yi
y1 yi
y2

Intercept
x1 x2 x3 xi xn 1 xn

yi  a  bxi  ui ui  yi  yˆi
Error
We call “u” disturbance or error, being the difference between the observed
value of the endogenous variable (y) and the estimated value that we obtain
through the regression line. yˆ i .


y i  a  bxi
The methodology for obtaining the line will be minimizing the sum of the
squared disturbances. Why are they squared?

n n

u  ( yi  yˆi ) 2
2
i  i 
u 2

i 1
 i i
( y
i 1
 ˆ
y ) 2

 n 2 n n
2
min   ui   ( yi  yˆi )    yi  aq  bpxi  
2

q, p  i 1 i 1 i 1 
In simple linear regression model the function chosen to approximate the
relationship between the variables is a line, i.e. y = a + bx, where a, b are
parameters. This line is called the regression line of Y on X.

We will estimate this line by using the method of least squares. Given a value of X,
we have two values of Y, the observed yi, and the theoretical, yi * = a + bXi. We
should minimize the errors:

n n
   yi a bxi    yi a bxi  2 2 MINIMIZE

i 1 i 1
The value that we
Errors due to aproximate
with a linear function
aproximate for “y”, with
the linear regression is
 na   y  b x
i
i
i
i  a  y bx

 y*
x i yi  y bx  xi b  xi2
i i i

y
 xi yi  n
i
 x bx nx b x
i
2
i

yi  a bxi 0    
i i i
 2 yi  a b xi 
a   2 
  i i      
2
i  i i i x y y n x b x i n x
  i  i 

b
 2 yi  a bxi xi 0

  
xi yi  a xi b 2
xi
  b 2   b 2
 xy
i i i i  xy x
 x
Regression line
• If the model is: y  a  bx  u
The explanatory variable is “x”, and the dependente variable is “y”, that is, the economic
model indicates that a change in “x” is expected to change “y”, and not the reverse.
• We calculate the slope:   xy   s xy 
… and the constant: a  y bx b   2   2
 s


x   x 
• We use the slope to answer the question: If x would
increase by one unit (of x), with how many units (of y)
would y change?
• We use the constant (or intercept) to answer the
question: If x would be zero, what would the predicted
value of y be?
[In some cases the constant is not relevant to interpret, for example, if x cannot take the
value 0 or if there are no values of x close to zero.]
Regression line
• If the model is: x  a'b' y  u
Here the explanatory variable is “y”, and the dependent variable is “x”. We will
never use this notation, except to answer the following question:

• Is b  b' and a  a ' ?


• NO; in general no, because the averages
and the variances of “x” and “y” can be
different! Here the slope would be;
  xy   s xy 
b'   2   
   s2 
 y   y 
Coefficient of determination
Residual variance: (We calculate the variance of the residuals (errors) of the
regression model).
 n 2  If RV is “large”, the residuals are on average large, with a
  u i ni 
small dependence, but what is actually “large”?
RV   u2   i 1 
 N 
 
 
Marginal variance: is the total variance of Y or X.
If we divide the residual variance by the total variance of Y we eliminate the
problem of the units of the measurement.
This measure helps to determine if the
regression explains much or little of the
RV  u2 variance of Y, but in the opposite sense, that is,
 2 closer to 0 means that more is explained. We
TV  y want a measure that we can be interpretted as
"higher", when a larger share of the variance, is
explained, so we prefer to use the coefficient of
determination. 2
R
Coefficient of determination
TV   y2 Decomposes into:

RV   u2

EV   R2  Y2   u2
 u2  Y2   u2  R2 RV EV
R  1 2 
2
  1  
Y Y2
u2
TV TV

 y2   R2   u2 TV  EV  RV

2
R
The coefficient of determination indicates the proportion of the
variance of Y that is explained by X.

It is useful to interpret if the regression line represent a good fit.


Coefficient of determination
When there is only one explanatory (independent)
variable and we have the population:

2
 xy  xy   xy 
R  bb' 2 2  
2   rxy2
 x  y   x y 

That is, we can calculate the linear correlation coefficient squared to


obtain the coefficient of determination.
In the case of two-dimensional distributions:

1  r  1
0  r2 1
0  R2  1

r  1 1  r  0 r0 0  r 1 r 1

Slope Negative Positive


Zero
yˆ i  a  bxi

A purpose to estimate a regression line could be to predict the outcome of a


variable (y) for a given value of the other variable (x). The prediction of Y for X
= x0 is simply the value obtained when we replace the value of x by x0 in the
estimated regression line. The reliability of this prediction will be greater when
the coefficient of determination is higher. That is, the prediction is more reliable
the higher proportion of the variance of y that is explained by x.

Given a value of the variable "X" that has not been observed, calculate the
corresponding value of "Y"

yˆ 0  a  bx0
Using the variance y covariance
for samlpes in chapter 4-5.
• In chapter 3 we have seen the difference in the variance of a population and
a sample.
• In chapter 5 we learned how to calculate the measures using formulas for
the population.
• In addition we have introduced formulas to calculate the variance of the
population using the moments.

• In an additional document (in Campus Extens) you can find how to calculate
the variance of a sample, using the moments.
• Also we show that calculating the correlation coefficient and the slope of a
regression line, gives the same result if we use the formulas for a population
or a sample. (Note: we can not, for example, calculate the covariance for a
population, and then use the variance of a sample x, if we calculate the
slope of a regression line).
[The sample and population formulas give the same result of the slope and the constant, but…]

What is the difference of analyzing


a population or a sample?
• If we have the entire population, the regression gives us
the parameters that interest us;  , the constant in the
population, and  , the slope in the population.
• If we have a random sample, the regression only gives
us an estimate of the parameters that interest us. Here
we need inferential statistics; we must calculate
confidence intervals or test hypothesis; For example,
can we reject the hypothesis that the slope in the
population is zero?
• The coefficient of determination for a sample is
different compared to using formulas for the population.
Adjusted coefficient of
determination
• If we have a sample it is not correct to use the
coefficient of determination R 2 because we
use a residual variance based on an estimated
regression line, i.e., we do not know the
constant and the slope of the population.
• If we start from a sample we should use the
adjusted coefficient of determination Radj2 . We
will show you two ways to calculate.
Adjusted coefficient of
determination
• We use formulas for the  n 2 
  u i ni 
sample variance in the  n  i 1 
formula of R 2 :  
nK  n 
• In the formula for the  
variance of a sample we su2  
2
Radj  1 2  1
have lost one degree of sY  n 2 
freedom because we do   y i ni 
 n  i 1
not know the mean of the    y2 
 n 1 n 
population.  
• In the formula of the  
residual variance we lose
K degrees of freedom, • Why didn’t we include
where K is the number of
parameters in our u2 in the formula for
regression. the residual variance?
• If we have a constant in
the model and one
explanatory variable, we
have K = 2
Adjusted coefficient of
determination
2
• Another way to obtain the same result is to R
adjust :  n 1 
2
Radj  1   (1  R )
2

nk
R2
• cannot reduce if we 2 add variables in our model, but
Radj
Rthis
2
adj
can happen for .
• penalize if we add variables. If an additional variable
contribute little (or nothing)
2
Radj to explain variation in the
dependent variable, can become smaller. Note also
that it may be negative, for example, if the correlation
coefficient is zero.

You might also like