You are on page 1of 60

Linear regression and correlation

Amine Hadji

Leiden University

March 1, 2022
Outline

• Regression line for the sample

• Correlation

• Hypothesis testing of linear relationship

• Multiple regression (standard deviation and hypothesis testing)

• R2
Scatter plot
Regression line in sample
Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction


Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables


resembles a straight line
Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables


resembles a straight line
• Outlier: a point in the scatterplot that has unusual combination of data values
“Alternative” facts
Nonlinearity & Outliers
Regression line in sample
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.
Regression line in sample
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.

The formula for the (sample) regression line:

ŷ = b0 + b1 x,
• ŷ : predicted (estimated) y

• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for x = 0)

• b1 : slope of the straight line in the sample (i.e. how much ŷ changes for one unit
increase of x).
Its sign determines if the line is increasing or decreasing
Regression line in sample
Residual error
Usually, the predicted variable ŷ 6= y the observed value:
Residual / Prediction error: y − ŷ .
Least squares estimation
The residuals for the handspan data:

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.


Least squares estimation
The residuals for the handspan data:
• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.


Least squares estimation
The residuals for the handspan data:
• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .
• For x2 = 69 we have y2 = 22 and ŷ2 = b0 + 69b1 . The residual is
e2 = 22 − b0 − 69b1 ...

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.


Least squares estimators

Least squares estimators:


Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)
Least squares estimators

Least squares estimators:


Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)

Example: Handspan data


ŷ = −3 + 0.35x
For instance for height 60 the average handspan is −3 + 0.35 × 60 = 18 cm.
Goodness of fit
Correlation
Correlation: measure of the strength and direction of a linear relationship between two
quantitative variables
• strength - how close the points are to a straight line.

• direction - if one variable is increasing/decreasing as the other variable increases

Formula:
n
1 X  xi − x̄  yi − ȳ 
r=
n−1 sx sy
i=1

• xi , yi : the x (or y ) measurement for the ith observation.

• x̄, ȳ : the mean of the x (or y ) measurements.

• sx , sy : the standard deviation of the x (or y ) measurements.


Correlation - Properties

• The correlation coefficient r is always between −1 and 1.

• Strength is indicated by the magnitude of the correlation.

• Direction is indicated by the sign of the correlation.

• r = 0 means that the best fitting line is the horizontal line.

• r is invariant to scaling of x or y . (e.g. from inch to cm)


Correlations
Strong vs. Weak Corr
Squared correlation

Interpretation:
• r is close to −1 or 1 implies

• r 2:
Squared correlation

Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1

• r 2 : proportion of variation of y explained by x.


Squared correlation

Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1

• r 2 : proportion of variation of y explained by x.


SSE
Formula: r 2 = 1 − SST , where
Pn
• SSO (Sum of Squares Total): i=1 (yi − ȳ )2 .
Pn
• SSE (Sum of Squared Errors): i=1 (yi − ŷi )2 .
Squared correlation
Problems in Regression

• Influential observations - observations with extreme values can have a big


impact on correlation
• Inappropriately Combining Groups - two distinct groups may show misleading
results
• Curvilinearity - linear regression for nonlinear data leads to bad predictions

• Extrapolation - no guarantee the linear relationship continues beyond the range


of the observed data
Influential obs.
Non-linearity
Inappropriate combination
Inappropriate combination
Extrapolation
Interpretation of observed correlation

“Correlation does not prove causation.”


• Rule for Concluding Cause and Effect: cause-and-effect relationships can be
inferred from randomized experiments, not from observational studies.
• Confounding variables

• Other explanatory variables


Causation
Estimation of Standard deviation

Standard deviation in Regression:


measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2
Estimation of Standard deviation

Standard deviation in Regression:


measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2

If we did not know anything about the xi , the standard deviation would be:
sP
n 2
i=1 (yi − ȳ )
s= .
n−1
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?

Hypothesis testing:

H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?

Hypothesis testing:

H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )

Remark: the p-value can be calculated using the t-distribution table


(with degree of freedom n − 2)
Multivariate regression

The formula for the multivariate regression line is

ŷ = b0 + b1 x1 + b2 x2 + ... + bp−1 xp−1

• x1 , ..., xp−1 : explanatory variables

• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for all xj = 0)

• b1 , ..., bp−1 : slopes corresponding to x1 , ..., xp−1 respectively


Examples

Example 1: Wage dependence on education and experience.

log(wage) = b0 + b1 (education) + b2 (work experience).


Example 2: Connection between behavior variable and GPA.

GPA = b0 + b1 (study hours) + b2 (classes missed) + b3 (work hours).


Omitted variables

Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Omitted variables

Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Examples:
• work experience & education ⇒ wage

• area density & magnitude of earthquake ⇒ death tolls

• height of father & height of mother ⇒ height of child


Multivariate regression - Assumptions

• No outliers
Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )


Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)


Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)

• The sample should be representative of the population


Calvin and Hobbes
Sample Regression

The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals

SSE = e12 + e22 + ... + en2


Sample Regression

The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals

SSE = e12 + e22 + ... + en2


They are called the least-square estimators (LSE)
Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?


Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?


Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )
Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?


Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )

Remark: the p-value can be calculated using the t-distribution table


(with degree of freedom n − p)
Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p

where p is the number of parameters in the multiple regression model.


Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p

where p is the number of parameters in the multiple regression model.

SSE
R2 = 1 − ,
SSTO

• SSTO: ni=1 (yi − ȳ )2 ,


P

• SSE: ni=1 (yi − ŷi )2 .


P
Problems with R 2

• the more explanatory variables are added, the more R 2 increases


Problems with R 2

• the more explanatory variables are added, the more R 2 increases

• general phenomenon called overfitting (i.e. explaining the noise)


Problems with R 2

• the more explanatory variables are added, the more R 2 increases

• general phenomenon called overfitting (i.e. explaining the noise)

• Math problem - if number of observations and number of explanatory variables are


the same ⇒ R 2 = 1.
Adjusted R 2

If p is the number of explanatory variables

2 n−1 p
Radj = 1 − (1 − R 2 ) = R 2 − (1 − R 2 ) .
n−p−1 n−p−1

• increases only if additional explanatory variable is not uncorrelated

• difficult to interpret

You might also like