Lecture 4

Linear regression and correlation
Amine Hadji
Leiden University
March 1, 2022
Outline
• Regression line for the sample
• Correlation
• Hypothesis testing of linear relationship
• Multiple regression (standard deviation and hypothesis testing)
• R2
Scatter plot
Regression line in sample
Relationship between variables
• Positive association: the two variables tend to increase/decrease together
• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables

resembles a straight line
• Linear relationship: the pattern of the relationship between the variables

resembles a straight line
• Outlier: a point in the scatterplot that has unusual combination of data values
“Alternative” facts
Nonlinearity & Outliers
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.
The formula for the (sample) regression line:
ŷ = b0 + b1 x,
• ŷ : predicted (estimated) y
• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for x = 0)
• b1 : slope of the straight line in the sample (i.e. how much ŷ changes for one unit
increase of x).
Its sign determines if the line is increasing or decreasing
Residual error
Usually, the predicted variable ŷ 6= y the observed value:
Residual / Prediction error: y − ŷ .
Least squares estimation
The residuals for the handspan data:
The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals
n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1
They are called the least squares estimators.

• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .
n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .
• For x2 = 69 we have y2 = 22 and ŷ2 = b0 + 69b1 . The residual is
e2 = 22 − b0 − 69b1 ...
n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

Least squares estimators
Least squares estimators:

Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)
Least squares estimators
Least squares estimators:

Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)
Example: Handspan data

ŷ = −3 + 0.35x
For instance for height 60 the average handspan is −3 + 0.35 × 60 = 18 cm.
Goodness of fit
Correlation
Correlation: measure of the strength and direction of a linear relationship between two
quantitative variables
• strength - how close the points are to a straight line.
• direction - if one variable is increasing/decreasing as the other variable increases
Formula:
n
1 X xi − x̄ yi − ȳ
r=
n−1 sx sy
i=1
• xi , yi : the x (or y ) measurement for the ith observation.
• x̄, ȳ : the mean of the x (or y ) measurements.
• sx , sy : the standard deviation of the x (or y ) measurements.

Correlation - Properties
• The correlation coefficient r is always between −1 and 1.
• Strength is indicated by the magnitude of the correlation.
• Direction is indicated by the sign of the correlation.
• r = 0 means that the best fitting line is the horizontal line.
• r is invariant to scaling of x or y . (e.g. from inch to cm)

Correlations
Strong vs. Weak Corr
Squared correlation
Interpretation:
• r is close to −1 or 1 implies
• r 2:
Squared correlation
Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1
• r 2 : proportion of variation of y explained by x.

Squared correlation
Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1
• r 2 : proportion of variation of y explained by x.

SSE
Formula: r 2 = 1 − SST , where
Pn
• SSO (Sum of Squares Total): i=1 (yi − ȳ )2 .
Pn
• SSE (Sum of Squared Errors): i=1 (yi − ŷi )2 .
Squared correlation
Problems in Regression
• Influential observations - observations with extreme values can have a big

impact on correlation
• Inappropriately Combining Groups - two distinct groups may show misleading
results
• Curvilinearity - linear regression for nonlinear data leads to bad predictions
• Extrapolation - no guarantee the linear relationship continues beyond the range

of the observed data
Influential obs.
Non-linearity
Inappropriate combination
Inappropriate combination
Extrapolation
Interpretation of observed correlation
“Correlation does not prove causation.”

• Rule for Concluding Cause and Effect: cause-and-effect relationships can be
inferred from randomized experiments, not from observational studies.
• Confounding variables
• Other explanatory variables

Causation
Estimation of Standard deviation
Standard deviation in Regression:

measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2
Estimation of Standard deviation
Standard deviation in Regression:

measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2
If we did not know anything about the xi , the standard deviation would be:
sP
n 2
i=1 (yi − ȳ )
s= .
n−1
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?
Hypothesis testing:
H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )
Hypothesis testing:
H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )
Remark: the p-value can be calculated using the t-distribution table

(with degree of freedom n − 2)
Multivariate regression
The formula for the multivariate regression line is
ŷ = b0 + b1 x1 + b2 x2 + ... + bp−1 xp−1
• x1 , ..., xp−1 : explanatory variables
• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for all xj = 0)
• b1 , ..., bp−1 : slopes corresponding to x1 , ..., xp−1 respectively

Examples
Example 1: Wage dependence on education and experience.
log(wage) = b0 + b1 (education) + b2 (work experience).

Example 2: Connection between behavior variable and GPA.
GPA = b0 + b1 (study hours) + b2 (classes missed) + b3 (work hours).

Omitted variables
Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Omitted variables
Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Examples:
• work experience & education ⇒ wage
• area density & magnitude of earthquake ⇒ death tolls
• height of father & height of mother ⇒ height of child

Multivariate regression - Assumptions
• No outliers
• No outliers
• The errors ei are normally distributed
ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• No outliers
• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)

• No outliers
• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)
• The sample should be representative of the population

Calvin and Hobbes
Sample Regression
The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals
SSE = e12 + e22 + ... + en2

Sample Regression
The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals
SSE = e12 + e22 + ... + en2

They are called the least-square estimators (LSE)
Hypothesis Testing
Question: Does the explanatory variable xk significantly influence response?

Hypothesis Testing

Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )
Hypothesis Testing

Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )
Remark: the p-value can be calculated using the t-distribution table

(with degree of freedom n − p)
Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p
where p is the number of parameters in the multiple regression model.

Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p
where p is the number of parameters in the multiple regression model.
SSE
R2 = 1 − ,
SSTO
• SSTO: ni=1 (yi − ȳ )2 ,

P
• SSE: ni=1 (yi − ŷi )2 .

P
Problems with R 2
• the more explanatory variables are added, the more R 2 increases

Problems with R 2
• general phenomenon called overfitting (i.e. explaining the noise)

Problems with R 2
• general phenomenon called overfitting (i.e. explaining the noise)
• Math problem - if number of observations and number of explanatory variables are

the same ⇒ R 2 = 1.
Adjusted R 2
If p is the number of explanatory variables
2 n−1 p
Radj = 1 − (1 − R 2 ) = R 2 − (1 − R 2 ) .
n−p−1 n−p−1
• increases only if additional explanatory variable is not uncorrelated
• difficult to interpret

Lecture 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

Linear regression and correlation

• Regression line for the sample

• Hypothesis testing of linear relationship

• Multiple regression (standard deviation and hypothesis testing)

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables

The formula for the (sample) regression line:

They are called the least squares estimators.

They are called the least squares estimators.

They are called the least squares estimators.

Least squares estimators:

Least squares estimators:

Example: Handspan data

• direction - if one variable is increasing/decreasing as the other variable increases

• xi , yi : the x (or y ) measurement for the ith observation.

• x̄, ȳ : the mean of the x (or y ) measurements.

• sx , sy : the standard deviation of the x (or y ) measurements.

• The correlation coefficient r is always between −1 and 1.

• Strength is indicated by the magnitude of the correlation.

• Direction is indicated by the sign of the correlation.

• r = 0 means that the best fitting line is the horizontal line.

• r is invariant to scaling of x or y . (e.g. from inch to cm)

• r 2 : proportion of variation of y explained by x.

• r 2 : proportion of variation of y explained by x.

• Influential observations - observations with extreme values can have a big

• Extrapolation - no guarantee the linear relationship continues beyond the range

“Correlation does not prove causation.”

• Other explanatory variables

Standard deviation in Regression:

Standard deviation in Regression:

Remark: the p-value can be calculated using the t-distribution table

The formula for the multivariate regression line is

ŷ = b0 + b1 x1 + b2 x2 + ... + bp−1 xp−1

• x1 , ..., xp−1 : explanatory variables

• b1 , ..., bp−1 : slopes corresponding to x1 , ..., xp−1 respectively

Example 1: Wage dependence on education and experience.

log(wage) = b0 + b1 (education) + b2 (work experience).

GPA = b0 + b1 (study hours) + b2 (classes missed) + b3 (work hours).

• area density & magnitude of earthquake ⇒ death tolls

• height of father & height of mother ⇒ height of child

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)

• The sample should be representative of the population

SSE = e12 + e22 + ... + en2

SSE = e12 + e22 + ... + en2

Question: Does the explanatory variable xk significantly influence response?

Question: Does the explanatory variable xk significantly influence response?

Question: Does the explanatory variable xk significantly influence response?

Remark: the p-value can be calculated using the t-distribution table

where p is the number of parameters in the multiple regression model.

where p is the number of parameters in the multiple regression model.

• SSTO: ni=1 (yi − ȳ )2 ,