You are on page 1of 53

Week 6: Correlation and Regression

PROF. MICHAEL DONG


CALIFORNIA STATE UNIVERSITY LONG BEACH

FALL 2020
1
1. Correlation Analysis

Road Map 2. Simple Linear Regression

3. Multiple Linear Regression

2
Very helpful statistics videos
https://www.youtube.com/channel/UCs3IhN8VOA_5WxpAgbSmFkg/playlists
By Stephanie Glen

3
1. Correlation Analysis

4
Scatter Plots
Check Python demonstration

5
Correlation Analysis
in contrast to a scatter plot, which graphically depicts the relationship between two data series,
correlation analysis expresses this same relationship using a single number. The correlation
coefficient is a measure of how closely related two data series are. In particular, the correlation
coefficient measures the direction and extent of linear association between two variables.

6
Correlation Analysis

7
Correlation Analysis

8
Correlation Calculation
To study historical or sample correlations, we need to use sample covariance. The sample
covariance of X and Y, for a sample of size n, is

We then need to calculate the sample variance of X to obtain its sample standard deviation.

The formula for computing the sample correlation coefficient is

9
Example

10
Example

11
Limitations of correlation analysis
correlation measures the linear association between two variables, but it may not always be re-
liable. two variables can have a strong nonlinear relation and still have a very low correlation.

For example, the relation B = (A − 4)2 is a nonlinear relation contrasted to the linear relation

12
Nonlinear relationship

13
Outliers
in the scatter plot in Figure 6, most of the
data lie clustered together with little
discernible relation between the two
variables. two cases, however (the two
circled observations), stand out from the
rest. in one of those cases, inflation was
extremely low at almost –2 percent, and
in the other case, stock returns were
strongly negative at almost –17 percent.
These observations are outliers. if we
compute the correlation coefficient for
the entire data sample, that correlation is
−0.0350. if we eliminate the two outliers,
however, the correlation is −0.1489.

14
Spurious Correlation/Regression
The term spurious correlation has been used to refer to 1) correlation between two variables
that reflects chance relationships in a particular data set, 2) correlation induced by a calculation
that mixes each of two variables with a third, and 3) correlation between two variables arising
not from a direct relation between them but from their relation to a third variable.
As an example of the second kind of spurious correlation, two variables that are uncorrelated
may be correlated if divided by a third variable.
As an example of the third kind of spurious correlation, height may be positively correlated with
the extent of a person’s vocabulary, but the underlying relationships are between age and height
and between age and vocabulary.
Investment professionals must be cautious in basing investment strategies on high correlations.
Spurious correlation may suggest investment strategies that appear profitable but actually
would not be so, if implemented.

15
Testing the Significance of the correlation coefficient

16
Tests Concerning the correlation

17
2. Linear Regression

18
2.1 Simple Regression Model

19
Some Reference Books
Introductory Econometrics pdf:

https://economics.ut.ac.ir/docu
ments/3030266/14100645/Jeffre
y_M._Wooldridge_Introductory_
Econometrics_A_Modern_Appro
ach__2012.pdf

20
The Simple Regression Model
Names:
Definition of the simple regression model • Simple Linear Regression, or
◦ “Explains variable y in terms of variable x” • Univariate Linear Regression, or
• Linear Regression with only one
independent variable

21
Definition
The variables y and x have several different names used
interchangeably, as follows: 

y is called the dependent variable, the explained variable,


the response variable, the predicted variable, or
the regressand; 

x is called the independent variable, the explanatory


variable, the control variable, the predictor variable, or
the regressor. (The term covariate is also used for x.)

The terms “dependent variable” and “independent variable”


are frequently used in econometrics. But be aware that the
label “independent” here does not refer to the statistical
notion of independence between random variables.

22
The Simple Regression Model (2 of 39)

Interpretation of the simple linear regression model


◦ Explains how y varies with changes in x

The simple linear regression model is rarely applicable in practice but its discussion is useful for
pedagogical reasons.

23
The Simple Regression Model (4 of 39)

When is there a causal interpretation?


◦ Conditional mean independence assumption

Example: wage equation

24
The Simple Regression Model (5 of 39)

Population regression function (PFR)


◦ The conditional mean independence assumption implies that

This means that the average value of the dependent variable can be expressed as a linear
function of the explanatory variable.

25
The Simple Regression Model (6 of 39)

26
The Simple Regression Model
Deriving the ordinary least squares estimates
◦ In order to estimate the regression model one needs data
◦ A random sample of n observations

27
Estimation Method: Ordinary Least Squares
Deriving the ordinary least squares (OLS) estimators
Defining regression residuals

Minimize the sum of the squared regression residuals

OLS estimators

28
Estimation Method: Ordinary Least Squares
OLS fits as good as possible a regression line through the data points

29
The Simple Regression Model (11 of 39)

30
The Simple Regression Model (14 of 39)

Properties of OLS on any sample of data


Fitted values and residuals

Algebraic properties of OLS regression

31
What each item means in OLS

32
Assumptions of the Linear Regression Model
to be able to draw valid conclusions from a linear regression model with a single independent
variable, we need to make the following six assumptions, known as the classic normal linear
regression model assumptions:

33
Zero conditional mean assumption

34
The standard error of the estimate/regression
The formula for the standard error of estimate for a linear regression model with one
independent variable is

35
Goodness-of-Fit
Goodness of fit
◦ How well does an explanatory variable explain the dependent variable?

Measures of variation:

36
Goodness-of-Fit

37
Goodness-of-fit

38
Estimator Properties
Expected values and variances of the OLS estimators
The estimated regression coefficients are random variables because they are calculated from a
random sample

The question is what the estimators will estimate on average and how large will their variability be in
repeated samples

39
Estimator Properties
Standard assumptions for the linear regression model
Assumption SLR.1 (Linear in parameters)

Assumption SLR.2 (Random sampling)

40
Estimator Properties
Assumptions for the linear regression model (cont.)
Assumption SLR.3 (Sample variation in the explanatory variable)

Assumption SLR.4 (Zero conditional mean)

41
Estimator Properties
Theorem 2.1 (Unbiasedness of OLS)

Interpretation of unbiasedness
◦ The estimated coefficients may be smaller or larger, depending on the sample that is the result of a random
draw.
◦ However, on average, they will be equal to the values that characterize the true relationship between y and x
in the population.
◦ “On average” means if sampling was repeated, i.e. if drawing the random sample and doing the estimation
was repeated many times.
◦ In a given sample, estimates may differ considerably from true values.

42
Estimator Properties
Variances of the OLS estimators
◦ Depending on the sample, the estimates will be nearer or farther away from the true population values.
◦ How far can we expect our estimates to be away from the true population values on average (= sampling
variability)?
◦ Sampling variability is measured by the estimator‘s variances

Assumption SLR.5 (Homoskedasticity)

43
Estimator Properties
Graphical illustration of homoskedasticity

44
Estimator Properties
An example for heteroskedasticity: Wage and education

45
Estimator Variance Properties
Theorem 2.2 (Variances of the OLS estimators)
Under assumptions SLR.1 – SLR.5:

Conclusion:
◦ The sampling variability of the estimated regression coefficients will be the higher, the larger the variability of
the unobserved factors, and the lower, the higher the variation in the explanatory variable.

46
Estimator Variance Properties
Estimating the error variance

47
Estimator Variance Properties
Theorem 2.3 (Unbiasedness of the error variance)

Calculation of standard errors for regression coefficients

The estimated standard deviations of the regression coefficients are called “standard errors.” They
measure how precisely the regression coefficients are estimated.

48
Hyothesis Testing

49
Analysis of Variance (ANOVA)
Analysis of variance (ANOVA) is a statistical procedure for dividing the total variability of a variable into
components that can be attributed to different sources.

an important statistical test conducted in analysis of variance is the F-test. The F-statistic tests whether all
the slope coefficients in a linear regression are equal to 0. in a regression with one independent variable,
this is a test of the null hypothesis H0: b1 = 0 against the alternative hypothesis Ha: b1 ≠ 0.

50
Analysis of Variance (ANOVA)

51
Prediction Intervals

52
Q&A
53

You might also like