Professional Documents
Culture Documents
FALL 2020
1
1. Correlation Analysis
2
Very helpful statistics videos
https://www.youtube.com/channel/UCs3IhN8VOA_5WxpAgbSmFkg/playlists
By Stephanie Glen
3
1. Correlation Analysis
4
Scatter Plots
Check Python demonstration
5
Correlation Analysis
in contrast to a scatter plot, which graphically depicts the relationship between two data series,
correlation analysis expresses this same relationship using a single number. The correlation
coefficient is a measure of how closely related two data series are. In particular, the correlation
coefficient measures the direction and extent of linear association between two variables.
6
Correlation Analysis
7
Correlation Analysis
8
Correlation Calculation
To study historical or sample correlations, we need to use sample covariance. The sample
covariance of X and Y, for a sample of size n, is
We then need to calculate the sample variance of X to obtain its sample standard deviation.
9
Example
10
Example
11
Limitations of correlation analysis
correlation measures the linear association between two variables, but it may not always be re-
liable. two variables can have a strong nonlinear relation and still have a very low correlation.
For example, the relation B = (A − 4)2 is a nonlinear relation contrasted to the linear relation
12
Nonlinear relationship
13
Outliers
in the scatter plot in Figure 6, most of the
data lie clustered together with little
discernible relation between the two
variables. two cases, however (the two
circled observations), stand out from the
rest. in one of those cases, inflation was
extremely low at almost –2 percent, and
in the other case, stock returns were
strongly negative at almost –17 percent.
These observations are outliers. if we
compute the correlation coefficient for
the entire data sample, that correlation is
−0.0350. if we eliminate the two outliers,
however, the correlation is −0.1489.
14
Spurious Correlation/Regression
The term spurious correlation has been used to refer to 1) correlation between two variables
that reflects chance relationships in a particular data set, 2) correlation induced by a calculation
that mixes each of two variables with a third, and 3) correlation between two variables arising
not from a direct relation between them but from their relation to a third variable.
As an example of the second kind of spurious correlation, two variables that are uncorrelated
may be correlated if divided by a third variable.
As an example of the third kind of spurious correlation, height may be positively correlated with
the extent of a person’s vocabulary, but the underlying relationships are between age and height
and between age and vocabulary.
Investment professionals must be cautious in basing investment strategies on high correlations.
Spurious correlation may suggest investment strategies that appear profitable but actually
would not be so, if implemented.
15
Testing the Significance of the correlation coefficient
16
Tests Concerning the correlation
17
2. Linear Regression
18
2.1 Simple Regression Model
19
Some Reference Books
Introductory Econometrics pdf:
https://economics.ut.ac.ir/docu
ments/3030266/14100645/Jeffre
y_M._Wooldridge_Introductory_
Econometrics_A_Modern_Appro
ach__2012.pdf
20
The Simple Regression Model
Names:
Definition of the simple regression model • Simple Linear Regression, or
◦ “Explains variable y in terms of variable x” • Univariate Linear Regression, or
• Linear Regression with only one
independent variable
21
Definition
The variables y and x have several different names used
interchangeably, as follows:
22
The Simple Regression Model (2 of 39)
The simple linear regression model is rarely applicable in practice but its discussion is useful for
pedagogical reasons.
23
The Simple Regression Model (4 of 39)
24
The Simple Regression Model (5 of 39)
This means that the average value of the dependent variable can be expressed as a linear
function of the explanatory variable.
25
The Simple Regression Model (6 of 39)
26
The Simple Regression Model
Deriving the ordinary least squares estimates
◦ In order to estimate the regression model one needs data
◦ A random sample of n observations
27
Estimation Method: Ordinary Least Squares
Deriving the ordinary least squares (OLS) estimators
Defining regression residuals
OLS estimators
28
Estimation Method: Ordinary Least Squares
OLS fits as good as possible a regression line through the data points
29
The Simple Regression Model (11 of 39)
30
The Simple Regression Model (14 of 39)
31
What each item means in OLS
32
Assumptions of the Linear Regression Model
to be able to draw valid conclusions from a linear regression model with a single independent
variable, we need to make the following six assumptions, known as the classic normal linear
regression model assumptions:
33
Zero conditional mean assumption
34
The standard error of the estimate/regression
The formula for the standard error of estimate for a linear regression model with one
independent variable is
35
Goodness-of-Fit
Goodness of fit
◦ How well does an explanatory variable explain the dependent variable?
Measures of variation:
36
Goodness-of-Fit
37
Goodness-of-fit
38
Estimator Properties
Expected values and variances of the OLS estimators
The estimated regression coefficients are random variables because they are calculated from a
random sample
The question is what the estimators will estimate on average and how large will their variability be in
repeated samples
39
Estimator Properties
Standard assumptions for the linear regression model
Assumption SLR.1 (Linear in parameters)
40
Estimator Properties
Assumptions for the linear regression model (cont.)
Assumption SLR.3 (Sample variation in the explanatory variable)
41
Estimator Properties
Theorem 2.1 (Unbiasedness of OLS)
Interpretation of unbiasedness
◦ The estimated coefficients may be smaller or larger, depending on the sample that is the result of a random
draw.
◦ However, on average, they will be equal to the values that characterize the true relationship between y and x
in the population.
◦ “On average” means if sampling was repeated, i.e. if drawing the random sample and doing the estimation
was repeated many times.
◦ In a given sample, estimates may differ considerably from true values.
42
Estimator Properties
Variances of the OLS estimators
◦ Depending on the sample, the estimates will be nearer or farther away from the true population values.
◦ How far can we expect our estimates to be away from the true population values on average (= sampling
variability)?
◦ Sampling variability is measured by the estimator‘s variances
43
Estimator Properties
Graphical illustration of homoskedasticity
44
Estimator Properties
An example for heteroskedasticity: Wage and education
45
Estimator Variance Properties
Theorem 2.2 (Variances of the OLS estimators)
Under assumptions SLR.1 – SLR.5:
Conclusion:
◦ The sampling variability of the estimated regression coefficients will be the higher, the larger the variability of
the unobserved factors, and the lower, the higher the variation in the explanatory variable.
46
Estimator Variance Properties
Estimating the error variance
47
Estimator Variance Properties
Theorem 2.3 (Unbiasedness of the error variance)
The estimated standard deviations of the regression coefficients are called “standard errors.” They
measure how precisely the regression coefficients are estimated.
48
Hyothesis Testing
49
Analysis of Variance (ANOVA)
Analysis of variance (ANOVA) is a statistical procedure for dividing the total variability of a variable into
components that can be attributed to different sources.
an important statistical test conducted in analysis of variance is the F-test. The F-statistic tests whether all
the slope coefficients in a linear regression are equal to 0. in a regression with one independent variable,
this is a test of the null hypothesis H0: b1 = 0 against the alternative hypothesis Ha: b1 ≠ 0.
50
Analysis of Variance (ANOVA)
51
Prediction Intervals
52
Q&A
53