You are on page 1of 19

Chapter 5: Correlation and Linear Regression

Phan Thi Khanh Van

E-mail: khanhvanphan@hcmut.edu.vn

June 27, 2020

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 1 / 19
Table of Contents

1 Covariance and correlation

2 Linear regression method


Empirical Models
Least square method
Linear regresstion method

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 2 / 19
Covariance and correlation

Expected Value of a Function of Two Random Variables


(P P
h(x, y )fXY (x, y ), if X,Y are discrete,
E [h(X , Y )] = R R
fXY (x, y )dxdy , if X,Y are continuous.

Covariance
The covariance between the random variables X and Y , denoted as
cov (X , Y ) or σXY , is

σXY = E [(X − µX )(Y − µY )] = E (XY ) − µX µY

Remark
If X and Y are independent, then cov (X , Y ) = 0. However, a
covariance of zero does not necessarily mean that the variables are
independent. X and Y may have a nonlinear relationship.

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 3 / 19
Example
Find cov (X , Y ), σX , σY , if the joint probability mass function of X , Y is

µX = 1 · 1/4 + 1.5 · 1/8 + 1.5 · 1/4 + 2.5 · 1/4 + 3 · 1/8 = 1.8125.


µY = 1 · 1/4 + 2 · 1/8 + 3 · 1/4 + 4 · 1/4 + 5 · 1/8 = 2.875.
µXY = 1 · 1/4 + 1.5 · 2 · 1/8 + 1.5 · 3 · 1/4 + 2.5 · 4 · 1/4 + 3 · 5 · 1/8 = 6.125.
σXY = µXY − µX q µY = 0.9141.
p
σX = V (X ) = E (X 2 ) − µ2X = 0.4961.
p q
σY = V (Y ) = E (Y 2 ) − µ2Y = 1.8594.
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 4 / 19
Correlation
The correlation between the random variables X and Y , denoted as ρXY
is
cov (X , Y ) σXY
ρXY = p = , −1 ≤ ρXY ≤ 1.
V (X )V (Y ) σX σY

Example
Find the correlation of X and Y in the previous example.
σXY 0.9141
ρXY = = = 0.9909
σX σY 0.4961 · 1.8594
Remark
If ρXY is near 1 or −1, the points in the joint probability distribution of X
and Y that receive positive probability tend to fall along a line of positive
(or negative) slope. If ρXY 6= 0: X and Y are called correlated.
Covariance and correlations are measures of the linear relationship
between random variables.
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 5 / 19
Regression analysis
The collection of statistical tools that are used to model and explore
relationships between variables that are related in a nondeterministic
manner is called regression analysis.

For example, in a chemical process, suppose that the yield of the product
is related to the process-operating temperature. Regression analysis can be
used to build a model to predict yield at a given temperature level. This
model can also be used for process optimization, such as finding the level
of temperature that maximizes yield, or for process control purposes.

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 6 / 19
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 7 / 19
Simple linear regression model
Y = β0 + β1 x + ε,

where ε is a random error term, the slope and intercept of the line are
called regression coefficients. The model is simple because it has only
1 independent (regressor) variable and a dependent (respone variable).

If x is fixed, the random component ε on the right-hand side of the model


determines the properties of Y .
Suppose that ε is of mean 0 and variance σ 2 . Then,
E (Y |x) = E (β0 + β1 x + ε) = β0 + β1 x + E (ε) = β0 + β1 x.
V (Y |x) = V (β0 + β1 x + ε) = V (ε) = σ 2
Thus, the true regression model µY |x = β0 + β1 x is a line of mean values.
If we have no theoretical knowledge of the relationship between x and y
and will base the choice of the model on inspection of a scatter diagram,
such as we did with the oxygen purity data. We then think of the
regression model as an empirical model (basing on experiences).
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 8 / 19
Least square method

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 9 / 19
Least square method
Least square method
The least squares method is a statistical procedure to find the best fit for
a set of data points by minimizing the sum of the squares of the vertical
deviations.

n
di2 → min .
P
We have to find m and b such that:
i=1
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 10 / 19
Least squares estimates in the simple linear regression
Suppose that we have n pairs of observations (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ).
We have to find the linear regression model for the data as
yi = β0 + β1 xi + εi , i = 1, 2, ..., n
The sum of the squares of the deviations of the observations from the true
n n
ε2i = (yi − β0 − β1 xi )2 → min.
P P
regression line is L =
i=1 i=1
The least squares estimators β̂0 , β̂1 must satisfy
n

∂L
P


 = −2 (yi − β̂0 − β̂1 xi ) = 0
∂β0 β̂0 ,β̂1
i=1
n
∂L
P
 ∂β1 β̂0 ,β̂1 = −2 (yi − β̂0 − β̂1 xi )xi = 0


i=1
n n

P P
nβ̂0 + β̂1

 xi = yi
⇔ i=1 i=1 .
n n n
xi2 =
P P P
β̂0 xi + β̂1 xi yi


i=1 i=1 i=1
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 11 / 19
Least squares estimates in the simple linear regression

Least squares estimates in the simple linear regression


The least squares estimates of the intercept and slope in the simple
linear regression model are
 n   n  n 
yi xi − n1
P P P
xi yi
i=1 i=1 i=1
β̂0 = ȳ − β̂1 x̄, β̂1 =  n   n 2 ,
P 2 1 P
xi − n xi
i=1 i=1

n n
   
1 P 1 P
where x̄ = n xi , ȳ = n yi .
i=1 i=1
The fitted or estimated regression line is therefore

ŷ = β̂0 + β̂1 x.

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 12 / 19
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 13 / 19
Least squares estimates in the simple linear regression

Example: Oxygen Purity


n = 20
20
P 20
P
xi = 23.92, yi = 1, 843.21
i=1 i=1
x̄ = 1.196, ȳ = 92.1605
20
yi2 = 170, 044.5321,
P
i=1
20
xi2 = 29.2892.
P
i=1
P20
xi yi = 2, 214.6566.
i=1
β̂1 = 14.9475, β̂0 = 74.2733.
If x = 1.5%, then
y ≈ 74.2833 + 14.9475 · 1.5
≈ 96.7045.
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 14 / 19
Least squares estimates in the simple linear regression

Estimator of Variance
An unbiased estimator of σ 2 (the variance of the error term ε) is
n
 n  n   n 
P 2 2
P 1 P P
yi − nȳ − β̂1 xi yi − n xi yi
2 i=1 i=1 i=1 i=1
σ =
n−2

Example: Oxygen Purity


An unbiased estimator of σ 2 in the example of Oxygen Purity:
n
 n  n   n 
P 2 2
P 1 P P
yi − nȳ − β̂1 xi yi − n xi yi
2 i=1 i=1 i=1 i=1
σ = ≈ 1.1805
n−2

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 15 / 19
Example
The biochemical oxygen demand (BOD) test is conducted over a period of
time in days. The resulting data for X : Time (days) and Y BOD
(mg/liter) follow:
x 1 2 4 6 8 10 12 14 16 18 20
y 0.6 0.7 1.5 1.9 2.1 2.6 2.9 3.7 3.5 3.7 3.8
a) Assuming that a simple linear regression model is appropriate, fit the
regression model relating BOD y to x. What is the estimate of σ 2 ?
b)What is the estimate of expected BOD level when the time is 15 days?
c) What change in mean BOD is expected when the time changes by three
days?
d) Suppose that the time used is six days. Calculate the fitted value of y
and the corresponding residual.
e) Calculate the fitted ŷi for each value of xi used to fit the model. Then
construct a graph of ŷi versus the corresponding observed values yi and
comment on what this plot would look like if the relationship between y
and x was a deterministic (no random error) straight line.

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 16 / 19
n n n n n
xi2 = 1541, yi2 = 80.36,
P P P P P
a) xi = 111, yi = 27, xi yi = 347.4.
i=1 i=1 i=1 i=1 i=1
n
  n  n 
yi xi − n1
P P P
xi yi
i=1 i=1 i=1
β̂1 =  n   n 2 = 0.1781.
P 2 1 P
xi − n xi
i=1 i=1
β̂0 = ȳ − β̂1 x̄ = 0.6578.
The estimator for σ 2 : σ̂ 2 = 0.0822.
b) If x = 15, then ŷ = 0.6578 + 0.1781 · 15 = 3.3294.
c) If ∆x = 3 then ∆ŷ = 0.1781 · ∆x = 0.5343.
d) If x = 6, then ŷ = 0.6578 + 0.1781 · 6 = 1.7264.
The corresponding residual: ε = 1.9 − 1.7264 = 0.1736.

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 17 / 19
e)
x 1 2 4 6 8 10 12 14 16 18 20
ŷ 0.84 1.01 1.37 1.73 2.08 2.44 2.8 3.15 3.51 3.86 4.22

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 18 / 19
Thank you for your attention!

(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 19 / 19

You might also like