Chapter 5: Correlation and Linear Regression: Phan Thi Khanh Van

Chapter 5: Correlation and Linear Regression
Phan Thi Khanh Van
E-mail: khanhvanphan@hcmut.edu.vn
June 27, 2020
(Phan Thi Khanh Van) Chap 5: Correlation and Linear Regression June 27, 2020 1 / 19
Table of Contents
1 Covariance and correlation
2 Linear regression method

Empirical Models
Least square method
Linear regresstion method
Covariance and correlation
Expected Value of a Function of Two Random Variables

(P P
h(x, y )fXY (x, y ), if X,Y are discrete,
E [h(X , Y )] = R R
fXY (x, y )dxdy , if X,Y are continuous.
Covariance
The covariance between the random variables X and Y , denoted as
cov (X , Y ) or σXY , is
σXY = E [(X − µX )(Y − µY )] = E (XY ) − µX µY
Remark
If X and Y are independent, then cov (X , Y ) = 0. However, a
covariance of zero does not necessarily mean that the variables are
independent. X and Y may have a nonlinear relationship.
Example
Find cov (X , Y ), σX , σY , if the joint probability mass function of X , Y is
µX = 1 · 1/4 + 1.5 · 1/8 + 1.5 · 1/4 + 2.5 · 1/4 + 3 · 1/8 = 1.8125.

µY = 1 · 1/4 + 2 · 1/8 + 3 · 1/4 + 4 · 1/4 + 5 · 1/8 = 2.875.
µXY = 1 · 1/4 + 1.5 · 2 · 1/8 + 1.5 · 3 · 1/4 + 2.5 · 4 · 1/4 + 3 · 5 · 1/8 = 6.125.
σXY = µXY − µX q µY = 0.9141.
p
σX = V (X ) = E (X 2 ) − µ2X = 0.4961.
p q
σY = V (Y ) = E (Y 2 ) − µ2Y = 1.8594.
Correlation
The correlation between the random variables X and Y , denoted as ρXY
is
cov (X , Y ) σXY
ρXY = p = , −1 ≤ ρXY ≤ 1.
V (X )V (Y ) σX σY
Example
Find the correlation of X and Y in the previous example.
σXY 0.9141
ρXY = = = 0.9909
σX σY 0.4961 · 1.8594
Remark
If ρXY is near 1 or −1, the points in the joint probability distribution of X
and Y that receive positive probability tend to fall along a line of positive
(or negative) slope. If ρXY 6= 0: X and Y are called correlated.
Covariance and correlations are measures of the linear relationship
between random variables.
Regression analysis
The collection of statistical tools that are used to model and explore
relationships between variables that are related in a nondeterministic
manner is called regression analysis.
For example, in a chemical process, suppose that the yield of the product
is related to the process-operating temperature. Regression analysis can be
used to build a model to predict yield at a given temperature level. This
model can also be used for process optimization, such as finding the level
of temperature that maximizes yield, or for process control purposes.
Simple linear regression model
Y = β0 + β1 x + ε,
where ε is a random error term, the slope and intercept of the line are
called regression coefficients. The model is simple because it has only
1 independent (regressor) variable and a dependent (respone variable).
If x is fixed, the random component ε on the right-hand side of the model

determines the properties of Y .
Suppose that ε is of mean 0 and variance σ 2 . Then,
E (Y |x) = E (β0 + β1 x + ε) = β0 + β1 x + E (ε) = β0 + β1 x.
V (Y |x) = V (β0 + β1 x + ε) = V (ε) = σ 2
Thus, the true regression model µY |x = β0 + β1 x is a line of mean values.
If we have no theoretical knowledge of the relationship between x and y
and will base the choice of the model on inspection of a scatter diagram,
such as we did with the oxygen purity data. We then think of the
regression model as an empirical model (basing on experiences).
Least square method
Least square method
Least square method
The least squares method is a statistical procedure to find the best fit for
a set of data points by minimizing the sum of the squares of the vertical
deviations.
n
di2 → min .
P
We have to find m and b such that:
i=1
Least squares estimates in the simple linear regression
Suppose that we have n pairs of observations (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ).
We have to find the linear regression model for the data as
yi = β0 + β1 xi + εi , i = 1, 2, ..., n
The sum of the squares of the deviations of the observations from the true
n n
ε2i = (yi − β0 − β1 xi )2 → min.
P P
regression line is L =
i=1 i=1
The least squares estimators β̂0 , β̂1 must satisfy
n

∂L
P


 = −2 (yi − β̂0 − β̂1 xi ) = 0
∂β0 β̂0 ,β̂1
i=1
n
∂L
P
 ∂β1 β̂0 ,β̂1 = −2 (yi − β̂0 − β̂1 xi )xi = 0


i=1
n n

P P
nβ̂0 + β̂1

 xi = yi
⇔ i=1 i=1 .
n n n
xi2 =
P P P
β̂0 xi + β̂1 xi yi


i=1 i=1 i=1

The least squares estimates of the intercept and slope in the simple
linear regression model are
n n n
yi xi − n1
P P P
xi yi
i=1 i=1 i=1
β̂0 = ȳ − β̂1 x̄, β̂1 = n n 2 ,
P 2 1 P
xi − n xi
i=1 i=1
n n

1 P 1 P
where x̄ = n xi , ȳ = n yi .
i=1 i=1
The fitted or estimated regression line is therefore
ŷ = β̂0 + β̂1 x.
Example: Oxygen Purity

n = 20
20
P 20
P
xi = 23.92, yi = 1, 843.21
i=1 i=1
x̄ = 1.196, ȳ = 92.1605
20
yi2 = 170, 044.5321,
P
i=1
20
xi2 = 29.2892.
P
i=1
P20
xi yi = 2, 214.6566.
i=1
β̂1 = 14.9475, β̂0 = 74.2733.
If x = 1.5%, then
y ≈ 74.2833 + 14.9475 · 1.5
≈ 96.7045.
Estimator of Variance
An unbiased estimator of σ 2 (the variance of the error term ε) is
n
n n n
P 2 2
P 1 P P
yi − nȳ − β̂1 xi yi − n xi yi
2 i=1 i=1 i=1 i=1
σ =
n−2
Example: Oxygen Purity

An unbiased estimator of σ 2 in the example of Oxygen Purity:
n
n n n
P 2 2
P 1 P P
yi − nȳ − β̂1 xi yi − n xi yi
2 i=1 i=1 i=1 i=1
σ = ≈ 1.1805
n−2
Example
The biochemical oxygen demand (BOD) test is conducted over a period of
time in days. The resulting data for X : Time (days) and Y BOD
(mg/liter) follow:
x 1 2 4 6 8 10 12 14 16 18 20
y 0.6 0.7 1.5 1.9 2.1 2.6 2.9 3.7 3.5 3.7 3.8
a) Assuming that a simple linear regression model is appropriate, fit the
regression model relating BOD y to x. What is the estimate of σ 2 ?
b)What is the estimate of expected BOD level when the time is 15 days?
c) What change in mean BOD is expected when the time changes by three
days?
d) Suppose that the time used is six days. Calculate the fitted value of y
and the corresponding residual.
e) Calculate the fitted ŷi for each value of xi used to fit the model. Then
construct a graph of ŷi versus the corresponding observed values yi and
comment on what this plot would look like if the relationship between y
and x was a deterministic (no random error) straight line.
n n n n n
xi2 = 1541, yi2 = 80.36,
P P P P P
a) xi = 111, yi = 27, xi yi = 347.4.
i=1 i=1 i=1 i=1 i=1
n
n n
yi xi − n1
P P P
xi yi
i=1 i=1 i=1
β̂1 = n n 2 = 0.1781.
P 2 1 P
xi − n xi
i=1 i=1
β̂0 = ȳ − β̂1 x̄ = 0.6578.
The estimator for σ 2 : σ̂ 2 = 0.0822.
b) If x = 15, then ŷ = 0.6578 + 0.1781 · 15 = 3.3294.
c) If ∆x = 3 then ∆ŷ = 0.1781 · ∆x = 0.5343.
d) If x = 6, then ŷ = 0.6578 + 0.1781 · 6 = 1.7264.
The corresponding residual: ε = 1.9 − 1.7264 = 0.1736.
e)
x 1 2 4 6 8 10 12 14 16 18 20
ŷ 0.84 1.01 1.37 1.73 2.08 2.44 2.8 3.15 3.51 3.86 4.22
Thank you for your attention!

Chapter 5: Correlation and Linear Regression: Phan Thi Khanh Van

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5: Correlation and Linear Regression: Phan Thi Khanh Van

Uploaded by

Copyright:

Available Formats

Chapter 5: Correlation and Linear Regression

Phan Thi Khanh Van

June 27, 2020

1 Covariance and correlation

2 Linear regression method

Expected Value of a Function of Two Random Variables

σXY = E [(X − µX )(Y − µY )] = E (XY ) − µX µY

µX = 1 · 1/4 + 1.5 · 1/8 + 1.5 · 1/4 + 2.5 · 1/4 + 3 · 1/8 = 1.8125.

If x is fixed, the random component ε on the right-hand side of the model

Least squares estimates in the simple linear regression

Example: Oxygen Purity

Example: Oxygen Purity

You might also like