You are on page 1of 31

 

LECTURE NOTES NO. 8 MATH 235


Chapter 8

Simple Linear Regression


and Correlation

 Model and examine the relationship


between a dependent variable and
one or more independent variables
(predictors).
 Study what causing the variation in the
dependent variable.
 Study the effect of X on Y?
 Study the dependence of Y on X?
 In simple regression model , we have one
independent variable, we have.
 In Multiple Regression , we two or more
independent variables.

1
Simple Linear regression Model
The simple linear regression model is a
model that describes a linear relationship
between two variables.
The independent variable (regressor,
predictor) is denoted by X.
The dependent (response) variable is
denoted by Y.
Examples:
e.g.1 : Y : Blood Pressure
X : Age
e.g.2: Y : Yield of a chemical process
X : Reaction temperature
Assuming linear relationship, the regression
model is written as
Y = β0 + β1 X

2
β0: Value of Y for X = 0 (Y-intercept)
β1 : Change in Y due to a change of one unit
in X (Slope)
The above model describes an exact
relationship between X and Y.
In practical situations, the relationship
between X and Y is not exact.

Scatter plots shows different trends

3
The appropriate model is written as
Y = β0 + β1 X +ε
Y : Dependent variable
X: Independent, Predictor

β0 and β1: Regression Coefficients


β1: Change in Y due to a change of one
unit in X (Slope)

β0: Value of Y for X = 0 (Y-intercept)

4
ε : Random Error Term
The regression coefficients β0 and β1are
unknown parameters.

PEARSON CORRELATION COEFFICIENT


( r)

The correlation coefficient (Pearson r) measures the


degree of correspondence (association) between
variables. Therefore, Pearson correlation coefficient
(r) assesses the degree that these variables are
linearly related in a sample.
We denote the correlation coefficient of the
population by  and the sample by r which is
defined by:
n
∑ ( x i− x̄ )( yi − ȳ )
i=1
r= , −1≤r≤1
n n

√ √
∑ ( x i− x̄ )2 ∑ ( y i− ȳ )2
i=1 i=1

S xy
r=
√ S xx √ S yy
5
n n

n ∑ xi∑ yi
i=1 i=1
S xy=∑ x i y i−
i=1 n
n

n ( ∑ xi )2
i=1
S xx =∑ x 2i −
i=1 n
n

n ( ∑ y i )2
i=1
S yy =∑ y 2i −
i=1 n
The inequality 1  r 1 is indicating that the r
cannot be out side of the range of –1 and 1.
Properties of Pearson's Correlation
1.The value of r falls between -1 and +1.
2.A positive value of r indicates that as one
variable increases, the other variable

6
increases. A negative value of r indicates
that as one variable increases, the other
variable decreases. If r = 0, then there is no
linear relationship between the two
variables.
3.The magnitude of r indicates the strength
of the association between the two
variables. As r gets closer to either -1 or
+1, the strength of the association becomes
greater.
4.r measures only the linear relationship
between X and Y.

Interpretation of the size of a correlation


Several authors have offered guidelines for the
interpretation of a correlation coefficient. e.g.:
- Small (Low) correlation: 0.1 < |r| ≤ 0.3
- Medium (Moderate) correlation: 0.3 <
7
|r| ≤ 0.5
- Large (High) correlation: 0.5 < |r| ≤ 1.0

Example: Consider the data obtained from a


chemical process where the yield of the process
is thought to be related to the reaction
temperature (see the table below).

8
Scatterplot of Yeild vs Tempreture

220

200

180
Yeild

160

140

120

50 60 70 80 90 100
Tempreture

Observatio X
n Temp Y Yield X2 Y2 XY
1 50 122 2500 14884 6100
2 53 118 2809 13924 6254
3 54 128 2916 16384 6912
4 55 121 3025 14641 6655
5 56 125 3136 15625 7000
6 59 136 3481 18496 8024
7 62 144 3844 20736 8928
8 65 142 4225 20164 9230
9 67 149 4489 22201 9983
10 71 161 5041 25921 11431
11 72 167 5184 27889 12024
12 74 168 5476 28224 12432
13 75 162 5625 26244 12150

9
14 76 171 5776 29241 12996
15 79 175 6241 30625 13825
16 80 182 6400 33124 14560
17 82 180 6724 32400 14760
18 85 183 7225 33489 15555
19 87 188 7569 35344 16356
20 90 200 8100 40000 18000
21 93 194 8649 37636 18042
22 94 206 8836 42436 19364
23 95 207 9025 42849 19665
24 97 210 9409 44100 20370
25 100 219 10000 47961 21900
Totals 1871 4158 145705 714538 322516
6 6
∑ x i= 1871 , ∑ x2i = 145705 , x̄=74 .84
i=1 i=1

6 6
∑ y i = 4158 , ∑ y 2i = 714538 , ȳ=166 .32
i=1 i=1

6
∑ x i y i=322516
i=1

n n

n ∑ xi∑ yi
S xy =∑ x i y i− i=1 i=1
=
i=1 n

(1871)(4158 )
S xy =322516− =11331.28
25

10
n 2

n
S xx =∑ x 2i −
(∑ )
i=1
xi

i =1 n

(18712 )
S xx=145705− =5679 .36
25
n 2

n
S yy =∑ y 2i −
(∑ )
i =1
yi

i=1 n

2
( 4158)
S yy =714538− =22979. 44
25
S xy 11331.28
r= = =0.99
√ S xx √ S yy √ 5679.36 √ 22997.44
The Model

Y i=β0 +β 1 xi +ε i , i=1,...,n
Assumptions

11
1. Theε1, ε2, …, εn are independent.

2
2. Var ( ε i )=σ , i=1 , .. . , n
(constant variance)

3. The ε1, ε2, …, εn follow normal distribution.

Estimation of β0, β1and σ2

The sample data can be used to estimate E(Y)

by estimating β0 and β1by


β 0 and {β 1 ¿ .
Specifically,

Y =β 0 + β1 X
Where Y is the fitted (predicted) value, the
above model is called the estimated model.

12
Least squares estimates of β0 and β1

Need to minimize the error sum of squares

n
Q( β 1 , β 0 )=∑ (Y i −β 0−β 1 x i )2
i=1

with respect to β0 and β1

The values β0 and β1, (say,


β 0 and {β 1 ¿ ),
that minimize Q(β0, β1)are the least squares
estimates (LSE’s), and are given by solving the
equations

n n
n β 0 + β 1 ∑ x i= ∑ Y i
i =1 i=1

13
n n n
2
β 0 ∑ x i +β 1 ∑ xi = ∑ xi Y i
i=1 i=1 i=1
These equations are called normal equations.
The solution of these equations is

S xy
β 1=
S xx ,

β 0 = ȳ−β 1 x̄
Where
n n

n ∑ xi∑ yi
S xy =∑ x i y i− i=1 i=1
=
i =1 n

14
n 2

S xx =∑
n
2
xi −
i=1
( ∑ xi )
i =1 n
n 2

S yy =∑
n
2
yi −
(∑ )
i =1
yi

i=1 n

Fitted values

^y i=β 0 +β 1 x i , i=1,...,n

15
Residuals

e i= y i − ^y i , i=1,...,n
(actual value – fitted value)

Estimation of the error variance (σ2)

16
n
Error sums of squares= SSE=∑ ( y i − ^y i )2
i=1

n
∑ ( y i − ^y i )2
SSE
^ 2 = s2 =
σ = i =1
n−2 n−2

It can be shown that

S yy − β 1 S xy
2
s =
n− 2

The standard error

σ^ S
S . E( β 1 )= =
√ S xx √ S xx
Where
n
∑ ( y i− y^ i )2
s 2 = i=1 = MSE
n−2

17
is the estimate of σ2

(MSE: Mean Square Error)

Assuming normality

β 1 −β 1
t=
s / √ S xx
Has t distribution with d.f = n-2

Result:

A (1-α)100% C.I for β1 is given by

s
β 0 ±t α /2 , n−2
√ S xx
18
β0
The standard error of

1 x̄ 2

S . E( β 0 )=s +
n S xx

Assuming normality

β 0 −β 0
t=
S . E( β 0 )
Has t distribution with d.f = n-2

Result:

A (1-α)100% C.I for a is given by

β 0 ±tα /2 ,n−2 S.E( β 0 )

19
Result:

E( s2 )=E ( i=1
y
∑ i i
( − y
^

n−2
)2

)
=σ 2

i.e. s2 is an unbiased estimator for σ2.

Testing Hypothesis
Test

H 0 : β 1= 0

H 1 : β 1≠0

If you reject H0, then the regression variable X


truly influence the response Y.

Test statistic

20
β1
T=
S . E ( β 1)

Reject H0 if |t| > tα/2,n-2 .

Test

H 0 : β 0 =0

H 1 : β 0≠ 0

Test statistic

β0
T=
S . E( β0 )

Reject H0 if |t| > tα/2,n-2

Quality of the fitted model


n
SS T =∑ ( y i− ȳ )2
i=1

SST Total Sums of squares


21
(Variability in Y variable)
n
SS E =∑ ( y i − ^y i )2
i =1

SSE Error Sums of squares

Does the data fit the model adequately?

Will the model predict the response well


enough?

R2 : Coefficient of determination (measures of


fit of regression line)

2
SS E
R =1−
SST
2
0≤R ≤ 1
22
Proportion of variation in response data that is
explained by model

Prediction:

^y ( x 0 )=β 0 +β 1 x 0
(The estimated or predicted value of y at x0)

Example: Consider the data obtained from a


chemical process where the yield of the process
is thought to be related to the reaction
temperature (see the table below).

23
Scatterplot of Yeild vs Tempreture

220

200

180
Yeild

160

140

120

50 60 70 80 90 100
Tempreture

Observatio X
n Temp Y Yield X2 Y2 XY
1 50 122 2500 14884 6100
2 53 118 2809 13924 6254
3 54 128 2916 16384 6912
4 55 121 3025 14641 6655
5 56 125 3136 15625 7000
6 59 136 3481 18496 8024
7 62 144 3844 20736 8928
8 65 142 4225 20164 9230
9 67 149 4489 22201 9983
10 71 161 5041 25921 11431
11 72 167 5184 27889 12024
12 74 168 5476 28224 12432
13 75 162 5625 26244 12150
14 76 171 5776 29241 12996
15 79 175 6241 30625 13825
16 80 182 6400 33124 14560
17 82 180 6724 32400 14760
18 85 183 7225 33489 15555
19 87 188 7569 35344 16356
20 90 200 8100 40000 18000
21 93 194 8649 37636 18042
22 94 206 8836 42436 19364
23 95 207 9025 42849 19665
24 97 210 9409 44100 20370
25 100 219 10000 47961 21900
Totals 1871 4158 145705 714538 322516

24
6 6
∑ x i= 1871 , ∑ x2i = 145705 , x̄=74 .84
i=1 i=1

6 6
∑ y i = 4158 , ∑ y 2i = 714538 , ȳ=166 . 32
i=1 i=1

6
∑ x i y i=322516
i =1

n n

n ∑ xi∑ yi
S xy =∑ x i y i− i =1 i=1
=
i =1 n

(1871)(4158 )
S xy=322516− =11331.28
25
n 2

S xx = ∑ x 2
i −
n (∑ )
i=1
xi

i =1 n

2
(1871 )
S xx =145705− =5679 .36
25
n 2

S yy =∑ y 2i −
n (∑ )
i =1
yi

i=1 n

25
2
( 4158)
S yy =714538− =22979. 44
25
S xy 11331 .28
β 1= = =1 .9951
S xx 5679. 36

β 0 =166.32−1. 9951(74 .84 )=17


The regression line

Yield= 17 + 1.9951 Temperature

The estimated error variance


n
∑ ( y− ^yi )2
^ 2 = s2 = i=1
σ
n−2

or

2
S yy −β 1 S xy
s =
n−2
26
22979. 44−1 . 9951(11331.28 ) 372
s2= = =16
25−2 23
2
s =16, s=4.01967
Standard Errors

S 4 . 02
S . E( β 1 )= = =0 . 05334
√ S xx √ 5679. 36
1 x̄ 2
S . E( β 0 )=s +
n S xx√
2
1 (74 .84 )
S . E( β 0 )=4 . 02
√+
25 5679. 36

S . E( β 0 )=4 . 072
A 95% C.I for β1 is given by

β 1±t α /2 ,n−2 S.E ( β1 )

27
1.9951±t0 .025,23( 0.05334)
1. 9951±2. 069(0 . 05334 )
(1.89, 2.11)

Test

H 0 : β 1= 0

H 1 : β 1≠0

The value of the test statistic

β1 1 . 9951
t= = =4 . 61
S . E( β 1 ) 0 . 05332

tα/2,n-2= t0.025, 23 =2.069

Since t = 4.61 > 2.069, we reject H0.

The temperature has a significant effect on the


yield.
28
n
SS T =∑ ( y i− ȳ )2 =S yy =22979
i=1

n
SS E =∑ ( y i − ^y i )2=372
i =1

Coefficient of determination:

2
SS E 327
R =1− =1− =98 . 4 %
SST 22979

98.4% of the variation in chemical yield


explained by temperature.

Predict chemical yield at temperature = 73

Yield= 17 + 1.9951 (73)=162.64

Statistical Analysis Using Excel


SUMMARY OUTPUT

Regression
Statistics          
Multiple R 0.991880955        
R Square 0.983827829        

29
Adjusted R
Square 0.983124692        
Standard Error 4.019665911        
Observations 25        
           
Standard
  Coefficients Error t Stat P-value  
4.17524
Intercept 17.00159173 4.071997 6 0.000363789  
37.4058
X Temp 1.99516847 0.053338 3 4.173E-22  
Cheking linearity

1) Scatterplot of X vs. Y

2) After the model is fit, sometimes the


residuals can tell us that the relationship
was not linear, the original data have a
nonlinear relationship.

30
Suppose you fit linear regression model,
then the residual plot will be
(plot of residuals vs. the x-values)

There is a clear pattern, this probably suggest


fitting quadratic model
31

You might also like