Professional Documents
Culture Documents
Linear Regression and Correlation Analysis
Linear Regression and Correlation Analysis
Chapter Goals
After completing this chapter, you should be
able to:
Chapter Goals
(continued)
Curvilinear relationships
3 y
x
2 y
x
4 y
Weak relationships
y
x
y
x
y
x
y
Correlation Coefficient
(continued)
Features of and r
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive
linear relationship
The closer to 0, the weaker the linear
relationship
Examples of Approximate
r Values
y
r = -1
r = -.6
r=0
r = +.3
r = +1
Calculating the
Correlation Coefficient
Sample correlation coefficient:
( x x )( y y )
[ ( x x ) ][ ( y y ) ]
n xy x y
[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Calculation Example
Tree
Height
Trunk
Diameter
35
49
27
33
60
13
21
45
11
51
12
Calculation Example
Tree
Height
Trunk
Diameter
xy
y2
x2
35
280
1225
64
49
441
2401
81
27
189
729
49
33
198
1089
36
60
13
780
3600
169
21
147
441
49
45
11
495
2025
121
51
12
612
2601
144
=321
=73
=3142
=14111
=713
Calculation Example
(continued)
Tree
Height,
y
n xy x y
0.886
Trunk Diameter, x
Excel Output
Excel Correlation Output
Tools / data analysis / correlation
Correlation between
Tree Height and Trunk Diameter
Hypotheses
H0: = 0
(no correlation)
HA: 0
(correlation exists)
Test statistic
1 r
n2
2
(No correlation)
H1: 0
(correlation exists)
=.05 , df = 8 - 2 = 6
r
1 r 2
n2
.886
1 .886 2
82
4.68
r
1 r 2
n2
.886
1 .886 2
82
Decision:
Reject H0
4.68
Conclusion:
There is
evidence of a
linear relationship
at the 5% level of
significance
d.f. = 8-2 = 6
/2=.025
Reject H0
-t/2
-2.4469
/2=.025
Do not reject H0
Reject H0
t/2
2.4469
4.68
Introduction to Regression
Analysis
No Relationship
Population Linear
Regression
The population regression model:
Population
y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
y 0 1x
Linear component
Random
Error
term, or
residual
Random Error
component
Population Linear
Regression
y
y 0 1x
(continued)
Observed Value
of y for xi
Predicted Value
of y for xi
Slope = 1
Random Error
for this x value
Intercept = 0
xi
Linear Regression
Assumptions
Estimated Regression
Model
The sample regression line provides an estimate of
the population regression line
Estimated
(or predicted)
y value
Estimate of
the regression
Estimate of the
regression slope
intercept
y i b0 b1x
Independent
variable
e (y y)
2
(y (b
b1x))
b1
( x x )( y y )
(x x)
2
algebraic equivalent:
b1
x y
xy
n
2
(
x
)
2
x
and
b0 y b1 x
or
y x x xy
n x x
2
b0
Interpretation of the
Slope and the Intercept
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Excel Output
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
41.33032
Observations
ANOVA
10
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Graphical Presentation
Intercept
= 98.248
Interpretation of the
Intercept, b0
house price 98.24833 0.10977 (square feet)
Interpretation of the
Slope Coefficient, b1
house price 98.24833 0.10977 (square feet)
SST
Total sum of
Squares
SST ( y y )2
SSE
Sum of Squares
Error
SSE ( y y )2
SSR
Sum of Squares
Regression
SSR ( y y )2
where:
SST ( y y )2
SSE ( y y )2
SSR ( y y )2
y
yi
2
SSE = (yi - yi )
_
y
Xi
_
y
Coefficient of
Determination, R2
SSR
R
SST
2
where
0 R 1
2
Coefficient of
Determination, R2
(continued)
Coefficient of determination
SSR sum of squares explained by regression
R
SST
total sum of squares
2
R r
2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Examples of Approximate
R2 Values
y
R2 = 1
R2 = 1
R = +1
2
Examples of Approximate
R2 Values
y
0 < R2 < 1
Examples of Approximate
R2 Values
R2 = 0
No linear relationship
between x and y:
R2 = 0
Excel Output
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
SSR 18934.9348
R
0.58082
SST 32600.5000
2
Regression Statistics
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
SSE
s
n k 1
Where
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the
model
sb1
(x x)
( x)
x n
where:
SSE
s
= Sample standard error of the estimate
n2
Excel Output
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
s 41.33032
sb1 0.03297
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
small s
small sb1
large sb1
large s
Test statistic
b1 1
t
sb1
d.f. n 2
where:
b1 = Sample regression slope
coefficient
1 = Hypothesized slope
sb1 = Estimator of the standard
error of the slope
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
HA: 1 0
Coefficients
Intercept
Square Feet
d.f. = 10-2 = 8
/2=.025
Reject H0
/2=.025
Do not reject H0
-t/2
-2.3060
Reject H
0
t/2
2.3060 3.329
b1
Standard Error
sb1
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
Decision:
Reject H0
Conclusion:
There is sufficient evidence
that square footage affects
house price
b1 t /2 sb1
d.f. = n - 2
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
1 (x p x)
2
n (x x)
2
y t /2 s
1 (x p x)
1
2
n (x x)
2
y t /2 s
Interval Estimates
for Different Values of x
y
Prediction Interval
for an individual y,
given xp
Confidence
Interval for
the mean of
y, given xp
b
1
+
y = b0
xp
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
y t /2 s
(x p x)2
317.85 37.12
2
n (x x)
Estimation of Individual
Values: Example
Prediction Interval Estimate for
y|xp
y t /2s
(x p x)2
1
1
317.85 102.28
2
n (x x)
In Excel, use
PHStat | regression | simple linear regression
Check the
confidence and prediction interval for X=
box and enter the x-value and confidence level
desired
Input values
Residual Analysis
Purposes
Examine for linearity assumption
Examine for constant variance for all
levels of x
Evaluate normal distribution assumption
Not Linear
residuals
residuals
Linear
x
Non-constant variance
residuals
residuals
Constant variance
Excel Output
RESIDUAL OUTPUT
Predicted
House Price
Residuals
251.92316
-6.923162
273.87671
38.12329
284.85348
-5.853484
304.06284
3.937162
218.99284
-19.99284
268.38832
-49.38832
356.20251
48.79749
367.17929
-43.17929
254.6674
64.33264
10
284.85348
-29.85348
Chapter Summary
Chapter Summary
(continued)