Professional Documents
Culture Documents
y y
x x
y y
x x
Scatter Plot Examples
(continued)
Strong relationships Weak relationships
y y
x x
y y
x x
Scatter Plot Examples
(continued)
No relationship
x
Correlation Coefficient
(continued)
x x x
r = -1 r = -.6 r=0
y y
x x
r = +.3 r = +1
Calculating the
Correlation Coefficient
Sample correlation coefficient:
r
( x x)( y y)
[ ( x x ) ][ ( y y ) ]
2 2
Tree n xy x y
Height, r
y 70 [n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ]
60
8(3142) (73)(321)
50
40
[8(713) (73)2 ][8(14111) (321) 2 ]
30
0.886
20
10
0
r = 0.886 → relatively strong positive
0 2 4 6 8 10 12 14
linear association between x and y
Trunk Diameter, x
Excel Output
Excel Correlation Output
Tools / data analysis / correlation…
Correlation between
Tree Height and Trunk Diameter
Significance Test for
Correlation
• Hypotheses
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists)
• Test statistic
r
t
2
– 1 r (with n – 2 degrees of
freedom)
n2
Example: Produce Stores
Is there evidence of a linear relationship
between tree height and trunk diameter at
the .05 level of significance?
r .886 Decision:
t 4.68
1 r 2 1 .886 2 Reject H0
y β0 β1x ε
Variable
y y β0 β1x ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value
Random Error
of y for xi
for this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line
ŷ i b0 b1x variable
e 2
(y ŷ) 2
(y (b 0
2
b1x))
The Least Squares Equation
• The formulas for b1 and b0 are:
b1
( x x )( y y )
(x x) 2
algebraic
equivalent: and
xy x y
b1 n b0 y b1 x
x 2
( x ) 2
n
Interpretation of the
Slope and the Intercept
• b0 is the estimated average value of
y when the value of x is zero
ANOVA Significance
df SS MS F F
18934.934 11.084
Regression 1 18934.9348 8 8 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficien P- Upper
ts Standard Error t Stat value Lower 95% 95%
0.1289 232.0738
Graphical Presentation
• House price model: scatter plot and
regression line
450
400
House Price ($1000s)
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
Xi x
Coefficient of Determination, R2
• The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
Coefficient of determination
2 SSR sum of squares explained by regression
R
SST total sum of squares
y
R2 = 1
x
R = +1
2
Examples of Approximate
R Values
2
y
0 < R2 < 1
x
Examples of Approximate
R Values
2
R2 = 0
y
No linear relationship
between x and y:
ANOVA Significance
df SS MS F F
18934.934 11.084
Regression 1 18934.9348 8 8 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficien P- Upper
ts Standard Error t Stat value Lower 95% 95%
0.1289 232.0738
Standard Error of Estimate
• The standard deviation of the variation of
observations around the regression line is
estimated by
SSE
s
n k 1
Where
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the
model
The Standard Deviation of the
Regression Slope
• The standard error of the regression slope
coefficient (b1) is estimated by
sε sε
sb1
(x x) 2
x 2
( x)2
n
where:
sb1 = Estimate of the standard error of the least squares slope
SSE
sε = Sample standard error of the estimate
n2
Excel Output
Regression Statistics sε 41.33032
Multiple R 0.76211
R Square 0.58082
Adjusted R
Square 0.52842 sb1 0.03297
Standard Error 41.33032
Observations 10
ANOVA Significance
df SS MS F F
18934.934 11.084
Regression 1 18934.9348 8 8 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficien P- Upper
ts Standard Error t Stat value Lower 95% 95%
0.1289 232.0738
Comparing Standard Errors
Variation of observed y values Variation in the slope of regression
from the regression line lines from different possible samples
y y
y y
2
1 (x p x)
ŷ t /2sε
n (x x) 2
Confidence Interval for
an Individual y, Given x
Confidence interval estimate for an
Individual value of y given a particular xp
2
1 (x p x)
ŷ t /2 sε 1
n (x x) 2
Confidence
Interval for
+ b x the mean of
y = b0
1
y, given xp
x
x xp
Example: House Prices
House Price Estimated Regression Equation:
Square Feet
in $1000s
(x)
(y) house price 98.25 0.1098 (sq.ft.)
245 1400
312 1600
279 1700 Predict the price for a house
308 1875
with 2000 square feet
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Example: House Prices (continued)
Predict the price for a house
with 2000 square feet:
98.25 0.1098(2000)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Estimation of Mean Values:
Example
Confidence Interval Estimate for E(y)|xp
Find the 95% confidence interval for the average
price of 2,000 square-foot houses
Predicted Price Yi = 317.85 ($1,000s)
1 (x p x)2
ŷ t α/2 sε 317.85 37.12
n (x x) 2
1 (x p x)2
ŷ t α/2 sε 1 317.85 102.28
n (x x) 2
y y
x x
residuals
x residuals x
Not Linear
Linear
Residual Analysis for
Constant Variance
y y
x x
residuals
x residuals x
RESIDUAL OUTPUT
House Price Model Residual Plot
Predicted
House
80
Price Residuals
1 251.92316 -6.923162 60
3 284.85348 -5.853484 20
4 304.06284 3.937162 0
0 1000 2000 3000
5 218.99284 -19.99284 -20
9 254.6674 64.33264
10 284.85348 -29.85348
Summary
• Introduced correlation analysis
• Discussed correlation to measure the
strength of a linear association
• Introduced simple linear regression analysis
• Calculated the coefficients for the simple
linear regression equation
• measures of variation (R2 and sε)
• Addressed assumptions of regression and
correlation
Summary
(continued)