You are on page 1of 50

Chapter 17

Simple Linear Regression


and Correlation

1
17.1 Introduction
In this chapter we employ Regression Analysis
to examine the relationship among quantitative
variables.
The technique is used to predict the value of one
variable (the dependent variable - y)based on
the value of other variables (independent
variables x1, x2,xk.)

2
17.2 The Model
The first order linear model
y b0 b1x
b0 and b1 are unknown,
y = dependent variable y therefore, are estimated
from the data.
x = independent variable
b0 = y-intercept
Rise
b1 = slope of the line b1 = Rise/Run
Run
= error variable b0
x
3
17.3 Estimating the Coefficients
The estimates are determined by
drawing a sample from the population of interest,
calculating sample statistics.
producing a straight line that cuts into the data.
y w The question is:
w Which straight line fits best?
w
w
w w w w w
w w w w w
w
x 4
The best line is the one that minimizes
the sum of squared vertical differences
between the points and the line.

Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2) w The smaller the sum of
w (3,1.5)
1 squared differences
the better the fit of the
1 2 3 4
line to the data. 5
To calculate the estimates of the coefficients The regression equation that estimates
that minimize the differences between the data the equation of the first order linear model
points and the line, use the formulas: is:

cov( X , Y )
b1
s 2x y b 0 b1x
b 0 y b1 x

6
Example 17.1 Relationship between odometer
reading and a used cars selling price.
A car dealer wants to find Car Odometer Price
the relationship between 1 37388 5318
2 44758 5061
the odometer reading and 3 45833 5008
the selling price of used cars. 4 30862 5795
5 31705 5784
A random sample of 100 cars 6 34010 5359
is selected, and the data . . .
. . .
recorded. . . .
Find the regression line.
Independent variable x
Dependent variable y
7
Solution
Solving by hand
To calculate b0 and b1 we need to calculate several
statistics first;
x 36,009.45; s 2x

2
( x x)
i
43,528,688
n 1

y 5,411.41; cov(X, Y)
( x x)(y
i i y)
1,356,256
n 1
where n = 100.

cov(X, Y) 1,356,256
b1 .0312
s 2x 43,528,688
b 0 y b1x 5411.41 ( .0312)(36,009.45) 6,533
y b 0 b1x 6,533 .0312 x 8
Using the computer (see file Xm17-01.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > OK
SUMMARY OUTPUT
6000
5500

Price
Regression Statistics
Multiple R 0.806308 5000
R Square 0.650132
Adjusted R Square
0.646562
4500
Standard Error
151.5688 19000 29000 39000 49000
Observations 100 Odometer
y 6,533 .0312x
ANOVA
df SS MS F Significance F
Regression 1 4183528 4183528 182.1056 4.4435E-24
Residual 98 2251362 22973.09
Total 99 6434890

CoefficientsStandard Error t Stat P-value


Intercept 6533.383 84.51232 77.30687 1.22E-89
Odometer -0.03116 0.002309 -13.4947 4.44E-24 9
6533
6000
5500

Price
5000
4500
0 No data 19000 29000 39000 49000
Odometer

y 6,533 .0312x

The intercept is b0 = 6533. This is the slope of the line.


For each additional mile on the odometer,
the price decreases by an average of $0.0312
Do not interpret the intercept as the
Price of cars that have not been driven

10
17.4 Error Variable: Required Conditions

The error is a critical part of the regression model.


Four requirements involving the distribution of must
be satisfied.
The probability distribution of is normal.
The mean of is zero: E() = 0.
The standard deviation of is s for all values of x.
The set of errors associated with different values of y are
all independent.
11
From the first three assumptions we have:
y is normally distributed with mean
E(y) = b0 + b1x, and a constant standard
deviation s
E(y|x3)
The standard deviation remains constant,
b0 + b1x3 m3
E(y|x2)
b0 + b1x2 m2
but the mean value changes with x E(y|x1)

b0 + b1x1 m1

x1 x2 x3

12
17.5 Assessing the Model
The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
Consequently, it is important to assess how well
the linear model fits the data.
Several methods are used to assess the model:
Testing and/or estimating the coefficients.
Using descriptive measurements.
13
Sum of squares for errors
This is the sum of differences between the points
and the regression line.
It can serve as a measure of how well the line fits the
n
data. SSE ( y i y i ) 2 .
i 1

cov( X , Y )
SSE (n 1)s 2Y
s 2x

This statistic plays a role in every statistical


technique we employ to assess the model.
14
Standard error of estimate
The mean error is equal to zero.
If s is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the
data well.
Therefore, we can, use s as a measure of the
suitability of using a linear model.
An unbiased estimator of s2 is given by s2
S tan dard Error of Estimate
SSE
s
n2 15
Example 17.2
Calculate the standard error of estimate for example
17.1, and describe what does it tell you about the
model fit?
Solution

s 2Y
( y i y i ) 2

6,434,890
64,999
Calculated before
n 1 99
2
cov( X , Y ) ( 1,356,256 )
SSE (n 1)s 2Y 2
99(64,999) 2,252,363
sx 43,528,688
Thus, It is hard to assess the model based
SSE 2,251,363 on s even when compared with the
s 151.6
n2 98 mean value of y.
s 151.6, y 5,411.4 16
Testing the slope
When no linear relationship exists between two
variables, the regression line should be horizontal.

q
q
q
qq q
q
q
q q

q q

Linear relationship. No linear relationship.


Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).

The slope is not equal to zero The slope is equal to zero


17
We can draw inference about b1 from b1 by testing
H0: b1 = 0
H1: b1 = 0 (or < 0,or > 0)
The test statistic is

b1 b1 s
t where s b1
s b1 (n 1)s 2x
The standard error of b1.

If the error variable is normally distributed, the statistic is


Student t distribution with d.f. = n-2.
18
Solution
Solving by hand
To compute t we need the values of b1 and sb1.
b1 .312
s 151.6
s b1 .00231
(n 1)s 2x (99)(43,528,688
b1 b1 .312 0 There is overwhelming evidence to infer
t 13.49
s b1 .00231 that the odometer reading affects the
auction selling price.
Using the computer
Coefficients Standard Error t Stat P-value
Intercept 6533.383035 84.51232199 77.30687 1.22E-89
Odometer -0.031157739 0.002308896 -13.4947 4.44E-24
19
Coefficient of determination
When we want to measure the strength of the linear
relationship, we use the coefficient of determination.

2
[cov( X , Y )] SSE
R
2
or R 1
2
s x2 s 2y ( yi y) 2

20
To understand the significance of this coefficient
note:

The regression model


Overall variability in y

The error

21
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.

y2

y1

x1 x2
Total variation in y = Variation explained by the + Unexplained variation (error)
regression line)
( y1 y ) 2 ( y 2 y ) 2 (y1 y) 2 (y 2 y) 2 (y1 y1 ) 2 (y 2 y 2 ) 2
22
Variation in y = SSR + SSE

R2 measures the proportion of the variation in y


that is explained by the variation in x.

2
R 1
SSE

( y i y ) 2 SSE

SSR
(y i y) 2
(y y)
i
2
(y i y)2

R2 takes on any value between zero and one.


R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
23
Example 17.4
Find the coefficient of determination for example 17.1;
what does this statistic tell you about the model?
Solution
2 [cov( X, Y)]2 [ 1,356 ,256 ]2
Solving by hand; R ( 43 ,528 ,688 )( 64 ,999 )
.6501
s 2x s 2y
Using the computer
From the regression output we have
Regression Statistics
Multiple R 0.8063 65% of the variation in the auction
R Square 0.6501 selling price is explained by the
Adjusted R Square 0.6466 variation in odometer reading. The
rest (35%) remains unexplained by
Standard Error 151.57 this model.
Observations 100 24
17.6 Finance Application: Market Model

One of the most important applications of linear


regression is the market model.
It is assumed that rate of return on a stock (R) is
linearly related to the rate of return on the overall
market.
R = b0 + b1Rm +
Rate of return on a particular stock Rate of return on some major stock index
The beta coefficient measures how sensitive the stocks rate
of return is to changes in the level of the overall market.
25
Example 17.5 The market model
SUMMARY OUTPUT
Estimate the market model for
Regression Statistics Nortel, a stock traded in the
Multiple R 0.560079 Toronto Stock Exchange.
R Square 0.313688
Adjusted R Square
0.301855 Data consisted of monthly
Standard Error
0.063123 percentage return for Nortel
Observations 60
and monthly percentage return
ANOVA for all the stocks.
This is a measure of the dfstocks SS This is aMS measure of the
F total risk embedded
Significance F
marketRegression
related risk. In this sample,
1 in the Nortel
0.10563 0.10563stock,26.50969
that is market-related.
3.27E-06
Residual 58 0.231105 0.003985
for each 1% increase in the TSE Specifically, 31.37% of the variation in Nortels
Total 59 0.336734
return, the average increase in return are explained by the variation in the
Nortels return is .8877%.
CoefficientsStandard TSEs
Error returns.
t Stat P-value
Intercept 0.012818 0.008223 1.558903 0.12446
TSE 0.887691 0.172409 5.148756 3.27E-06

26
17.7 Using the Regression Equation
Before using the regression model, we need to
assess how well it fits the data.
If we are satisfied with how well the model fits
the data, we can use it to make predictions for y.
Illustration
Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 17.1).
y 6533 .0312x 6533 .0312( 40,000) 5,285
27
Prediction interval and confidence interval
Two intervals can be used to discover how closely
the predicted value will match the true value of y.
Prediction interval - for a particular value of y,
Confidence interval - for the expected value of y.
The prediction interval The confidence interval
1 ( x g x) 2 1 ( x g x) 2
y t 2 s 1 y t 2 s
n ( x i x) 2 n ( x i x) 2

The prediction interval is wider than the confidence interval


28
Example 17.6 interval estimates for the car
auction price
Provide an interval estimate for the bidding price on a
Ford Taurus with 40,000 miles on the odometer.
Solution
The dealer would like to predict the price of a single car

The prediction interval(95%) =


1 ( x g x) 2
y t 2 s 1
n ( x i x) 2
t.025,98

1 ( 40,000 36,009)2
[ 6533 .0312( 40000)] 1.984(151.6) 1 5,285 303
100
4,309,340,160 29
The car dealer wants to bid on a lot of 250 Ford
Tauruses, where each car has been driven for about
40,000 miles.
Solution
The dealer needs to estimate the mean price per car.
1 ( x g x)2
The confidence interval (95%) = y t 2 s
n
( x i x)2

1 ( 40,000 36,009) 2
[ 6533 .0312( 40000)] 1.984(151.6) 5,285 35
100
4,309,340,160

30
The effect of the given value of x on the interval
As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
y b0 b1x g 1 ( x x ) 2
The yconfidence
t 2 s interval

g

when xg =nx
( x x)2
i

y( x g x 1)
y( x g x 1) The yconfidence 1 12
t 2 s interval

x
when xg = n 1
( xi x)2

x 2 x 1 x 1 x 2 1
The confidence interval 22
x y t 2 s
( x( x2)1)xx21 ( x 12))xx 12
when xg = xn 2
( xi x)2
31
17.8 Coefficient of correlation
The coefficient of correlation is used to measure the
strength of association between two variables.
The coefficient values range between -1 and 1.
If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.
If r = 0 there is no linear pattern.
The coefficient can be used to test for linear
relationship between two variables.
32
Testing the coefficient of correlation
When there are no linear relationship between two
variables, r = 0.
The hypotheses are:
H0: r = 0 Y
H1: r = 0
X
The test statistic is:
The statistic is Student t distributed
n2
t r with d.f. = n - 2, provided the variables
1 r2 are bivariate normally distributed.

w here r is the sample coefficient of correlation


cov (X, Y)
calculated by r
sxsy 33
Example 17.7 Testing for linear relationship

Test the coefficient of correlation to determine if linear


relationship exists in the data of example 17.1.
Solution
We test H0: r = 0 The value of the t statistic is
H1: r 0. n2
t r 13.49
2
Solving by hand: 1 r
The rejection region is Conclusion: There is sufficient
|t| > t/2,n-2 = t.025,98 = 1.984 or so. evidence at = 5% to infer that
The sample coefficient of there are linear relationship
between the two variables.
correlation r=cov(X,Y)/sxsy=-.806
34
Spearman rank correlation coefficient

The Spearman rank test is used to test whether


relationship exists between variables in cases where
at least one variable is ranked, or
both variables are quantitative but the normality
requirement is not satisfied

35
The hypotheses are:
H0: rs = 0
H1: rs = 0
The test statistic is
cov (a, b)
rs
s a sb
a and b are the ranks of the data.
For a large sample (n > 30) rs is approximately
normally distributed

z rs n 1
36
Example 17.8

A production manager wants to examine the


relationship between
aptitude test score given prior to hiring, and
performance rating three months after starting work.
A random sample of 20 production workers was
selected. The test scores as well as performance
rating was recorded.

37
Aptitude Performance
Employee test rating Solution
1 59 3
2 47 2
The problem objective is
3 58 4 to analyze the
4 66 3 relationship between
5 77 2
. . . two variables.
. . .
Performance rating is
. . .
Scores range from 0 to 100 Scores range from 1 to 5 ranked.
The hypotheses are: The test statistic is rs,
H0: rs = 0 and the rejection region
H1: rs = 0 is |rs| > rcritical (taken from
the Spearman rank
correlation table). 38
Aptitude Performance
Employee test Rank(a) rating Rank(b)
1 59 9 3 10.5
2 47 3 2 3.5 Ties are broken
3 58 8 4 17 by averaging the
4 66 14 3 10.5 ranks.
5 77 20 2 3.5
. . . . .
. . . . .
. . . . .

Conclusion:
Solving by hand
Do not
Rank each
reject thevariable separately.At 5% significance
null hypothesis.
level there is insufficient
Calculate sa = 5.92; sevidence to infer that the
b =5.50; cov(a,b) = 12.34
two variable
Thus rs =are related tos one
cov(a,b)/[s another.
a b] = .379.
The critical value for = .05 and n = 20 is .450. 39
17.9 Regression Diagnostics - I
The three conditions required for the validity of
the regression analysis are:
the error variable is normally distributed.
the error variance is constant for all values of x.
The errors are independent of each other.
How can we diagnose violations of these
conditions?

40
Residual Analysis

Examining the residuals (or standardized residuals),


we can identify violations of the required conditions
Example 17.1 - continued
Nonnormality.
Use Excel to obtain the standardized residual histogram.
Examine the histogram and look for a bell shaped diagram with
mean close to zero.

41
RESIDUAL OUTPUT A Partial list of For each residual we calculate
Standard residuals the standard deviation as follows:
Observation Residuals Standard Residuals
1 -50.45749927 -0.334595895 sri s 1 hi w here
2 -77.82496482 -0.516076186
3 -97.33039568 -0.645421421 1 ( xi x)2
hi
4
5
223.2070978
238.4730715
1.480140312
1.58137268
n
( x j x)2

Standardized residual i =
Residual i / Standard deviation
40

30

20
We can also apply the Lilliefors test
10 or the c2 test of normality.

0
-2.5 -1.5 -0.5 0.5 1.5 2.5 More
42
Heteroscedasticity

When the requirement of a constant variance is


violated we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + y^
+ + ++ +
+ + +
+ + ++
+ + ++

The spread increases with ^y


43
When the requirement of a constant variance is
not violated we have homoscedasticity.

+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + ++ ++ +
+ + + y^ ++
+ + + ++ +
+ + + + +
+ +++
+ +++
+
The spread of the data points
does not change much.
44
When the requirement of a constant variance is
not violated we have homoscedasticity.

++
^y ++ +
++ ++
Residual
+ +++
+ + +++ +
+ +++
+ + + +
+ ++ +
+ + +
+ ++
+ + + + ++
+ + y^ +
+ + +
+ + + ++ +
+ ++
+ ++
As far as the even spread, this is
a much better situation
45
Nonindependence of error variables

A time series is constituted if data were collected


over time.
Examining the residuals over time, no pattern should
be observed if the errors are independent.
When a pattern is detected, the errors are said to be
autocorrelated.
Autocorrelation can be detected by graphing the
residuals against time.

46
Patterns in the appearance of the residuals
over time indicates that autocorrelation exists.

Residual Residual

+
+ ++
+
+ + +
+ + +
0 + 0 + +
+ Time Time
+ + + + + +
+
+
+ ++ +
+

Note the runs of positive residuals, Note the oscillating behavior of the
replaced by runs of negative residuals residuals around zero.

47
Outliers
An outlier is an observation that is unusually small or
large.
Several possibilities need to be investigated when an
outlier is observed:
There was an error in recording the value.
The point does not belong in the sample.
The observation is valid.
Identify outliers from the scatter diagram.
It is customary to suspect an observation is an outlier if
its |standard residual| > 2 48
An outlier An influential observation

+++++++++++
+ +
+ but, some outliers
+ +
+ +
may be very influential
+
+ + + +
+
+ +
+

The outlier causes a shift


in the regression line

49
Procedure for regression diagnostics

Develop a model that has a theoretical basis.


Gather data for the two variables in the model.
Draw the scatter diagram to determine whether a
linear model appears to be appropriate.
Check the required conditions for the errors.
Assess the model fit.
If the model fits the data, use the regression
equation.
50