Business Statistics - Session 9

Business Statistics- Session 9
Correlation and Simple regression

Bivariate Data
Often a manager or decision maker is interested in

the relationship between two variables.
Two descriptive measures of the relationship

between two variables are covariance and correlation
coefficient.
Bivariate Data
Example:
1. Is there any relationship between the variation in time

spent on training an employee and the performance of
salespeople on the job?
2. Is there any relationship between the variation in the

amounts of money spent on advertising in an area and the
sales volume in that area?
3. Is there any relationship between the variation in the

frequency of numbers that are played in a lottery and the
relative frequency of numbers that win in the lottery?
Covariance
The covariance is a measure of the linear association

between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.

Covariance
The covariance is computed as follows:
 ( xi  x )( yi  y ) for
sxy 
n 1 samples
 ( xi   x )( yi   y ) for
 xy  populations
N
Correlation Coefficient
Correlation is a measure of linear association and not

necessarily causation.
Just because two variables are highly correlated, it

does not mean that one variable is the cause of the
other.
The correlation coefficient is computed as follows:
sxy  xy
rxy   xy 
 x y
sx s y
for for
samples populations
Where,
Sxy is sample covariance
 xy is the population covariance
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear

relationship.
Values near +1 indicate a strong positive linear

relationship.
The closer the correlation is to zero, the weaker the

relationship.
Covariance and Correlation Coefficient
Example: Golfing Study

A golfer is interested in investigating the
relationship, if any, between driving distance and
18-hole score.
Average Driving Average

Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
x y ( xi  x ) ( yi  y ) ( xi  x )( yi  y )
277.6 69 10.65 -1.0 -10.65
259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192 .8944
Sample Covariance
sxy   ( x  x )( y  y ) 35.40
i i
   7.08
n1 61
Sample Correlation Coefficient
sxy 7.08
rxy    -.9631
sx sy (8.2192)(.8944)
Simple Linear Regression Model
The equation that describes how y is related to x and
an error term is called the regression model.
The simple linear regression model is:
y = b0 + b1x +e
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation
Positive Linear Relationship
E(y)
Regression line
Intercept Slope b1
b0 is positive
x
Negative Linear Relationship
E(y)
Intercept Regression line

b0
Slope b1
is negative
x
No Relationship
E(y)
Regression line
Intercept
b0
Slope b1
is 0
x
Types of Relationships
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Estimated Simple Linear Regression Equation
The estimated simple linear regression equation
ŷ  b0  b1 x
The graph is called the estimated regression line.

b0 is the y intercept of the line.
b1 is the slope of the line.
ŷ is the estimated value of y for a given x value.
Least Squares Method
• Least Squares Criterion
min  (y i  y i ) 2
where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation
• Slope for the Estimated Regression
Equation
b1   ( x  x )( y  y )
i i
 (x  x )
i
2
y-Intercept for the Estimated Regression Equation
b0  y  b1 x
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations
Simple Linear Regression
Example: Reed Auto Sales

Reed Auto periodically has a special week-long sale.
As part of the advertising campaign Reed runs one or
more television commercials during the weekend
preceding the sale. Data from a sample of 5 previous
sales are shown on the next slide.
Example: Reed Auto Sales
Number of Number of
TV Ads Cars Sold
1 14
3 24
2 18
1 17
3 27
Estimated Regression Equation
Slope for the Estimated Regression Equation
b1   ( x  x )( y  y ) 20
i i
 5
 (x  x )i
2
4
y-Intercept for the Estimated Regression Equation

b0  y  b1 x  20  5(2)  10
yˆ  10  5x
Scatter Diagram and Trend Line
30
25
20
Cars Sold y = 5x + 10
15
10
0
0 1 2 3 4
TV Ads
Model
(continued)
Y
Observed Value
of Y for Xi
εi Slope = β
Predicted Random Error for

Value of Y for this Xi value
Xi
Intercept = α
Xi
X
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
 i
( y  y ) 2
  i
( ˆ
y  y ) 2
  i i
( y  ˆ
y ) 2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error (the difference
between the real values, and what we
predict from the line of best fit)
The coefficient of determination is:
r2 = SSR/SST
where:
r2 = SSR/SST = 100/114 = .8772

The regression relationship is very strong; 88%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.
rxy  (sign of b1 ) Coefficient of Determination

rxy  (sign of b1 ) r 2
where:
b1 = the slope of the estimated regression
equation yˆ  b0  b1 x
rxy  (sign of b1 ) r 2
The sign of b1 in the equation yˆ  10  5 x is “+”.
rxy = + .8772
rxy = +.9366
Simple Linear Regression Example:
Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Simple Linear Regression Example: Scatter Plot
House price model: Scatter Plot
450
400
House Price ($1000s)

350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Simple Linear Regression Example: Using Excel
Simple Linear Regression Example: Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Graphical Representation
House price model: Scatter Plot and Prediction Line
450
400

350 Slope
300
250
= 0.10977
200
150
100
50
Intercep 0
t 0 500 1000 1500 2000 2500 3000
= 98.248 Square Feet
house price  98.24833  0.10977 (square feet)

Interpretation of bo
• α is the estimated mean value of Y

when the value of X is zero (if X = 0 is
in the range of observed X values)
• Because a house cannot have a square
footage of 0, b0 has no practical
application
Interpreting b1
• β estimates the change in the mean value of

Y as a result of a one-unit increase in X
– Here, β = 0.10977 tells us that the mean
value of a house increases by
0.10977($1000) = $109.77, on average, for
each additional one square foot of size
Making Predictions
Predict the price for a house
with 2000 square feet:
house price  98.25  0.1098 (sq.ft.)
 98.25  0.1098(200 0)
 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Simple Linear Regression Example: Making Predictions
• When using a regression model for prediction,
only predict within the relevant range of data
Relevant range
for interpolation
450
400

350
300
250
200
150 Do not try to
100
extrapolate
50
0 beyond the
0 1000 2000 3000 range of
Square Feet observed X’s
Coefficient of Determination, r2
• The coefficient of determination is the portion of
the total variation in the dependent variable that
is explained by variation in the independent
variable
• The coefficient of determination is also called r-
squared and is denoted as r2
2 SSR regression sum of squares

r  
SST total sum of squares
note: 0  r 1
2
Simple Linear Regression Example: Coefficient of
Determination, r2 in Excel
SSR 18934.9348
Regression Statistics
r  2
  0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in
Standard Error 41.33032
house prices is explained by
Observations 10
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
MCQ
1.Suppose that we know the height of a student but do not know her weight. We use an estimating
equation to determine an estimate of her weight based on her height. We can therefore surmise that:
(a) Weight is the independent variable.
(b) Height is the dependent variable.
(c) The relationship between weight and height is an inverse one.
(d) None of these.
(e) (b) and (c) but not (a).
2. Suppose the estimating equation Y = 5 – 2X has been calculated for a set of data. Which of the
following is true for this situation?
(a) The Y-intercept of the line is 2.
(b) The slope of the line is negative.
(c) The line represents an inverse relationship.
(d) All of these.
(e) (b) and (c) but not (a).
Multiple Regression Model
The equation that describes how the
dependent variable y is related to the
independent variables x1, x2, . . . xp and an error
term is called the multiple regression model.
y = b0 + b1x1 + b2x2 + . . . + bpxp + e
where:
b0, b1, b2, . . . , bp are the parameters, and
e is a random variable called the error term
Example: Programmer Salary Survey

A software firm collected data for a sample of 20
computer programmers. A suggestion was made
that regression analysis could be used to determine if
salary was related to the years of experience and the
score on the firm’s programmer aptitude test.
The years of experience, score on the aptitude

test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the next
slide.
Exper. Score Salary Exper. Score Salary

4 78 24 9 88 38
7 100 43 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29
10 84 38 8 87 34
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30 3 70 28.2
6 91 33 3 89 30
Suppose we believe that salary (y) is related to the

years of experience (x1) and the score on the
programmer aptitude test (x2) by the following
regression model:
y = b0 + b1x1 + b2x2 + e
where
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
Solving for the Estimates of b0, b1, b2
Least Squares
Input Data Output
x1 x2 y Computer b0 =
Package b1 =
4 78 24
for Solving b2 =
7 100 43
. . . Multiple
R2 =
. . . Regression
3 89 30 Problems etc.
Excel Worksheet (showing partial data entered)

A B C D E
Programmer Experience (yrs) Test Score Salary ($K)
1 4 78 24.0
2 7 100 43.0
3 1 86 23.7
4 5 82 34.3
5 8 86 35.8
6 10 84 38.0
7 0 75 22.2
8 1 80 23.1
Excel’s Regression Dialog Box

Excel’s Regression Equation Output
A B C D E
38
39 Coeffic. Std. Err. t Stat P-value
40 Intercept 3.17394 6.15607 0.5156 0.61279
41 Experience 1.4039 0.19857 7.0702 1.9E-06
42 Test Score 0.25089 0.07735 3.2433 0.00478
43
SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)

Interpreting the Coefficients
In multiple regression analysis, we interpret each

regression coefficient as follows:
bi represents an estimate of the change in y

corresponding to a 1-unit increase in xi when all
other independent variables are held constant.
b1 = 1. 404
Salary is expected to increase by $1,404 for

each additional year of experience (when the variable
score on programmer attitude test is held constant).
b2 = 0.251
Salary is expected to increase by $251 for each

additional point scored on the programmer aptitude
test (when the variable years of experience is held
constant).
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
SST = SSR + SSE
 i
( y  y ) 2
  i
( ˆ
y  y ) 2
  i i
( y  ˆ
y ) 2
where:
SSE = sum of squares due to error
Excel’s ANOVA Output
A B C D E F
32
33 ANOVA
34 df SS MS F Significance F
35 Regression 2 500.3285 250.1643 42.76013 2.32774E-07
36 Residual 17 99.45697 5.85041
37 Total 19 599.7855
38
SSR
SST
R2 = SSR/SST
R2 = 500.3285/599.7855 = .83418
Adjusted Multiple Coefficient of Determination
n1
Ra2 2
 1  (1  R )
np1
20  1
R  1  (1  .834179)
2
a  .814671
20  2  1
Where,
n is number of observations.
p is number of independent variables.
Adjusted Multiple Coefficient of Determination
Excel’s Regression Statistics
A B C
23
24 SUMMARY OUTPUT
25
26 Regression Statistics
27 Multiple R 0.913334059
28 R Square 0.834179103
29 Adjusted R Square 0.814670762
30 Standard Error 2.418762076
31 Observations 20
32

Business Statistics - Session 9

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Statistics - Session 9

Uploaded by

Copyright:

Available Formats

Business Statistics- Session 9

Correlation and Simple regression

Often a manager or decision maker is interested in

Two descriptive measures of the relationship

1. Is there any relationship between the variation in time

2. Is there any relationship between the variation in the

3. Is there any relationship between the variation in the

The covariance is a measure of the linear association

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.

Correlation is a measure of linear association and not

Just because two variables are highly correlated, it

Values near -1 indicate a strong negative linear

Values near +1 indicate a strong positive linear

The closer the correlation is to zero, the weaker the

Example: Golfing Study

Average Driving Average

Intercept Regression line

The estimated simple linear regression equation

The graph is called the estimated regression line.

y-Intercept for the Estimated Regression Equation

Example: Reed Auto Sales

Example: Reed Auto Sales

y-Intercept for the Estimated Regression Equation

Predicted Random Error for

The coefficient of determination is:

r2 = SSR/SST = 100/114 = .8772

rxy  (sign of b1 ) Coefficient of Determination

The sign of b1 in the equation yˆ  10  5 x is “+”.

House price model: Scatter Plot

House Price ($1000s)

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

House price model: Scatter Plot and Prediction Line

House Price ($1000s)

house price  98.24833  0.10977 (square feet)

house price  98.24833  0.10977 (square feet)

• α is the estimated mean value of Y

house price  98.24833  0.10977 (square feet)

• β estimates the change in the mean value of

house price  98.25  0.1098 (sq.ft.)

House Price ($1000s)

2 SSR regression sum of squares

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

y = b0 + b1x1 + b2x2 + . . . + bpxp + e

Example: Programmer Salary Survey

The years of experience, score on the aptitude

Exper. Score Salary Exper. Score Salary

Suppose we believe that salary (y) is related to the

Excel Worksheet (showing partial data entered)

Excel’s Regression Dialog Box

Excel’s Regression Equation Output

SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)

In multiple regression analysis, we interpret each

bi represents an estimate of the change in y

Salary is expected to increase by $1,404 for

Salary is expected to increase by $251 for each

Relationship Among SST, SSR, SSE

SST = SSR + SSE

Excel’s ANOVA Output

Excel’s Regression Statistics

You might also like