You are on page 1of 60

Business Statistics- Session 9

Correlation and Simple regression


Bivariate Data

Often a manager or decision maker is interested in


the relationship between two variables.

Two descriptive measures of the relationship


between two variables are covariance and correlation
coefficient.
Bivariate Data
Example:

1. Is there any relationship between the variation in time


spent on training an employee and the performance of
salespeople on the job?

2. Is there any relationship between the variation in the


amounts of money spent on advertising in an area and the
sales volume in that area?

3. Is there any relationship between the variation in the


frequency of numbers that are played in a lottery and the
relative frequency of numbers that win in the lottery?
Covariance

The covariance is a measure of the linear association


between two variables.

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.


Covariance
The covariance is computed as follows:

 ( xi  x )( yi  y ) for
sxy 
n 1 samples

 ( xi   x )( yi   y ) for
 xy  populations
N
Correlation Coefficient

Correlation is a measure of linear association and not


necessarily causation.

Just because two variables are highly correlated, it


does not mean that one variable is the cause of the
other.
Correlation Coefficient
The correlation coefficient is computed as follows:
sxy  xy
rxy   xy 
 x y
sx s y

for for
samples populations

Where,
Sxy is sample covariance
 xy is the population covariance
Correlation Coefficient
The coefficient can take on values between -1 and +1.

Values near -1 indicate a strong negative linear


relationship.

Values near +1 indicate a strong positive linear


relationship.

The closer the correlation is to zero, the weaker the


relationship.
Covariance and Correlation Coefficient

Example: Golfing Study


A golfer is interested in investigating the
relationship, if any, between driving distance and
18-hole score.

Average Driving Average


Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
Covariance and Correlation Coefficient
Example: Golfing Study

x y ( xi  x ) ( yi  y ) ( xi  x )( yi  y )
277.6 69 10.65 -1.0 -10.65
259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192 .8944
Covariance and Correlation Coefficient
Example: Golfing Study
Sample Covariance

sxy   ( x  x )( y  y ) 35.40
i i
   7.08
n1 61
Sample Correlation Coefficient
sxy 7.08
rxy    -.9631
sx sy (8.2192)(.8944)
Simple Linear Regression Model
The equation that describes how y is related to x and
an error term is called the regression model.
The simple linear regression model is:

y = b0 + b1x +e

where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation
Positive Linear Relationship

E(y)

Regression line

Intercept Slope b1
b0 is positive

x
Simple Linear Regression Equation
Negative Linear Relationship

E(y)

Intercept Regression line


b0

Slope b1
is negative

x
Simple Linear Regression Equation
No Relationship

E(y)

Regression line
Intercept
b0
Slope b1
is 0

x
Types of Relationships
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Estimated Simple Linear Regression Equation

The estimated simple linear regression equation

ŷ  b0  b1 x

The graph is called the estimated regression line.


b0 is the y intercept of the line.
b1 is the slope of the line.
ŷ is the estimated value of y for a given x value.
Least Squares Method
• Least Squares Criterion
min  (y i  y i ) 2

where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation
Least Squares Method
• Slope for the Estimated Regression
Equation

b1   ( x  x )( y  y )
i i

 (x  x )
i
2
Least Squares Method

y-Intercept for the Estimated Regression Equation

b0  y  b1 x

where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations
Simple Linear Regression

Example: Reed Auto Sales


Reed Auto periodically has a special week-long sale.
As part of the advertising campaign Reed runs one or
more television commercials during the weekend
preceding the sale. Data from a sample of 5 previous
sales are shown on the next slide.
Simple Linear Regression

Example: Reed Auto Sales

Number of Number of
TV Ads Cars Sold
1 14
3 24
2 18
1 17
3 27
Estimated Regression Equation
Slope for the Estimated Regression Equation

b1   ( x  x )( y  y ) 20
i i
 5
 (x  x )i
2
4

y-Intercept for the Estimated Regression Equation


b0  y  b1 x  20  5(2)  10
Estimated Regression Equation
yˆ  10  5x
Scatter Diagram and Trend Line

30

25

20
Cars Sold y = 5x + 10
15

10

0
0 1 2 3 4
TV Ads
Simple Linear Regression
Model
(continued)

Y
Observed Value
of Y for Xi

εi Slope = β

Predicted Random Error for


Value of Y for this Xi value
Xi

Intercept = α

Xi
X
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE

 i
( y  y ) 2
  i
( ˆ
y  y ) 2
  i i
( y  ˆ
y ) 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error (the difference
between the real values, and what we
predict from the line of best fit)
Coefficient of Determination

The coefficient of determination is:

r2 = SSR/SST

where:
SSR = sum of squares due to regression
SST = total sum of squares
Coefficient of Determination

r2 = SSR/SST = 100/114 = .8772


The regression relationship is very strong; 88%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.
Sample Correlation Coefficient

rxy  (sign of b1 ) Coefficient of Determination


rxy  (sign of b1 ) r 2

where:
b1 = the slope of the estimated regression
equation yˆ  b0  b1 x
Sample Correlation Coefficient

rxy  (sign of b1 ) r 2

The sign of b1 in the equation yˆ  10  5 x is “+”.

rxy = + .8772

rxy = +.9366
Simple Linear Regression Example:
Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Simple Linear Regression Example: Scatter Plot

House price model: Scatter Plot

450
400

House Price ($1000s)


350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Simple Linear Regression Example: Using Excel
Simple Linear Regression Example: Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Simple Linear Regression Example:
Graphical Representation

House price model: Scatter Plot and Prediction Line

450
400

House Price ($1000s)


350 Slope
300
250
= 0.10977
200
150
100
50
Intercep 0
t 0 500 1000 1500 2000 2500 3000
= 98.248 Square Feet

house price  98.24833  0.10977 (square feet)


Simple Linear Regression Example:
Interpretation of bo

house price  98.24833  0.10977 (square feet)

• α is the estimated mean value of Y


when the value of X is zero (if X = 0 is
in the range of observed X values)
• Because a house cannot have a square
footage of 0, b0 has no practical
application
Simple Linear Regression Example:
Interpreting b1

house price  98.24833  0.10977 (square feet)

• β estimates the change in the mean value of


Y as a result of a one-unit increase in X
– Here, β = 0.10977 tells us that the mean
value of a house increases by
0.10977($1000) = $109.77, on average, for
each additional one square foot of size
Simple Linear Regression Example:
Making Predictions
Predict the price for a house
with 2000 square feet:

house price  98.25  0.1098 (sq.ft.)

 98.25  0.1098(200 0)

 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Simple Linear Regression Example: Making Predictions
• When using a regression model for prediction,
only predict within the relevant range of data
Relevant range
for interpolation
450
400

House Price ($1000s)


350
300
250
200
150 Do not try to
100
extrapolate
50
0 beyond the
0 1000 2000 3000 range of
Square Feet observed X’s
Coefficient of Determination, r2
• The coefficient of determination is the portion of
the total variation in the dependent variable that
is explained by variation in the independent
variable
• The coefficient of determination is also called r-
squared and is denoted as r2

2 SSR regression sum of squares


r  
SST total sum of squares

note: 0  r 1
2
Simple Linear Regression Example: Coefficient of
Determination, r2 in Excel
SSR 18934.9348
Regression Statistics
r  2
  0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in
Standard Error 41.33032
house prices is explained by
Observations 10
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
MCQ
1.Suppose that we know the height of a student but do not know her weight. We use an estimating
equation to determine an estimate of her weight based on her height. We can therefore surmise that:
(a) Weight is the independent variable.
(b) Height is the dependent variable.
(c) The relationship between weight and height is an inverse one.
(d) None of these.
(e) (b) and (c) but not (a).

2. Suppose the estimating equation Y = 5 – 2X has been calculated for a set of data. Which of the
following is true for this situation?
(a) The Y-intercept of the line is 2.
(b) The slope of the line is negative.
(c) The line represents an inverse relationship.
(d) All of these.
(e) (b) and (c) but not (a).
Multiple Regression Model
The equation that describes how the
dependent variable y is related to the
independent variables x1, x2, . . . xp and an error
term is called the multiple regression model.

y = b0 + b1x1 + b2x2 + . . . + bpxp + e

where:
b0, b1, b2, . . . , bp are the parameters, and
e is a random variable called the error term
Multiple Regression Model

Example: Programmer Salary Survey


A software firm collected data for a sample of 20
computer programmers. A suggestion was made
that regression analysis could be used to determine if
salary was related to the years of experience and the
score on the firm’s programmer aptitude test.

The years of experience, score on the aptitude


test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the next
slide.
Multiple Regression Model

Exper. Score Salary Exper. Score Salary


4 78 24 9 88 38
7 100 43 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29
10 84 38 8 87 34
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30 3 70 28.2
6 91 33 3 89 30
Multiple Regression Model

Suppose we believe that salary (y) is related to the


years of experience (x1) and the score on the
programmer aptitude test (x2) by the following
regression model:

y = b0 + b1x1 + b2x2 + e

where
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
Solving for the Estimates of b0, b1, b2

Least Squares
Input Data Output
x1 x2 y Computer b0 =
Package b1 =
4 78 24
for Solving b2 =
7 100 43
. . . Multiple
R2 =
. . . Regression
3 89 30 Problems etc.
Solving for the Estimates of b0, b1, b2

Excel Worksheet (showing partial data entered)


A B C D E
Programmer Experience (yrs) Test Score Salary ($K)
1 4 78 24.0
2 7 100 43.0
3 1 86 23.7
4 5 82 34.3
5 8 86 35.8
6 10 84 38.0
7 0 75 22.2
8 1 80 23.1
Solving for the Estimates of b0, b1, b2

Excel’s Regression Dialog Box


Solving for the Estimates of b0, b1, b2

Excel’s Regression Equation Output

A B C D E
38
39 Coeffic. Std. Err. t Stat P-value
40 Intercept 3.17394 6.15607 0.5156 0.61279
41 Experience 1.4039 0.19857 7.0702 1.9E-06
42 Test Score 0.25089 0.07735 3.2433 0.00478
43
Estimated Regression Equation

SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)


Interpreting the Coefficients

In multiple regression analysis, we interpret each


regression coefficient as follows:

bi represents an estimate of the change in y


corresponding to a 1-unit increase in xi when all
other independent variables are held constant.
Interpreting the Coefficients

b1 = 1. 404

Salary is expected to increase by $1,404 for


each additional year of experience (when the variable
score on programmer attitude test is held constant).
Interpreting the Coefficients

b2 = 0.251

Salary is expected to increase by $251 for each


additional point scored on the programmer aptitude
test (when the variable years of experience is held
constant).
Multiple Coefficient of Determination

Relationship Among SST, SSR, SSE

SST = SSR + SSE

 i
( y  y ) 2
  i
( ˆ
y  y ) 2
  i i
( y  ˆ
y ) 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Multiple Coefficient of Determination

Excel’s ANOVA Output

A B C D E F
32
33 ANOVA
34 df SS MS F Significance F
35 Regression 2 500.3285 250.1643 42.76013 2.32774E-07
36 Residual 17 99.45697 5.85041
37 Total 19 599.7855
38
SSR
SST
Multiple Coefficient of Determination

R2 = SSR/SST

R2 = 500.3285/599.7855 = .83418
Adjusted Multiple Coefficient of Determination

n1
Ra2 2
 1  (1  R )
np1

20  1
R  1  (1  .834179)
2
a  .814671
20  2  1

Where,
n is number of observations.
p is number of independent variables.
Adjusted Multiple Coefficient of Determination

Excel’s Regression Statistics

A B C
23
24 SUMMARY OUTPUT
25
26 Regression Statistics
27 Multiple R 0.913334059
28 R Square 0.834179103
29 Adjusted R Square 0.814670762
30 Standard Error 2.418762076
31 Observations 20
32

You might also like