You are on page 1of 44

In this chapter, you learn:

• How to use regression analysis to predict the value of a


dependent variable based on an independent variable
• The meaning of the regression coefficients b0 and b1
• How to evaluate the assumptions of regression analysis and
know what to do if the assumptions are violated
• To make inferences about the slope and correlation
coefficient
• To estimate mean values and predict individual values
Simple Linear Regression

 Managerial decisions are often based on the


relationship between two or more variables.
 Regression analysis can be used to develop an
equation showing how the variables are related.
 The variable being predicted is called the dependent
variable and is denoted by y.
 The variables being used to predict the value of the
dependent variable are called the independent
variables and are denoted by x.
Simple Linear Regression

 Simple linear regression involves one independent


variable and one dependent variable.
 The relationship between the two variables is
approximated by a straight line.
 Regression analysis involving two or more
independent variables is called multiple regression.
• A scatter plot can be used to show the relationship between
two variables
• Correlation analysis is used to measure the strength of the
association (linear relationship) between two variables
• Correlation is only concerned with strength of the relationship
• No causal effect is implied with correlation
Introduction to Regression
Analysis
• Regression analysis is used to:
• Predict the value of a dependent variable based on the
value of at least one independent variable
• Explain the impact of changes in an independent variable
on the dependent variable
Dependent variable: the variable we wish to
predict or explain
Independent variable: the variable used to predict
or explain the dependent
variable
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
(continued)
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
(continued)
No relationship

X
Simple Linear Regression
Model
• Only one independent variable, X
• Relationship between X and Y is described by a
linear function
• Changes in Y are assumed to be related to changes in
X
Simple Linear Regression
Model

Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Linear component Random Error


component
Simple Linear Regression
Model (continued)

Y
Observed Value
of Y for Xi

εi Slope = β1

Predicted Value Random Error for this Xi


of Y for Xi value

Intercept = β0

Xi
X
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an estimate of the
population regression line

Estimated (or
predicted) Y Estimate of the Estimate of the
value for regression regression slope
observation i intercept

Value of X for
observation i
Estimation Process
Sample Data:
Regression Model X Y
Y = β00 + β11X + e x11 y11
Unknown Parameters . .
b00, b11 . .
xnn ynn

Estimated
b00 and b11 Regression Equation
provide estimates of Y ^ =b0 +b 1 X
β00 and β11 Sample Statistics
b00, b11
b0 and b1 are obtained by finding the values of that
minimize the sum of the squared differences
between Y and

where:
Yi = observed value of the dependent variable
for the ith observation
= estimated value of the dependent variable
for the ith observation
Least Squares Method
Slope and intercept for the Estimated Regression Equation

∑ (X i −)(Y i −Ȳ )
b0 =Ȳ −b 1 
   
b1 = and
∑ ¿¿¿
  where:
Xi = value of independent variable for ith
observation
Yi = value of dependent variable for ith
observation
= mean of the independent variable
= mean of the dependent variable
Simple Linear Regression
Example: Reed Auto Sales
Reed Auto periodically has a special week-long sale. As part of the
advertising campaign Reed runs one or more television commercials during
the weekend preceding the sale. Data from a sample of 5 previous sales are
shown below:
Error (ɛ) =
Number of Number of No. of TV No. of cars
ads (X) sold (Y) (X-) (Y-Ȳ) Est. sales Actual
(X-)(Y-Ȳ) (X-)^2 = 10+5X sales - Est.
TV Ads (x) Cars Sold (y) sales

1 14 1 14 -1 -6 6 1 15 -1

3 24 3 24 1 4 4 1 25 -1
2 18
2 18 0 -2 0 0 20 -2
1 17
3 27 1 17 -1 -3 3 1 15 2

Sx = 10 Sy = 100 3 27 1 7 7 1 25 2

=2 Ȳ = 20     20 4    
Estimated Regression Equation

 Slope for the Estimated Regression Equation


  = 20/4 = 5

 y-Intercept for the Estimated Regression Equation


  = 20-5(2) = 10
 Estimated Regression Equation
 

i.e.,
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Sum of Squares calculation

No. of TV ads No. of cars sold Est. sales Error (ɛ) = Actual
(X) (Y) (X-) (Y-Ȳ)
= 10+5X sales(Y) - Est. sales(
(Y-Ȳ)^2 ( - Ȳ)^2 (Y- )^2

1 14 -1 -6 15 -1 36 25 1

3 24 1 4 25 -1 16 25 1

2 18 0 -2 20 -2 4 0 4

1 17 -1 -3 15 2 9 25 4

3 27 1 7 25 2 49 25 4

=2 Ȳ = 20     SST=114 SSR=100  SSE=14


Coefficient of Determination

 The coefficient of determination is:

r2 = SSR/SST
OR
r2 = 1-(SSE/SST)

where:
SSR = sum of squares due to regression
SSE = sum of squares due to errors
SST = total sum of squares
Coefficient of Determination

r2 = SSR/SST = 100/114 = .8772

The regression relationship is very strong; 87.72%


of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.
Sample Correlation Coefficient

where:
b1 = the slope of the estimated regression
equation
Sample Correlation Coefficient

The sign of b1 in the equation is “+”.

rxy = +.9366
Assumptions about the Error term e

1. The error  is a random variable with mean of zero.

2. The variance of  , denoted by  2, is the same for


all values of the independent variable.

3. The values of  are independent.

4. The error  is a normally distributed random


variable.
Testing for Significance
To test for a significant regression relationship, we
must conduct a hypothesis test to determine whether
the value of b1 is zero.

Two tests are commonly used:


t Test and F Test

Both the t test and F test require an estimate of s 2,


the variance of e in the regression model.
Testing for Significance
• An Estimate of s 2
The mean square error (MSE) provides the estimate
of s 2, and the notation s2 is also used.

s 2 = MSE = SSE/(n-k-1)

where:
k is the no. of independent variables
Testing for Significance
• An Estimate of s
• To estimate s we take the square root of s 2.
• The resulting s is called the standard error of
the estimate.
Testing for Significance: t-test
• Hypotheses

• Test Statistic

where
Testing for Significance: t Test

 Rejection Rule

Reject H0 if p-value < a


or t < -tor t > t

where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: t Test

1. Determine the hypotheses.

2. Specify the level of significance. a = .05

3. Select the test statistic.

4. State the rejection rule. Reject H0 if p-value < .05


or |t| > 3.182 (with
3 degrees of freedom)
Testing for Significance: t Test

5. Compute the value of the test statistic.

6. Determine whether to reject H0.


t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0.
Confidence Interval for β1
 We can use a 95% confidence interval for 1 to test
the hypotheses just used in the t test.
 H0 is rejected if the hypothesized value of 1 is not
included in the confidence interval for 1.
Confidence Interval for β1
• The form of a confidence interval for 1 is: is the
margin
of error
b11 is the
point
estimator where is the t value providing an area
of a/2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Confidence Interval for β1
 Rejection Rule
Reject H0 if 0 is not included in
the confidence interval for 1.
 95% Confidence Interval for 1
= 5 +/- 3.182(1.08) = 5 +/- 3.44
or 1.56 to 8.44
 Conclusion
0 is not included in the confidence interval.
Reject H0
Testing for Significance: F Test

 Hypotheses

 Test Statistic

F = MSR/MSE

MSR = SSR/k

MSE = SSE/(n-k-1)

k = no. of independent variables


Testing for Significance: F Test

 Rejection Rule

Reject H0 if
p-value < a
or F > F
where:
F is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
Testing for Significance: F Test

1. Determine the hypotheses.

2. Specify the level of significance. a = .05

3. Select the test statistic. F = MSR/MSE

4. State the rejection rule. Reject H0 if p-value < .05


or F > 10.13 (with 1 d.f.
in numerator and
3 d.f. in denominator)
Testing for Significance: F Test

5. Compute the value of the test statistic.

F = MSR/MSE = 100/4.667 = 21.43

6. Determine whether to reject H0.


F = 17.44 provides an area of .025 in the upper
tail. Thus, the p-value corresponding to F = 21.43
is less than .025. Hence, we reject H0.
The statistical evidence is sufficient to conclude
that we have a significant relationship between the
number of TV ads aired and the number of cars sold.
Example: Data were collected from a sample of 10 Armand’s Pizza Parlor restaurants located near college
campuses. Obtain the estimated regression line and estimate quarterly sales of an outlet near a campus with
30,000 students.
Student Quarterly Error
population sales (X-) (Y-Ȳ) (X-)(Y-Ȳ) (X-)2 = 60+5X (Y-Ȳ)2 (-Ȳ)2 (Y-)2
(1000s) X ($1000s) Y (Y-)

2 58 -12 -72 864 144 70 -12 5184 3600 144

6 105 -8 -25 200 64 90 15 625 1600 225


8 88 -6 -42 252 36 100 -12 1764 900 144
8 118 -6 -12 72 36 100 18 144 900 324
12 117 -2 -13 26 4 120 -3 169 100 9
16 137 2 7 14 4 140 -3 49 100 9
20 157 6 27 162 36 160 -3 729 900 9
20 169 6 39 234 36 160 9 1521 900 81
22 149 8 19 152 64 170 -21 361 1600 441

26 202 12 72 864 144 190 12 5184 3600 144


SST=
SST= SSR=
SSR= SSE=
SSE=

 == 14
14 Ȳ
Ȳ == 130
130 2840
2840 568
568 15730 14200 1530
15730 14200 1530
Estimated regression line of quarterly sales
250

200

f(x) = 5 x + 60
R² = 0.9
Quarterly sales ($1000s)

150

Linear ()

100

50

0
0 5 10 15 20 25 30

No. students (1000s)


SUMMARY OUTPUT

Regression Statistics
Multiple R 0.950122955
R Square 0.90273363
Adjusted R Square 0.890575334
Standard Error 13.82931669
Observations 10

ANOVA
  df SS MS F Significance F
Regression 1 14200 14200 74.24836601 2.54887E-05
Residual/Error 8 1530 191.25
Total 9 15730      

  Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 60 9.22603481 6.503335532 0.000187444 38.72472558 81.27527442 38.72472558 81.27527442
X Variable 1 5 0.580265238 8.616749156 2.54887E-05 3.661905962 6.338094038 3.661905962 6.338094038

RESIDUAL OUTPUT

Observation Predicted Y Residuals


1 70 -12
2 90 15
3 100 -12
4 100 18
5 120 -3
6 140 -3
7 160 -3
8 160 9
9 170 -21
10 190 12
Practice Exercises

1. Data on advertising expenditures and revenue (in thousands of dollars) for the Four Seasons Restaurant are as
follow:
Advt exp($1000s) Revenue($1000s)
1 19
2 32
4 44
6 40
10 52
14 53
20 54

a. Let x equal advertising expenditures and y equal revenue. Use the method of least squares to develop a
straight-line approximation of the relationship between the two variables.
b. Test whether revenue and advertising expenditures are related at 0.05 level of significance.
c. Test whether the estimated regression coefficient is significant at 0.05 level of significance.
d. Construct a confidence interval for regression coefficient at 0.05 level of significance.
2. Concur Technologies, Inc., is a large expense-management company located in Redmond, Washington. The Wall
street Journal asked Concur to examine the data from 8.3 million expense reports to provide insights regarding
business travel expenses. Their analysis of the data showed that New York was the most expensive city, with an
average daily hotel room rate of $198 and an average amount spent on entertainment, including group meals and
tickets for shows, sports, and other events, of $172. In comparison, the U.S. averages for these two categories were
$89 for the room rate and $99 for entertainment. The following table shows the average daily hotel room rate and the
amount spent on entertainment for a random sample of 9 of the 25 most visited U.S. cities.
City Room rent($) Entertainment($)
Boston 148 161
Denver 96 105
Nashville 91 101
New Orleans 110 142
Phoenix 90 100
San Diego 102 120
San Francisco 136 167
San Jose 90 140
Tampa 82 98
(i) Develop a scatter diagram for these data with the room rate as the independent variable. (ii) What does the scatter
diagram developed in part (i) indicate about the relationship between the two variables? (iii) Develop the least
squares estimated regression equation. (iv) Provide an interpretation for the slope of the estimated regression
equation. (v) The average room rate in Chicago is $128, considerably higher than the U.S. average. Predict the
entertainment expense per day for Chicago.

You might also like