You are on page 1of 73

QUANTITATIVE ANALYSIS (ISD

551)
Lecturer: Dr. Emmanuel Quansah
Department: Supply Chain and Information Systems - KSB
Office: SF 25, KSB Undergraduate Block
Module 5:
REGRESSION
MODELS
Learning Objectives (1 of
2)
After completing this module, students will be able to:
1. Identify variables, visualize them in a scatter diagram,
and use them in a regression model.
2. Develop simple linear regression equations from
sample data and interpret the slope and intercept.
3. Calculate the coefficient of determination and the
coefficient of correlation and interpret their
meanings.
4. Interpret the F test in a linear regression model.
5. Use computer software for regression analysis.
6. Develop a multiple regression model and use it
for
prediction purposes.
Module
Outline
1. Introduction and some terminology
2. Scatter Diagrams
3. Simple Linear Regression
4. Measuring the Fit of the Regression Model
5. Assumptions of the Regression Model
6. Testing the Model for Significance
7. Using Computer Software for Regression
8. Multiple Regression Analysis
Introduction (1 of
2)
• Regression analysis – very valuable tool for a manager
• Understand the relationship between variables
• Predict the value of one variable based on another variable
• Simple linear regression models have only two variables
• Multiple regression models have more than one
independent variable
Introduction (1 of
2)
• Dependent Variable - A dependent or outcome variable
is any variable that changes its value in response to
another variable
• Independent Variable - An independent or predictor or
explanatory variable is a variable that influences another
variable
• Correlation - is a statistical method used to determine
whether a linear relationship exists between any two
variables. It allows you to find out if there is a statistically
significant relationship between TWO variables
• Regression - is a statistical method used to describe the
nature of the relationship (causation) between any two
variables. It allows you to make predictions based on the
relationship that exists between two variables.
Introduction (2 of
2)
• Variable to be predicted is called the dependent variable
or response variable
• Value depends on the value of the independent variable(s)
• Explanatory or predictor variable
8

The General
Idea
• Simple Regression Analysis
• Considers the relationship between two variables
(an independent variable (X) and a dependent
variable (Y)) OR between a single explanatory
variable and response variable
9

The General
•Idea
Multiple regression simultaneously considers
the influence of multiple explanatory variables
(X1, X2…Xk) on a response variable Y

The intent is to look at


the independent effect
of each variable as well
as the combined effect
Scatter
•Diagram
Scatter diagram or scatter plot often used to investigate
the relationship between variables
• Independent variable normally plotted on X axis
• Dependent variable normally plotted on Y axis
Triple A Construction (1
of 7)
• Triple A Construction renovates old homes
• The dollar volume of renovation work is dependent on the
area payroll
TABLE 4.1 Triple A Construction Company Sales and Local
Payroll
TRIPLE A’S SALES LOCAL PAYROLL
($100,000s) ($100,000,000s)
6 3
8 4
9 6
5 4
4.5 2
9.5 5
Triple A Construction (2
of 7)
FIGURE 4.1 Scatter Diagram of Triple A
Construction Company Data
Simple Linear Regression (1 of
2)
• Regression models used to test relationships between
variables
• Random error

Y  0  1 X  
where
Y = dependent variable
(response)
X = independent variable (predictor
or explanatory)
β0 = intercept (value of Y
when X = 0)
β1 = slope of the
regression line
Simple Linear Regression (2 of
2)
• True values for the slope and intercept are not known
• Estimated using sample data

Yˆ  b0  b1 X

where
Ŷ = predicted value of Y
b0 = estimate of β0, based on
sample results
b1 = estimate of β1, based on
sample results
Triple A Construction (3
of 7)
• Predict sales based on area payroll
Y = Sales X=
Area payroll
– The line Figure 4.1 minimizes the errors

Error = (Actual value) − (Predicted value

e  Y Yˆ

– Regression analysis minimizes the sum of squared


errors
– Least-squares regression
Triple A Construction (4
of 7)
• Formulas for simple linear regression, intercept and slope

Yˆ  b0  b1 X

 Xn
X  average (mean) of X values

Y  n  average (mean) of Y
Y

values
1
b 
 ( X  X )(Y  Y
) ( X 
b0  Y Xb1)2
X
Triple A Construction (5
of 7)
TABLE 4.2 Regression Calculations for Triple A Construction

Y X (X − X̅ )2 (X − X̅ )(Y − Y̅ )
6 3 (3 − 4)2 = 1 (3 − 4)(6 − 7) = 1
8 4 (4 − 4)2 = 0 (4 − 4)(8 − 7) = 0
9 6 (6 − 4)2 = 4 (6 − 4)(9 − 7) = 4
5 4 (4 − 4)2 = 0 (4 − 4)(5 − 7) = 0
4.5 2 (2 − 4)2 = 4 (2 − 4)(4.5 − 7) = 5
9.5 5 (5 − 4)2 = 1 (5 − 4)(9.5 − 7) = 2.5
ΣY = 42 ΣX = 24 Σ(X − X̅ )2 = 10 Σ(X − X̅ )(Y − Y̅ ) =
12.5
Y
̅ = 42÷6 = X̅ = 24÷6
7 =
4
Triple A Construction (6
of 7)
• Regression calculations

X X 
24
4 Y Y 
42
7
6 6
6 6
b1

 ( X – X )(Y – Y ) 12.5
 10  1.25
( X –
b0  Y Xb1)2X  7 – (1.25)(4) 
2
Therefore Yˆ  2
+1.25X
Triple A Construction (7
of 7)
• Regression calculations

X   X

24
4 Y  Y

42

6 sales = 2 + 1.25(payroll)
7
6 6

b1
  ( X – X ) If the payroll6next year is $600 million
 ( X – (Y Ŷ– Y= )2 +12.5
1.25(6) = 9.5 or $ 950,000
b0  Y  b1 X   1.25
 X )2 10
Therefore 7Y–ˆ (1.25)(4) 2
2+
1.25X
Simple Linear Regression
Problem Data
Question: Given the data below, what is the simple linear
regression model that can be used to predict sales?

Week Sales
1 150
2 157
3 162
4 166
5 177
YI = 143.5 + 6.3x Resulting regression

180 model
175
170
165
160 Sales

155
Sales

Forecast
150
145
140
135
1 2 3 4 5
Period

Now if we plot the regression generated forecasts against the


actual sales we obtain the following chart:
Measuring the Fit of the
Regression Model (1 of 5)
• Regression models can be developed for any variables
X and Y
• How helpful is the model in predicting Y?
• With average error positive and negative errors cancel
each other out
• Three measures of variability
• SST – Total variability about the mean
• SSE – Variability about the regression line
• SSR – Total variability that is explained by the model
Measuring the Fit of the
Regression Model (2 of 5)
• Sum of squares total
SST   (Y Y )2

• Sum of squares error

SSE   e 2   (Y Yˆ )2

• Sum of squares regression

SSR   (Yˆ Y )2

• An important relationship

SST  SSR + SSE


Measuring the Fit of the
Regression Model (3 of 5)
TABLE 4.3 Sum of Squares for Triple A
Construction
Y X (Y − Y̅ )2 Ŷ (Y − Ŷ)2 (Ŷ − Y̅ )2
6 3 (6 − 7)2 = 1 2 + 1.25(3) = 5.75 0.0625 1.563
8 4 (8 − 7)2 = 1 2 + 1.25(4) = 7.00 1 0
9 6 (9 − 7)2 = 4 2 + 1.25(6) = 9.50 0.25 6.25
5 4 (5 − 7)2 = 4 2 + 1.25(4) = 7.00 4 0
4.5 2 (4.5 − 7)2 = 6.25 2 + 1.25(2) = 4.50 0 6.25
9.5 5 (9.5 − 7)2 = 6.25 2 + 1.25(5) = 8.25 1.5625 1.563
Y
̅ =7 ∑(Y − Y̅ )2 = 22.5 ∑(Y − Ŷ)2 = 6.875 ∑(Ŷ − Y̅ )2 = 15.625
SST = 22.5 SSE = 6.875 SSR = 15.625
Measuring the Fit of the
Regression Model (4 of 5)

For Triple A
Construction
SST = 22.5
SSE = 6.875
SSR = 15.625
Measuring the Fit of the
Regression Model (5 of 5)
FIGURE 4.2 Deviations from the Regression Line and from
the Mean
Coefficient of Determination (1
of 2)
• The proportion of the variability in Y explained by the
regression equation
• The coefficient of determination is r2.

SSR SSE
r2   1–
SST SST
• For Triple A Construction

15.625
r 2  22.5  0.6944
Coefficient of Determination (2
of 2)

About 69% of the variability in Y is


explained by the equation based on
payroll (X)
Correlation
Coefficient
• An expression of the strength of the linear relationship
• Always between +1 and −1
• The correlation coefficient is r

r  r2

• For Triple A Construction

r 
0.6944  0.8333
Four Values of the Correlation
Coefficient
FIGURE 4.3 Four Values of the Correlation Coefficient
Estimating the Variance (1
of 2)
• Errors are assumed to have a constant variance (σ2),
usually unknown
• Estimated using the mean squared error (MSE),
s2
SSE
s  MSE 
2

nk
1
where
n = number of observations in the sample
k = number of independent variables
Estimating the Variance (2
of 2)
• For Triple A
Construction

s 2  MSE  SSE 6.8750 6.8750


  

1.7188
n  k  1 6 11 4
• Estimate the standard deviation, s
• The standard error of the estimate or the standard
deviation of the regression

s  MSE  1.7188  1.31


Testing the Model for Significance (1
of 4)
• When the sample size is too small, you can get
good values for MSE and r2 even if there is no
relationship between the variables
• Testing the model for significance helps determine if the
values are meaningful
• Performing a statistical hypothesis test
Testing the Model for Significance (2
of 4)
• We start with the general linear model

Y  0  1 X  

• If β1 = 0, the null hypothesis is that there is no


relationship between X and Y
• The alternate hypothesis is that there is a linear
relationship (β1 ≠ 0)
• If the null hypothesis can be rejected, we have
proven
there is a relationship
• We use the F statistic
Testing the Model for Significance (3
of 4)
• The F statistic is based on the MSE and MSR
SSR
MSR  k
where
k = number of independent variables in the model
• The F statistic is

MSR
F
MSE

Describes an F distribution with:


degrees of freedom for the numerator = df1 = k
degrees of freedom for the denominator = df2 = n − k
−1
Testing the Model for Significance (4
of 4)
• If there is very little error, MSE would be small
and the F statistic would be large – model is
useful
• If the F statistic is large, the significance level
(p-value) will be low, – unlikely would have
occurred by chance
• When the F value is large, we can reject the
null hypothesis and accept that there is a
linear relationship between X and Y and the
values of the MSE and r2 are meaningful
Steps in a Hypothesis Test (1
of 2)
1. Specify null and alternative hypotheses
H0 : 1  0
H 1 : 1  0

2. Select the level of significance (α)


Common values are 0.01 and 0.05.
3. Calculate the value of the test statistic

MSR
F
MSE
Steps in a Hypothesis Test (2
of 2)
4. Make a decision using one of the following methods
a) Reject the null hypothesis if the test statistic is
greater than the F value from the table in
Appendix
D. Otherwise, do not reject the
F
null hypothesis:
Reject if Fcalculated  ,df1 ,d 2
f

1 k
2 nk1
b) Reject the null hypothesis if the observed significance
level, or p-value, is less than the level of significance
(α). Otherwise, do not reject the null hypothesis:
p-value  P(F  calculated test
statistic) Reject if p-value  
Triple A Construction (1
of 3)
Step 1:
H0: β1 = 0 (no linear relationship between X and
Y) H1: β1 ≠ 0 (linear relationship exists between X
and Y)
Step 2:
Select α = 0.05
Step 3:
– Calculate the value of the test statistic

SSR 15.6250
MSR    15.6250
k 1
MSR 15.6250
F    9.09
Triple A Construction (2
of 3)
• Step 4:
• Reject the null hypothesis if the test statistic is greater
than the F value in Appendix D
df1 = k = 1
df2 = n − k − 1 = 6 − 1 − 1 = 4

The value of F associated with a 5% level of significance


and with degrees of freedom 1 and 4 is found in
Appendix D.
F0.05,1,4 = 7.71
Fcalculated = 9.09
Reject H0 because 9.09 > 7.71
Triple A Construction (3
of 3)
FIGURE 4.5 F Distribution for Triple A Construction Test for
Significance
• We can conclude there
is a statistically
significant relationship
between X and Y
• The r2 value of 0.69
means about 69% of
the variability in sales
(Y) is explained by
local
payroll (X)
Analysis of Variance (ANOVA)
• Table
With software models, an ANOVA table is typically created
that shows the observed significance level (p-value) for the
calculated F value
• This can be compared to the level of significance (α) to
make a decision

TABLE 4.4 Analysis of Variance Table for Regression

DF SS MS F SIGNIFICANCE F
Regression k SSR MSR = SSR÷k MSR÷MSE P(F > MSR÷MSE)
Residual n−k−1 SSE MSE = SSE÷(n − k − 1)
Total n−1 SST
ANOVA for Triple A
Construction
PROGRAM 4.1C Excel 2016 Output for Triple A Construction
Example

P(F > 9.0909) = 0.0394


Using Software (1 of
10)
PROGRAM 4.1A Accessing the Regression Option in
Excel 2016
Using Software (2 of
10)
PROGRAM 4.1B Data Input for Regression in Excel 2016
Using Software (3 of
10)
PROGRAM 4.1C Excel 2016 Output for Triple A
Construction Example
Using Software (4 of
10)
PROGRAM 4.2A Using Excel QM for Regression
Using Software (5 of
10)
PROGRAM 4.2B Initializing the Spreadsheet in Excel QM
Using Software (6 of
10)
PROGRAM 4.2C Input and Results for Regression in Excel
QM
Using Software (7 of
10)
PROGRAM 4.3A QM for Windows Regression Option in
Forecasting Module
Using Software (8 of
10)
PROGRAM 4.3B QM for Windows Screen to Initialize the
Problem
Using Software (9 of
10)
PROGRAM 4.3C Data Input for Triple A
Construction Example
Using Software (10 of
10)
PROGRAM 4.3D QM for Windows Output for Triple A
Construction Example
Multiple Regression Analysis (1
of 2)
• Extensions to the simple linear model
• Models with more than one independent variable

Y = β 0 + β 1X 1 + β 2X 2 + … + β k X k + ε
where
Y = dependent variable (response variable)
Xi = ith independent variable (predictor or
explanatory variable)
β0 = intercept (value of Y when all Xi = 0)
βi = coefficient of the ith independent
variable
k= number of independent variables
ε= random error
Multiple Regression Analysis (2
of 2)
• To estimate these values, a sample is taken the
following equation developed

Yˆ  b0  b1 X1  b2 X 2  ...  bk X k

where
Ŷ= predicted value of Y
b0 = sample intercept (an estimate of β0)
bi = sample coefficient of the ith variable (an
estimate
of βi)
Multiple Regression Model…
1

X2
• YI = a + b1x1 + b2x2
Formula for multiple linear regression
with two independent variables X
1

Multiple regression equation with k independent variables:


Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y

Yˆ i  b0  b1X 1  b 2 X 2
 b X e
Multiple Regression
Model…2 • A simple regression model
(one independent
variable) fits a regression
line in
Y 2-dimensional space

Yˆ  b0 
b1X1  b 2 X 2
• A multiple regression model
with two explanatory variables
fits a regression plane in
3-dimensional space X

X1
Multiple Regression Model…
3
• YI = a + b1x1 +
Formula for multiple linear
regression with two
independent variables

b2x2

We will be calculating a, b1
and b2 using excel software
Multiple Regression
Example…1
A distributor of frozen dessert pies wants to evaluate
factors thought to influence demand

Dependent variable: Pie sales (units per


week) Independent variables: Price (in $)
Advertising ($100’s)
Data are collected for 15 weeks
Multiple Regression
Example…2
Price Advertising
Week Pie Sales ($) ($100s)

1 350 5.50 3.3


2 460 7.50 3.3 Multiple regression equation:
3 350 8.00 3.0
4 430 8.00 4.5 • Sales = b0 + b1 (Price) + b2
5 350 6.80 3.0
6 380 7.50 4.0 (Advertising)
7 430 4.50 3.0
• Sales = b0 +b1X1 + b2X2
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
Where X1 = Price
11 340 7.20 3.5
12 300 7.90 3.2 X2 = Advertising
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Estimating a Multiple Linear
Regression Equation using
Excel
• Excel will be used to generate the
coefficients and measures of goodness of fit
for multiple regression

• Excel:
• Tools / Data Analysis... / Regression
Multiple Regression
Equation 2 Variable
Multiple R
Example, Excel
Regression Statistics
0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341 Sales  306.526 - 24.975(X1)  74.131(X2 )
Observations 15

ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Multiple Regression
Equation 2 Variable
Example
Sales  306.526 - 24.975(X 1 )  74.131(X
2 )

where
Sales is in number of pies per week
Price is in $
b Advertising
= -24.975:
1 sales will
is in $100’s. b2 = 74.131: sales will
decrease, on average, by increase, on average, by
24.975 pies per week for each 74.131 pies per week for
each $100 increase in
$1 increase in selling price, advertising, net of the
net of the effects of changes effects of changes due to
due to advertising price
Multiple Regression
Equation 2 Variable
Example
Predict sales for a week in which the
selling price is $5.50 and advertising is
$350:
Sales  306.526 - 24.975(X1 )  74.131(X2 )
 306.526 - 24.975 (5.50)  74.131 (3.5)
 428.62
Note that Advertising is in
Predicted sales is $100’s, so $350 means that X2
428.62 pies = 3.5
Jenny Wilson Realty (1 of
9)
• Develop a model to determine the suggested listing price for
houses based on the size and age of the house

Yˆ  b 0  b 1 X 1  b 2 X 2
where
Ŷ = predicted value of dependent variable
(selling price)
b0 = Y intercept
X1 and X2 = value of the two independent
variables
(square footage and age) respectively
b1 and b2 = slopes for X1 and X2 respectively
• Selects a sample of houses that have sold recently and
records the data
Jenny Wilson Real Estate
Data
TABLE 4.5 Jenny Wilson Real Estate Data
SELLING PRICE ($) SQUARE FOOTAGE AGE CONDITION
95,000 1,926 30 Good
119,000 2,069 40 Excellent
124,800 1,720 30 Excellent
135,000 1,396 15 Good
142,000 1,706 32 Mint
145,000 1,847 38 Mint
159,000 1,950 27 Mint
165,000 2,323 30 Excellent
182,000 2,285 26 Mint
183,000 3,752 35 Good
200,000 2,300 18 Good
211,000 2,525 17 Good
215,000 3,800 40 Excellent
219,000 1,740 12 Mint
Jenny Wilson Realty (2 of
9)
PROGRAM 4.4A Input Screen for Jenny Wilson Realty
Multiple Regression in Excel 2016
Jenny Wilson Realty (3 of
9)
PROGRAM 4.4B Excel 2016 Output Screen for Jenny
Wilson Realty Multiple Regression Example

Y ˆ  b0  b1 X 1  b 2 X 2
 146,630.89  43.82X 1 
2898.69X 2
Evaluating the Multiple Regression
(1 of 2)
Model
• Similar to simple linear regression models
• The p-value for the F test and r2 interpreted the same
• The hypothesis is different because there is more than
one independent variable
• The F test is investigating whether all the coefficients are
equal to 0 at the same time
Evaluating the Multiple Regression
(2 of 2)
Model
• To determine which independent variables are significant,
tests are performed for each variable

H 0 : 1  0
H1 : 1  0

• The test statistic is calculated and if the p-value is lower


than the level of significance (α), the null hypothesis is
rejected
Jenny Wilson Realty (4 of
9)
• Full model is statistically significant
• Useful in predicting selling price

p-value for F test = 0.002 r2 = 0.6719


• Are both variables significant?
H 0 : 1  0
• For X1 (square footage)
H1 : 1  0
For a = 0.05, p-value = 0.0013
null hypothesis is rejected
• For X2 (age)

For a = 0.05, p-value = 0.0039


null hypothesis is rejected
Jenny Wilson Realty (5 of
9)

Both square footage and


age are helpful in predicting
the selling price
Questions???

You might also like