MMGT6012

MMGT6012
Business Tools for Management
TOPIC 5: Multiple Regression Modelling

Dr. Matthew Beck
ITLS, Business School
The University of Sydney Page 1

5. Multiple Regression Modelling
Assumptions of Regression
Linear relationship between X and Y
No multicolinearity
– Independent variables are not correlated with each other
Normality of Error
– Error values (ε) are normally distributed for any given value of X
Homoscedasticity
– The probability distribution of the errors has constant variance
Independence of Errors
– Error values are statistically independent

The Simple Linear Regression Function
In the population the regression model is:
𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝜀𝜀𝑖𝑖
In the sample the regression model is:
𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖

Y Yi = β0 + β1Xi + ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Observed Value
of Y for Xi
εi
Predicted Value Random Error Slope = β1
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Observed Value
of Y for Xi
εi
Predicted Value Random Error Slope = β1
of Y for Xi
for this Xi value
Objective is to
Intercept = β0
minimise all errors!
Xi X
The Multiple Linear Regression Function
In the population the regression model (for k indep. variables) is:
𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1𝑖𝑖 + 𝛽𝛽2 𝑋𝑋2𝑖𝑖 + ⋯ + 𝛽𝛽𝑘𝑘 𝑋𝑋𝑘𝑘𝑖𝑖 + 𝜀𝜀𝑖𝑖
In the sample the regression model is:
𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1𝑖𝑖 + 𝛽𝛽2 𝑋𝑋2𝑖𝑖 + ⋯ + 𝛽𝛽𝑘𝑘 𝑋𝑋𝑘𝑘𝑘𝑘

Two Variable Model
Y
Ŷ = b0 + b1X1 + b 2 X 2
X2
X1
Multiple Regression Example
A local golf store wants to evaluate factors thought to influence

demand for boxes of golf balls
Dependent Variable:
– Golf Ball Sales
Independent Variables:
– Price (in $)
– Advertising (in $100)
Data are collected for 15 weeks

Week Sales Price Advert
Multiple Regression Example 1 450 33 3.3
2 560 45 3.3
3 450 48 3.0
Sales = β0 + β1(Price) + β2(Advert)
4 530 48 4.5
5 450 41 3.0
6 480 45 4.0
7 530 27 3.0
8 570 38 3.7
9 550 42 3.5
10 590 30 4.0
11 440 43 3.5
12 400 47 3.2
13 540 35 4.0
14 550 30 3.5
15 400 42 2.7
Visualising the relationships
Sales vs. Price Sales vs. Advertising

700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
0 20 40 60 0 1 2 3 4 5

How well does the model fit the data?

– R2 reports the variation in Y explained by the variation in X
– 51.9% of the variation in sales is explained by variations in price and

advertisting

Are the coefficients significant?

– Can use either the sig value (< 0.05) or the t statistic (> 1.96)

Use the coefficients to construct the regression equation:

– Sales = 406.9 – 4.149(Price) + 73.773(Advert)

As advertising increases by
1 unit ($100), sales
increase by 73.8 units
Sales = 406.9 – 4.149(Price) + 73.773(Advert)
As price increases by 1 unit

($1), sales decrease by 4.1
units

Use the regression model to predict sales:

– What if prices was set to $40 per box and advertising was $450?
Sales = 406.9 – 4.149(Price) + 73.773(Advert)
Sales = 406.9 – 4.149(40) + 73.773(4.5)
Sales = 572.9 boxes of golf balls Note that advertising was initially
measured in hundreds of dollars so
convert 450 to 4.5 “hundreds”

Class Activity
What is the range of sales for which you are 95% confident the
actual sales of golf balls will be?

Summarising What We Know
We know that we do the following for regression:

– Assess model fit
– Assess significance of coefficients
– Interpret model
– Make predictions

Adding to Our Knowledge
There are several issues with multiple regression modelling:

– R-square never decreases when a new X variable is added to the model
– Using multiple t-tests to individually test each coefficient one at a time
increases the chance of making a Type I error
– How do we compare the impacts of different independent variables
– How do we add categorical variables to our regression model
– How do we remove variables that are insignificant

Adjusted R-square
R-square:
– Never decreases when a new X variable is added to the model
Adjusted R-Square:
– Calibrates the R2 based on how many X variables we are using
– Does using an extra X (k = number of X’s) add any benefit
 2  n − 1 
2
R = 1 − (1 − R ) 
 n − k − 1 
adj

Adjusted R-square
Adjusted R-square is used for comparing model fit:

– Used to determine which model fits the data better
– Can only compare models with the same Y (dependent) variable
R-Square is used for interpreting model fit:

– Used to interpret how well the chosen model is performing

ANOVA (F test)
Using a t-test:
– Every time we use a t-test we allow for a 5% chance of a mistake
– The more times we do a test the more likely it is we will make a mistake
ANOVA (F test):
– Test of overall model significance
– H0: β1 = β2 = … = βk = 0 (no linear relationship)
– H1: at least one βi ≠ 0 (at least one Xi affects Y)

ANOVA (F test)

Standardised Betas
Slope Coefficients:
– Tell us the average change in Y for a one unit chance in X
– They are a function of how X is measured!
Predicting Public Transport Use:

– Distance to work is in kilometres
– Age is in years
– Income is in dollars
– Number of people in family is in ???

Standardised Betas
Recall our discussion of the Normal distribution:

– We could standardise data to remove units of measurement
𝒙𝒙 − 𝝁𝝁 x = value of interest
𝐳𝐳 = μ = mean of the data
𝝈𝝈 σ = standard deviation
We can do the same to the data in regression:

– We could convert columns of data to z-scores
– Use the z-scores as the X variables in the regression model
– These slope coefficients would have a standardised unit of measure

Standardised Betas
Only absolute values matter:
Standardised Betas are used for comparison NOT interpretation

Dummy Coding
Adding continuous variables is easy:

– Coefficients tell you as X changes by a unit how does Y change
– Have numerical properties ($10 is twice as much as $5)
Can categorical variables change by one unit?

– Male = 1
– Female = 2
– Does being female mean you are twice a male?

Dummy Coding
Regression modelling relies on numerical properties:

– We need a method to separate the numerical properties from the numbers
that are used to represent different categories of outcomes
Dummy coding is the process of including categorical variables:

– You create a “dummy” variable for each category within a variable
– A numerical proxy for the presence or absence of a category
Typically:
– 0 = absence of category/characteristic
– 1 = presence of category/characteristic

Dummy Coding – Simple Example
Predicting income based on gender:

– Female = 0
– Male = 1
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 450.694 1.033 436.194 .000
GENDER 109.010 1.393 .999 78.243 .000
a. Dependent Variable: INCOME

Class Activity
Income = 450.69 + 109.01(Gender)
– How much do you predict females (0) earn?
– How much do you predict males (1) earn?
– What is the difference in weekly income?

Dummy Coding – Visual Example Ball Sales Price Season

275 36 0
338 42 1
Does seasonality affect golf ball sales? 420 30 1
– Winter = 0 254 31 0
373 47 1
– Summer = 1 418 34 1
199 38 0
243 45 0
Scatterplot ignoring Season: 259 30 0
Sales vs. Price 403 40 1

343 41 1
450
400 242 33 0
350 191 39 0
300 217 45 0
250
351 34 1
200
150 364 32 1
100 283 33 0
50
330 47 1
0
0 10 20 30 40 50 324 43 1

Dummy Coding – Visual Example
Looking at sales in Summer and Winter side-by-side:
Sales vs. Price (Winter) Sales vs. Price (Summer)

450 450
400 400
350 350
300 300
250 250
200 200
150 150
100 100
50 50
0 0
0 10 20 30 40 50 0 10 20 30 40 50

Looking at sales in Summer and Winter on the one graph:
Sales
450
400
350
300
250
200
150
100
50
0
0 5 10 15 20 25 30 35 40 45 50

The effect of price is the same no matter what season it is:

– Just in summer we sell a constant amount more
Y = b0 + b1X1 + b2X2 Same Slope
– X2 = 1 → b0 + b1X1 + b2(1) → b0 + b1X1 + b2 → (b0 + b2) + b1X1

– X2 = 0 → b0 + b1X1 + b2(0) → b0 + b1X1 + 0 → b0 + b1X1
Different Constant
A dummy variable gives the average difference between two categories

Looking at regression output from SPSS:
What is the regression equation:

– On average, how many more golf balls do we sell in summer each week?

Dummy Coding – Multiple Categories
Thus far we have only looked at a variable with two categories:

– How many variables (X’s) were used the regression equation?
– We compared the impact of one category (1) against a base (0)
If we have a variable with more than two categories:

– Create (k-1) new variables
– Where k is the # of categories

Sales Colour Colour
Dummy Coding – Multiple Categories 1623 White 2
1259 Yellow 1
68 Pink 3
2000 110 Pink 3
y = -563.36x + 2303.6
1800 R² = 0.3865 346 Pink 3
1600 1780 White 2
1400 1335 Yellow 1
1200 1861 White 2
1000 1697 White 2
800
1870 White 2
600
83 Pink 3
400
323 Pink 3
200
0
1163 Yellow 1
0 1 2 3 4 1720 White 2
1090 Yellow 1
1716 White 2
1300 Yellow 1
1886 White 2
1548 White 2
Sales Colour Colour Yellow White Pink
1623 White 2
1259 Yellow 1
68 Pink 3
110 Pink 3
346 Pink 3
1780 White 2
1335 Yellow 1
1861 White 2
1697 White 2
1870 White 2
83 Pink 3
323 Pink 3
1163 Yellow 1
1720 White 2
1090 Yellow 1
1716 White 2
1300 Yellow 1
1886 White 2
1548 White 2

Dummy Coding – Multiple Categories
The variable with the zeros is referred to as the base variable:

– In the previous slide our base variable was the colour white
– I generally use the mode category as the base
– We compare the impact of a category to this basis for comparison
SPSS output for this regression model:

Class Activity
1. How well does the model now fit compared to previously?
2. What is the regression equation?
3. How do you interpret the regression coefficients?

Stepwise Regression
The process we use to remove insignificant variables
1. Perform regression and put all your X variables in the model
2. Identify the X variable with the biggest sig value
3. Remove it!!
4. Keep going one by one until only significant variables remain

Quick Summary of Topic 5
Measures of Model Performance:

– R-square: Percent of variation in Y explained by variations in the X’s
– Adjusted R-square: Used to compare performance of different models of Y
– ANOVA: Used to assess overall model significance
Coefficients:
– T-test: Used to determine if coefficients are significant or not
– Stepwise: Used to remove insignificant X’s one at a time
– Unstandardised: Used to create regression line and measure impact of X
– Standardised: Used to compare relative impacts of the different X
– Interpretation of slope coefficient for continuous vs. categorical X’s differs

Class Activity
1. Locate the "Class Data – Clean.xlsx" data file
2. On this data identify a Y variable of interest:

– Conduct regression modelling on the Y variable
3. Create a Powerpoint presentation that shows:

– Your initial model and a list of what variables you removed
– Your final model, interpretation of model performance
– Interpretation of slope coefficients and what it tells you about the Y variable
4. Email to matthew.beck@sydney.edu.au
– GROUP NAME in subject line

MMGT6012 - Topic 5 - Multiple Regression Modelling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MMGT6012 - Topic 5 - Multiple Regression Modelling

Uploaded by

Copyright:

Available Formats

Business Tools for Management

TOPIC 5: Multiple Regression Modelling

The University of Sydney Page 1

Linear relationship between X and Y

The University of Sydney Page 2

The Simple Linear Regression Function

In the population the regression model is:

𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝜀𝜀𝑖𝑖

In the sample the regression model is:

𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖

The University of Sydney Page 3

The Multiple Linear Regression Function

In the population the regression model (for k indep. variables) is:

𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1𝑖𝑖 + 𝛽𝛽2 𝑋𝑋2𝑖𝑖 + ⋯ + 𝛽𝛽𝑘𝑘 𝑋𝑋𝑘𝑘𝑖𝑖 + 𝜀𝜀𝑖𝑖

In the sample the regression model is:

𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1𝑖𝑖 + 𝛽𝛽2 𝑋𝑋2𝑖𝑖 + ⋯ + 𝛽𝛽𝑘𝑘 𝑋𝑋𝑘𝑘𝑘𝑘

The University of Sydney Page 7

Two Variable Model

Multiple Regression Example

A local golf store wants to evaluate factors thought to influence

Data are collected for 15 weeks

The University of Sydney Page 9

Multiple Regression Example

Visualising the relationships

Sales vs. Price Sales vs. Advertising

The University of Sydney Page 11

Multiple Regression Example

How well does the model fit the data?

– 51.9% of the variation in sales is explained by variations in price and

The University of Sydney Page 12

Multiple Regression Example

Are the coefficients significant?

The University of Sydney Page 13

Multiple Regression Example

Use the coefficients to construct the regression equation:

The University of Sydney Page 14

Multiple Regression Example

Sales = 406.9 – 4.149(Price) + 73.773(Advert)

As price increases by 1 unit

The University of Sydney Page 15

Multiple Regression Example

Use the regression model to predict sales:

Sales = 406.9 – 4.149(Price) + 73.773(Advert)

Sales = 406.9 – 4.149(40) + 73.773(4.5)

The University of Sydney Page 16

The University of Sydney Page 17

Summarising What We Know

We know that we do the following for regression:

The University of Sydney Page 18

Adding to Our Knowledge

There are several issues with multiple regression modelling:

The University of Sydney Page 19

Adjusted R-square is used for comparing model fit:

R-Square is used for interpreting model fit:

The University of Sydney Page 21

The University of Sydney Page 22

The University of Sydney Page 23

Predicting Public Transport Use:

The University of Sydney Page 24

Recall our discussion of the Normal distribution:

We can do the same to the data in regression:

The University of Sydney Page 25

Only absolute values matter: