Chapter 6 (Part Ii)

CHAPTER 6 (PART II)
Trendlines and Regression Analysis
Prepared by: Nur Liyana Mohamed Yousop

MULTIPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION
• A linear regression model with more than one independent variable is called a multiple linear
regression model.
ESTIMATED MULTIPLE REGRESSION
EQUATION
• We estimate the regression coefficients—called partial regression coefficients —

b0, b1, b2,… bk, then use the model:
• The partial regression coefficients represent the expected change in the dependent
variable when the associated independent variable is increased by one unit while
the values of all other independent variables are held constant.
EXCEL REGRESSION TOOL
• The independent variables in the spreadsheet must be in contiguous columns.

• So, you may have to manually move the columns of data around before applying the tool.
• Key differences:
• Multiple R and R Square are called the multiple correlation coefficient and the coefficient
of multiple determination, respectively, in the context of multiple regression.
• ANOVA tests for significance of the entire model. That is, it computes an F-statistic for testing
the hypotheses:
INTERPRETING REGRESSION
RESULTS
Data: Salary Data

 Predict current salary using the following
indicators:
 Beginning salary
 Previous experience (in month) when hired
 Total years of education
 Samples : 100 employees in a firm
RESULTS
Regression model
෡ = -4139.2377 + 1.7302X1 – 10.9071X2 + 719.1221X3
𝒀
Where;
෡
Y : Current Salary
X1 : Beginning Salary
X2 : Previous Experience (months)
X3 : Total years of education
RESULTS
The R2 value of 0.8031 indicates From the model, it is evident that total years of education has a
that 80.31% of the variation in larger impact on the current salary as compared to other variables.
the current salary (DV) is
explained by the IVs. This also
indicates that only19.69% of the
variation is explained by other
variables.
RESULTS
The test of ANOVA is slightly different for MLR as compared to SLR. The test significance model
is as follows:
H0 : ß1 = ß2 = ..... = ßn = 0
H1 : At least one ßm is not equal to 0
By looking at the p-value of significance F = 0.000 < α=5%, thus, we reject our null hypothesis.
Therefore, it is conclusive that at least one slope is statistically different from zero (all
independent variables explains variation in dependent variable-model is fit).
MODEL BUILDING ISSUES
• A good regression model should include only significant independent variables.

• However, it is not always clear exactly what will happen when we add or remove variables from a
model; variables that are (or are not) significant in one model may (or may not) be significant in
another.
• Therefore, you should not consider dropping all insignificant variables at one time, but rather take
a more structured approach.
• Adding an independent variable to a regression model will always

result in R2 equal to or greater than the R2 of the original model.
• Adjusted R2 reflects both the number of independent variables and

the sample size and may either increase or decrease when an
independent variable is added or dropped. An increase in adjusted
R2 indicates that the model has improved.
SYSTEMATIC MODEL BUILDING
APPROACH
• Construct a model with all available independent variables. Check for significance of the
1 independent variables by examining the p-values.
• Identify the independent variable having the largest p-value that exceeds the chosen level
2 of significance.
• Remove the variable identified in step 2 from the model and evaluate adjusted R2.
3 • (Don’t remove all variables with p-values that exceed a at the same time, but remove only one at a time.)
• Continue until all variables are significant.

4
IDENTIFYING THE BEST
REGRESSION MODEL
Banking Data
Relationship between average bank balances with age, education, income,
home value and wealth
Result
Home value has the largest p-value; drop and re-run the regression.
MULTICOLLINEARITY
Multicollinearity occurs when there are strong correlations among the independent
variables, and they can predict each other better than the dependent variable.
• When significant multicollinearity is present, it becomes difficult to isolate the effect of one
independent variable on the dependent variable, the signs of coefficients may be the
opposite of what they should be, making it difficult to interpret regression coefficients, and
p-values can be inflated.
Correlations exceeding ±0.7 may indicate multicollinearity
The variance inflation factor is a better indicator, but not computed in Excel.
IDENTIFYING POTENTIAL
MULTICOLLINEARITY
Colleges and Universities correlation matrix; none exceed the recommend threshold of ±0.7
Banking Data correlation matrix; large correlations exist

IDENTIFYING POTENTIAL
MULTICOLLINEARITY
Salary Data??
What do you
think ???
BEFORE DROPPING ANY VARIABLES MODEL 1
DROP HOME VALUE MODEL 2
Bank regression after removing Home Value
✓ Adjusted R2 improves slightly from 94.41% to 94.43%.

✓ All X variables are significant (p-value <α=5%)
DROP WEALTH MODEL 3
✓ If we remove Wealth from the model, the adjusted R2 drops to

91.93%, Education is no longer significant.
DROP HOME VALUE AND WEALTH MODEL 4
✓ If we remove Home Value and Wealth from the model, the adjusted R2
drops to 92.01%, Education is no longer significant.
DROP HOME VALUE, WEALTH AND MODEL 5
EDUCATION
✓ Dropping Home Value, Wealth and Education and leaving only Age
and Income in the model results in an adjusted R2 of 92.02% and all
variables are significant.
DROP HOME VALUE AND INCOME MODEL 6
• If we remove Income from the model instead of Wealth, the Adjusted R2

drops to only 93.45%, and all remaining variables (Age, Education, and
Wealth) are significant.
SUMMARY
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

Age 0.0000** 0.0000** 0.0000** 0.0000** 0.0000** 0.0000**
Education 0.0541* 0.0039** 0.4112 0.3392 0.0000**
Income 0.0005** 0.0000** 0.0000** 0.0000** 0.0000**
Home Value 0.4075 0.9157
Wealth 0.0000** 0.0000** 0.0000**
R2 0.9469 0.9465 0.9225 0.9225 0.9218 0.9365

Adj-R2 0.9441 0.9443 0.9193 0.9201 0.9202 0.9345
PRACTICAL ISSUES IN TRENDLINE
AND REGRESSION MODELING
• Identifying the best regression model often requires experimentation and trial and error.
• The independent variables selected should make sense in attempting to explain the dependent
variable
• Logic should guide your model development. In many applications, behavioral, economic, or
physical theory might suggest that certain variables should belong in a model.
• Additional variables increase R2 and, therefore, help to explain a larger proportion of the
variation.
• Even though a variable with a large p-value is not statistically significant, it could simply be the
result of sampling error and a modeler might wish to keep it.
• Good models are as simple as possible (the principle of parsimony).
OVERFITTING
Overfitting means fitting a model too closely to the sample data at the
risk of not fitting it well to the population in which we are interested.
In multiple regression, if we add too many terms to the model, then the
model may not adequately predict other values from the population.
Overfitting can be mitigated by using good logic, intuition, theory, and

parsimony.
CHECKING ASSUMPTION FOR MULTIPLE LINEAR
REGRESSION
CHECKING ASSUMPTIONS
Assumption Verification Details
Linearity • Examine scatter diagram (should appear If assumption is met:

linear) o Residuals randomly scattered
Linear relationship • Examine residual plot (should appear about zero
between IV and DV random) o Do not exhibit a specific pattern
Normality of Errors • View a histogram of standard residuals If assumption is met:

• Formal Goodness of Fit Test (e.g. Pearson, o Bell-shaped distribution
Errors of all IVs are Chi-square, Jacque-Bera and others)
normally distributed,
mean=0
CHECKING ASSUMPTIONS
Assumption Verification Details

Homoscedasticity • Examine the residual plot If assumption is met:
Constant variance o There will not be dramatic
Variance around the regression differences in the spread of the
line is similar for all the IVs data for different values of the
IVs
Independence of Errors • Durbin Watson Statistics If assumption is met:

(Autocorrelation) o No autocorrelation, if 1.5 ≤ D ≤
The error term for all IVs should 2.5
not be correlated with one
another. If the do, then the • d takes on values between 0 and 4. A value of d = 2
means there is no autocorrelation. A value
problem of autocorrelation substantially below 2 (and especially a value less
persists. than 1) means that the data is positively
autocorrelated. A value of d substantially above 2
means that the data is negatively autocorrelated
ESTIMATED MODEL MODEL 2
Bank regression after removing Home Value
෣ =𝛽
𝐵𝑎𝑙𝑎𝑛𝑐𝑒 ෢0 + 𝛽
෢1 Age+ 𝛽
෢3 Education+ 𝛽
෢4 Income+ 𝛽
෢5 Wealth
෣ = −12,432.4567 + 325.0653 Age+ 773.3800 Education+

𝐵𝑎𝑙𝑎𝑛𝑐𝑒
0.1597 Income+ 0.0730 Wealth
CHECKING REGRESSION ASSUMPTIONS
FOR MODEL 2
• Linearity
• linear trend in scatterplot
CONTINUED…
• Linearity
• no pattern in residual plot
CONTINUED…
• Normality of Errors
• Residual histogram appears slightly skewed but is not a serious departure
• Data→ Data Analysis → Histogram
Histogram
45
40
35
30
Frequency
25
20 Frequency
15
10
0
-3 -2 -1 0 1 2 3 More
BIN
CONTINUED…
• Homoscedasticity
• Residual plot shows no serious difference in the spread of the data for different X
values.
CONTINUED…
• Homoscedasticity
**the variances along the line of best fit remain similar

CONTINUED…
• Autocorrelation
REGRESSION WITH CATEGORICAL VARIABLES
REGRESSION WITH CATEGORICAL
VARIABLES
Categorical data can be

included as independent
variables, but must be
Regression analysis requires coded numeric using dummy
numerical data. variables.
• For variables with 2
categories, code as 0
and 1.
A MODEL WITH CATEGORICAL
VARIABLES
Data: Employee Salaries Predict Salary using Age and MBA (code as
yes=1, no=0)
• Employee Salaries provides data for
35 employees
Where;
OR;
෣ = 𝛽
𝑆𝑎𝑙𝑎𝑟𝑦 ෢0 + 𝛽
෢1 𝐴𝑔𝑒 + 𝛽
෢2 𝑀𝐵𝐴
RESULTS FROM DUMMY
REGRESSION
෣ = 893.5876 + 1044.1460 𝐴𝑔𝑒 + 14767.2316 𝑀𝐵𝐴
• 𝑆𝑎𝑙𝑎𝑟𝑦
• If MBA = 0, salary = 893.5876 + 1044.1460 Age
• If MBA = 1, salary =15,660.8192 + 1044.1460 Age
RESULTS FROM DUMMY
REGRESSION
• The coefficient of multiple determination (R2) is 0.9528. For our
sample problem, this means 95.28% of salary variation can be
explained by Age and by MBA. The predicted equation fits the
data pretty well.
DUMMY REGRESSION ASSUMPTION
ASSUMPTIONS???
END OF CHAPTER 6

Chapter 6 (Part Ii)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6 (Part Ii)

Uploaded by

Copyright:

Available Formats

CHAPTER 6 (PART II)

Trendlines and Regression Analysis

Prepared by: Nur Liyana Mohamed Yousop

• We estimate the regression coefficients—called partial regression coefficients —

• The independent variables in the spreadsheet must be in contiguous columns.

Data: Salary Data

• A good regression model should include only significant independent variables.

• Adding an independent variable to a regression model will always

• Adjusted R2 reflects both the number of independent variables and

• Continue until all variables are significant.

Correlations exceeding ±0.7 may indicate multicollinearity

Banking Data correlation matrix; large correlations exist

Bank regression after removing Home Value

✓ Adjusted R2 improves slightly from 94.41% to 94.43%.

✓ If we remove Wealth from the model, the adjusted R2 drops to

• If we remove Income from the model instead of Wealth, the Adjusted R2

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

R2 0.9469 0.9465 0.9225 0.9225 0.9218 0.9365

Overfitting can be mitigated by using good logic, intuition, theory, and

Assumption Verification Details

Linearity • Examine scatter diagram (should appear If assumption is met:

Normality of Errors • View a histogram of standard residuals If assumption is met:

Assumption Verification Details

Independence of Errors • Durbin Watson Statistics If assumption is met:

Bank regression after removing Home Value

෣ = −12,432.4567 + 325.0653 Age+ 773.3800 Education+

**the variances along the line of best fit remain similar

Categorical data can be

You might also like