You are on page 1of 41

CHAPTER 6 (PART II)

Trendlines and Regression Analysis

Prepared by: Nur Liyana Mohamed Yousop


MULTIPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION

• A linear regression model with more than one independent variable is called a multiple linear
regression model.
ESTIMATED MULTIPLE REGRESSION
EQUATION

• We estimate the regression coefficients—called partial regression coefficients —


b0, b1, b2,… bk, then use the model:

• The partial regression coefficients represent the expected change in the dependent
variable when the associated independent variable is increased by one unit while
the values of all other independent variables are held constant.
EXCEL REGRESSION TOOL

• The independent variables in the spreadsheet must be in contiguous columns.


• So, you may have to manually move the columns of data around before applying the tool.
• Key differences:
• Multiple R and R Square are called the multiple correlation coefficient and the coefficient
of multiple determination, respectively, in the context of multiple regression.
• ANOVA tests for significance of the entire model. That is, it computes an F-statistic for testing
the hypotheses:
INTERPRETING REGRESSION
RESULTS

Data: Salary Data


 Predict current salary using the following
indicators:
 Beginning salary
 Previous experience (in month) when hired
 Total years of education
 Samples : 100 employees in a firm
INTERPRETING REGRESSION
RESULTS
Regression model
෡ = -4139.2377 + 1.7302X1 – 10.9071X2 + 719.1221X3
𝒀

Where;

Y : Current Salary
X1 : Beginning Salary
X2 : Previous Experience (months)
X3 : Total years of education
INTERPRETING REGRESSION
RESULTS

The R2 value of 0.8031 indicates From the model, it is evident that total years of education has a
that 80.31% of the variation in larger impact on the current salary as compared to other variables.
the current salary (DV) is
explained by the IVs. This also
indicates that only19.69% of the
variation is explained by other
variables.
INTERPRETING REGRESSION
RESULTS

The test of ANOVA is slightly different for MLR as compared to SLR. The test significance model
is as follows:
H0 : ß1 = ß2 = ..... = ßn = 0
H1 : At least one ßm is not equal to 0
By looking at the p-value of significance F = 0.000 < α=5%, thus, we reject our null hypothesis.
Therefore, it is conclusive that at least one slope is statistically different from zero (all
independent variables explains variation in dependent variable-model is fit).
MODEL BUILDING ISSUES

• A good regression model should include only significant independent variables.


• However, it is not always clear exactly what will happen when we add or remove variables from a
model; variables that are (or are not) significant in one model may (or may not) be significant in
another.
• Therefore, you should not consider dropping all insignificant variables at one time, but rather take
a more structured approach.

• Adding an independent variable to a regression model will always


result in R2 equal to or greater than the R2 of the original model.

• Adjusted R2 reflects both the number of independent variables and


the sample size and may either increase or decrease when an
independent variable is added or dropped. An increase in adjusted
R2 indicates that the model has improved.
SYSTEMATIC MODEL BUILDING
APPROACH

• Construct a model with all available independent variables. Check for significance of the
1 independent variables by examining the p-values.

• Identify the independent variable having the largest p-value that exceeds the chosen level
2 of significance.

• Remove the variable identified in step 2 from the model and evaluate adjusted R2.
3 • (Don’t remove all variables with p-values that exceed a at the same time, but remove only one at a time.)

• Continue until all variables are significant.


4
IDENTIFYING THE BEST
REGRESSION MODEL

Banking Data
Relationship between average bank balances with age, education, income,
home value and wealth
Result
Home value has the largest p-value; drop and re-run the regression.
MULTICOLLINEARITY

Multicollinearity occurs when there are strong correlations among the independent
variables, and they can predict each other better than the dependent variable.
• When significant multicollinearity is present, it becomes difficult to isolate the effect of one
independent variable on the dependent variable, the signs of coefficients may be the
opposite of what they should be, making it difficult to interpret regression coefficients, and
p-values can be inflated.

Correlations exceeding ±0.7 may indicate multicollinearity

The variance inflation factor is a better indicator, but not computed in Excel.
IDENTIFYING POTENTIAL
MULTICOLLINEARITY

Colleges and Universities correlation matrix; none exceed the recommend threshold of ±0.7

Banking Data correlation matrix; large correlations exist


IDENTIFYING POTENTIAL
MULTICOLLINEARITY

Salary Data??
What do you
think ???
BEFORE DROPPING ANY VARIABLES MODEL 1
DROP HOME VALUE MODEL 2

Bank regression after removing Home Value

✓ Adjusted R2 improves slightly from 94.41% to 94.43%.


✓ All X variables are significant (p-value <α=5%)
DROP WEALTH MODEL 3

✓ If we remove Wealth from the model, the adjusted R2 drops to


91.93%, Education is no longer significant.
DROP HOME VALUE AND WEALTH MODEL 4

✓ If we remove Home Value and Wealth from the model, the adjusted R2
drops to 92.01%, Education is no longer significant.
DROP HOME VALUE, WEALTH AND MODEL 5
EDUCATION

✓ Dropping Home Value, Wealth and Education and leaving only Age
and Income in the model results in an adjusted R2 of 92.02% and all
variables are significant.
DROP HOME VALUE AND INCOME MODEL 6

• If we remove Income from the model instead of Wealth, the Adjusted R2


drops to only 93.45%, and all remaining variables (Age, Education, and
Wealth) are significant.
SUMMARY

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6


Age 0.0000** 0.0000** 0.0000** 0.0000** 0.0000** 0.0000**
Education 0.0541* 0.0039** 0.4112 0.3392 0.0000**
Income 0.0005** 0.0000** 0.0000** 0.0000** 0.0000**
Home Value 0.4075 0.9157
Wealth 0.0000** 0.0000** 0.0000**

R2 0.9469 0.9465 0.9225 0.9225 0.9218 0.9365


Adj-R2 0.9441 0.9443 0.9193 0.9201 0.9202 0.9345
PRACTICAL ISSUES IN TRENDLINE
AND REGRESSION MODELING

• Identifying the best regression model often requires experimentation and trial and error.
• The independent variables selected should make sense in attempting to explain the dependent
variable
• Logic should guide your model development. In many applications, behavioral, economic, or
physical theory might suggest that certain variables should belong in a model.
• Additional variables increase R2 and, therefore, help to explain a larger proportion of the
variation.
• Even though a variable with a large p-value is not statistically significant, it could simply be the
result of sampling error and a modeler might wish to keep it.
• Good models are as simple as possible (the principle of parsimony).
OVERFITTING

Overfitting means fitting a model too closely to the sample data at the
risk of not fitting it well to the population in which we are interested.

In multiple regression, if we add too many terms to the model, then the
model may not adequately predict other values from the population.

Overfitting can be mitigated by using good logic, intuition, theory, and


parsimony.
CHECKING ASSUMPTION FOR MULTIPLE LINEAR
REGRESSION
CHECKING ASSUMPTIONS

Assumption Verification Details

Linearity • Examine scatter diagram (should appear If assumption is met:


linear) o Residuals randomly scattered
Linear relationship • Examine residual plot (should appear about zero
between IV and DV random) o Do not exhibit a specific pattern

Normality of Errors • View a histogram of standard residuals If assumption is met:


• Formal Goodness of Fit Test (e.g. Pearson, o Bell-shaped distribution
Errors of all IVs are Chi-square, Jacque-Bera and others)
normally distributed,
mean=0
CHECKING ASSUMPTIONS

Assumption Verification Details


Homoscedasticity • Examine the residual plot If assumption is met:
Constant variance o There will not be dramatic
Variance around the regression differences in the spread of the
line is similar for all the IVs data for different values of the
IVs

Independence of Errors • Durbin Watson Statistics If assumption is met:


(Autocorrelation) o No autocorrelation, if 1.5 ≤ D ≤
The error term for all IVs should 2.5
not be correlated with one
another. If the do, then the • d takes on values between 0 and 4. A value of d = 2
means there is no autocorrelation. A value
problem of autocorrelation substantially below 2 (and especially a value less
persists. than 1) means that the data is positively
autocorrelated. A value of d substantially above 2
means that the data is negatively autocorrelated
ESTIMATED MODEL MODEL 2

Bank regression after removing Home Value

෣ =𝛽
𝐵𝑎𝑙𝑎𝑛𝑐𝑒 ෢0 + 𝛽
෢1 Age+ 𝛽
෢3 Education+ 𝛽
෢4 Income+ 𝛽
෢5 Wealth

෣ = −12,432.4567 + 325.0653 Age+ 773.3800 Education+


𝐵𝑎𝑙𝑎𝑛𝑐𝑒
0.1597 Income+ 0.0730 Wealth
CHECKING REGRESSION ASSUMPTIONS
FOR MODEL 2
• Linearity
• linear trend in scatterplot
CONTINUED…

• Linearity
• no pattern in residual plot
CONTINUED…

• Normality of Errors
• Residual histogram appears slightly skewed but is not a serious departure
• Data→ Data Analysis → Histogram

Histogram
45

40

35

30
Frequency

25

20 Frequency
15

10

0
-3 -2 -1 0 1 2 3 More
BIN
CONTINUED…

• Homoscedasticity
• Residual plot shows no serious difference in the spread of the data for different X
values.
CONTINUED…

• Homoscedasticity

**the variances along the line of best fit remain similar


CONTINUED…

• Autocorrelation
REGRESSION WITH CATEGORICAL VARIABLES
REGRESSION WITH CATEGORICAL
VARIABLES

Categorical data can be


included as independent
variables, but must be
Regression analysis requires coded numeric using dummy
numerical data. variables.
• For variables with 2
categories, code as 0
and 1.
A MODEL WITH CATEGORICAL
VARIABLES

Data: Employee Salaries Predict Salary using Age and MBA (code as
yes=1, no=0)
• Employee Salaries provides data for
35 employees

Where;

OR;

෣ = 𝛽
𝑆𝑎𝑙𝑎𝑟𝑦 ෢0 + 𝛽
෢1 𝐴𝑔𝑒 + 𝛽
෢2 𝑀𝐵𝐴
RESULTS FROM DUMMY
REGRESSION
෣ = 893.5876 + 1044.1460 𝐴𝑔𝑒 + 14767.2316 𝑀𝐵𝐴
• 𝑆𝑎𝑙𝑎𝑟𝑦
• If MBA = 0, salary = 893.5876 + 1044.1460 Age
• If MBA = 1, salary =15,660.8192 + 1044.1460 Age
RESULTS FROM DUMMY
REGRESSION
• The coefficient of multiple determination (R2) is 0.9528. For our
sample problem, this means 95.28% of salary variation can be
explained by Age and by MBA. The predicted equation fits the
data pretty well.
DUMMY REGRESSION ASSUMPTION

ASSUMPTIONS???
END OF CHAPTER 6

You might also like