Professional Documents
Culture Documents
• A linear regression model with more than one independent variable is called a multiple linear
regression model.
ESTIMATED MULTIPLE REGRESSION
EQUATION
• The partial regression coefficients represent the expected change in the dependent
variable when the associated independent variable is increased by one unit while
the values of all other independent variables are held constant.
EXCEL REGRESSION TOOL
Where;
Y : Current Salary
X1 : Beginning Salary
X2 : Previous Experience (months)
X3 : Total years of education
INTERPRETING REGRESSION
RESULTS
The R2 value of 0.8031 indicates From the model, it is evident that total years of education has a
that 80.31% of the variation in larger impact on the current salary as compared to other variables.
the current salary (DV) is
explained by the IVs. This also
indicates that only19.69% of the
variation is explained by other
variables.
INTERPRETING REGRESSION
RESULTS
The test of ANOVA is slightly different for MLR as compared to SLR. The test significance model
is as follows:
H0 : ß1 = ß2 = ..... = ßn = 0
H1 : At least one ßm is not equal to 0
By looking at the p-value of significance F = 0.000 < α=5%, thus, we reject our null hypothesis.
Therefore, it is conclusive that at least one slope is statistically different from zero (all
independent variables explains variation in dependent variable-model is fit).
MODEL BUILDING ISSUES
• Construct a model with all available independent variables. Check for significance of the
1 independent variables by examining the p-values.
• Identify the independent variable having the largest p-value that exceeds the chosen level
2 of significance.
• Remove the variable identified in step 2 from the model and evaluate adjusted R2.
3 • (Don’t remove all variables with p-values that exceed a at the same time, but remove only one at a time.)
Banking Data
Relationship between average bank balances with age, education, income,
home value and wealth
Result
Home value has the largest p-value; drop and re-run the regression.
MULTICOLLINEARITY
Multicollinearity occurs when there are strong correlations among the independent
variables, and they can predict each other better than the dependent variable.
• When significant multicollinearity is present, it becomes difficult to isolate the effect of one
independent variable on the dependent variable, the signs of coefficients may be the
opposite of what they should be, making it difficult to interpret regression coefficients, and
p-values can be inflated.
The variance inflation factor is a better indicator, but not computed in Excel.
IDENTIFYING POTENTIAL
MULTICOLLINEARITY
Colleges and Universities correlation matrix; none exceed the recommend threshold of ±0.7
Salary Data??
What do you
think ???
BEFORE DROPPING ANY VARIABLES MODEL 1
DROP HOME VALUE MODEL 2
✓ If we remove Home Value and Wealth from the model, the adjusted R2
drops to 92.01%, Education is no longer significant.
DROP HOME VALUE, WEALTH AND MODEL 5
EDUCATION
✓ Dropping Home Value, Wealth and Education and leaving only Age
and Income in the model results in an adjusted R2 of 92.02% and all
variables are significant.
DROP HOME VALUE AND INCOME MODEL 6
• Identifying the best regression model often requires experimentation and trial and error.
• The independent variables selected should make sense in attempting to explain the dependent
variable
• Logic should guide your model development. In many applications, behavioral, economic, or
physical theory might suggest that certain variables should belong in a model.
• Additional variables increase R2 and, therefore, help to explain a larger proportion of the
variation.
• Even though a variable with a large p-value is not statistically significant, it could simply be the
result of sampling error and a modeler might wish to keep it.
• Good models are as simple as possible (the principle of parsimony).
OVERFITTING
Overfitting means fitting a model too closely to the sample data at the
risk of not fitting it well to the population in which we are interested.
In multiple regression, if we add too many terms to the model, then the
model may not adequately predict other values from the population.
=𝛽
𝐵𝑎𝑙𝑎𝑛𝑐𝑒 0 + 𝛽
1 Age+ 𝛽
3 Education+ 𝛽
4 Income+ 𝛽
5 Wealth
• Linearity
• no pattern in residual plot
CONTINUED…
• Normality of Errors
• Residual histogram appears slightly skewed but is not a serious departure
• Data→ Data Analysis → Histogram
Histogram
45
40
35
30
Frequency
25
20 Frequency
15
10
0
-3 -2 -1 0 1 2 3 More
BIN
CONTINUED…
• Homoscedasticity
• Residual plot shows no serious difference in the spread of the data for different X
values.
CONTINUED…
• Homoscedasticity
• Autocorrelation
REGRESSION WITH CATEGORICAL VARIABLES
REGRESSION WITH CATEGORICAL
VARIABLES
Data: Employee Salaries Predict Salary using Age and MBA (code as
yes=1, no=0)
• Employee Salaries provides data for
35 employees
Where;
OR;
= 𝛽
𝑆𝑎𝑙𝑎𝑟𝑦 0 + 𝛽
1 𝐴𝑔𝑒 + 𝛽
2 𝑀𝐵𝐴
RESULTS FROM DUMMY
REGRESSION
= 893.5876 + 1044.1460 𝐴𝑔𝑒 + 14767.2316 𝑀𝐵𝐴
• 𝑆𝑎𝑙𝑎𝑟𝑦
• If MBA = 0, salary = 893.5876 + 1044.1460 Age
• If MBA = 1, salary =15,660.8192 + 1044.1460 Age
RESULTS FROM DUMMY
REGRESSION
• The coefficient of multiple determination (R2) is 0.9528. For our
sample problem, this means 95.28% of salary variation can be
explained by Age and by MBA. The predicted equation fits the
data pretty well.
DUMMY REGRESSION ASSUMPTION
ASSUMPTIONS???
END OF CHAPTER 6