You are on page 1of 18

Session 10: Multiple Linear Regression-

Model Building

Dr. Mahesh K C 1
Multiple Linear Regression (MLR)
• Simple linear regression examines relationship between single predictor and
response.
• Multiple Regression models set of predictors to single continuous response.
• Provide improved precision for estimation and prediction.
• Model uses plane or hyper-plane to approximate relationship between
predictor set and single response.

• Predictors are typically continuous.


• Categorical predictors can be included, through the use of indicator (dummy)
variables.
• Here, plane/hyper-plane represents a linear surface in p dimensions.

Dr. Mahesh
2 KC
The Multiple Regression Model
• Multiple Regression Model:
y   0  1 x1   2 x2     p x p  
where, β1, β2, …, βp are model parameters whose true value remains unknown and ε
represents error term.
• Model parameters are estimated from data set using method of least squares.
• The estimated regression plane: yˆ  b0  b1 x1  b2 x2  ....  bp x p

• We interpret the coefficient bi as “estimated change in response variable, for unit increase
in variable xi, when all remaining predictors held constant.”
• The quantity (y – ŷ) measures error in prediction called residual. Residual equals
vertical distance between data point and regression plane (or hyperplane) in multiple
regression.
• Coefficient of Determination (R2): Represents proportion of variability in response
variable accounted for by linear relationship with predictor set.

Dr. Mahesh
3 KC
An Example
• Consider a MLR model to estimate the miles per gallons (mpg) based on
weight (wt) and displacement (disp).
• A 3D scatter plot for the same will be:

• The estimated regression equation:


mpg  b0  b1 wt  b2 disp

• The approximating plane will be

Dr. Mahesh K C 4
Model Assumptions

• Zero Mean Assumption: Error term ε random variable with mean,


E(ε) = 0.
• Constant Variance Assumption: Variance of ε constant, regardless
of value of x1, x2, …, xp. This assumption is also known as
homoscedasticity.
• Independence Assumption: Values of ε are independent.
• Normality Assumption: Error term ε normally distributed random
variable.

Dr. Mahesh K C 5
Coefficient of Determination (R2) and Adjusted R2
• Would we expect higher R2 value when using two predictors, rather than one?
• Yes, R2 always increases by including additional predictor. When new predictor is
useful, R2 increases substantially. Otherwise, R2 may increase small or negligible
amount.
• Largest R2 may occur for models with most predictors, rather than best
predictors.

• Adjusted R2 measure “adjusts” R2, by penalizing models that include non-useful


predictors.
• If R2adj < R2, indicates at least one predictor in the model may be extraneous, and
should be omitted from the model.
• Models should be evaluated based on R2adj rather than R2 .

Dr. Mahesh
6 KC
Inference on Regression: The t-test and F-test
• t-test is used to test the significance of individual predictors of the regression model.
• Hypothesis Test for model: H0: βi = 0 against H1: βi ≠ 0 (i = 1, 2, 3,…., p)
• Reject the null hypothesis when p-value < level of significance (α).

• F-test is used for overall significance of the regression model.


• Hypotheses for F-Test: H0 :β1 = β2 = … = βm = 0 against H1 : At least one βi ≠ 0.
• The ANOVA table
Source of Degrees of
Sum of Squares Mean Square F
Variation Freedom
SSR
Regression SSR p MSR 
p MSR
F
Error SSE MSE
SSE n  p 1 MSE 
(or Residual) n  p 1

Total SST  SSR  SSE n 1 reject H0 if p-value < α

Dr. Mahesh
7 KC
Multi-collinearity
• Multicollinearity is condition where two or more predictors are correlated.
• This leads to instability in solution space, with possibly incoherent results.
• Data set with severe multicollinearity may have significant F-test, while none
of the t-tests for the individual predictors are significant.
• Multicollinearity produce high variability in coefficient estimates (b1, b2,…).

• Highly correlated variables tend to overemphasize regression model


component.
• Multicollinearity issue can be identified by examining correlation structure
among predictors.
• One may use matrix plot to identify the correlation structure.

Dr. Mahesh
8 KC
Multicollinearity Contd.’
• Consider a MLR with two predictors:
yˆ  b0  b1 x1  b2 x2
• If predictors x1 and x2 not correlated and orthogonal. In such cases, the predictors form solid
basis upon which the response surface y rests firmly, providing stable coefficients b1 and b2
(see figure A) with small variability SE(b1) and SE(b2).
• If the predictors x1 and x2 correlated (multicollinear situation), so that as one of them
increases, so does the other. In this case , the predictors no longer form a solid basis on which
the response surface y rests firmly (unstable), providing highly variable coefficients b1 and b2
(see figure B) due to high inflated values of SE(b1) and SE(b2).

Dr. Mahesh K C 9
Does method exist to identify multicollinearity in regression model?
• Variance Inflation Factors (VIFs) measures the correlation between the ith
predictor xi and the remaining predictor variables.
1
VIFi  ; i  1, 2, 3,..., p
1  Ri2

• When xi completely uncorrelated with remaining predictors, Ri2 = 0 leads to


minimum value for VIFi = 1. Alternately, VIFi increases without bound as Ri2
approaches 1.
• Large VIFi will produce an inflated standard error of the estimates leading to
degradation in the precision of the estimates.

• In general, VIFi > 5 and VIFi > 10 indicates moderate and severe
multicollinearity, respectively.

Dr. Mahesh
10 K C
Some Guidelines for model building using multiple linear regression
Step 1: Detect (using VIF criterion) and eliminate multicollinearity (if present) by
dropping variables. Drop one variable at a time until multicollinearity is eliminated.
Step 2: Run regression and check for influential observations, outliers and high
leverage observations.
Step 3: If one or more influential observations/outliers/ high leverage observations
are present delete one of them and rerun regression and go back to Step 2.

Step 4: Keep doing this until you get no further influential observations/ outliers/
high leverage observations or 10% (or 5% case to case) of data has been removed.
Step 5: Check for regression assumptions of linearity, normality, homoscedasticity and
independence of the residuals.
Step 6: If some of the assumptions in Step 5 is violated then try using transformations.
If NO transformation can be found which can correct for the violations then STOP.

Dr. Mahesh K C 11
Step 7: When all the regression assumptions are met then look at the p-value of
the F-test. If it is not significant then STOP.
Step 8: If the p-value of the F-test is significant then look at the p-values of the
individual the coefficients. If some of the p-values are not significant then
choose one of the variables with non-significant p-value and drop it from the
model and run regression again.
Step 9: Repeat Step 8 until you get the p-values of all the coefficients significant.

Dr. Mahesh K C 12
Model Building: Health Care Revenue data
• These data were collected by Department of Health and Social Services (DHSS)
of the state of New Mexico and cover 52 of the 60 licensed facilities in New
Mexico in 1998. Specific definitions of the variables are given below. The
location of the facility is indicated whether it is the rural or non rural area.
Variable Definition
RURAL Rural home (1) and non-rural home (0)
BED Number of Beds in home
MCDAYS Annual medical in-patient days(hundreds)
TDAYS Annual Total Patient Days (Hundreds)
PCREV Annual Total Patient Care Revenue($100)
NSAL Annual nursing salaries ($100)
FEXP Annual Facilities Expenditure ($100)

• DHSS is interested to predict patient care revenue based on the other hospital
characteristics.
Dr. Mahesh K C 13
Model Building: HCR Data
• Objective: Build a model to predict patient care revenue.
• Response Variable: PCREV
• Continuous Predictors: BED, MCDAYS, TDAYS, NSAL, FEXP
• Categorical Predictor (dummy): RURAL
• Total records: 52
• Total Variables: 7 (6 predictors)
• No missing values

• Step 1: Checking Multicollinearity. Dropped variables “TDAYS” as the corresponding


VIF=8.47 >5. Now we have 5 predictors- BED, MCDAYS, NSAL, FEXP and RURAL.
Repeated check for multicollinearity resulted that model (lm2) is free of the same.
• Step 2a: Check for influential observations: Since none of residual has Cook’s distance
greater than 1, no influential observations. The largest Cook’s distance = 0.77.
Dr. Mahesh K C 14
Model Building Cont’d
• Step 2b: Check for outliers: We found that the following standardized residual values
are beyond > 2: 3.74, 3.55, 2.76 and 2.94. We have deleted these one by one and
accordingly updated the data set. Now the final updated data (HRev5) at this stage
consists of 48 records on 5 predictors and having standardized residual 1.94.
• Step 2c: Check for leverage values: The leverage value = 0.66 > 2(p+1)/n = 12/48 =
0.25. After removing the respective record, the new updated data set (HRev6 with 47
records) is again tested for leverage value and found that the problem still persist.
• Since we already deleted 5 records (10% of 52 records), we stop at this stage and
proceed to step 3. The data (HRev6) now has 47 records and 5 predictors.
• Step 3: Check for regression assumptions: The standardized residuals versus fitted
plot & the Q-Q plot roughly shows that the assumptions are more or less met. But we
proceed to step 4.
• Step 4a: We checked the significance of the variables and found that the predictor
“RURAL” is not significant at 5% level. We removed the same and updated the data
(HRev7).
• Step 4b: Again checked for significance and found that “FEXP” not significant at 5%.
We removed and updated the data (HRev8).
• Step 4c: All predictors are now significant and the overall model is also significant
Dr. Mahesh K C 15
Model Building Cont’d

Dr. Mahesh K C 16
Summary Results of Model 9

Variables Estimates Std. error Pr(>|t|)


Intercept -2056.49 833.18

BED 82.92 16.36 8.12e-06

MCDAYS 15.67 4.46 0.00106


NSAL 1.44 0.28 7.87e-06
Residual standard error: 1817 on 43 degrees of freedom. Multiple R-squared: 0.9045,
Adjusted R-squared: 0.8978, F-statistic: 135.7 on 3 and 43 DF, and p-value: < 2.2e-16

Final MLR model:


PCREV = -2056.49+82.92BED+15.67MCDAYS+1.44NSAL

Dr. Mahesh K C 17
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 18

You might also like