Professional Documents
Culture Documents
Model Building
Dr. Mahesh K C 1
Multiple Linear Regression (MLR)
• Simple linear regression examines relationship between single predictor and
response.
• Multiple Regression models set of predictors to single continuous response.
• Provide improved precision for estimation and prediction.
• Model uses plane or hyper-plane to approximate relationship between
predictor set and single response.
Dr. Mahesh
2 KC
The Multiple Regression Model
• Multiple Regression Model:
y 0 1 x1 2 x2 p x p
where, β1, β2, …, βp are model parameters whose true value remains unknown and ε
represents error term.
• Model parameters are estimated from data set using method of least squares.
• The estimated regression plane: yˆ b0 b1 x1 b2 x2 .... bp x p
• We interpret the coefficient bi as “estimated change in response variable, for unit increase
in variable xi, when all remaining predictors held constant.”
• The quantity (y – ŷ) measures error in prediction called residual. Residual equals
vertical distance between data point and regression plane (or hyperplane) in multiple
regression.
• Coefficient of Determination (R2): Represents proportion of variability in response
variable accounted for by linear relationship with predictor set.
Dr. Mahesh
3 KC
An Example
• Consider a MLR model to estimate the miles per gallons (mpg) based on
weight (wt) and displacement (disp).
• A 3D scatter plot for the same will be:
Dr. Mahesh K C 4
Model Assumptions
Dr. Mahesh K C 5
Coefficient of Determination (R2) and Adjusted R2
• Would we expect higher R2 value when using two predictors, rather than one?
• Yes, R2 always increases by including additional predictor. When new predictor is
useful, R2 increases substantially. Otherwise, R2 may increase small or negligible
amount.
• Largest R2 may occur for models with most predictors, rather than best
predictors.
Dr. Mahesh
6 KC
Inference on Regression: The t-test and F-test
• t-test is used to test the significance of individual predictors of the regression model.
• Hypothesis Test for model: H0: βi = 0 against H1: βi ≠ 0 (i = 1, 2, 3,…., p)
• Reject the null hypothesis when p-value < level of significance (α).
Dr. Mahesh
7 KC
Multi-collinearity
• Multicollinearity is condition where two or more predictors are correlated.
• This leads to instability in solution space, with possibly incoherent results.
• Data set with severe multicollinearity may have significant F-test, while none
of the t-tests for the individual predictors are significant.
• Multicollinearity produce high variability in coefficient estimates (b1, b2,…).
Dr. Mahesh
8 KC
Multicollinearity Contd.’
• Consider a MLR with two predictors:
yˆ b0 b1 x1 b2 x2
• If predictors x1 and x2 not correlated and orthogonal. In such cases, the predictors form solid
basis upon which the response surface y rests firmly, providing stable coefficients b1 and b2
(see figure A) with small variability SE(b1) and SE(b2).
• If the predictors x1 and x2 correlated (multicollinear situation), so that as one of them
increases, so does the other. In this case , the predictors no longer form a solid basis on which
the response surface y rests firmly (unstable), providing highly variable coefficients b1 and b2
(see figure B) due to high inflated values of SE(b1) and SE(b2).
Dr. Mahesh K C 9
Does method exist to identify multicollinearity in regression model?
• Variance Inflation Factors (VIFs) measures the correlation between the ith
predictor xi and the remaining predictor variables.
1
VIFi ; i 1, 2, 3,..., p
1 Ri2
• In general, VIFi > 5 and VIFi > 10 indicates moderate and severe
multicollinearity, respectively.
Dr. Mahesh
10 K C
Some Guidelines for model building using multiple linear regression
Step 1: Detect (using VIF criterion) and eliminate multicollinearity (if present) by
dropping variables. Drop one variable at a time until multicollinearity is eliminated.
Step 2: Run regression and check for influential observations, outliers and high
leverage observations.
Step 3: If one or more influential observations/outliers/ high leverage observations
are present delete one of them and rerun regression and go back to Step 2.
Step 4: Keep doing this until you get no further influential observations/ outliers/
high leverage observations or 10% (or 5% case to case) of data has been removed.
Step 5: Check for regression assumptions of linearity, normality, homoscedasticity and
independence of the residuals.
Step 6: If some of the assumptions in Step 5 is violated then try using transformations.
If NO transformation can be found which can correct for the violations then STOP.
Dr. Mahesh K C 11
Step 7: When all the regression assumptions are met then look at the p-value of
the F-test. If it is not significant then STOP.
Step 8: If the p-value of the F-test is significant then look at the p-values of the
individual the coefficients. If some of the p-values are not significant then
choose one of the variables with non-significant p-value and drop it from the
model and run regression again.
Step 9: Repeat Step 8 until you get the p-values of all the coefficients significant.
Dr. Mahesh K C 12
Model Building: Health Care Revenue data
• These data were collected by Department of Health and Social Services (DHSS)
of the state of New Mexico and cover 52 of the 60 licensed facilities in New
Mexico in 1998. Specific definitions of the variables are given below. The
location of the facility is indicated whether it is the rural or non rural area.
Variable Definition
RURAL Rural home (1) and non-rural home (0)
BED Number of Beds in home
MCDAYS Annual medical in-patient days(hundreds)
TDAYS Annual Total Patient Days (Hundreds)
PCREV Annual Total Patient Care Revenue($100)
NSAL Annual nursing salaries ($100)
FEXP Annual Facilities Expenditure ($100)
• DHSS is interested to predict patient care revenue based on the other hospital
characteristics.
Dr. Mahesh K C 13
Model Building: HCR Data
• Objective: Build a model to predict patient care revenue.
• Response Variable: PCREV
• Continuous Predictors: BED, MCDAYS, TDAYS, NSAL, FEXP
• Categorical Predictor (dummy): RURAL
• Total records: 52
• Total Variables: 7 (6 predictors)
• No missing values
Dr. Mahesh K C 16
Summary Results of Model 9
Dr. Mahesh K C 17
References
• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.
Dr. Mahesh K C 18