Professional Documents
Culture Documents
Dr. Mahesh K C 1
Regression with Categorical Predictors Using Indicator Variables
• Categorical variables can be included in model through use of indicator variables.
• Example: Consider Cars data set. We have: mpg, cylinders, cubicinches, hp, weightlbs, time.to.60
are continuous variables and brand-Categorical Variable with three levels US, Japan and Europe.
The variable “year” is not considered.
• For regression, categorical variable with k categories transformed to k – 1 indicator (dummy)
variables. Indicator variable is binary, equals 1 when observation belongs to category, otherwise
equals 0.
• Brand variable is transformed to two indicator (dummy) variables:
1 1
C1 if country is Japan C 2 if country is US
0 otherwise 0 otherwise
Dr. Mahesh
2 KC
Estimated Regression Equation with Categorical Predictors
• Including indicator variables into model produces estimated reg. eq:
mpg b 0 b1 ( cylinders ) b 2 ( cubicinche s ) b 3 ( hp ) b 4 ( weightlbs )
b 5 ( time .to . 60 ) b 6 C 1 b 7 C 2
Dr. Mahesh
3 KC
Variable Selection Methods
• Several variable selection methods available.
• Assist analyst in determining which variables to include in model.
• Algorithms help select predictors leading to optimal model.
• Four variable selection methods:
(1) Forward Selection
(2) Backwards Elimination
(3) Stepwise Selection
(4) Best Subsets
Dr. Mahesh
4 KC
The Partial F-Test (Theory optional)
• Suppose model has x1,…,xp predictors and we consider adding additional predictor x*.
• Calculate sequential sum of squares from adding x*, given existing x1,…,xp in model.
• Full sum of squares SSFull = x1,…,xp, x* in model.
• Reduced sum of squares SSReduced = x1,…,xp in model.
• Therefore, extra sum of squares SSExtra denoted by
SS Extra SS ( x * | x1 , x2 ,..., x p ) SS Full SS Re duced
• Null hypothesis for Partial F-Test
– Ho: No, SSExtra associated with x* does not contribute significantly to model
– Ha: Yes, SSExtra associated with x* does contribute significantly to model
• Test statistic for Partial F-Test SS Extra
F ( x * | x1 , x 2 , , x p )
MSE Full
Dr. Mahesh
5 KC
Backwards Elimination Procedure
• Procedure begins with all variables in model.
• Step 1:
Perform regression on full model with all variables
For example, assume model has x1,…,x4
• Step 2:
For each variable in model perform partial F-test
Select variable with smallest partial F-statistic, denoted Fmin
• Step 3:
If Fmin not significant, remove associated variable from model and return to Step 2
Otherwise, if Fmin significant, stop algorithm and report current model
If first pass, then current model is full model
If not first pass, then full set of predictors reduced by one or more variables
Dr. Mahesh
6 KC
Backwards Elimination Applied to Cars Data Set
• We begin with all predictors (excluding the predictor “Year”) included in the model.
• Partial F-statistic calculated for each predictor. Smallest F-statistic Fmin (= 0.5132) associated
with cubicinches. Here, Fmin is not significant at 5%, therefore cubicinches is dropped.
• On second pass predictor cylinders is eliminated as its Fmin (= 0.4425) which is not significant
at 5%.
• On third pass predictor time.to.60 is dropped with Fmin (=1.7229) which is not significant at
5%.
• Finally, all predictors are significant at 5% level.
• Procedure terminates with model (B):
Model B : mpg b0 b1 (hp ) b2 ( weightlbs ) b6 (brand )
Dr. Mahesh
7 KC
Backwards Elimination Applied to Cars Data Set
• Most of the time variable selection methods take care of multicollinearity. Still
one may check for the same with latest model.
• Based on Model B check influential, outliers and leverage values.
• Check assumptions on regression. If violated one may try transformation
either on response variable or predictors or both.
Dr. Mahesh K C 8
References
• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.
Dr. Mahesh K C 9