You are on page 1of 31

Model Selection

FEATURE ENGINEERING

What data scientists spend the


most time doing
 An overwhelming number of times, 3% 5%
4%
the raw data obtained in any Building training sets : 3%
business domain is crude, 9%
Cleaning and organising data: 60%
unorganised and irrelevant. Also,
data is missing and is expressed in Collecting data sets: 19%
different scales with different units 19% Mining data for patterns: 9%
and not suitable for analysis Refining algorithms: 4%
directly. 60%
Others: 5%
 Hence, data needs to be cleaned
and organised.
FEATURE ENGINEERING

What data scientists spend the


 Feature engineering refers to the steps most time doing
3% 5%
followed to convert raw data into data 4%
Building training sets : 3%
that can be used for model building 9%
and analysis. Cleaning and organising data: 60%
Collecting data sets: 19%
 It is essential because:
19% Mining data for patterns: 9%
● The input features should be
Refining algorithms: 4%
formatted properly for an 60%
algorithm to work (algorithm Others: 5%
compatibility) and
● It can improve the model
performance.
FEATURE ENGINEERING

Feature
Engineering

Numeric Categorical
Features Features

Removing One Hot Label Mean


Scaling Imputation Binning Transformation
Outliers Encoding Encoding Encoding

 Here, we will discuss a few popular feature engineering techniques, for numeric and categorical features
NUMERIC FEATURES – SCALING

 Most often, features do not have similar ranges or units of measurements.


 Suppose there are two features, Age (in years) and Income (in ₹ lakhs), which may affect expenditure.
 If we fit a linear regression, then Income would have a greater impact on Expenditure than Age since the scale
of measurement for Income (10,00,000 to 25,00,000) is higher than that for Age (20−50). Also, their units are
different
 But that does not necessarily mean Income is more significant than Age. The algorithm does not know that.
 To avoid this issue, all the features are brought to the same scale or level.
 This can be achieved in many ways. Two popular approaches include:
● Min-max normalization and
● Z-score normalisation
 Nevertheless, if the exclusive raw influence of a single feature on the response is of essence, then scaling may
be avoided. When prediction by model building is the sole purpose, then scaling is recommended.
NUMERIC FEATURES – SCALING

 Min-max normalization:
𝑥𝑖 − 𝑥𝑚𝑖𝑛
𝑚𝑖 = , 𝑖 = 1, … , 𝑛
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
● The above transformation ensures all observations is in the range 0-1.
 Z-score normalization:
𝑥𝑖 − μ
𝑍𝑖 = , 𝑖 = 1, … , 𝑛
𝜎
● In absence of μ and 𝜎, we may use:
𝑥𝑖 − 𝑥
𝑍𝑖 = , 𝑖 = 1, … , 𝑛
𝑠
● The above transformation ensures observations are comparable since they are unit free and scaled against
two important parameters of the distribution.
● If the data distribution is normal, 𝑍𝑖 s will lie within (-3, 3) almost 99.7% of times.
NUMERIC FEATURES – OUTLIER HANDLING

 First of all, outliers should not be removed without proper inspection.


 Outliers do indicate a certain change or anomaly in the process, and they impact the model
performance. For example, in linear regression, outliers can drastically change the model
parameters, resulting in non-sense predictions.
 Data considering just the outliers can be analysed separately.
 Nevertheless, outlier detection could be a challenge.
 We could apply two statistical methods:
● Percentile-based method and
● Standard deviation-based method
 Outliers can be removed or capped, or analysed separately.
NUMERIC FEATURES – OUTLIER HANDLING

Box Plot

 Visual inspection with a histogram or a scatter plot may 3,000


indicate outliers.
2,500
 We could apply two statistical methods:
● Percentile-based method: 2,000
● Observations present outside the upper 97.5th
percentile or the lower 2.5th percentile could be 1,500
treated as outliers
● A box-and-whisker plot could be applied 1,000
X
500
● Standard-deviation-based method:
● Observations lying outside 3-6 standard deviations 0
from the mean can be treated as outliers
1
NUMERIC FEATURES – IMPUTATION

 There are many ways to handle missing data. Statistical theory plays a big role in the imputation
of missing data.
 Missing data gives rise to serious problems with the estimation of model parameters since the
data is incomplete. It may produce biased estimates.
 We will discuss some common approaches for handling missing data. These include the following:
● Dropping observations or features
● Imputing by the average or the median value of the non-missing observations
● Using regression to impute
● Using a distribution structure to impute (multiple imputation)
NUMERIC FEATURES – IMPUTATION
X3 X4 X5 X6
94 26 6
55 90 9 83
93 97 95 74
62 68 56
51 80 90
85 29 78 0
100 2 37 22
54 6 7
16 46 84 19
30 29 87
33 18 73
98 94 9 100  Dropping observations or features:
74 52 60 80
51 80 90 ● We may drop the feature X6 from the analysis as this
85 29 78 0 column has many missing observations (>50%)
100 2 37
94 26 6 ● We may drop the observation highlighted by the colour
55 90 9
93 97 95 74 since all the values are missing
30 29 87 17
33 18 73 96
98 94 9
74 52 60 80
51 80 90
62 68 56 68
51 80 90 3
85 29 78
100 2 37 22
54 6 7
NUMERIC FEATURES – IMPUTATION
X3 X4 X5 X6
94 26 6
55 90 9 83
93 97 95 74
62 68 56
51 80 90
85 29 78 0
100 2 37 22
54 6 7
16 46 84 19  Average or median value imputation
30 29 87
33 18 73 ● The missing value may be imputed by taking the average
98 94 9 100
74 52 60 80 or the median of the rest of the observations
51 80 90 corresponding to the feature
85 29 78 0
100 2 37 ● Here, the missing value in X5 is imputed using the
94 26 6
55 90 9 median of the remaining values of X5
93 97 95 74
60 ● Imputing by median is slightly safer in the sense that the
30 29 87 17 median is less sensitive to outliers
33 18 73 96
98 94 9
74 52 60 80
51 80 90
62 68 56 68
51 80 90 3
85 29 78
100 2 37 22
54 6 7
NUMERIC FEATURES – IMPUTATION
Y X3 X4 X5
694.85 94 26 6
1008.2 55 90 9
1192.2 93 97 95
931.8 62 68 56
330.3 51 90
499.05 85 29 78
742.7 100 2 37  Regression based imputation
899.1 54 6 7
283.3 16 46 84 ● Fit a linear regression model between Y and X4 ignoring
343.5
1032.7
30
33
29
18
87
73
the missing observations completely
1411.7
1120.15
98
74
94
52
9
60
● Then use the fitted line for interpolating or predicting the
323.3 51 80 90 missing values of X4
499.05 85 78
727.7 100 2 37 ● The line is 𝑌 = 653.055 + 3.1529 𝑋4
714.85 94 26 6
1035.2 55 90 9 ● Put the values of Y for which X4 values are missing in the
1184.2 93 97 95 equation
945.8 60
343.5 30 29 87 ● Impute the missing values
1048.7 33 18 73
1418.7 98 94 9 ● Non-linear regression can also be applied
1105.15 74 52 60
340.3 51 80 90
915.8 62 68 56
339.3 51 80 90
460.05 85 29 78
734.7 100 2 37
912.1 54 6 7
NUMERIC FEATURES – IMPUTATION
Y X3 X4 X5
694.85 94 26 6
1008.2 55 90 9
1192.2 93 97 95
931.8 62 68 56
330.3 51 90  Distribution-based imputation:
499.05 85 29 78
742.7
899.1
100
54
2
6
37
7
● Plot a histogram for X4 with the available values
283.3
343.5
16
30
46
29
84
87
● Depending upon the shape of the distribution, fit some
1032.7 33 18 73 known distribution, such as a normal or an exponential
1411.7 98 94 9 distribution
1120.15 74 52 60
323.3 51 80 90 ● Estimate the parameters of the distribution
499.05 85 78
727.7 100 2 37 ● Generate random numbers from this distribution, say,
714.85 94 26 6
1035.2 55 90 9 1,000 times
1184.2 93 97 95
945.8 60 ● Take the average of the numbers and put this in the place
343.5 30 29 87 of a single missing value
1048.7 33 18 73
1418.7 98 94 9 ● Repeat the above steps for other missing points
1105.15 74 52 60
340.3 51 80 90 ● In the absence of a known distribution,
915.8 62 68 56
339.3 51 80 90
bootstrapping can be applied to impute
460.05 85 29 78 the value
734.7 100 2 37
912.1 54 6 7
NUMERIC FEATURES – BINNING

 Binning refers to creating categories or groups using 30


numerical values. 25
 For example, if we have Income values between ₹10,00,000
20
and 25,00,000, we can bin them into groups, such as
<10,00,000, 10,00,000−15,00,000, 15,00,000−20,00,000, 15
>20,00,000 10
 The groups may have equal or unequal widths. 5
 Through binning, we get rid of small observational 0
fluctuations or errors. <8 8-13 13-18 18-23 23-28 >28
 We consider every observation in a bin to have similar
characteristics. Symmetric distribution

 By doing so, the model is somewhat protected from


overfitting.
 However, since we are losing out on critical information,
performance may be bad.
 When data has outliers, binning could help with grouping
all the observations beyond a certain threshold.
 Histograms uses binning.
NUMERIC FEATURES – FEATURE TRANSFORMATION

 In many scenarios, taking a suitable transformation 4,000


helps.
 Consider a feature that shows the skewed 3,000
distribution on the right

Frequency
 Most observations are clustered towards the lesser 2,000
values.
 These could have an impact on the model to be
fitted, since more weightage should be given to the 1,000
lower values than the higher ones.
 Also, an increase of one unit in the lower values may 0
not have the same impact on the response as 0.0 0.2 0.4 0.6
increase of one unit in the higher values would have.
 To eliminate this bias, a suitable transformation, such
as a logarithmic transformation may be considered.
NUMERIC FEATURES – FEATURE TRANSFORMATION

 By taking a log transformation, the skewness has 3,500


reduced greatly. 3,000
 Sometimes values may contain 0, so to avoid
2,500
floating error, log(1+x) is considered.
 Whenever data is skewed, log-transformation or 2,000
square root transformation or other power 1,500
transformations could help.
1,000
 Power transformation means:
500
𝑋∗ = 𝑋𝑘
 Where k can be any real number except 0. 0
-12 -10 -8 -6 -4 -2 0
 Taking suitable transformation tends to adjust for
the effect of skewness on the model parameters.
CATEGORICAL FEATURES – ONE HOT ENCODING

 It is essential to convert categorical variables into numeric variables since X


most machine learning algorithms work with numbers Delhi
 One-Hot encoding is the most popular way to handle categorical variables. Mumbai
 Suppose X is a categorical variable with only the categories ‘Delhi’, ‘Mumbai’, Chennai
‘Chennai’ and ‘Kolkata’. Kolkata
 We create three variables (called dummy variables), X1, X2 and X3, from X. Delhi

● X1 = 1 if Delhi; else, X1 = 0 Kolkata

● X2 = 1 if Mumbai; else, X2 = 0
● X3 = 1 if Chennai; else, X3 = 0

X X1 X2 X3
Delhi 1 0 0
Mumbai 0 1 0
Chennai 0 0 1
Kolkata 0 0 0
Delhi 1 0 0
Kolkata 0 0 0
CATEGORICAL FEATURES – ONE HOT ENCODING

 X1 = 0, X2 = 0 and X3 = 0 means it is the last X X1 X2 X3


category, that is, ‘Kolkata’, that we are talking Delhi 1 0 0
about Mumbai 0 1 0
 Creating dummy variables like this works in two Chennai 0 0 1
ways: Kolkata 0 0 0
● The individual exclusive effect of each category Delhi 1 0 0
can be studied Kolkata 0 0 0
● It gets rid of the problem of linear dependence
among the variables
 Recall that in linear regression, the output has N-1
rows corresponding to each category of a
categorical variable.
 The missing category is the base category, which is
essentially X1=0, X2=0 and X3=0 in this case.
 The coefficients are interpreted based on the base
category
 This type of encoding works well with nominal
data.
CATEGORICAL FEATURES – LABEL ENCODING

 For ordinal data, where there is an ordered arrangement in X X1


the categories, we can use label encoding Bad 0
 For customer ratings, label encoding is given. Poor 1
 The difference between Bad and Poor is the same as the Fair 2
difference between Poor and Fair. Good 3
 So, increment is linear, which may not be the case. Great 4
 However, this may work well with ordinal data. Excellent 5
CATEGORICAL FEATURES – MEAN ENCODING

 This encoding technique is guided by the response variable. X Y Code


 Suppose for a classification problem, Y is the binary response Delhi 0 0.33
with the values of 0 or 1. Mumbai 1 0.5
 Delhi occurs thrice in the data with the Y values as 0, 0 and 1. Chennai 0 0
 We take the average of 0, 0 and 1 (proportion of 1’s) which is Kolkata 0 0.66
0.33 Delhi 0 0.33
 Delhi will be coded as 0.33 Kolkata 1 0.66
 Similarly, Mumbai occurs twice with the Y values 1 and 0. Delhi 1 0.33
 Mumbai will be coded as the average of 1 and 0, which is 0.5. Mumbai 0 0.5
 This helps with capturing the effects between X and Y and the Kolkata 1 0.66
intra-category relationship. Nevertheless, it runs the risk of
overfitting the model since a part of the information is already
getting captured
SUMMARY

⚫⚫ Scaling

⚫⚫ Outlier Handling

⚫⚫ Imputation

⚫⚫ Binning

⚫⚫ Feature Transformation

⚫⚫ Encoding
FEATURE SELECTION
Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
45.72 92 38 94 26 6 23 53 2 58 98 2 61
91.18 53 65 55 90 9 83 27 11 0 97 52 88
84.94 66 44 93 97 95 74 34 93 84 36 27 85
86.37 38 84 62 68 56 68 84 24 32 48 41 80
44.84 63 59 51 80 90 3 10 37 96 18 11 24
53.95 81 86 85 29 78 0 11 90 54 72 41 35
49.26 53 58 100 2 37 22 25 31 70 67 34 73
70.31 86 35 54 6 7 70 41 92 69 92 52 97
52.65 68 57 16 46 84 19 84 4 41 17 67 63
49.09 36 51 30 29 87 17 91 98 60 29 42 82
93.07 87 83 33 18 73 96 36 83 85 2 35 48
101.84 54 91 98 94 9 100 57 40 54 47 22 65
87.61 82 79 74 52 60 80 71 23 23 25 42 42

Feature Set
 Feature selection refers to selecting significant features from the full set of
features and getting rid of redundant features
 Irrelevant features could impact predictions, performance, and distorts the
relationships between X and Y
 Feature selection sometimes helps with managing overfitting
 It keeps the model relatively simple
 It enables training the algorithm to train faster by reducing the model complexity
FEATURE SELECTION

Feature
Selection

Filter Wrapper Embedded


Methods Methods Methods

Pairwise Hypothesis Information Variance Forward Backward Stepwise Cross Random


Correlation Testing Gain Threshold Selection Elimination Method Validation Forest

 Let us discuss each of them briefly


 This is not an exhaustive list but a list of some popular ones
FILTER METHODS

⚫⚫ Filter methods are based on individual features or the relationship


between the features and the response

⚫⚫ Generally, a criterion to filter is set, e.g., correlation, variance,


p-value etc

⚫⚫ If a feature does not qualify the criterion, the it is eliminated from


the model

⚫⚫ These methods are independent of the machine learning


algorithm that we employ, and are based simply on some
statistical or numeric measures

⚫⚫ These methods are simple and easy to implement


FILTER METHOD – CORRELATION
 Check the pairwise correlation between the
variables
 The correlation between Y and X6, Y and X5 and
X12 appears significant
 These three features, X5, X6 and X12, could be
retained for model building
 X1, X8 and X11 have very less values. These
features can be eliminated
 The correlation between X10 and X5 appears
significant. There could be some
multicollinearity
 You may perform test of hypothesis
FILTER METHOD – HYPOTHESIS TESTING

 We can perform hypothesis testing:


● For a numeric response that is distributed normally, perform pairwise
independent two-sample t-test for categorical feature
● For a numeric response may or may NOT distributed normally, perform pairwise
non-parametric Mann-Whitney U test (also known as Wilcoxon rank sum test)
for categorical feature
● For categorical variables (both response and feature), perform Pearson’s chi
squared test
FILTER METHOD – HYPOTHESIS TESTING

 For any test, check for the p-value


 A p-value less than 0.05 suggests that the corresponding feature has a significant
impact on the response
 Hence, the feature should be retained for model building

Wilcoxon rank sum test with continuity correction

data: Y0 and Y1
W = 0, p-value = 3.375e-06
alternative hypothesis: true location shift is not equal to 0
WRAPPER METHODS

⚫● Wrapper methods refer to techniques, which search for different


subsets of features (searching the feature space) and evaluate the
model fit to the data using some criteria (AIC, p-value)

⚫⚫ They select the optimum subset based on the set criteria


⚫⚫ The criteria may use the p-value or the AIC value to select
features

⚫⚫ Wrapper methods can be computationally expensive since they


search almost the entire feature set multiple times

⚫⚫ Feature interactions are considered


WRAPPER METHODS

The following feature selection methods come under wrapper


methods:

⚫⚫ Forward selection – Begins with the null model and keeps on


adding most significant features one by one

⚫⚫ Backward elimination – Begins with the full model and keeps


eliminating the least significant features one by one

⚫⚫ Stepwise method – Combination of forward selection and


backward elimination. Here, the most important features are
added, and the least important features are removed at the same
time
WRAPPER METHODS
 For example, let us consider Y (binary) as the response and three features, X1, X2 and X3
 Suppose we fit a logistic regression model
 Backward elimination:
● It starts with the full model Y with X1, X2 and X3
● X1 is dropped from the model and the AIC is calculated. If the difference in the AIC is high,
then it indicates that X1 is an important predictor. X2 and X3 are likewise dropped
● Say, for X2, the change in the AIC is not significant. Then X2 is dropped from the model
● The new logistic regression model is Y with X1 and X3
● Again, X1 and X3 are dropped sequentially and the change in the AIC is checked
● The process above is repeated until all the redundant features have been eliminated
 The elimination above is performed in linear regression traditionally by individual the t-test p-
value
WRAPPER METHOD

 We notice that from X1 to X14, only X3, X4, X6, X8 and X10 are selected by the stepwise method
in linear regression
 The selection criteria used is AIC

stepAIC(mod1, direction = "both", trace = FALSE)

Call:
lm(formula = Y ~ X3 + X4 + X6 + X8 + X10, data = data1)

Coefficients:
(Intercept) X3 X4 X6 X8 X10
16.3708 5.4039 0.1723 8.5645 0.2386 -0.2473

You might also like