Professional Documents
Culture Documents
Model+Selection 2
Model+Selection 2
FEATURE ENGINEERING
Feature
Engineering
Numeric Categorical
Features Features
Here, we will discuss a few popular feature engineering techniques, for numeric and categorical features
NUMERIC FEATURES – SCALING
Min-max normalization:
𝑥𝑖 − 𝑥𝑚𝑖𝑛
𝑚𝑖 = , 𝑖 = 1, … , 𝑛
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
● The above transformation ensures all observations is in the range 0-1.
Z-score normalization:
𝑥𝑖 − μ
𝑍𝑖 = , 𝑖 = 1, … , 𝑛
𝜎
● In absence of μ and 𝜎, we may use:
𝑥𝑖 − 𝑥
𝑍𝑖 = , 𝑖 = 1, … , 𝑛
𝑠
● The above transformation ensures observations are comparable since they are unit free and scaled against
two important parameters of the distribution.
● If the data distribution is normal, 𝑍𝑖 s will lie within (-3, 3) almost 99.7% of times.
NUMERIC FEATURES – OUTLIER HANDLING
Box Plot
There are many ways to handle missing data. Statistical theory plays a big role in the imputation
of missing data.
Missing data gives rise to serious problems with the estimation of model parameters since the
data is incomplete. It may produce biased estimates.
We will discuss some common approaches for handling missing data. These include the following:
● Dropping observations or features
● Imputing by the average or the median value of the non-missing observations
● Using regression to impute
● Using a distribution structure to impute (multiple imputation)
NUMERIC FEATURES – IMPUTATION
X3 X4 X5 X6
94 26 6
55 90 9 83
93 97 95 74
62 68 56
51 80 90
85 29 78 0
100 2 37 22
54 6 7
16 46 84 19
30 29 87
33 18 73
98 94 9 100 Dropping observations or features:
74 52 60 80
51 80 90 ● We may drop the feature X6 from the analysis as this
85 29 78 0 column has many missing observations (>50%)
100 2 37
94 26 6 ● We may drop the observation highlighted by the colour
55 90 9
93 97 95 74 since all the values are missing
30 29 87 17
33 18 73 96
98 94 9
74 52 60 80
51 80 90
62 68 56 68
51 80 90 3
85 29 78
100 2 37 22
54 6 7
NUMERIC FEATURES – IMPUTATION
X3 X4 X5 X6
94 26 6
55 90 9 83
93 97 95 74
62 68 56
51 80 90
85 29 78 0
100 2 37 22
54 6 7
16 46 84 19 Average or median value imputation
30 29 87
33 18 73 ● The missing value may be imputed by taking the average
98 94 9 100
74 52 60 80 or the median of the rest of the observations
51 80 90 corresponding to the feature
85 29 78 0
100 2 37 ● Here, the missing value in X5 is imputed using the
94 26 6
55 90 9 median of the remaining values of X5
93 97 95 74
60 ● Imputing by median is slightly safer in the sense that the
30 29 87 17 median is less sensitive to outliers
33 18 73 96
98 94 9
74 52 60 80
51 80 90
62 68 56 68
51 80 90 3
85 29 78
100 2 37 22
54 6 7
NUMERIC FEATURES – IMPUTATION
Y X3 X4 X5
694.85 94 26 6
1008.2 55 90 9
1192.2 93 97 95
931.8 62 68 56
330.3 51 90
499.05 85 29 78
742.7 100 2 37 Regression based imputation
899.1 54 6 7
283.3 16 46 84 ● Fit a linear regression model between Y and X4 ignoring
343.5
1032.7
30
33
29
18
87
73
the missing observations completely
1411.7
1120.15
98
74
94
52
9
60
● Then use the fitted line for interpolating or predicting the
323.3 51 80 90 missing values of X4
499.05 85 78
727.7 100 2 37 ● The line is 𝑌 = 653.055 + 3.1529 𝑋4
714.85 94 26 6
1035.2 55 90 9 ● Put the values of Y for which X4 values are missing in the
1184.2 93 97 95 equation
945.8 60
343.5 30 29 87 ● Impute the missing values
1048.7 33 18 73
1418.7 98 94 9 ● Non-linear regression can also be applied
1105.15 74 52 60
340.3 51 80 90
915.8 62 68 56
339.3 51 80 90
460.05 85 29 78
734.7 100 2 37
912.1 54 6 7
NUMERIC FEATURES – IMPUTATION
Y X3 X4 X5
694.85 94 26 6
1008.2 55 90 9
1192.2 93 97 95
931.8 62 68 56
330.3 51 90 Distribution-based imputation:
499.05 85 29 78
742.7
899.1
100
54
2
6
37
7
● Plot a histogram for X4 with the available values
283.3
343.5
16
30
46
29
84
87
● Depending upon the shape of the distribution, fit some
1032.7 33 18 73 known distribution, such as a normal or an exponential
1411.7 98 94 9 distribution
1120.15 74 52 60
323.3 51 80 90 ● Estimate the parameters of the distribution
499.05 85 78
727.7 100 2 37 ● Generate random numbers from this distribution, say,
714.85 94 26 6
1035.2 55 90 9 1,000 times
1184.2 93 97 95
945.8 60 ● Take the average of the numbers and put this in the place
343.5 30 29 87 of a single missing value
1048.7 33 18 73
1418.7 98 94 9 ● Repeat the above steps for other missing points
1105.15 74 52 60
340.3 51 80 90 ● In the absence of a known distribution,
915.8 62 68 56
339.3 51 80 90
bootstrapping can be applied to impute
460.05 85 29 78 the value
734.7 100 2 37
912.1 54 6 7
NUMERIC FEATURES – BINNING
Frequency
Most observations are clustered towards the lesser 2,000
values.
These could have an impact on the model to be
fitted, since more weightage should be given to the 1,000
lower values than the higher ones.
Also, an increase of one unit in the lower values may 0
not have the same impact on the response as 0.0 0.2 0.4 0.6
increase of one unit in the higher values would have.
To eliminate this bias, a suitable transformation, such
as a logarithmic transformation may be considered.
NUMERIC FEATURES – FEATURE TRANSFORMATION
● X2 = 1 if Mumbai; else, X2 = 0
● X3 = 1 if Chennai; else, X3 = 0
X X1 X2 X3
Delhi 1 0 0
Mumbai 0 1 0
Chennai 0 0 1
Kolkata 0 0 0
Delhi 1 0 0
Kolkata 0 0 0
CATEGORICAL FEATURES – ONE HOT ENCODING
⚫⚫ Scaling
⚫⚫ Outlier Handling
⚫⚫ Imputation
⚫⚫ Binning
⚫⚫ Feature Transformation
⚫⚫ Encoding
FEATURE SELECTION
Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
45.72 92 38 94 26 6 23 53 2 58 98 2 61
91.18 53 65 55 90 9 83 27 11 0 97 52 88
84.94 66 44 93 97 95 74 34 93 84 36 27 85
86.37 38 84 62 68 56 68 84 24 32 48 41 80
44.84 63 59 51 80 90 3 10 37 96 18 11 24
53.95 81 86 85 29 78 0 11 90 54 72 41 35
49.26 53 58 100 2 37 22 25 31 70 67 34 73
70.31 86 35 54 6 7 70 41 92 69 92 52 97
52.65 68 57 16 46 84 19 84 4 41 17 67 63
49.09 36 51 30 29 87 17 91 98 60 29 42 82
93.07 87 83 33 18 73 96 36 83 85 2 35 48
101.84 54 91 98 94 9 100 57 40 54 47 22 65
87.61 82 79 74 52 60 80 71 23 23 25 42 42
Feature Set
Feature selection refers to selecting significant features from the full set of
features and getting rid of redundant features
Irrelevant features could impact predictions, performance, and distorts the
relationships between X and Y
Feature selection sometimes helps with managing overfitting
It keeps the model relatively simple
It enables training the algorithm to train faster by reducing the model complexity
FEATURE SELECTION
Feature
Selection
data: Y0 and Y1
W = 0, p-value = 3.375e-06
alternative hypothesis: true location shift is not equal to 0
WRAPPER METHODS
⚫⚫ The criteria may use the p-value or the AIC value to select
features
We notice that from X1 to X14, only X3, X4, X6, X8 and X10 are selected by the stepwise method
in linear regression
The selection criteria used is AIC
Call:
lm(formula = Y ~ X3 + X4 + X6 + X8 + X10, data = data1)
Coefficients:
(Intercept) X3 X4 X6 X8 X10
16.3708 5.4039 0.1723 8.5645 0.2386 -0.2473