You are on page 1of 3

Feature Engineering

Feature engineering is the process of using domain knowledge of the data to


create features that make machine learning algorithms work. Feature engineering is
fundamental to the application of machine learning and is both difficult and expensive. The
need for manual feature engineering can be obviated by automated feature learning.

Feature engineering is an informal topic, but it is considered essential in applied machine


learning.

Steps of reducing variables

1. Remove variables having more than 80 % missing values


2. Remove variables having zero variance
3. WOE and IV

4. Variable Clustering
a. PCA(Principal Component analysis
5. Multicollinearity
a. VIF= 1/(1-R2) > 5

 Validating statistical Model


KS & LIFT
ways to measure KS Statistic
Method 1: Decile Method
This method is the most common way to calculate the KS statistic for validating binary
predictive model. See the steps below.
You need to have two variables before calculating KS. One is the dependent variable
which should be binary. The second one is the predicted probability score which is
generated from statistical model.
Create deciles based on predicted probability columns which mean dividing probability
into 10 parts. The first decile should contain highest probability score.
Calculate the cumulative % of events and non-events in each decile and then compute
the difference between this two cumulative distribution.
KS is where the difference is maximum
If KS is in top 3 decile and score above 40, it is considered a good predictive model. At
the same time, it is important to validate the model by checking other performance
metrics as well to confirm that model is not suffering from overfitting problem.

lift is a measure of the performance of a targeting model (association rule) at predicting or


classifying cases as having an enhanced response (with respect to the population as a whole),
measured against a random choice targeting model. A targeting model is doing a good job if
the response within the target is much better than the average for the population as a whole.
Lift is simply the ratio of these values: target response divided by the average response.

For example, suppose a population has an average response rate of 5%, but a certain model (or
rule) has identified a segment with a response rate of 20%. Then that segment would have a lift
of 4.0 (20%/5%).
Typically, the modeler seeks to divide the population into quantiles and rank the quantiles by
lift. Organizations can then consider each quantile, and by weighing the predicted response rate
(and associated financial benefit) against the cost, they can decide whether to market to that
quantile or not.

You might also like