You are on page 1of 43

Feature Engineering

Spatial refers to space. Temporal refers to time.


Discrete vs continuous data
• Both data types are important for statistical analysis. However, there are some
notable differences between the two that need to be understood before
drawing any conclusions or making assumptions about the data type at hand.
Discrete data Continuous data
Takes specific countable values Takes any measured value within a specific range

Some common examples of discrete data are the Some common examples of continuous data are
number of students, the number of children, the height, weight, length, time, temperature, age, and so
shoe size, and so on on
Ordinal data values and integer values represent Decimal numbers and fractions represent continuous
discrete data data
Easily counted on something as simple as a Requires more in-depth measurement tools and
number line methods like curves and skews
Discrete data remains constant over a specific Continuous data varies over time and can have
time interval separate values ​at any given point
What is continuous data?
• Continuous data is a type of numerical data that refers to the unspecified number of
possible measurements between two realistic points.

• These numbers are Over time, measuring a particular not always clean and tidy like those
in discrete data, as they're usually collected from precise measurements.

• Subject allows us to create a defined range, where we can reasonably expect to collect
more data.

• Continuous data is all about accuracy. Variables in these data sets often
carry decimal points, with the number to the right stretched out as far as possible.

• This level of detail is paramount for scientists, doctors, and manufacturers, to name a few.
Examples of continuous data
• Some examples of continuous data include:
• The weight of newborn babies
• The daily wind speed
• The temperature of a freezer

• When you think of experiments or studies involving constant measurements,


they're likely to be continuous variables to some extent.
• If you have a number like “2.86290” anywhere on a spreadsheet, it's not a
number you could have quickly arrived at yourself — think measurement devices
like stopwatches, scales, thermometers, and the like.
Key characteristics of continuous data
• Unlike discrete data, continuous data can be either numeric or distributed over
date and time.
• This data type uses advanced statistical analysis methods taking into account
the infinite number of possible values. Key characteristics of continuous
data are:
• Continuous data changes over time and can have different values at different
time intervals.
• Continuous data is made up of random variables, which may or may not
be whole numbers.
• Continuous data is measured using data analysis methods such as line graphs,
skews, and so on.
• Regression analysis is one of the most common types
of continuous data analysis.
Feature engineering is the most important aspect of a data science model
development. There are several categories of features in a raw dataset.

Features can be text, date/time, categorical, and continuous variables. For a


machine learning model, the dataset needs to be processed in the form of
numerical vectors to train it using an ML algorithm.

The objective of this article is to demonstrate feature engineering techniques to


transform a categorical feature into a continuous feature and vice-versa.

•Feature Binning: Conversion of a continuous variable to categorical.


•Feature Encoding: Conversion of a categorical variable to numerical features.
Feature Encoding:
• Feature Encoding is used for the transformation of a categorical feature
into a numerical variable. Most of the ML algorithms cannot handle
categorical variables and hence it is important to do feature encoding.
There are many encoding techniques used for feature engineering:
• 1. Label Encoding:
• Label encoding is an encoding technique to transform categorical variables
into numerical variables by assigning a numerical value to each of the
categories. Label encoding can be used for Ordinal variables.
• 2. Ordinal encoding:
• Ordinal encoding is an encoding technique to transform an original
categorical variable to a numerical variable by ensuring the ordinal
nature of the variables is sustained.
• 3. Frequency encoding:
• Frequency encoding is an encoding technique to transform an original
categorical variable to a numerical variable by considering the
frequency distribution of the data. It can be useful for nominal
features.
• 4. Binary encoding:
• Binary encoding is an encoding technique to transform an original categorical variable to
a numerical variable by encoding the categories as Integer and then converted into
binary code. This method is preferable for variables having a large number of categories.
• For a 100 category variable, Label Encoding creating 100 labels each corresponding to a
category, but instead binary encoding creating only 7 categories.
• 5. One hot encoding:
• One hot encoding technique splits the category each to a column. It
creates k different columns each for a category and replaces one
column with 1 rest of the columns is 0.
• 6. Target Mean encoding:
• Mean encoding is one of the best techniques to transform categorical variables into numerical variables as it
takes the target class label into account. The basic idea is to replace the categorical variable with the mean of
its corresponding target variable.
• Here the categorical variable that needs to be encoded is the independent variable (IV) and the target class
label is the dependent variable (DV).
• Steps for mean encoding:
• Select a category
• Group by the category and obtain aggregated sum (= a)
• Group by the category and obtain aggregated total count (= b)
• Numerical value for that category = a/b
Feature Binning:

• Binning or discretization is used for the transformation of a


continuous or numerical variable into a categorical feature. Binning of
continuous variable introduces non-linearity and tends to improve
the performance of the model. It can be also used to identify missing
values or outliers.
• There are two types of binning:
• Unsupervised Binning: Equal width binning, Equal frequency binning
• Supervised Binning: Entropy-based binning
• Unsupervised Binning:
• Unsupervised binning is a category of binning that transforms a
numerical or continuous variable into categorical bins without
considering the target class label into account. Unsupervised binning
are of two categories:
• 1. Equal Width Binning:
• This algorithm divides the continuous variable into several categories
having bins or range of the same width.
• Notations,
x = number of categories
w = width of a category
max, min = Maximum and Minimun of the list
• 2. Equal frequency binning:
• This algorithm divides the data into various categories having
approximately the same number of values. The values of data are
distributed equally into the formed categories.

Notations,
x = number of categories
freq = frequency of a category
n = number of values in data
• Supervised Binning:
• Supervised binning is a type of binning that transforms a numerical or
continuous variable into a categorical variable considering the target
class label into account.
• It refers to the target class label when selecting discretization cut
points. Entropy-based binning is a type of supervised binning.

• 1. Entropy-based Binning:
• The entropy-based binning algorithm categorizes the continuous or
numerical variable majority of values in a bin or category belong to
the same class label.
• It calculates entropy for target class labels, and it categorizes the split
based on maximum information gain.
Problem with One Hot Encoding while handling large number of
categories
What’s new in handling
categorical data, we have Well, I have OHE to save
Label encoding and One me here also.
hot encoding techniques
from Scikit learn.

Well, I too always rely on one hot


encoding (OHE) for my categorical
data but one problem forced me to
think if this is a good enough
technique every time.

Lets talk about non-ordinal


categorical column

What you think , what should we


do if we get 1000’s of categories in
a column. Machine Learning Pipeline
Yeah, you got it genius!, that is my
point here- COD (Curse of
Don’t you think it will make 1000’s dimensionality)
of new column/features. Now with this COD you are limited to
using linear model only (need to check
Your algorithm or CPU will get this)
scared to see that many features
to get single information ….
(kidding..)
Do you know or can think of any
better way to deal with it.

Yeah, I didn’t give much


thought to 1000’s of Use PCA for dimensionality
column it will generate, it reduction.
will increase the But wait! , it doesn’t seem
dimensionality. right to increase that much
dimensions first and then
use PCA.
I think you got something
on this.. right ?
It helps to reduce number of the output
feature representation for a categorical
column.
It doesn’t care much that how many
Yes, I think so, it should work. different categories a column has, but it
It is Bin counting and some calls it gives number of columns equal to
Response encoding, categories in target variable.
So, if your problem is binary classification,
it will give you two columns representation
only irrespective of no. of unique values in
a categorical column.

And how it helps ?

It sounds interesting..
This encoding technique doesn’t
encode the categorical values I didn’t find anything implemented as such
directly, but gathers some stats but I have written my own function, and yes
about them it incurs a slight delay in the learning
say — Conditional probability, odd- pipeline but it is good for Kaggle
ratio or log odd ratio and then it competitions
makes a count table and insert
these stats into it.

Go on…
It seems lots of work
before you get what you
want, what library
function do you use to
implement this
The idea of bin counting is deviously
simple: rather than using the value
of the categorical variable as the
feature, use the conditional
probability of the target under that
value.

In other words, instead of encoding


the identity of the categorical value,
we compute the association
statistics between that value and the
target that we wish to predict.

If you are familiar with naive Bayes


classifiers, this statistic should ring a In short, bin counting converts a categorical variable into
bell, because it is the conditional statistics about the value. It turns a large, sparse, binary
probability of the class under the representation of the categorical variable, such as that
assumption that all features are produced by one-hot encoding, into a very small, dense,
independent real-valued numeric representation.
What is Binning?
• Binning is a technique that
accomplishes exactly what it sounds
like. It will take a column with
continuous numbers and place the
numbers in “bins” based on ranges that
we determine.
• This will give us a new categorical
variable feature.

• For instance, let’s say we have a


DataFrame of cars.

We’ll focus on the MPG column and make 3 bins, fuel-efficient, average fuel efficiency, and gas guzzlers.
Making the Bins

• Our first step will be to determine the ranges for these bins. This can be tricky as
it can be done in a few different ways.
• One way to do this is to divide the bins up evenly based on the distribution of
values. Basically, we would go through the same process as if we were creating
a histogram.
• Since Python makes generating histograms easy with matplotlib, let’s visualize
this data in a histogram and set the bins parameter to 3.
Pros and Cons of Bin Counting approach
Bin-counting Scheme
• The encoding schemes we discussed so far, work quite well on categorical
data in general, but they start causing problems when the number of distinct
categories in any feature becomes very large.
• Essential for any categorical feature of m distinct labels, you get m separate
features.
• This can easily increase the size of the feature set causing problems like
storage issues, model training problems with regard to time, space and
memory.
• Besides this, we also have to deal with ‘curse of dimensionality’ where
basically with an enormous number of features and not enough
representative samples, model performance starts getting affected often
leading to overfitting.
• The bin-counting scheme is a useful scheme for dealing with categorical
variables having many categories.
• In this scheme, instead of using the actual label values for encoding, we use
probability based statistical information about the value and the actual
target or response value which we aim to predict in our modeling efforts.

• A simple example would be based on past historical data for IP addresses


and the ones which were used in DDOS attacks; we can build probability
values for a DDOS attack being caused by any of the IP addresses.
• Using this information, we can encode an input feature which depicts that if
the same IP address comes in the future, what is the probability value of a
DDOS attack being caused. This scheme needs historical data as a pre-
requisite and is an elaborate one.
Feature Hashing Scheme
• The feature hashing scheme is another useful feature engineering scheme for
dealing with large scale categorical features.
• In this scheme, a hash function is typically used with the number of encoded
features pre-set (as a vector of pre-defined length) such that the hashed
values of the features are used as indices in this pre-defined vector and values
are updated accordingly.
• Since a hash function maps a large number of values into a small finite set of
values, multiple different values might create the same hash which is
termed as collisions.
• Typically, a signed hash function is used so that the sign of the value obtained
from the hash is used as the sign of the value which is stored in the final
feature vector at the appropriate index.
• This should ensure lesser collisions and lesser accumulation of error due to
collisions.
Dealing with categorical features with high
cardinality: Feature Hashing
• Many machine learning algorithms are
not able to use non-numeric data.
• Usually, these features are represented
by strings, and we need some way of
transforming them into numbers before
using scikit-learn’s algorithms.
• The different ways of doing this are
called encodings.
Feature
hashing in a
tweet
example.

You might also like