Professional Documents
Culture Documents
elitedatascience.com/feature-engineering
In the previous overview, you learned a reliable framework for cleaning your dataset. We
fixed structural errors, handled missing data, and filtered observations.
In this guide, we'll see how we can perform feature engineering to help out our
algorithms and improve model performance.
Remember, out of all the core steps, data scientists usually spend the most time on
feature engineering:
Feature engineering is about creating new input features from your existing ones.
1/8
In general, you can think of data cleaning as a process of subtraction and feature
engineering as a process of addition.
This is often one of the most valuable tasks a data scientist can do to improve model
performance, for 3 big reasons:
1. You can isolate and highlight key information, which helps your algorithms "focus"
on what’s important.
2. You can bring in your own domain expertise.
3. Most importantly, once you understand the "vocabulary" of feature engineering, you
can bring in other people’s domain expertise!
In this lesson, we will introduce several heuristics to help spark new ideas.
Before moving on, we just want to note that this is not an exhaustive compendium of all
feature engineering because there are limitless possibilities for this step.
The good news is that this skill will naturally improve as you gain more experience.
Getting classy.
2/8
Try to think of specific information you might want to isolate. Here, you have a lot of
"creative freedom."
Going back to our example with the real-estate dataset, let's say you remembered that
the housing crisis occurred in the same timeframe...
As you might suspect, "domain knowledge" is very broad and open-ended. At some point,
you'll get stuck or exhaust your ideas.
That's where these next few steps come in. These are a few specific heuristics that can
help spark more.
3/8
Joining forces.
The first of these heuristics is checking to see if you can create any interaction features
that make sense. These are combinations of two or more features.
By the way, in some contexts, "interaction terms" must be products between two
variables. In our context, interaction features can be products, sums, or differences
between two features.
A general tip is to look at each pair of features and ask yourself, "could I combine this
information in any way that might be even more useful?"
Example (real-estate)
Let's say we already had a feature called 'num_schools', i.e. the number of
schools within 5 miles of a property.
Let's say we also had the feature 'median_school', i.e. the median quality score of
those schools.
However, we might suspect that what's really important is having many school
options, but only if they are good.
Well, to capture that interaction, we could simple create a new
feature 'school_score' = 'num_schools' x 'median_school'
4/8
The next heuristic we’ll consider is grouping sparse classes.
Sparse classes (in categorical features) are those that have very few total observations.
They can be problematic for certain machine learning algorithms, causing models to be
overfit.
To begin, we can group similar classes. In the chart above, the 'exterior_walls' feature
has several classes that are quite similar.
We might want to group 'Wood Siding', 'Wood Shingle', and 'Wood' into a single
class. In fact, let's just label all of them as 'Wood'.
Next, we can group the remaining sparse classes into a single 'Other' class, even if
there's already an 'Other' class.
We'd group 'Concrete Block', 'Stucco', 'Masonry', 'Other', and 'Asbestos shingle' into
just 'Other'.
Here's how the class distributions look after combining similar and other classes:
5/8
After combining sparse classes, we have fewer unique classes, but each one has more
observations.
Often, an eyeball test is enough to decide if you want to group certain classes together.
Most machine learning algorithms cannot directly handle categorical features. Specifically,
they cannot handle text values.
Dummy variables are a set of binary (0 or 1) variables that each represent a single class
from a categorical feature.
The information you represent is exactly the same, but this numeric representation allows
you to pass the technical requirements for algorithms.
In the example above, after grouping sparse classes, we were left with 8 classes, which
translate to 8 dummy variables:
6/8
(The 3rd column depicts an example for an observation with brick walls)
Unused features are those that don’t make sense to pass into our machine learning
algorithms. Examples include:
ID columns
Features that wouldn't be available at the time of prediction
Other text descriptions
Redundant features would typically be those that have been replaced by other features
that you’ve added during feature engineering.
7/8
"Would someone please ask Alex to get off the ping-pong table? We're waiting to play!"
After completing Data Cleaning and Feature Engineering, you'll have transformed your
raw dataset into an analytical base table (ABT). We call it an "ABT" because it's what
you'll be building your models on.
As a final tip: Not all of the features you engineer need to be winners. In fact, you’ll often
find that many of them don’t improve your model. That’s fine because one highly
predictive feature makes up for 10 duds.
The key is choosing machine learning algorithms that can automatically select the best
features among many options (built-in feature selection).
This will allow you to avoid overfitting your model despite providing many input features.
Additional Resources
Best Practices for Feature Engineering
Ready to roll up your sleeves take your skills to the next level? Join the Machine
Learning Accelerator today.
8/8