Professional Documents
Culture Documents
Feature Engine Readthedocs Io en Latest
Feature Engine Readthedocs Io en Latest
Release 1.1.2
Feature-engine Developers
1 Installation 3
3 Feature-engine’s Transformers 7
4 Getting Help 11
6 Contributing 15
7 Open Source 17
Bibliography 191
Index 193
i
ii
feature_engine Documentation, Release 1.1.2
Feature-engine is a Python library with multiple transformers to engineer features for use in machine learning models.
Feature-engine preserves Scikit-learn functionality with methods fit() and transform() to learn parameters from
and then transform the data.
Feature-engine includes transformers for:
• Missing data imputation
• Categorical variable encoding
• Discretisation
• Variable transformation
• Outlier capping or removal
• Variable combination
• Variable selection
Feature-engine allows you to select the variables you want to transform within each transformer. This way, different
engineering procedures can be easily applied to different feature subsets.
Feature-engine transformers can be assembled within the Scikit-learn pipeline, therefore making it possible to save and
deploy one single object (.pkl) with the entire machine learning pipeline. Check Quick Start for an example, on the
navigation panel on the left.
Would you like to know more about what is unique about Feature-engine?
This article provides a nice summary:
• Feature-engine: A new open source Python package for feature engineering.
TABLE OF CONTENTS 1
feature_engine Documentation, Release 1.1.2
2 TABLE OF CONTENTS
CHAPTER
ONE
INSTALLATION
Feature-engine is a Python 3 package and works well with 3.6 or later. Earlier versions have not been tested. The
simplest way to install Feature-engine is from PyPI with pip:
Feature-engine is an active project and routinely publishes new releases. To upgrade Feature-engine to the latest version,
use pip like this:
If you’re using Anaconda, you can install the Anaconda Feature-engine package:
3
feature_engine Documentation, Release 1.1.2
4 Chapter 1. Installation
CHAPTER
TWO
• Website.
• Feature Engineering for Machine Learning, Online Course.
• Feature Selection for Machine Learning, Online Course.
• Python Feature Engineering Cookbook.
• Feature-engine: A new open-source Python package for feature engineering.
• Practical Code Implementations of Feature Engineering for Machine Learning with Python.
En Español:
• Ingeniería de variables para machine learning, Curso Online.
• Ingeniería de variables, MachinLenin, charla online.
More resources in the Learning Resources sections on the navigation panel on the left.
5
feature_engine Documentation, Release 1.1.2
THREE
FEATURE-ENGINE’S TRANSFORMERS
7
feature_engine Documentation, Release 1.1.2
FOUR
GETTING HELP
Can’t get something to work? Here are places where you can find help.
1. The docs (you’re here!).
2. Stack Overflow. If you ask a question, please mention “feature_engine” in it.
3. If you are enrolled in the Feature Engineering for Machine Learning course in Udemy , post a question in a
relevant section.
4. If you are enrolled in the Feature Selection for Machine Learning course in Udemy , post a question in a relevant
section.
5. Join our gitter community. You an ask questions here as well.
6. Ask a question in the repo by filing an issue.
11
feature_engine Documentation, Release 1.1.2
FIVE
Check if there’s already an open issue on the topic. If not, open a new issue with your bug report, suggestion or new
feature request.
13
feature_engine Documentation, Release 1.1.2
SIX
CONTRIBUTING
15
feature_engine Documentation, Release 1.1.2
16 Chapter 6. Contributing
CHAPTER
SEVEN
OPEN SOURCE
If you’re new to Feature-engine this guide will get you started. Feature-engine transformers have the methods fit()
and transform() to learn parameters from the data and then modify the data. They work just like any Scikit-learn
transformer.
7.1.1 Installation
Feature-engine is a Python 3 package and works well with 3.6 or later. Earlier versions have not been tested. The
simplest way to install Feature-engine is from PyPI with pip:
Note that Feature-engine is an active project and routinely publishes new releases. In order to upgrade Feature-engine
to the latest version, use pip as follows.
If you’re using Anaconda, you can install the Anaconda Feature-engine package:
Once installed, you should be able to import Feature-engine without an error, both in Python and in Jupyter notebooks.
17
feature_engine Documentation, Release 1.1.2
This is an example of how to use Feature-engine’s transformers to perform missing data imputation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
Feature-engine’s transformers can be assembled within a Scikit-learn pipeline. This way, we can store our feature
engineering pipeline in one object and save it in one pickle (.pkl). Here is an example on how to do it:
# load dataset
data = pd.read_csv('houseprice.csv')
# continuous variables
numerical = [
var for var in numerical if var not in discrete
and var not in ['Id', 'SalePrice']
]
# scale features
('scaler', MinMaxScaler()),
# Lasso
('lasso', Lasso(random_state=2909, alpha=0.005))
])
# predict
pred_train = price_pipe.predict(X_train)
pred_test = price_pipe.predict(X_test)
# Evaluate
print('Lasso Linear Model train mse: {}'.format(
mean_squared_error(y_train, np.exp(pred_train))))
print('Lasso Linear Model train rmse: {}'.format(
sqrt(mean_squared_error(y_train, np.exp(pred_train)))))
print()
print('Lasso Linear Model test mse: {}'.format(
mean_squared_error(y_test, np.exp(pred_test))))
print('Lasso Linear Model test rmse: {}'.format(
sqrt(mean_squared_error(y_test, np.exp(pred_test)))))
plt.scatter(y_test, np.exp(pred_test))
plt.xlabel('True Price')
plt.ylabel('Predicted Price')
plt.show()
7.2 Installation
Feature-engine is a Python 3 package and works well with 3.6 or later. Earlier versions have not been tested. The
simplest way to install Feature-engine is from PyPI with pip, Python’s preferred package installer:
Feature-engine is an active project and routinely publishes new releases with new or updated transformers. To upgrade
Feature-engine to the latest version, use pip like this:
If you’re using Anaconda, you can take advantage of the conda utility to install the Anaconda Feature-engine package:
Can’t get something to work? Here are places where you can find help.
1. The docs (you’re here!).
2. Stack Overflow. If you ask a question, please tag it with “feature-engine”.
3. If you are enrolled in the Feature Engineering for Machine Learning course in Udemy , post a question in a
relevant section.
4. If you are enrolled in the Feature Selection for Machine Learning course in Udemy , post a question in a relevant
section.
5. Join our Gitter community. You can ask questions here as well.
6. Ask a question in the repo by filing an issue.
7.4 Datasets
The user guide and examples included in Feature-engine’s documentation are based on these 3 datasets:
Titanic dataset
We use the dataset available in openML which can be downloaded from here.
Ames House Prices dataset
We use the data set created by Professor Dean De Cock: * Dean De Cock (2011) Ames, Iowa: Alternative to the Boston
Housing * Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3.
The examples are based on a copy of the dataset available on Kaggle.
The original data and documentation can be found here:
• Documentation
• Data
Credit Approval dataset
We use the Credit Approval dataset from the UCI Machine Learning Repository:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of
Information and Computer Science.
To download the dataset visit this website and click on “crx.data” to download the data set.
To prepare the data for the examples:
import random
import pandas as pd
import numpy as np
# load data
data = pd.read_csv('crx.data', header=None)
# replace ? by np.nan
data = data.replace('?', np.nan)
Feature-engine’s missing data imputers replace missing data by parameters estimated from data or arbitrary values
pre-defined by the user.
7.5.1 MeanMedianImputer
API Reference
• Then replaces the missing data with the estimated mean / median (transform).
Parameters
imputation_method: str, default=median Desired method of imputation. Can take ‘mean’ or
‘median’.
variables: list, default=None The list of variables to be imputed. If None, the imputer will
select all variables of type numeric.
Attributes
Methods
fit(X, y=None)
Learn the mean or median values.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset.
y: pandas series or None, default=None y is not needed in this imputation. You can pass
None or y.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError If there are no numerical variables in the df or the df is empty
transform(X)
Replace missing data with the learned parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The dataframe without missing
values in the selected variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError If the dataframe has different number of features than the df used in fit()
Example
The MeanMedianImputer() replaces missing data with the mean or median of the variable. It works only with numerical
variables. You can pass a list of variables to impute, or the imputer will automatically select all numerical variables in
the train set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
7.5.2 ArbitraryNumberImputer
API Reference
transformer = ArbitraryNumberImputer(
variables = ['varA', 'varB'],
arbitrary_number = 99
)
Xt = transformer.fit_transform(X)
Alternatively, you can impute varA with 1 and varB with 99 like this:
transformer = ArbitraryNumberImputer(
imputer_dict = {'varA' : 1, 'varB': 99]
)
Xt = transformer.fit_transform(X)
Parameters
arbitrary_number: int or float, default=999 The number to be used to replace missing data.
variables: list, default=None The list of variables to be imputed. If None, the imputer will find
and select all numerical variables. This parameter is used only if imputer_dict is None.
imputer_dict: dict, default=None The dictionary of variables and the arbitrary numbers for
their imputation.
Attributes
See also:
feature_engine.imputation.EndTailImputer
Methods
fit(X, y=None)
This method does not learn any parameter. Checks dataframe and finds numerical variables, or checks that
the variables entered by user are numerical.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset.
y: None y is not needed in this imputation. You can pass None or y.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError If there are no numerical variables in the df or the df is empty
transform(X)
Replace missing data with the learned parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The dataframe without missing
values in the selected variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError If the dataframe has different number of features than the df used in fit()
Example
The ArbitraryNumberImputer() replaces missing data with an arbitrary value determined by the user. It works only
with numerical variables. A list of variables can be indicated, or the imputer will automatically select all numerical
variables in the train set. A dictionary with variables and their arbitrary values can be indicated to use different arbitrary
values for variables.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
7.5.3 EndTailImputer
API Reference
You can change the factor that multiplies the std, IQR or the maximum value using the parameter ‘fold’ (we used
fold=3 in the examples above).
The imputer then replaces the missing data with the estimated values (transform).
Parameters
imputation_method: str, default=gaussian Method to be used to find the replacement values.
Can take ‘gaussian’, ‘iqr’ or ‘max’.
gaussian: the imputer will use the Gaussian limits to find the values to replace missing data.
iqr: the imputer will use the IQR limits to find the values to replace missing data.
max: the imputer will use the maximum values to replace missing data. Note that if ‘max’
is passed, the parameter ‘tail’ is ignored.
tail: str, default=right Indicates if the values to replace missing data should be selected from
the right or left tail of the variable distribution. Can take values ‘left’ or ‘right’.
fold: int, default=3 Factor to multiply the std, the IQR or the Max values. Recommended values
are 2 or 3 for Gaussian, or 1.5 or 3 for IQR.
variables: list, default=None The list of variables to be imputed. If None, the imputer will find
and select all variables of type numeric.
Attributes
imputer_dict_: Dictionary with the values at the end of the distribution per variable.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
Methods
fit(X, y=None)
Learn the values at the end of the variable distribution.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset.
y: pandas Series, default=None y is not needed in this imputation. You can pass None or
y.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError If the dataframe has different number of features than the df used in fit()
Example
The EndTailImputer() replaces missing data with a value at the end of the distribution. The value can be determined
using the mean plus or minus a number of times the standard deviation, or using the inter-quartile range proximity rule.
The value can also be determined as a factor of the maximum value. See the API Reference above for more details.
You decide whether the missing data should be placed at the right or left tail of the variable distribution.
It works only with numerical variables. A list of variables can be indicated, or the imputer will automatically select all
numerical variables in the train set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
7.5.4 CategoricalImputer
API Reference
class feature_engine.imputation.CategoricalImputer(imputation_method='missing',
fill_value='Missing', variables=None,
return_object=False, ignore_format=False)
The CategoricalImputer() replaces missing data in categorical variables by an arbitrary value or by the most
frequent category.
The CategoricalVariableImputer() imputes by default only categorical variables (type ‘object’ or ‘categorical’).
You can pass a list of variables to impute, or alternatively, the encoder will find and encode all categorical
variables.
If you want to impute numerical variables with this transformer, there are 2 ways of doing it:
Option 1: Cast your numerical variables as object in the input dataframe, before passing it to the transformer.
Option 2: Set ignore_format=True. Note that if you do this and do not pass the list of variables to impute,
the imputer will automatically select and impute all variables in the dataframe.
Parameters
Attributes
imputer_dict_: Dictionary with most frequent category or arbitrary value per variable.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
Methods
fit: Learn the most frequent category, or assign arbitrary value to variable.
transform: Impute missing data.
fit_transform: Fit to the data, than transform it.
fit(X, y=None)
Learn the most frequent category if the imputation method is set to frequent.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset.
y: pandas Series, default=None y is not needed in this imputation. You can pass None or
y.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame.
• If user enters non-categorical variables (unless ignore_format is True)
ValueError If there are no categorical variables in the df or the df is empty
transform(X)
Replace missing data with the learned parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The dataframe without missing
values in the selected variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError If the dataframe has different number of features than the df used in fit()
Example
The CategoricalImputer() replaces missing data in categorical variables with an arbitrary value, like the string ‘Missing’
or by the most frequent category.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
test_t['MasVnrType'].value_counts().plot.bar()
7.5.5 RandomSampleImputer
API Reference
Note, if the variables indicated in the random_state list are not numerical the imputer will return an error. Note
also that the variables indicated as seed should not contain missing values.
This estimator stores a copy of the training set when the fit() method is called. Therefore, the object can become
quite heavy. Also, it may not be GDPR compliant if your training data set contains Personal Information. Please
check if this behaviour is allowed within your organisation.
Parameters
random_state: int, str or list, default=None The random_state can take an integer to set the
seed when extracting the random samples. Alternatively, it can take a variable name or a list
of variables, which values will be used to determine the seed observation per observation.
seed: str, default=’general’ Indicates whether the seed should be set for each observation with
missing values, or if one seed should be used to impute all observations in one go.
general: one seed will be used to impute the entire dataframe. This is equivalent to setting
the seed in pandas.sample(random_state).
observation: the seed will be set for each observation using the values of the variables
indicated in the random_state for that particular observation.
seeding_method: str, default=’add’ If more than one variable are indicated to seed the ran-
dom sampling per observation, you can choose to combine those values as an addition or a
multiplication. Can take the values ‘add’ or ‘multiply’.
variables: list, default=None The list of variables to be imputed. If None, the imputer will
select all variables in the train set.
Attributes
X_: Copy of the training dataframe from which to extract the random samples.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
Methods
fit(X, y=None)
Makes a copy of the train set. Only stores a copy of the variables to impute. This copy is then used to
randomly extract the values to fill the missing data during transform.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset. Only a
copy of the indicated variables will be stored in the transformer.
y: None y is not needed in this imputation. You can pass None or y.
Returns
self
Raises
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
Example
The RandomSampleImputer() replaces missing data with a random sample extracted from the variable. It works with
both numerical and categorical variables. A list of variables can be indicated, or the imputer will automatically select
all variables in the train set.
A seed can be set to a pre-defined number and all observations will be replaced in batch. Alternatively, a seed can be
set using the values of 1 or more numerical variables. In this case, the observations will be imputed individually, one
at a time, using the values of the variables as a seed.
For example, if the observation shows variables color: np.nan, height: 152, weight:52, and we set the imputer as:
RandomSampleImputer(random_state=['height', 'weight'],
seed='observation',
seeding_method='add'))
observation.sample(1, random_state=int(152+52))
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
7.5.6 AddMissingIndicator
API Reference
Attributes
variables_: List of variables for which the missing indicators will be created.
n_features_in_: The number of features in the train set used in fit.
Methods
fit: Learn the variables for which the missing indicators will be created
transform: Add the missing indicators.
fit_transform: Fit to the data, then trasnform it.
fit(X, y=None)
Learn the variables for which the missing indicators will be created.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset.
y: pandas Series, default=None y is not needed in this imputation. You can pass None or
y.
Returns
self.variables_ [list] The list of variables for which missing indicators will be added.
Raises
TypeError If the input is not a Pandas DataFrame
transform(X)
Add the binary missing indicators.
Parameters
X [pandas dataframe of shape = [n_samples, n_features]] The dataframe to be transformed.
Returns
X_transformed [pandas dataframe of shape = [n_samples, n_features]] The dataframe con-
taining the additional binary variables. Binary variables are named with the original vari-
able name plus ‘_na’.
rtype DataFrame ..
Example
The AddMissingIndicator() adds a binary variable indicating if observations are missing (missing indicator). It adds a
missing indicator for both categorical and numerical variables. A list of variables for which to add a missing indicator
can be passed, or the imputer will automatically select all variables.
The imputer has the option to select if binary variables should be added to all variables, or only to those that show
missing data in the train set, by setting the option how=’missing_only’.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
7.5.7 DropMissingData
API Reference
Attributes
variables_: List of variables for which the rows with NA will be deleted.
n_features_in_: The number of features in the train set used in fit.
Methods
fit: Learn the variables for which the rows with NA will be deleted
transform: Remove observations with NA
fit_transform: Fit to the data, then transform it.
return_na_data: Returns the dataframe with the rows that contain NA .
fit(X, y=None)
Learn the variables for which the rows with NA will be deleted.
Parameters
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
transform(X)
Remove rows with missing values.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The dataframe to be trans-
formed.
Returns
X_transformed: pandas dataframe The complete case dataframe for the selected vari-
ables, of shape [n_samples - rows_with_na, n_features]
rtype DataFrame ..
Example
DropMissingData() deletes rows with missing values. It works with numerical and categorical variables. You can pass
a list of variables to impute, or the transformer will select and impute all variables. The trasformer has the option to
learn the variables with missing data in the train set, and then remove observations with NA only in those variables.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Load dataset
(continues on next page)
189
(1022, 79)
(829, 79)
7.6.1 OneHotEncoder
API Reference
The encoder first finds the categories to be encoded for each variable (fit). The encoder then creates one dummy
variable per category for each variable (transform).
Note
New categories in the data to transform, that is, those that did not appear in the training set, will be ignored (no
binary variable will be created for them). This means that observations with categories not present in the train
set, will be encoded as 0 in all the binary variables.
Also Note
The original categorical variables are removed from the returned dataset when we apply the transform() method.
In their place, the binary variables are returned.
Parameters
top_categories: int, default=None If None, a dummy variable will be created for each category
of the variable. Alternatively, we can indicate in top_categories the number of most
frequent categories to encode. In this case, dummy variables will be created only for those
popular categories and the rest will be ignored, i.e., they will show the value 0 in all the
binary variables.
drop_last: boolean, default=False Only used if top_categories = None. It indicates
whether to create dummy variables for all the categories (k dummies), or if set to True,
it will ignore the last binary variable and return k-1 dummies.
drop_last_binary: boolean, default=False Whether to return 1 or 2 dummy variables for bi-
nary categorical variables. When a categorical variable has only 2 categories, then the second
dummy variable created by one hot encoding can be completely redundant. Setting this pa-
rameter to True, will ensure that for every binary variable in the dataset, only 1 dummy is
created.
variables: list, default=None The list of categorical variables that will be encoded. If None,
the encoder will find and transform all variables of type object or categorical by default. You
can also make the transformer accept numerical variables, see the next parameter.
ignore_format: bool, default=False Whether the format in which the categorical variables are
cast should be ignored. If false, the encoder will automatically select variables of type object
or categorical, or check that the variables entered by the user are of type object or categori-
cal. If True, the encoder will select all variables or accept all variables entered by the user,
including those cast as numeric.
Attributes
encoder_dict_: Dictionary with the categories for which dummy variables will be created.
variables_: The group of variables that will be transformed.
vari- A list with binary variables identified from the data. That is, variables with only 2
ables_binary_: categories.
n_features_in_: The number of features in the train set used in fit.
Notes
If the variables are intended for linear models, it is recommended to encode into k-1 or top categories. If the
variables are intended for tree based algorithms, it is recommended to encode into k or top n categories. If
feature selection will be performed, then also encode into k or top n categories. Linear models evaluate all
features during fit, while tree based models and many feature selection algorithms evaluate variables or groups
of variables separately. Thus, if encoding into k-1, the last variable / category will not be examined.
References
One hot encoding of top categories was described in the following article:
[1]
Methods
fit(X, y=None)
Learns the unique categories per variable. If top_categories is indicated, it will learn the most popular
categories. Alternatively, it learns all unique categories per variable.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just seleted variables.
y: pandas series, default=None Target. It is not needed in this encoded. You can pass y or
None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame.
• f user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in the df or the df is empty
• If the variable(s) contain null values
inverse_transform(X)
inverse_transform is not implemented for this transformer.
transform(X)
Replaces the categorical variables by the binary variables.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to transform.
Returns
X: pandas dataframe. The transformed dataframe. The shape of the dataframe will be dif-
ferent from the original as it includes the dummy variables in place of the of the original
categorical ones.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values.
• If dataframe has different number of features than the df used in fit()
Example
The OneHotEncoder() replaces categorical variables by a set of binary variables, one per unique category. The encoder
has the option to create k or k-1 binary variables, where k is the number of unique categories.
The encoder can also create binary variables for the n most popular categories, n being determined by the user. This
means, if we encode the 6 more popular categories, we will only create binary variables for those categories, and the
rest will be dropped.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
encoder.encoder_dict_
7.6.2 CountFrequencyEncoder
API Reference
Attributes
encoder_dict_: Dictionary with the count or frequency per category, per variable.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
See also:
feature_engine.encoding.RareLabelEncoder
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try
grouping infrequent categories using the RareLabelEncoder().
Methods
fit(X, y=None)
Learn the counts or frequencies which will be used to replace the categories.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset. Can be
the entire dataframe, not just the variables to be transformed.
y: pandas Series, default = None y is not needed in this encoder. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame.
• f user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in the df or the df is empty
• If the variable(s) contain null values
inverse_transform(X)
Convert the encoded variable back to the original values.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The transformed dataframe.
Returns
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
transform(X)
Replace categories with the learned parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The dataset to transform.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The dataframe containing the
categories replaced by numbers.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
Warning If after encoding, NAN were introduced.
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
encoder.encoder_dict_
7.6.3 OrdinalEncoder
API Reference
of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or
‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is iden-
tical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the
categories to the mapped numbers (transform).
Parameters
encoding_method: str, default=’ordered’ Desired method of encoding.
‘ordered’: the categories are numbered in ascending order according to the target mean value
per category.
‘arbitrary’ : categories are numbered arbitrarily.
variables: list, default=None The list of categorical variables that will be encoded. If None,
the encoder will find and transform all variables of type object or categorical by default. You
can also make the transformer accept numerical variables, see the next parameter.
ignore_format: bool, default=False Whether the format in which the categorical variables are
cast should be ignored. If false, the encoder will automatically select variables of type object
or categorical, or check that the variables entered by the user are of type object or categori-
cal. If True, the encoder will select all variables or accept all variables entered by the user,
including those cast as numeric.
Attributes
encoder_dict_: Dictionary with the ordinal number per category, per variable.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
See also:
feature_engine.encoding.RareLabelEncoder
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try
grouping infrequent categories using the RareLabelEncoder().
References
Encoding into integers ordered following target mean was discussed in the following talk at PyData London
2017:
[1]
Methods
fit(X, y=None)
Learn the numbers to be used to replace the categories in each variable.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to be encoded.
y: pandas series, default=None The Target. Can be None if encoding_method = ‘arbitrary’.
Otherwise, y needs to be passed when fitting the transformer.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame.
• If user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in the df or the df is empty
• If the variable(s) contain null values
inverse_transform(X)
Convert the encoded variable back to the original values.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The transformed dataframe.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The un-transformed
dataframe, with the categorical variables containing the original values.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
transform(X)
Replace categories with the learned parameters.
Parameters
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
Warning If after encoding, NAN were introduced.
Example
The OrdinalEncoder() replaces the categories by digits, starting from 0 to k-1, where k is the number of different
categories. If you select “arbitrary”, then the encoder will assign numbers as the labels appear in the variable (first
come first served). If you select “ordered”, the encoder will assign numbers following the mean of the target value for
that label. So labels for which the mean of the target is higher will get the number 0, and those where the mean of the
target is smallest will get the number k-1. This way, we create a monotonic relationship between the encoded variable
and the target.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
encoder.encoder_dict_
7.6.4 MeanEncoder
API Reference
Attributes
encoder_dict_: Dictionary with the target mean value per category per variable.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
See also:
feature_engine.encoding.RareLabelEncoder
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try
grouping infrequent categories using the RareLabelEncoder().
References
[1]
Methods
fit: Learn the target mean value per category, per variable.
transform: Encode the categories to numbers.
fit_transform: Fit to the data, then transform it.
inverse_transform: Encode the numbers into the original categories.
fit(X, y)
Learn the mean value of the target for each category of the variable.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to be encoded.
y: pandas series The target.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame.
• f user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in the df or the df is empty
• If the variable(s) contain null values
inverse_transform(X)
Convert the encoded variable back to the original values.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The transformed dataframe.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The un-transformed
dataframe, with the categorical variables containing the original values.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
transform(X)
Replace categories with the learned parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The dataset to transform.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The dataframe containing the
categories replaced by numbers.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
Warning If after encoding, NAN were introduced.
Example
The MeanEncoder() replaces categories with the mean of the target per category. For example, if we are trying to
predict default rate, and our data has the variable city, with categories, London, Manchester and Bristol, and the default
rate per city is 0.1, 0.5, and 0.3, respectively, the encoder will replace London by 0.1, Manchester by 0.5 and Bristol
by 0.3.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data = load_titanic()
encoder.encoder_dict_
7.6.5 WoEEncoder
API Reference
or categorical, or check that the variables entered by the user are of type object or categori-
cal. If True, the encoder will select all variables or accept all variables entered by the user,
including those cast as numeric.
Attributes
See also:
feature_engine.encoding.RareLabelEncoder
feature_engine.discretisation
Notes
Methods
fit(X, y)
Learn the WoE.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the categorical variables.
y: pandas series. Target, must be binary.
Returns
self
Raises
TypeError
• If the input is not the Pandas DataFrame.
• If user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in df or df is empty
• If variable(s) contain null values.
• If y is not binary with values 0 and 1.
• If p(0) = 0 or p(1) = 0.
inverse_transform(X)
Convert the encoded variable back to the original values.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The transformed dataframe.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The un-transformed
dataframe, with the categorical variables containing the original values.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
transform(X)
Replace categories with the learned parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The dataset to transform.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The dataframe containing the
categories replaced by numbers.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
Warning If after encoding, NAN were introduced.
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# transform
train_t = woe_encoder.transform(train_t)
test_t = woe_encoder.transform(test_t)
woe_encoder.encoder_dict_
7.6.6 PRatioEncoder
API Reference
𝑝(1)/𝑝(0)
𝑙𝑜𝑔(𝑝(1)/𝑝(0))
Note
This categorical encoding is exclusive for binary classification.
For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is
0.2, blue will be replaced by: 0.8 / 0.2 = 4 if ratio is selected, or log(0.8/0.2) = 1.386 if log_ratio is selected.
Note: the division by 0 is not defined and the log(0) is not defined. Thus, if p(0) = 0 for the ratio encoder, or
either p(0) = 0 or p(1) = 0 for log_ratio, in any of the variables, the encoder will return an error.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list
of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or
‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is iden-
tical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the
categories into the mapped numbers (transform).
Parameters
encoding_method: str, default=’ratio’ Desired method of encoding.
‘ratio’ : probability ratio
‘log_ratio’ : log probability ratio
variables: list, default=None The list of categorical variables that will be encoded. If None,
the encoder will find and transform all variables of type object or categorical by default. You
can also make the transformer accept numerical variables, see the next parameter.
ignore_format: bool, default=False Whether the format in which the categorical variables are
cast should be ignored. If false, the encoder will automatically select variables of type object
or categorical, or check that the variables entered by the user are of type object or categori-
cal. If True, the encoder will select all variables or accept all variables entered by the user,
including those cast as numeric.
Attributes
encoder_dict_: Dictionary with the probability ratio per category per variable.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
See also:
feature_engine.encoding.RareLabelEncoder
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try
grouping infrequent categories using the RareLabelEncoder().
Methods
fit(X, y)
Learn the numbers that should be used to replace the categories in each variable. That is the ratio of
probability.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the categorical variables.
y: pandas series. Target, must be binary.
Returns
self
Raises
TypeError
• If the input is not the Pandas DataFrame.
• If user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in df or df is empty
• If variable(s) contain null values.
• If y is not binary with values 0 and 1.
• If p(0) = 0 or any of p(0) or p(1) are 0.
inverse_transform(X)
Convert the encoded variable back to the original values.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The transformed dataframe.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The un-transformed
dataframe, with the categorical variables containing the original values.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
transform(X)
Replace categories with the learned parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The dataset to transform.
Returns
X: pandas dataframe of shape = [n_samples, n_features]. The dataframe containing the
categories replaced by numbers.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
Warning If after encoding, NAN were introduced.
Example
The PRatioEncoder() replaces the labels by the ratio of probabilities. It only works for binary classification.
The target probability ratio is given by: p(1) / p(0)
The log of the target probability ratio is: np.log( p(1) / p(0) )
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data = load_titanic()
# transform
train_t = pratio_encoder.transform(train_t)
test_t = pratio_encoder.transform(test_t)
pratio_encoder.encoder_dict_
7.6.7 DecisionTreeEncoder
API Reference
ignore_format: bool, default=False Whether the format in which the categorical variables are
cast should be ignored. If false, the encoder will automatically select variables of type object
or categorical, or check that the variables entered by the user are of type object or categori-
cal. If True, the encoder will select all variables or accept all variables entered by the user,
including those cast as numeric.
Attributes
encoder_: sklearn Pipeline containing the ordinal encoder and the decision tree.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
See also:
sklearn.ensemble.DecisionTreeRegressor
sklearn.ensemble.DecisionTreeClassifier
feature_engine.discretisation.DecisionTreeDiscretiser
feature_engine.encoding.RareLabelEncoder
feature_engine.encoding.OrdinalEncoder
Notes
The authors designed this method originally, to work with numerical variables. We can replace numerical vari-
ables by the preditions of a decision tree utilising the DecisionTreeDiscretiser().
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try
grouping infrequent categories using the RareLabelEncoder().
References
[1]
Methods
fit(X, y=None)
Fit a decision tree per variable.
Parameters
X [pandas dataframe of shape = [n_samples, n_features]] The training input samples. Can
be the entire dataframe, not just the categorical variables.
y [pandas series.] The target variable. Required to train the decision tree and for ordered
ordinal encoding.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame.
• f user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in the df or the df is empty
• If the variable(s) contain null values
inverse_transform(X)
inverse_transform is not implemented for this transformer.
transform(X)
Replace categorical variable by the predictions of the decision tree.
Parameters
X [pandas dataframe of shape = [n_samples, n_features]] The input samples.
Returns
X [pandas dataframe of shape = [n_samples, n_features].] Dataframe with variables encoded
with decision tree predictions.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If dataframe is not of same size as that used in fit()
Warning If after encoding, NAN were introduced.
Example
The DecisionTreelEncoder() replaces categories in the variable with the predictions of a decision tree. The transformer
first encodes categorical variables into numerical variables using ordinal encoding. You have the option to have the
integers assigned to the categories as they appear in the variable, or ordered by the mean value of the target per category.
After this, the transformer fits with this numerical variable a decision tree to predict the target variable. Finally, the
original categorical variable is replaced by the predictions of the decision tree.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
(continues on next page)
data = load_titanic()
7.6.8 RareLabelEncoder
API Reference
Attributes
encoder_dict_: Dictionary with the frequent categories, i.e., those that will be kept, per variable.
variables_: The variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
Methods
fit(X, y=None)
Learn the frequent categories for each variable.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just selected variables
y: None y is not required. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame.
• If user enters non-categorical variables (unless ignore_format is True)
ValueError
• If there are no categorical variables in the df or the df is empty
• If the variable(s) contain null values
Warning If the number of categories in any one variable is less than the indicated in
n_categories.
inverse_transform(X)
inverse_transform is not implemented for this transformer yet.
transform(X)
Group infrequent categories. Replace infrequent categories by the string ‘Rare’ or any other name provided
by the user.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input samples.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The dataframe where rare cate-
gories have been grouped.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If user enters non-categorical variables (unless ignore_format is True)
Example
The RareLabelEncoder() groups infrequent categories altogether into one new category called ‘Rare’ or a different
string indicated by the user. We need to specify the minimum percentage of observations a category should show to be
preserved and the minimum number of unique categories a variable should have to be re-grouped.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
replace_with='Rare')
encoder.encoder_dict_
You can also specify the maximum number of categories that can be considered frequent using the max_n_categories
parameter.
A 10
B 10
C 2
D 1
Name: var_A, dtype: int64
A 10
B 10
C 2
Rare 1
Name: var_A, dtype: int64
A 10
B 10
Rare 3
Name: var_A, dtype: int64
Feature-engine’s variable transformers transform numerical variables with various mathematical transformations.
7.7.1 LogTransformer
API Reference
Attributes
Methods
fit(X, y=None)
This transformer does not learn parameters.
Selects the numerical variables and determines whether the logarithm can be applied on the selected vari-
ables, i.e., it checks that the variables are positive.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features]. The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, default=None It is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
• If some variables contain zero or negative values
inverse_transform(X)
Convert the data back to the original representation.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
• If some variables contain zero or negative values
transform(X)
Transform the variables with the logarithm.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
• If some variables contain zero or negative values
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('houseprice.csv')
# un-transformed variable
X_train['LotArea'].hist(bins=50)
# transformed variable
(continues on next page)
7.7.2 LogCpTransformer
API Reference
Attributes
Methods
fit(X, y=None)
Learn the constant C to add to the variable before the logarithm transformation if C=”auto”.
Select the numerical variables or check that the variables entered by the user are numerical. Then check
that the selected variables are positive after addition of C.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features]. The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, default=None It is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
• If some variables contain zero or negative values after adding C
inverse_transform(X)
Convert the data back to the original representation.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: Pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
transform(X)
Transform the variables with the logarithm of x plus a constant C.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
• If some variables contains zero or negative values after adding C
Example
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
# Load dataset
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
# learned constant C
tf.C_
# un-transformed variable
X_train[12].hist()
# transformed variable
train_t[12].hist()
7.7.3 ReciprocalTransformer
API Reference
class feature_engine.transformation.ReciprocalTransformer(variables=None)
The ReciprocalTransformer() applies the reciprocal transformation 1 / x to numerical variables.
The ReciprocalTransformer() only works with numerical variables with non-zero values. If a variable contains
the value 0, the transformer will raise an error.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and
transform all numerical variables.
Parameters
variables: list, default=None The list of numerical variables to transform. If None, the trans-
former will automatically find and select all numerical variables.
Attributes
Methods
fit(X, y=None)
This transformer does not learn parameters.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features]. The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, default=None It is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
• If some variables contain zero as values
inverse_transform(X)
Convert the data back to the original representation.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
• If some variables contain zero values
transform(X)
Apply the reciprocal 1 / x transformation.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
• If some variables contain zero values
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# un-transformed variable
X_train['LotArea'].hist(bins=50)
# transformed variable
train_t['LotArea'].hist(bins=50)
7.7.4 PowerTransformer
API Reference
Attributes
Methods
fit(X, y=None)
This transformer does not learn parameters.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, default=None It is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
inverse_transform(X)
Convert the data back to the original representation.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas Dataframe The dataframe with the power transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
transform(X)
Apply the power transformation to the variables.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas Dataframe The dataframe with the power transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# un-transformed variable
X_train['LotArea'].hist(bins=50)
# transformed variable
train_t['LotArea'].hist(bins=50)
7.7.5 BoxCoxTransformer
API Reference
class feature_engine.transformation.BoxCoxTransformer(variables=None)
The BoxCoxTransformer() applies the BoxCox transformation to numerical variables.
The Box-Cox transformation is defined as:
• T(Y)=(Y exp()1)/ if !=0
• log(Y) otherwise
where Y is the response variable and is the transformation parameter. varies, typically from -5 to 5. In the
transformation, all values of are considered and the optimal value for a given variable is selected.
The BoxCox transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/
scipy/reference/generated/scipy.stats.boxcox.html
The BoxCoxTransformer() works only with numerical positive variables (>=0).
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and
transform all numerical variables.
Parameters
variables: list, default=None The list of numerical variables to transform. If None, the trans-
former will automatically find and select all numerical variables.
Attributes
References
[1]
Methods
fit(X, y=None)
Learn the optimal lambda for the BoxCox transformation.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, default=None It is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
• If some variables contain zero values
transform(X)
Apply the BoxCox transformation.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the df has different number of features than the df used in fit()
• If some variables contain negative values
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# un-transformed variable
X_train['LotArea'].hist(bins=50)
# transformed variable
train_t['GrLivArea'].hist(bins=50)
7.7.6 YeoJohnsonTransformer
API Reference
class feature_engine.transformation.YeoJohnsonTransformer(variables=None)
The YeoJohnsonTransformer() applies the Yeo-Johnson transformation to the numerical variables.
The Yeo-Johnson transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/
doc/scipy/reference/generated/scipy.stats.yeojohnson.html
The YeoJohnsonTransformer() works only with numerical variables.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and
transform all numerical variables.
Parameters
variables: list, default=None The list of numerical variables to transform. If None, the trans-
former will automatically find and select all numerical variables.
Attributes
lambda_dict_ Dictionary containing the best lambda for the Yeo-Johnson per variable.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
References
[1]
Methods
fit(X, y=None)
Learn the optimal lambda for the Yeo-Johnson transformation.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, default=None It is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
transform(X)
Apply the Yeo-Johnson transformation.
Parameters
X: Pandas DataFrame of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe The dataframe with the transformed variables.
rtype DataFrame ..
Raises
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# un-transformed variable
X_train['LotArea'].hist(bins=50)
# transformed variable
train_t['LotArea'].hist(bins=50)
Feature-engine’s variable discretisation transformers transform continuous numerical variables into discrete variables.
The discrete variables will contain contiguous intervals in the case of the equal frequency and equal width transformers.
The Decision Tree discretiser will return a discrete variable, in the sense that the new feature takes a finite number of
values.
7.8.1 EqualFrequencyDiscretiser
API Reference
Attributes
See also:
pandas.qcut https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
References
[1], [2]
Methods
fit(X, y=None)
Learn the limits of the equal frequency intervals.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset. Can be
the entire dataframe, not just the variables to be transformed.
y: None y is not needed in this encoder. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
transform(X)
Sort the variable values into the intervals.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to transform.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The transformed data with the
discrete variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the dataframe is not of the same size as the one used in fit()
Example
The EqualFrequencyDiscretiser() sorts the variable values into contiguous intervals of equal proportion of observations.
The limits of the intervals are calculated according to the quantiles. The number of intervals or quantiles should be
determined by the user. The transformer can return the variable as numeric or object (default = numeric).
The EqualFrequencyDiscretiser() works only with numerical variables. A list of variables can be indicated, or the
discretiser will automatically select all numerical variables in the train set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = data = pd.read_csv('houseprice.csv')
disc.binner_dict_
{'LotArea': [-inf,
5007.1,
7164.6,
8165.700000000001,
8882.0,
9536.0,
10200.0,
11046.300000000001,
12166.400000000001,
14373.9,
inf],
'GrLivArea': [-inf,
912.0,
1069.6000000000001,
1211.3000000000002,
1344.0,
1479.0,
1603.2000000000003,
(continues on next page)
7.8.2 EqualWidthDiscretiser
API Reference
The EqualWidthDiscretiser() first finds the boundaries for the intervals for each variable. Then, it transforms the
variables, that is, sorts the values into the intervals.
Parameters
variables: list, default=None The list of numerical variables to transform. If None, the discre-
tiser will automatically select all numerical type variables.
bins: int, default=10 Desired number of equal width intervals / bins.
return_object: bool, default=False Whether the the discrete variable should be returned casted
as numeric or as object. If you would like to proceed with the engineering of the variable as
if it was categorical, use True. Alternatively, keep the default to False.
Categorical encoders in Feature-engine work only with variables of type object, thus, if you
wish to encode the returned bins, set return_object to True.
return_boundaries [bool, default=False] Whether the output should be the interval boundaries.
If True, it returns the interval boundaries. If False, it returns integers.
Attributes
See also:
pandas.cut https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
References
[1], [2]
Methods
fit(X, y=None)
Learn the boundaries of the equal width intervals / bins for each variable.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset. Can be
the entire dataframe, not just the variables to be transformed.
y: None y is not needed in this encoder. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
transform(X)
Sort the variable values into the intervals.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to transform.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The transformed data with the
discrete variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the dataframe is not of the same size as the one used in fit()
Example
The EqualWidthDiscretiser() sorts the variable values into contiguous intervals of equal size. The size of the interval
is calculated as:
( max(X) - min(X) ) / bins
where bins, which is the number of intervals, should be determined by the user. The transformer can return the variable
as numeric or object (default = numeric).
The EqualWidthDiscretiser() works only with numerical variables. A list of variables can be indicated, or the discretiser
will automatically select all numerical variables in the train set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = data = pd.read_csv('houseprice.csv')
disc.binner_dict_
'LotArea': [-inf,
22694.5,
44089.0,
65483.5,
86878.0,
108272.5,
129667.0,
151061.5,
172456.0,
193850.5,
inf],
'GrLivArea': [-inf,
768.2,
1202.4,
1636.6,
2070.8,
2505.0,
2939.2,
3373.4,
3807.6,
4241.799999999999,
inf]}
# with equal width discretisation, each bin does not necessarily contain
# the same number of observations.
train_t.groupby('GrLivArea')['GrLivArea'].count().plot.bar()
plt.ylabel('Number of houses')
7.8.3 ArbitraryDiscretiser
API Reference
return_boundaries: bool, default=False Whether the output, that is the bin names / values,
should be the interval boundaries. If True, it returns the interval boundaries. If False, it
returns integers.
Attributes
See also:
pandas.cut https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
Methods
fit(X, y=None)
This transformer does not learn any parameter.
Check dataframe and variables. Checks that the user entered variables are in the train set and cast as
numerical.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset. Can be
the entire dataframe, not just the variables to be transformed.
y: None y is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any of the user provided variables are not numerical
ValueError
• If there are no numerical variables in the df or the df is empty
• If the variable(s) contain null values
transform(X)
Sort the variable values into the intervals.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The dataframe to be trans-
formed.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The transformed data with the
discrete variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the dataframe is not of the same size as the one used in fit()
Example
The ArbitraryDiscretiser() sorts the variable values into contiguous intervals which limits are arbitrarily defined by the
user.
The user must provide a dictionary of variable:list of limits pair when setting up the discretiser.
The ArbitraryDiscretiser() works only with numerical variables. The discretiser will check that the variables entered
by the user are present in the train set and cast as numerical.
First, let’s load a dataset and plot a histogram of a continuous variable.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from feature_engine.discretisation import ArbitraryDiscretiser
boston_dataset = load_boston()
data = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
data['LSTAT'].hist(bins=20)
plt.xlabel('LSTAT')
plt.ylabel('Number of observations')
plt.title('Histogram of LSTAT')
plt.show()
Now, let’s discretise the variable into arbitrarily determined intervals. We want the interval names as integers, so we
set return_boundaries to False.
transformer = ArbitraryDiscretiser(
binning_dict=user_dict, return_object=False, return_boundaries=False)
X = transformer.fit_transform(data)
X['LSTAT'].value_counts().plot.bar()
plt.xlabel('LSTAT - bins')
plt.ylabel('Number of observations')
plt.title('Discretised LSTAT')
plt.show()
Alternatively, we can return the interval limits in the discretised variable by setting return_boundaries to True.
transformer = ArbitraryDiscretiser(
binning_dict=user_dict, return_object=False, return_boundaries=True)
X = transformer.fit_transform(data)
X['LSTAT'].value_counts().plot.bar(rot=0)
plt.xlabel('LSTAT - bins')
plt.ylabel('Number of observations')
plt.title('Discretised LSTAT')
plt.show()
7.8.4 DecisionTreeDiscretiser
API Reference
Attributes
See also:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor
References
[1]
Methods
fit(X, y)
Fit the decision trees. One tree per variable to be transformed.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset. Can be
the entire dataframe, not just the variables to be transformed.
y: pandas series. Target variable. Required to train the decision tree.
Returns
self
Raises
TypeError
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values
• If the dataframe is not of the same size as the one used in fit()
Example
In the original article, each feature of the challenge dataset was recoded by training a decision tree of limited depth (2,
3 or 4) using that feature alone, and letting the tree predict the target. The probabilistic predictions of this decision tree
were used as an additional feature, that was now linearly (or at least monotonically) correlated with the target.
According to the authors, the addition of these new features had a significant impact on the performance of linear
models.
In the following example, we recode 2 numerical variables using decision trees.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
data = data = pd.read_csv('houseprice.csv')
disc.binner_dict_
Feature-engine’s outlier cappers cap maximum or minimum values of a variable at an arbitrary or derived value. The
OutlierTrimmer removes outliers from the dataset.
7.9.1 Winsorizer
API Reference
Attributes
right_tail_caps_: Dictionary with the maximum values at which variables will be capped.
left_tail_caps_ : Dictionary with the minimum values at which variables will be capped.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
Methods
fit(X, y=None)
Learn the values that should be used to replace outliers.
Parameters
X [pandas dataframe of shape = [n_samples, n_features]] The training input samples.
y [pandas Series, default=None] y is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError If the input is not a Pandas DataFrame
transform(X)
Cap the variable values, that is, censors outliers.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The dataframe with the capped
variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError If the dataframe is not of same size as that used in fit()
Example
Censors variables at predefined minimum and maximum values. The minimum and maximum values can be calculated
in 1 of 3 different ways:
Gaussian limits: right tail: mean + 3* std
left tail: mean - 3* std
IQR limits: right tail: 75th quantile + 3* IQR
left tail: 25th quantile - 3* IQR
where IQR is the inter-quartile range: 75th quantile - 25th quantile.
percentiles or quantiles: right tail: 95th percentile
left tail: 5th percentile
See the API Reference for more details.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
return data
data = load_titanic()
capper.right_tail_caps_
train_t[['fare', 'age']].max()
fare 174.781622
age 67.490484
dtype: float64
7.9.2 ArbitraryOutlierCapper
API Reference
class feature_engine.outliers.ArbitraryOutlierCapper(max_capping_dict=None,
min_capping_dict=None,
missing_values='raise')
The ArbitraryOutlierCapper() caps the maximum or minimum values of a variable at an arbitrary value indicated
by the user.
You must provide the maximum or minimum values that will be used to cap each variable in a dictionary {fea-
ture:capping value}
Parameters
max_capping_dict: dictionary, default=None Dictionary containing the user specified cap-
ping values for the right tail of the distribution of each variable (maximum values).
min_capping_dict: dictionary, default=None Dictionary containing user specified capping
values for the eft tail of the distribution of each variable (minimum values).
missing_values [string, default=’raise’] Indicates if missing values should be ignored or raised.
If missing_values='raise' the transformer will return an error if the training or the
datasets to transform contain missing values.
Attributes
right_tail_caps_: Dictionary with the maximum values at which variables will be capped.
left_tail_caps_: Dictionary with the minimum values at which variables will be capped.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
Methods
fit(X, y=None)
This transformer does not learn any parameter.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
y: pandas Series, default=None y is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError If the input is not a Pandas DataFrame
transform(X)
Cap the variable values, that is, censors outliers.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to be transformed.
Returns
X: pandas dataframe of shape = [n_samples, n_features] The dataframe with the capped
variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError If the dataframe is not of same size as that used in fit()
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
(continues on next page)
data = load_titanic()
capper.right_tail_caps_
train_t[['fare', 'age']].max()
fare 200
age 50
dtype: float64
7.9.3 OutlierTrimmer
API Reference
Attributes
right_tail_caps_: Dictionary with the maximum values above which values will be removed.
left_tail_caps_ : Dictionary with the minimum values below which values will be removed.
variables_: The group of variables that will be transformed.
n_features_in_: The number of features in the train set used in fit.
Methods
transform(X)
Remove observations with outliers from the dataframe.
Parameters
X [pandas dataframe of shape = [n_samples, n_features]] The data to be transformed.
Returns
X [pandas dataframe of shape = [n_samples, n_features]] The dataframe without outlier ob-
servations.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError If the dataframe is not of same size as that used in fit()
Example
Removes values beyond predefined minimum and maximum values from the data set. The minimum and maximum
values can be calculated in 1 of 3 different ways:
Gaussian limits: right tail: mean + 3* std
left tail: mean - 3* std
IQR limits: right tail: 75th quantile + 3* IQR
left tail: 25th quantile - 3* IQR
where IQR is the inter-quartile range: 75th quantile - 25th quantile.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
return data
data = load_titanic()
capper.right_tail_caps_
train_t[['fare', 'age']].max()
fare 65.0
age 53.0
dtype: float64
Feature-engine’s creation transformers combine features into new variables through various mathematical and other
methods.
7.10.1 MathematicalCombination
API Reference
class feature_engine.creation.MathematicalCombination(variables_to_combine,
math_operations=None,
new_variables_names=None,
missing_values='raise')
MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more
additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard
deviation of a group of variables, and returns the result into new variables.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter,
number_payments_third_quarter and number_payments_fourth_quarter, we can use MathematicalCom-
bination() to calculate the total number of payments and mean number of payments as follows:
transformer = MathematicalCombination(
variables_to_combine=[
'number_payments_first_quarter',
'number_payments_second_quarter',
'number_payments_third_quarter',
'number_payments_fourth_quarter'
],
math_operations=[
'sum',
'mean'
],
new_variables_name=[
'total_number_payments',
'mean_number_payments'
]
)
Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features total_number_payments and
mean_number_payments, plus the original set of variables.
Attention, if some of the variables to combine have missing data and missing_values = 'ignore', the value
will be ignored in the computation. To be clear, if variables A, B and C, have values 10, 20 and NA, and we
perform the sum, the result will be A + B = 30.
Parameters
variables_to_combine: list The list of numerical variables to be combined.
math_operations: list, default=None The list of basic math operations to be used to create the
new features.
If None, all of [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’] will be performed over the
variables_to_combine. Alternatively, you can enter the list of operations to carry out.
Each operation should be a string and must be one of the elements in ['sum', 'prod',
'mean', 'std', 'max', 'min'].
Each operation will result in a new variable that will be added to the transformed dataset.
new_variables_names: list, default=None Names of the newly created variables. You can en-
ter a name or a list of names for the newly created features (recommended). You must enter
one name for each mathematical transformation indicated in the math_operations param-
eter. That is, if you want to perform mean and sum of features, you should enter 2 new
variable names. If you perform only mean of features, enter 1 variable name. Alternatively,
if you chose to perform all mathematical transformations, enter 6 new variable names.
The name of the variables indicated by the user should coincide with the order in which the
mathematical operations are initialised in the transformer. That is, if you set math_operations
= [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables
and the second variable name to the product of the variables.
If new_variable_names = None, the transformer will assign an arbitrary name to the
newly created features starting by the name of the mathematical operation, followed by the
variables combined separated by -.
missing_values: string, default=’raise’ Indicates if missing values should be ignored or raised.
If ‘raise’ the transformer will return an error if the the datasets to fit or transform contain
missing values. If ‘ignore’, missing data will be ignored when performing the calculations.
Attributes
combina- Dictionary containing the mathematical operation to new variable name pairs.
tion_dict_:
math_operations_: List with the mathematical operations to be applied to the variables_to_combine.
n_features_in_: The number of features in the train set used in fit.
Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical
operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical
examples within the financial sector are:
• Sum debt across financial products, i.e., credit cards, to obtain the total debt.
• Take the average payments to various financial products per month.
• Find the Minimum payment done at any one month.
In insurance, we can sum the damage to various parts of a car to obtain the total damage.
Methods
fit(X, y=None)
This transformer does not learn parameters.
Perform dataframe checks. Creates dictionary of operation to new feature name pairs.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, or np.array. Defaults to None. It is not needed in this transformer. You
can pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any user provided variables in variables_to_combine are not numerical
ValueError If the variable(s) contain null values when missing_values = raise
transform(X)
Combine the variables with the mathematical operations.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to transform.
Returns
X: Pandas dataframe, shape = [n_samples, n_features + n_operations] The dataframe
with the original variables plus the new variables.
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
ValueError
• If the variable(s) contain null values when missing_values = raise
• If the dataframe is not of the same size as that used in fit()
Example
MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more addi-
tional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of
a group of variables and returns the result into new variables.
In this example, we sum 2 variables from the house prices dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('houseprice.csv').fillna(0)
math_combinator = MathematicalCombination(
variables_to_combine=['LotFrontage', 'LotArea'],
math_operations = ['sum'],
new_variables_names = ['LotTotal']
)
math_combinator.fit(X_train, y_train)
X_train_ = math_combinator.transform(X_train)
print(math_combinator.combination_dict_)
{'LotTotal': 'sum'}
7.10.2 CombineWithReferenceFeature
API Reference
class feature_engine.creation.CombineWithReferenceFeature(variables_to_combine,
reference_variables, operations=['sub'],
new_variables_names=None,
missing_values='ignore')
CombineWithReferenceFeature() applies basic mathematical operations between a group of variables and one or
more reference features. It adds one or more additional features to the dataframe with the result of the operations.
In other words, CombineWithReferenceFeature() sums, multiplies, subtracts or divides a group of features to /
by a group of reference variables, and returns the result as new variables in the dataframe.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter,
number_payments_third_quarter, number_payments_fourth_quarter, and total_payments, we can use
CombineWithReferenceFeature() to determine the percentage of payments per quarter as follows:
transformer = CombineWithReferenceFeature(
variables_to_combine=[
'number_payments_first_quarter',
'number_payments_second_quarter',
'number_payments_third_quarter',
'number_payments_fourth_quarter',
],
reference_variables=['total_payments'],
operations=['div'],
new_variables_name=[
'perc_payments_first_quarter',
'perc_payments_second_quarter',
'perc_payments_third_quarter',
'perc_payments_fourth_quarter',
]
)
Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features indicated in the new_variables_name list plus the
original set of variables.
Parameters
variables_to_combine: list The list of numerical variables to be combined with the reference
variables.
reference_variables: list The list of numerical reference variables that will be added to, multi-
plied with, or subtracted from the variables_to_combine, or used as denominator for division.
operations: list, default=[‘sub’] The list of basic mathematical operations to be used in trans-
formation.
If None, all of [‘sub’, ‘div’,’add’,’mul’] will be performed. Alternatively, you can enter a list
of operations to carry out. Each operation should be a string and must be one of the elements
in ['sub', 'div','add', 'mul'].
Each operation will result in a new variable that will be added to the transformed dataset.
new_variables_names: list, default=None Names of the newly created variables. You can en-
ter a list with the names for the newly created features (recommended). You must enter as
many names as new features created by the transformer. The number of new features is the
number of operations times the number of reference variables times the number of variables
to combine.
Thus, if you want to perform 2 operations, sub and div, combining 4 variables with 2 refer-
ence variables, you should enter 2 X 4 X 2 new variable names.
The name of the variables indicated by the user should coincide with the order in which the
operations are performed by the transformer. The transformer will first carry out ‘sub’, then
‘div’, then ‘add’ and finally ‘mul’.
If new_variable_names is None, the transformer will assign an arbitrary name to the newly
created features.
missing_values: string, default=’ignore’ Indicates if missing values should be ignored or
raised. If ‘ignore’, the transformer will ignore missing data when transforming the data. If
‘raise’ the transformer will return an error if the training or the datasets to transform contain
missing values.
Attributes
Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical
operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical
examples within the financial sector are:
• Ratio between income and debt to create the debt_to_income_ratio.
• Subtraction of rent from income to obtain the disposable_income.
Methods
fit(X, y=None)
This transformer does not learn any parameter. Performs dataframe checks.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, or np.array. Default=None. It is not needed in this transformer. You can
pass y or None.
Returns
self
Raises
TypeError
• If the input is not a Pandas DataFrame
• If any user provided variables are not numerical
ValueError If any of the reference variables contain null values and the mathematical oper-
ation is ‘div’.
transform(X)
Combine the variables with the mathematical operations.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The data to transform.
Returns
X: Pandas dataframe, shape = [n_samples, n_features + n_operations] The dataframe
with the operations results added as columns.
rtype DataFrame ..
Example
CombineWithReferenceFeature() combines a group of variables with a group of reference variables utilizing basic
mathematical operations (subtraction, division, addition and multiplication), returning one or more additional features
in the dataframe as a result.
In this example, we subtract 2 variables from the house prices dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('houseprice.csv').fillna(0)
combinator = CombineWithReferenceFeature(
variables_to_combine=['LotArea'],
reference_variables=['LotFrontage'],
operations = ['sub'],
new_variables_names = ['LotPartial']
)
combinator.fit(X_train, y_train)
X_train_ = combinator.transform(X_train)
(continues on next page)
print(X_train_[["LotPartial","LotFrontage","LotArea"]].head())
7.10.3 CyclicalTransformer
API Reference
Attributes
References
http://blog.davidkaleko.com/feature-engineering-cyclical-features.html
Methods
fit(X, y=None)
Learns the maximum value of each of the cyclical variables.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training input samples.
Can be the entire dataframe, not just the variables to transform.
y: pandas Series, default=None It is not needed in this transformer. You can pass y or None.
Returns
self
Raises
TypeError If the input is not a Pandas DataFrame.
ValueError:
• If some of the columns contains NaNs.
• If some of the mapping keys are not present in variables.
transform(X)
Creates new features using the cyclical transformation.
Parameters
X: Pandas DataFrame of shame = [n_samples, n_features] The data to be transformed.
Returns
X: Pandas dataframe. The dataframe with the additional new features. The original vari-
ables will be dropped if drop_originals is False, or retained otherwise.
Raises
TypeError If the input is not Pandas DataFrame.
Example
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
'day': [6, 7, 5, 3, 1, 2, 4],
'months': [3, 7, 9, 12, 4, 6, 12],
})
X = cyclical.fit_transform(df)
print(cyclical.max_values_)
print(X.head())
Feature-engine’s feature selection transformers are used to drop subsets of variables. Or in other words to select subsets
of variables.
7.11.1 DropFeatures
API Reference
class feature_engine.selection.DropFeatures(features_to_drop)
DropFeatures() drops a list of variable(s) indicated by the user from the dataframe.
When is this transformer useful?
Sometimes, we create new variables combining other variables in the dataset, for example, we obtain the variable
age by subtracting date_of_application from date_of_birth. After we obtained our new variable, we do
not need the date variables in the dataset any more. Thus, we can add DropFeatures() in the Pipeline to have
these removed.
Parameters
Methods
fit(X, y=None)
This transformer does not learn any parameter.
Verifies that the input X is a pandas dataframe, and that the variables to drop exist in the training dataframe.
Parameters
X [pandas dataframe of shape = [n_samples, n_features]] The input dataframe
y [pandas Series, default = None] y is not needed for this transformer. You can pass y or
None.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
The DropFeatures() drops a list of variables indicated by the user from the original dataframe. The user can pass a
single variable as a string or list of variables to be dropped.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
# original columns
X_train.columns
train_t.columns
7.11.2 DropConstantFeatures
API Reference
Attributes
See also:
sklearn.feature_selection.VarianceThreshold
Notes
This transformer is a similar concept to the VarianceThreshold from Scikit-learn, but it evaluates number of
unique values instead of variance
Methods
fit(X, y=None)
Find constant and quasi-constant features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input dataframe.
y: None y is not needed for this transformer. You can pass y or None.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
The DropConstantFeatures() drops constant and quasi-constant variables from a dataframe. By default, DropConstant-
Features drops only constant variables. This transformer works with both numerical and categorical variables.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
transformer.constant_features_
We see in the following code snippets that for the variables parch and embarked, more than 70% of the observations
displayed the same value:
X_train['embarked'].value_counts() / len(X_train)
S 0.711790
C 0.197598
Q 0.090611
Name: embarked, dtype: float64
X_train['parch'].value_counts() / len(X_train)
0 0.771834
1 0.125546
2 0.086245
3 0.005459
4 0.004367
5 0.003275
6 0.002183
9 0.001092
Name: parch, dtype: float64
77% of the passengers had 0 parent or child. Because of this, these features were deemed constant and removed.
7.11.3 DropDuplicateFeatures
API Reference
Attributes
Methods
fit(X, y=None)
Find duplicated features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input dataframe.
y: None y is not needed for this transformer. You can pass y or None.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
The DropDuplicateFeatures() finds and removes duplicated variables from a dataframe. The user can pass a list of
variables to examine, or alternatively the selector will examine all variables in the data set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data = data[['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'cabin',
˓→'embarked']]
train_t.columns
transformer.features_to_drop_
transformer.duplicated_feature_sets_
7.11.4 DropCorrelatedFeatures
API Reference
Attributes
See also:
pandas.corr
feature_engine.selection.SmartCorrelationSelection
Methods
fit(X, y=None)
Find the correlated features.
Parameters
X [pandas dataframe of shape = [n_samples, n_features]] The training dataset.
y [pandas series. Default = None] y is not needed in this transformer. You can pass y or
None.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
The DropCorrelatedFeatures() finds and removes correlated variables from a dataframe. The user can pass a list of
variables to examine, or alternatively the selector will examine all numerical variables in the data set.
import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures
X = make_data()
Xt = tr.fit_transform(X)
tr.correlated_feature_sets_
tr.features_to_drop_
print(print(Xt.head()))
var_11
0 -1.989335
1 -1.309524
2 -0.852703
3 0.484649
4 -0.186530
7.11.5 SmartCorrelatedSelection
API Reference
SmartCorrelatedSelection() returns a dataframe containing from each group of correlated features, the selected
variable, plus all original features that were not correlated to any other.
Correlation is calculated with pandas.corr().
SmartCorrelatedSelection() works only with numerical variables. Categorical variables will need to be encoded
to numerical or will be excluded from the analysis.
Parameters
variables: list, default=None The list of variables to evaluate. If None, the transformer will
evaluate all numerical variables in the dataset.
method: string or callable, default=’pearson’ Can take ‘pearson’, ‘spearman’, ‘kendall’ or
callable. It refers to the correlation method to be used to identify the correlated features.
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
• callable: callable with input two 1d ndarrays and returning a float.
For more details on this parameter visit the pandas.corr() documentation.
threshold: float, default=0.8 The correlation threshold above which a feature will be deemed
correlated with another one and removed from the dataset.
missing_values: str, default=ignore Takes values ‘raise’ and ‘ignore’. Whether the missing
values should be raised as error or ignored when determining correlation.
selection_method: str, default= “missing_values” Takes the values “missing_values”, “cardi-
nality”, “variance” and “model_performance”.
“missing_values”: keeps the feature from the correlated group with least missing observa-
tions
“cardinality”: keeps the feature from the correlated group with the highest cardinality.
“variance”: keeps the feature from the correlated group with the highest variance.
“model_performance”: trains a machine learning model using the correlated feature group
and retains the feature with the highest importance.
estimator: object, default = None A Scikit-learn estimator for regression or classification.
scoring: str, default=’roc_auc’ Desired metric to optimise the performance of the estimator.
Comes from sklearn.metrics. See the model evaluation documentation for more options:
https://scikit-learn.org/stable/modules/model_evaluation.html
cv: int, cross-validation generator or an iterable, default=3 Determines the cross-validation
splitting strategy. Possible inputs for cv are:
• None, to use cross_validate’s default 5-fold cross validation
• int, to specify the number of folds in a (Stratified)KFold,
• CV splitter
– (https://scikit-learn.org/stable/glossary.html#term-CV-splitter)
• An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, Strat-
ifiedKFold is used. In all other cases, Fold is used. These splitters are instantiated with
shuffle=False so the splits will be the same across calls.
Attributes
See also:
pandas.corr
feature_engine.selection.DropCorrelatedFeatures
Methods
fit(X, y=None)
Find the correlated feature groups. Determine which feature should be selected from each group.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The training dataset.
y: pandas series. Default = None y is needed if selection_method ==
‘model_performance’.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import SmartCorrelatedSelection
X = make_data()
Xt = tr.fit_transform(X)
tr.correlated_feature_sets_
tr.features_to_drop_
print(print(Xt.head()))
var_7
0 -2.230170
1 -1.447490
2 -2.240741
3 -3.534861
4 -2.053965
7.11.6 SelectByShuffling
API Reference
Attributes
Methods
fit(X, y)
Find the important features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input dataframe
y: array-like of shape (n_samples) Target variable. Required to train the estimator.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
The SelectByShuffling() selects important features if permutation their values at random produces a decrease in the
initial model performance.
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import SelectByShuffling
# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.DataFrame(diabetes_y)
# fit transformer
Xt = tr.fit_transform(X, y)
tr.initial_model_performance_
0.488702767247119
tr.performance_drifts_
{0: -0.02368121940502793,
1: 0.017909161264480666,
2: 0.18565460365508413,
3: 0.07655405817715671,
4: 0.4327180164470878,
5: 0.16394693824418372,
6: -0.012876023845921625,
7: 0.01048781540981647,
8: 0.3921465005640224,
9: -0.01427065640301245}
tr.features_to_drop_
[0, 1, 3, 6, 7, 9]
print(Xt.head())
1 2 3 4 5 7 8
0 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.002592 0.019908
1 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 -0.039493 -0.068330
2 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.002592 0.002864
(continues on next page)
7.11.7 SelectBySingleFeaturePerformance
API Reference
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass,
StratifiedKFold is used. In all other cases, Fold is used. These splitters are instantiated
with shuffle=False so the splits will be the same across calls.
For more details check Scikit-learn’s cross_validate documentation
Attributes
Methods
fit(X, y)
Select features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input dataframe
y: array-like of shape (n_samples) Target variable. Required to train the estimator.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
The SelectBySingleFeaturePerformance()selects features based on the performance of machine learning models trained
using individual features. In other words, selects features based on their individual performance, returned by estimators
trained on only that particular feature.
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import SelectBySingleFeaturePerformance
# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.DataFrame(diabetes_y)
# fit transformer
sel.fit(X, y)
sel.features_to_drop_
[1]
sel.feature_performance_
{0: 0.029231969375784466,
1: -0.003738551760264386,
2: 0.336620809987693,
3: 0.19219056680145055,
4: 0.037115559827549806,
5: 0.017854228256932614,
6: 0.15153886177526896,
7: 0.17721609966501747,
8: 0.3149462084418813,
9: 0.13876602125792703}
7.11.8 SelectByTargetMeanPerformance
API Reference
class feature_engine.selection.SelectByTargetMeanPerformance(variables=None,
scoring='roc_auc_score',
threshold=0.5, bins=5,
strategy='equal_width', cv=3,
random_state=None)
SelectByTargetMeanPerformance() uses the mean value of the target per category, or interval if the variable is nu-
merical, as proxy for target estimation. With this proxy and the real target, the selector determines a performance
metric for each feature, and then selects them based on this performance metric.
Attributes
Methods
fit(X, y)
Find the important features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input dataframe
y: array-like of shape (n_samples) Target variable. Required to train the estimator.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from feature_engine.selection import SelectByTargetMeanPerformance
# load data
data = pd.read_csv('../titanic.csv')
# feature engine automates the selection for both categorical and numerical
# variables
sel = SelectByTargetMeanPerformance(
variables=None,
scoring="roc_auc_score",
threshold=0.6,
bins=3,
strategy="equal_frequency",
cv=2,# cross validation
random_state=1, #seed for reproducibility
)
sel.variables_categorical_
sel.variables_numerical_
['age', 'fare']
sel.feature_performance_
{'pclass': 0.6802934787230475,
'sex': 0.7491365252482871,
'age': 0.5345141148737766,
'sibsp': 0.5720480307315783,
'parch': 0.5243557188989476,
'fare': 0.6600883312700917,
'cabin': 0.6379782658154696,
'embarked': 0.5672382248783936}
sel.features_to_drop_
# remove features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)
(continues on next page)
X_train.shape, X_test.shape
7.11.9 RecursiveFeatureElimination
API Reference
Attributes
Methods
fit(X, y)
Find the important features. Note that the selector trains various models at each round of selection, so it
might take a while.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input dataframe
y: array-like of shape (n_samples) Target variable. Required to train the estimator.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import RecursiveFeatureElimination
# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.DataFrame(diabetes_y)
# fit transformer
Xt = tr.fit_transform(X, y)
0.488702767247119
{0: -0.0032796652347705235,
9: -0.00028200591588534163,
6: -0.0006752869546966522,
7: 0.00013883578730117252,
1: 0.011956170569096924,
3: 0.028634492035512438,
5: 0.012639090879036363,
2: 0.06630127204137715,
8: 0.1093736570697495,
4: 0.024318093565432353}
[0, 6, 7, 9]
print(Xt.head())
1 3 5 2 8 4
0 0.050680 0.021872 -0.034821 0.061696 0.019908 -0.044223
1 -0.044642 -0.026328 -0.019163 -0.051474 -0.068330 -0.008449
2 0.050680 -0.005671 -0.034194 0.044451 0.002864 -0.045599
(continues on next page)
7.11.10 RecursiveFeatureAddition
API Reference
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass,
StratifiedKFold is used. In all other cases, Fold is used. These splitters are instantiated
with shuffle=False so the splits will be the same across calls.
For more details check Scikit-learn’s cross_validate documentation
Attributes
Methods
fit(X, y)
Find the important features. Note that the selector trains various models at each round of selection, so it
might take a while.
Parameters
X: pandas dataframe of shape = [n_samples, n_features] The input dataframe
y: array-like of shape (n_samples) Target variable. Required to train the estimator.
Returns
self
transform(X)
Return dataframe with selected features.
Parameters
X: pandas dataframe of shape = [n_samples, n_features]. The input dataframe.
Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]
Pandas dataframe with the selected features.
rtype DataFrame ..
Example
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import RecursiveFeatureElimination
# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.DataFrame(diabetes_y)
# fit transformer
Xt = tr.fit_transform(X, y)
0.488702767247119
{4: 0,
8: 0.2837159006046677,
2: 0.1377700238871593,
5: 0.0023329006089969906,
3: 0.0187608758643259,
1: 0.0027994385024313617,
7: 0.0026951300105543807,
6: 0.002683967832484757,
9: 0.0003040126429713075,
0: -0.007386876030245182}
[0, 6, 7, 9]
print(Xt.head())
4 8 2 3
0 -0.044223 0.019908 0.061696 0.021872
1 -0.008449 -0.068330 -0.051474 -0.026328
2 -0.045599 0.002864 0.044451 -0.005671
(continues on next page)
Feature-engine’s Scikit-learn wrappers wrap Scikit-learn transformers allowing their implementation only on a selected
subset of features.
7.12.1 SklearnTransformerWrapper
API Reference
Attributes
Methods
fit(X, y=None)
Fits the Scikit-learn transformer to the selected variables.
If you enter None in the variables parameter, all variables will be automatically transformed by the OneHo-
tEncoder, OrdinalEncoder or SimpleImputer. For the rest of the transformers, only the numerical variables
will be selected and transformed.
If you enter a list in the variables attribute, the SklearnTransformerWrapper will check that those variables
exist in the dataframe and are of type numeric for all transformers except the OneHotEncoder, OrdinalEn-
coder or SimpleImputer, which also accept categorical variables.
Parameters
rtype DataFrame ..
Raises
TypeError If the input is not a Pandas DataFrame
Example
Implements Scikit-learn transformers like the SimpleImputer, the OrdinalEncoder or most scalers only to the selected
subset of features.
In the next code snippet we show how to wrap the SimpleImputer from Scikit-learn to impute only the selected variables.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.wrappers import SklearnTransformerWrapper
# Load dataset
data = pd.read_csv('houseprice.csv')
In the next snippet of code we show how to wrap the StandardScaler from Scikit-learn to standardize only the selected
variables.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper
# Load dataset
data = pd.read_csv('houseprice.csv')
In the next snippet of code we show how to wrap the SelectKBest from Scikit-learn to select only a subset of the
variables.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_regression, SelectKBest
from feature_engine.wrappers import SklearnTransformerWrapper
# Load dataset
data = pd.read_csv('houseprice.csv')
selector = SklearnTransformerWrapper(
transformer = SelectKBest(f_regression, k=5),
variables = cols)
selector.fit(X_train.fillna(0), y_train)
7.13 Tutorials
Coming Soon!
You can find some videos on how to use Feature-engine in our you tube Feature-engine playlist. The list is a bit short
at the moment, apologies.
7.14 How To
Find jupyter notebooks with examples of each transformer functionality. Within each folder, you will find a jupyter
notebook showcasing the functionality of each transformer.
7.15 Books
You can learn more about how to use Feature-engine and feature engineering in general in the following books:
7.16 Courses
You can learn more about how to use Feature-engine and, feature engineering
and feature selection in general in the following online courses:
Articles and videos about Feature-engine and feature engineering and selection
in general.
7.17.1 Blogs
7.17.2 Videos
7.17.3 En Español
7.18 Roadmap
This document provides general directions on what the core contributors would
like to see developed in Feature-engine. As resources are limited, we can’t
promise when or if the transformers listed here will be included in the library.
We welcome all the help we can get to support this vision. If you are interested
in contributing, please get in touch.
7.18.1 Purpose
7.18.2 Vision
We are interested in adding a module that creates date and time related features
from datetime variables. This module would include transformers to extract all
possible date and time related features, like hr, min, sec, day, year, is_weekend,
etc. And it would also include transformers to capture elapsed time between 2
or more variables.
We would also like to add a module that returns straightforward features from
simple text variables, to capture text complexity, like for example counting the
number of words, unique words, lexical complexity, number of paragraphs and
sentences. We would also consider integrating the Bag of Words and TFiDF
from sklearn with a wrapper that returns a dataframe ready to use to train ma-
chine learning models. Below we show more detail into these new modules.
In addition, we are evaluating whether including a module to extract features
from time series is possible, within the current design of the package, and
if it adds real value compared to the functionality already existing in pandas
and Scipy, and in other well established open source projects like tsfresh and
featuretools. The transformations we are considering are shown in this image:
7.18.5 Goals
7.19 About
7.19.1 History
Data scientists spend a huge amount of time on data pre-processing and trans-
formation. It would be great (we thought back in the day) to gather the most fre-
quently used data pre-processing techniques and transformations in a library,
from which we could pick and choose the transformation that we need, and use
it just like we would use any other sklearn class. This was the original vision
for Feature-engine.
Feature-engine is an open source Python package originally designed to sup-
port the online course Feature Engineering for Machine Learning in Udemy,
but has now gained popularity and supports transformations beyond those
taught in the course. It was launched in 2017, and since then, several releases
have appeared and a growing international community is beginning to lead the
development.
7.19.2 Governance
7.19.4 Contributors
You can learn more about Feature-engine’s Contributors in the GitHub con-
tributors page.
Coming soon
7.20 Contributing
We prefer to handle most contributions through the github repository. You can
also join our mailing list.
1. Open issues.
2. Gitter community.
When you fork the repository, you create a copy of Feature-engine’s source
code into your account, which you can edit. To fork Feature-engine’s reposi-
tory, click the fork button in the upper right corner of Feature-engine’s GitHub
page.
Once you forked the repository, follow these steps to set up your development
environment:
1. Clone your fork into your local machine:
$ git␣
˓→clone https://github.com/<YOURUSERNAME>/feature_engine
2. Set up an upstream remote from where you can pull the latest code changes
occurring in the main Feature-engine repository:
$ git remote -v
origin https:/
˓→/github.com/YOUR_USERNAME/feature_engine.git (fetch)
origin https:/
˓→/github.com/YOUR_USERNAMEfeature_engine.git (push)
upstream https:/
˓→/github.com/feature-engine/feature_engine.git (push)
Keep in mind that Feature-engine is being actively developed, so you may need
to update your fork regularly. See below for tips on Keeping your fork up to
date.
3. Optional but highly advisable: Create a virtual environment. Use any vir-
tual environment tool of your choice. Some examples include:
1. venv
2. conda environments
4. Change directory into the cloned repository:
$ cd feature_engine
$ pip install -e .
This will add Feature-engine to your PYTHONPATH so your code edits are
automatically picked up, and there is no need to re-install the package after
each code change.
6. Install the additional dependencies for tests and documentation:
7. Make sure that your local master is up to date with the remote master:
If you just cloned your fork, your local master should be up to date. If you
cloned your fork a time ago, probably the main repository had some code
changes. To sync your fork master to the main repository master, read below
the section Keeping your fork up to date.
8. Create a new branch where you will develop your feature:
There are 3 things to keep in mind when creating a feature branch. First, give
the branch a name that identifies the feature you are going to build. Second,
make sure you checked out your branch from master branch. Third, make sure
your local master was updated with the upstream master.
9. Once your code is ready, commit your changes and push your branch to your
fork:
$ git add .
$ git commit -m "my commit message"
$ git push origin myfeaturebranch
This will add a new branch to your fork. In the commit message, be succint,
describe what is being added and if it resolves an issue, make sure to reference
the issue in the commit message (you can also do this from Github).
10. Go to your fork in Github, you will see the branch you just pushed and
next to it a button to create a PR. Go ahead and create a PR from your feature
branch to Feature_engine’s master branch.
First thing, make a pull request (PR). Once you have written a bit of code for
your new feature, or bug fix, or example, or whatever task you are working on,
make a PR. The PR should be made from your feature_branch (in your fork),
to Feature-engine’s master branch in the main repository.
When you develop a new feature, or bug, or any contribution, there are a few
things to consider:
1. Make regular code commits to your branch, locally.
2. Give clear messages to your commits, indicating which changes were made at
each commit (use present tense)
3. Try and push regularly to your fork, so that you don’t lose your changes, should
a major catastrophe arise
4. If your feature takes some time to develop, make sure you rebase up-
stream/master onto your feature branch
Once your contribution contains the new code, the tests, and ideally the doc-
umentation, the review process will start. Likely, there will be some back and
forth until the final submission.
Once the submission is reviewed and provided the continuous integration tests
have passed and the code is up to date with Feature-engine’s master branch,
we will be ready to “Squash and Merge” your contribution into the master
branch of Feature-engine. “Squash and Merge” combines all of your commits
into a single commit which helps keep the history of the repository clean and
tidy.
Once your contribution has been merged into master, you will be listed as a
Feature-engine contributor :)
You can test the code functionality either in your development environment or
using tox. If you want to use tox:
1. Install tox in your development environment:
$ cd feature_engine
$ tox
$ pytest
These command will run all the test scripts within the test folder. Alternatively,
you can run specific scripts as follows:
1. Change into the tests folder:
$ cd tests
$ pytest test_categorical_encoder.py
If tests pass, your code is functional. If not, try and fix the issue following the
error messages. If stuck, get in touch.
When you’re collaborating using forks, it’s important to update your fork to
capture changes that have been made by other collaborators.
If your feature takes a few days or weeks to develop, it may happen that new
code changes are made to Feature_engine’s master branch by other contribu-
tors. Some of the files that are changed maybe the same files you are working
on. Thus, it is really important that you pull and rebase the upstream master
into your feature branch, fairly often. To keep your branches up to date:
1. Check out your local master:
If your feature branch has uncommited changes, it will ask you to commit or
stage those first.
2. Pull and rebase the upstream master on your local master:
Your master should be a copy of the upstream master. If was is not, there
may appear some conflicting files. You will need to resolve these conflicts and
continue the rebase.
3. Pull the changes to your fork:
The previous command will update your fork so that your fork’s master is in
sync with Feature-engine’s master. Now, you need to rebase master onto your
feature branch.
4. Check out your feature branch:
Again, if conflicts arise, try and resolve them and continue the rebase. Now
you are good to go to continue developing your feature.
Only Core contributors have write access to the repository, can review and can
merge pull requests. Some preferences for commit messages when merging in
pull requests:
• Make sure to use the “Squash and Merge” option in order to create a Git history
that is understandable.
• Keep the title of the commit short and descriptive; be sure it includes the PR
# and the issue #.
Update your local fork (see section Keeping your fork updated) and delete
the feature branch.
Well done and thank you very much for your support!
Releases
After a few features have been added to the master branch by yourself and other
contributors, we will merge master into a release branch, e.g. 0.6.X, to release
a new version of Feature-engine to PyPI.
First, make sure you have properly installed Sphinx and the required depen-
dencies.
1. If you haven’t done so, in your virtual environment, from the root folder of
the repository, install the requirements for the documentation:
This command tells sphinx that the documentation files are within the docs
folder, and the html files should be placed in the build folder.
If everything worked fine, you can navigate the html files located in build.
Alternatively, you need to troubleshoot through the error messages returned
by sphinx.
Good luck and get in touch if stuck!
7.22 Governance
Contributors
Core Contributors
Core Contributors are community members who are dedicated to the continued
development of the project through ongoing engagement with the community.
Core Contributors are expected to review code contributions, can approve and
merge pull requests, can decide on the fate of pull requests, and can be involved
in deciding major changes to the Feature-engine API. Core Contributors de-
termine who can join as a Core Contributor.
Feature-engine was founded by Soledad Galli who at the time was solely re-
sponsible for the initial prototypes, documentation, and dissemination of the
project. In the tradition of Python, Sole is referred to as the “benevolent dic-
tator for life” (BDFL) of the project, or simply, “the founder”. From a gover-
nance perspective, the BDFL has a special role in that they provide vision,
thought leadership, and high-level direction for the project’s members and
contributors. The BDFL has the authority to make all final decisions for the
Feature-engine Project. However, in practice, the BDFL, chooses to defer that
authority to the consensus of the community discussion channels and the Core
Contributors. The BDFL can also propose and vote for new Core Contributors.
Contributors
• Soledad Galli
This small release fixes a Bug in how the OneHotEncoder handles binary cat-
egorical variables when the parameter drop_last_binary is set to True. It
also ensures that the values in the OneHotEncoder.encoder_dict_ are lists
of categories and not arrays. These bugs were introduced in v1.1.0.
Bug fix
Contributors
Mayor changes
New transformer
Minor changes
Documentation
Contributors
• Hector Patino
• Andrew Tan
• Shubhmay Potdar
• Agustin Firpo
• Indy Navarro Vidal
• Ashok Kumar
• Chris Samiullah
• Soledad Galli
In this release, we enforce compatibility with Scikit-learn by adding the
check_estimator tests to all transformers in the package.
In order to pass the tests, we needed to modify some of the internal function-
ality of Feature-engine transformers and create new attributes. We tried not to
break backwards compatibility as much as possible.
Mayor changes
Minor changes
New transformer
Code improvement
Documentation
Community
Contributors
• Nicolas Galli
• Pradumna Suryawanshi
• Elamraoui Sohayb
• Soledad Galli
New transformers
Bug Fix
Tutorials
• Creation: updated “how to” examples on how to combine variables into new
features (by Elamraoui Sohayb and Nicolas Galli)
• Kaggle Kernels: include links to Kaggle kernels
Bug Fix
Contributors
• Ashok Kumar
• Christopher Samiullah
• Nicolas Galli
• Nodar Okroshiashvili
• Pradumna Suryawanshi
• Sana Ben Driss
• Tejash Shah
• Tung Lee
• Soledad Galli
In this version, we made a major overhaul of the package, with code quality
improvement throughout the code base, unification of attributes and methods,
addition of new transformers and extended documentation. Read below for
more details.
Renaming of Modules
Renaming of Classes
We shortened the name of categorical encoders, and also renamed other classes
to simplify import syntax.
• Encoders: the word Categorical was removed from the classes name. Now,
instead of MeanCategoricalEncoder, the class is called MeanEncoder. In-
stead of RareLabelCategoricalEncoder it is RareLabelEncoder and so
on. Please check the encoders documentation for more details.
• Imputers: the CategoricalVariableImputer is now called
CategoricalImputer.
Renaming of Parameters
Tutorials
• Imputation: updated “how to” examples of missing data imputation (by Prad-
umna Suryawanshi)
• Encoders: new and updated “how to” examples of categorical encoding (by
Ashok Kumar)
• Discretisation: new and updated “how to” examples of discretisation (by
Ashok Kumar)
• Variable transformation: updated “how to” examples on how to apply math-
ematical transformations to variables (by Pradumna Suryawanshi)
Code Architecture
Documentation
Other Changes
Version 0.6.1
Version 0.6.0
Version 0.5.0
Version 0.4.3
Version 0.4.0
• Improved: the RareLabelEncoder now allows the user to define the name
for the label that will replace rare categories.
• Improved: All feature engine transformers (except missing data imputers)
check that the data sets do not contain missing values.
• Improved: the LogTransformer will raise an error if a variable has zero or
negative values.
• Improved: the ReciprocalTransformer now works with variables of type
integer.
• Improved: the ReciprocalTransformer will raise an error if the variable
contains the value zero.
• Improved: the BoxCoxTransformer will raise an error if the variable con-
tains negative values.
• Improved: the OutlierCapper now finds and removes outliers based of per-
centiles.
• Improved: Feature-engine is now compatible with latest releases of Pandas
and Scikit-learn.
Version 0.3.0
[1] Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop
and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
[1] Galli S. “Machine Learning in Financial Risk Assessment”. https://www.youtube.com/watch?v=KHGGlozsRtA
[1] Micci-Barreca D. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Pre-
diction Problems”. ACM SIGKDD Explorations Newsletter, 2001. https://dl.acm.org/citation.cfm?id=507538
[1] Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop
and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
[1] Box and Cox. “An Analysis of Transformations”. Read at a RESEARCH MEETING, 1964. https://rss.onlinelibrary.
wiley.com/doi/abs/10.1111/j.2517-6161.1964.tb00553.x
[1] Weisberg S. “Yeo-Johnson Power Transformations”. https://www.stat.umn.edu/arc/yjpower.pdf
[1] Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science,
vol. 1, pp. 111 117, 2006.
[2] Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/
2015/Dong_Ying.pdf
[1] Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science,
vol. 1, pp. 111 117, 2006.
[2] Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/
2015/Dong_Ying.pdf
[1] Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop
and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
191
feature_engine Documentation, Release 1.1.2
192 Bibliography
INDEX
193
feature_engine Documentation, Release 1.1.2
194 Index
feature_engine Documentation, Release 1.1.2
R transform() (feature_engine.encoding.RareLabelEncoder
RandomSampleImputer (class in fea- method), 73
ture_engine.imputation), 36 transform() (feature_engine.encoding.WoEEncoder
RareLabelEncoder (class in feature_engine.encoding), method), 62
72 transform() (feature_engine.imputation.AddMissingIndicator
ReciprocalTransformer (class in fea- method), 40
ture_engine.transformation), 83 transform() (feature_engine.imputation.ArbitraryNumberImputer
RecursiveFeatureAddition (class in fea- method), 28
ture_engine.selection), 158 transform() (feature_engine.imputation.CategoricalImputer
RecursiveFeatureElimination (class in fea- method), 34
ture_engine.selection), 155 transform() (feature_engine.imputation.DropMissingData
return_na_data() (fea- method), 43
ture_engine.imputation.DropMissingData transform() (feature_engine.imputation.EndTailImputer
method), 43 method), 32
transform() (feature_engine.imputation.MeanMedianImputer
S method), 25
SelectByShuffling (class in feature_engine.selection), transform() (feature_engine.imputation.RandomSampleImputer
146 method), 38
SelectBySingleFeaturePerformance (class in fea- transform() (feature_engine.outliers.ArbitraryOutlierCapper
ture_engine.selection), 149 method), 118
SelectByTargetMeanPerformance (class in fea- transform() (feature_engine.outliers.OutlierTrimmer
ture_engine.selection), 151 method), 121
SklearnTransformerWrapper (class in fea- transform() (feature_engine.outliers.Winsorizer
ture_engine.wrappers), 161 method), 115
SmartCorrelatedSelection (class in fea- transform() (feature_engine.selection.DropConstantFeatures
ture_engine.selection), 142 method), 136
transform() (feature_engine.selection.DropCorrelatedFeatures
T method), 141
transform() (feature_engine.creation.CombineWithReferenceFeature (feature_engine.selection.DropDuplicateFeatures
transform()
method), 129 method), 138
transform() (feature_engine.creation.CyclicalTransformer transform() (feature_engine.selection.DropFeatures
method), 131 method), 133
transform()
transform() (feature_engine.creation.MathematicalCombination (feature_engine.selection.RecursiveFeatureAddition
method), 125 method), 159
transform()
transform() (feature_engine.discretisation.ArbitraryDiscretiser (feature_engine.selection.RecursiveFeatureElimination
method), 105 method), 156
transform()
transform() (feature_engine.discretisation.DecisionTreeDiscretiser (feature_engine.selection.SelectByShuffling
method), 111 method), 147
transform() (feature_engine.discretisation.EqualFrequencyDiscretiser (feature_engine.selection.SelectBySingleFeaturePerformance
transform()
method), 98 method), 150
transform()
transform() (feature_engine.discretisation.EqualWidthDiscretiser (feature_engine.selection.SelectByTargetMeanPerformance
method), 102 method), 153
transform()
transform() (feature_engine.encoding.CountFrequencyEncoder (feature_engine.selection.SmartCorrelatedSelection
method), 51 method), 144
transform() (feature_engine.encoding.DecisionTreeEncoder transform() (feature_engine.transformation.BoxCoxTransformer
method), 70 method), 91
transform() (feature_engine.encoding.MeanEncoder transform() (feature_engine.transformation.LogCpTransformer
method), 58 method), 81
transform() (feature_engine.encoding.OneHotEncoder transform() (feature_engine.transformation.LogTransformer
method), 47 method), 77
transform() (feature_engine.encoding.OrdinalEncoder transform() (feature_engine.transformation.PowerTransformer
method), 54 method), 87
transform() (feature_engine.encoding.PRatioEncoder transform() (feature_engine.transformation.ReciprocalTransformer
method), 66 method), 84
Index 195
feature_engine Documentation, Release 1.1.2
transform() (feature_engine.transformation.YeoJohnsonTransformer
method), 94
transform() (feature_engine.wrappers.SklearnTransformerWrapper
method), 162
W
Winsorizer (class in feature_engine.outliers), 113
WoEEncoder (class in feature_engine.encoding), 60
Y
YeoJohnsonTransformer (class in fea-
ture_engine.transformation), 93
196 Index