Professional Documents
Culture Documents
Preprocessing (Part 1)
https://pub.towardsai.net/the-complete-guide-to-data-preprocessing-3e0092b74016
Towards AI
1. Data cleaning
6. Discretization
In this part of the article, we will focus on steps 1–5, and in the second
part we will discuss steps 6–7.
The order of the steps outlined above may change according to the
model requirements and characteristics of the data set. Nonetheless,
some of the steps depend on each other. For example, the data should
be normalized only after it is cleaned and the missing values and
outliers have been handled.
The main modules in Scikit-Learn that are used for data preprocessing
are:
• sklearn.preprocessing provides various transformers for
scaling, normalization, encoding features, and
discretization.
Data Cleaning
Data cleaning (or cleansing) involves correcting or removing incorrect,
inaccurate, inconsistent, irrelevant, or duplicate data from the data set.
These issues can arise from various sources, such as:
For example, imputing missing values using the mean of each feature:
imputer = SimpleImputer(strategy='mean')
X = [[np.nan, 5, np.nan], [2, 4, 10], [3, np.nan, 5]]
imputer.fit_transform(X)
imputer = IterativeImputer(max_iter=10)
X = [[1, 2], [2, 4], [4, 8], [np.nan, 3], [5, np.nan]]
imputer.fit_transform(X)
array([[ 1. , 2. ],
[ 2. , 4. ],
[ 4. , 8. ],
[ 1.50000846, 3. ],
[ 5. , 10.00000145]])
We can see that the imputer has learned that the second feature is
equal to twice of the first one.
The following example replaces the missing values with the mean
feature value of the two nearest neighbors:
imputer = KNNImputer(n_neighbors=2)
In general, the simple imputer performs worse than the more complex
imputers on weak models, while it works as well as or better than them
on powerful models.
encoder = OrdinalEncoder()
array([[1., 0.],
[0., 1.],
[2., 0.]])
Cons:
encoder = OneHotEncoder()
In the transformed matrix, the first three columns are the encoding of
the first feature with categories
LowIncome/MediumIncome/HighIncome, and the last two columns
are the encoding of the second feature with categories BA/PhD.
Note that the encoder by default returns a SciPy sparse matrix (where
only nonzero values are stored), thus in order to display it we had to
convert it to a dense NumPy array by calling the toarray() method.
Cons:
X = np.random.RandomState(0).lognormal(size=500)
plt.hist(X, bins=50)
pt = PowerTransformer('box-cox')
Note that power transformations are not effective with every type
of distribution. For example, they do not work well with uniform
or bimodal distributions.
The Complete Guide to Data
Preprocessing (Part 2)
https://pub.towardsai.net/the-complete-guide-to-data-preprocessing-part-2-96cbcd1b6d90
Discretization
Discretization transforms a continuous-valued feature into a discrete
one by partitioning the range of its values into a set of intervals or bins.
The two main methods for discretization are:
• ‘quantile’ (the default): all the bins have the same number
of points.
• ‘kmeans’: the bins are determined by running a K-Means
clustering on the data points.
encode='ordinal')
discretizer.fit_transform(X)
discretizer.bin_edges_
The goal of feature scaling is to bring all the features to a common scale
or range. The most common approaches for feature scaling are:
scaler = MinMaxScaler()
X = [[-1, 2, 3], [0.5, 6, 10], [0, 1, 8]]
scaler.fit_transform(X)
array([[0. , 0.2 , 0. ],
[1. , 1. , 1. ],
[0.66666667, 0. , 0.71428571]])
Pros:
Cons:
scaler = StandardScaler()
scaler.fit_transform(X)
Pros:
Cons:
scaler = RobustScaler()
X = [[-1, 2, 3], [0.5, 6, 10], [0, 1, 8]]
scaler.fit_transform(X)
array([[-1.33333333, 0. , -1.42857143],
[ 0.66666667, 1.6 , 0.57142857],
[ 0. , -0.4 , 0. ]])
Pros:
Cons:
Note that all three scalers preserve the shape of the original
distribution of the feature, since they are all linear transformations
(i.e., transformations of the form f(x) = ax + b).
np.random.seed(0)
X.head()
Let’s also check the data types of the features and whether there
are any missing values:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null float64
1 name 1309 non-null object
2 sex 1309 non-null category
3 age 1046 non-null float64
4 sibsp 1309 non-null float64
5 parch 1309 non-null float64
6 ticket 1309 non-null object
7 fare 1308 non-null float64
8 cabin 295 non-null object
9 embarked 1307 non-null category
10 boat 486 non-null object
11 body 121 non-null float64
12 home.dest 745 non-null object
dtypes: category(2), float64(6), object(5)
memory usage: 115.4+ KB
We can see that out of the 13 features, 6 of them have a ‘float64’ data
type and 7 have a categorical type (‘category’ or ‘object’). However, the
data type by itself is not enough to indicate whether a feature is
numerical or not. For example, the ‘pclass’ feature, despite having a
float64 type, is in fact an ordinal feature with discrete values (1.0, 2.0
or 3.0).
We can also observe that 5 of the features contain missing values: ‘age’,
‘cabin’, ‘boat’, ‘body’, and ‘home.dest’.
Data Cleaning
We can easily drop these columns by calling the drop() method of the
DataFrame:
X.drop(['name', 'ticket', 'boat', 'body', 'home.dest'], axis=1,
inplace=True)
X.head()
We can now do more advanced data exploration. First, let’s check the
correlations between the features, including the label:
# Merge the features and the label to one DataFrame
df = pd.concat([X, y.astype('float')], axis=1)
plt.figure(figsize=(6, 4))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
The correlations between the features
We can see that the features ‘sibsp’ and ‘parch’ are weakly correlated
with the target, which suggests that some feature engineering may be
needed to extract more useful information from them (e.g., combine
them into a single feature of ‘family_size’).
Let’s also find if there are any outliers in the data by drawing a box
plot:
sns.boxplot(data=X)
We can see that there are many outliers in the ‘fare’ column. This
suggests that using a robust scaler to normalize the data would be a
better choice than a standard scaler.
0 0.618029
1 0.381971
Name: survived, dtype: float64
The classes are fairly balanced (61.8% of the passengers did not
survive, and 38.2% survived).
Data Preprocessing
Let’s identify the tasks we need to perform in order to prepare the data
set for modeling:
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
num_transformer = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('scaler', RobustScaler())
])
preprocessor = ColumnTransformer([
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
model = Pipeline([
('pre', preprocessor),
('clf', RandomForestClassifier())
])
Train-Test Split
Before training the model, we split the data set into 80% training and
20% test sets:
Model Evaluation
Let’s now evaluate the model both on the training and the test sets:
train_acc = model.score(X_train, y_train)
print(f'Train accuracy: {train_acc:.4f}')
Final Notes
You can find the code examples of this article on my
github: https://github.com/roiyeho/medium/tree/main/data_preproc
essing
https://medium.com/@roiyeho/discretization-and-when-to-use-it-649db24e59d1\
This article explains what is discretization, when to use it, and how to
apply it to your own data sets using Scikit-Learn.
Discretization Definition
For example, let’s say that we have an age feature with the following
values: 1, 3, 4, 7, 11, 12, 15, 17, 24, 31, 35, 36, 40, 41, 43, 46, 50, 74, 77,
86 and we want to discretize it into 5 bins.
In the equal-width approach, we take the range of the values (in this
case 86 - 1 = 85) and divide it by the number of bins (5), such that the
width of each bin is 85 / 5 = 17. Therefore, the bins in this case would
be [1, 18], (18, 35], (35, 52], (52, 69], and (69, 86].
On the other hand, in the equal-depth approach, each bin should have
the same number of values. Since we have a total of 20 age values, each
bin should have 4 values. Therefore, the bins in this case are [1, 7], (7,
17], (17, 36], (36, 46], and (46, 86].
Even when the machine learning model can handle continuous values,
discretization can often help it make a better usage of the continuous-
valued feature.
For example, imagine that we need to predict the price of a house given
its location, represented by its latitude and longitude. Let’s call these
two features x₀ and x₁. If we use these two features directly, our model
may learn an incorrect correlation between the house location and its
price.
For example, if we use linear regression for the prediction, our model’s
hypothesis is:
So now the linear regression model will be able to learn the correlation
between each bin (area) and the house price.
Discretization in Scikit-Learn
Discretization Example
For example, let’s build a regression model for the California housing
dataset available at Scikit-Learn. The goal in this data set is to predict
the median house value of a given district (house block) in California,
based on 8 different features of that district (such as the median
income or the average number of rooms per household).
To explore the data set, we merge the features (X) and the labels (y)
into a pandas DataFrame and display the first rows from the table:
df.head()
reg = LinearRegression()
reg.fit(X_train, y_train)
longitude_bins = encoder.fit_transform(df[['Longitude']])
print(longitude_bins)
[[0. 1. 0. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
longitude_df = pd.DataFrame(longitude_bins,
columns=longitude_labels)
df2
Let’s do the same for the Latitude column:
latitude_bins = encoder.fit_transform(df[['Latitude']])
latitude_df = pd.DataFrame(latitude_bins,
columns=latitude_labels)
df3.head()
X = df3.drop('MedianValue', axis=1)
y = df3['MedianValue']
We split the data set again into training and test sets, and then fit our
model to the training set:
reg.fit(X_train, y_train)
Our R² score has significantly improved both on the training and the
test sets!
Final Notes
You can find the code example of this article on my github
repository: https://github.com/roiyeho/medium/tree/main/discretiza
tion