You are on page 1of 6

Artificial Intelligence and Machine Learning

PRACTICAL 3

Data Engineering - Feature Engineering

Categorical Feature Imputation and Encoding

Prepared by Nima Dema

Table of Contents
0. Learning Objectives 2
1. Imputing Categorical features 2
1.1. Import required Libraries (Already done for you) 3

1 | Page 29 November 2023


Artificial Intelligence and Machine Learning

1.2. Load data from CSV file 3


1.3. Create new cdf containing categorical features 3
1.4. Check null values in cdf 3
1.5. Impute Categorical features 4
1.6. Drop features 4
2. Encoding Categorical features 4
2.1. Encode nominal features 4
2.2. Encode ordinal features 5
2.3. Encode target feature 5
2.4. Combine all encoded categorical features. 6

0. Learning Objectives
In this week’s practical session on data engineering, the focus is on categorical
feature engineerig techniques, a crucial aspect of preparing data for machine
learning models. The overarching goal is to equip participants with the skills
necessary to handle categorical features effectively, addressing challenges such as
missing values and optimizing feature representation for improved model
performance.

By the end of the lab, you should be able to:

➔ Apply feature engineering techniques for categorical features.


➔ Prepare categorical features for traini ng machine learning model.

1. Imputing Categorical features


INSTRUCTIONS:

➔ Load data from loan_train.csv file in a dataframe named df. Check datatypes
of all the features. Do you think datatypes of all the features are in expected
type? If not, use astype method to convert features to required types.
➔ From df create a dataframe named as cdf which contains only categorical
features.
➔ Use appropriate method to check null values in given dataset.
➔ use sklearn SimpleImputer to impute the missing values using most suitable
strategy.

2 | Page 29 November 2023


Artificial Intelligence and Machine Learning

1.1. Import required Libraries (Already done for you)


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

1.2. Load data from CSV file


…….
…….

1.3. Create new cdf containing categorical features


For this task, use select_dtypes() method of pandas dataframe with include
parameter.

…….
…….

1.4. Check null values in cdf


Use isna() or isnull() method to check null values. You may find sum() method useful
here.

…….
…….

1.5. Impute Categorical features


For this task, you may have to import SimpleImputer from sklearn (Already done).
After the imputation is completed, check for null values again to verify if your
imputation step is successfully completed or not.

from sklearn.impute import SimpleImputer


…….
…….
…….

3 | Page 29 November 2023


Artificial Intelligence and Machine Learning

1.6. Drop features


The Loan_ID column contains unique values for all the training example. Lets drop it
from our dataframe. Use drop() method for this task.

…….
…….

2. Encoding Categorical features


INSTRUCTIONS:

➔ Choose all nominal feature in your cdf and apply one-hot encoding. Create
nominaldf which contains encoded nominal feature.
➔ Choose all ordinal features in your cdf and apply OrdinalEncoder to encode
ordinal features. Create ordinaldf which contains encoded ordinal features.
➔ Use LabelEncoder to encode target feature.

2.1. Encode nominal features


The values of features such as Gender, Married, Self_Employed doesnot posses any
order. They are nomial features. Encode them using OneHotEncoder from sklearn.
For this task, you need to import OneHotEncoder (already done for you). Then
convert your encoded features to dataframe back as sklearn returns array.

from sklearn.preprocessing import OneHotEncoder


#Create one hot encoder object
ohe = OneHotEncoder()

#use fit_transform method to actually transform your categorical


features into number
…….
…….
#Now convert encoded features into dataframe
…….

4 | Page 29 November 2023


Artificial Intelligence and Machine Learning

…….

2.2. Encode ordinal features


The values of features such as Dependents, Education, Property_Area posses
natural ordering. They are ordinal features. Encode them using Ordinal Encoder from
sklearn. For this task, you need to import OrdinalEncoder (already done for you).
Then convert your encoded features to dataframe back as sklearn returns array.

from sklearn.preprocessing import OrdinalEncoder

categories_val = [['0','1','2','3+'],['Not
Graduate','Graduate'],['Rural','Semiurban','Urban']]

#create ordinalencoder's object


oe = OrdinalEncoder(categories = categories_val)

#use fit_transform method to actually transform your categorical


features into number
…….
…….
#Now convert encoded features into dataframe
…….
…….

2.3. Encode target feature


Encoding target feature is optional since many algorithms implemented in sklearn
accept it in categorical format. However, in this step, we are going to implement
target encoding using LabelEncoder. Note that you don’t have to convert encoded
target to dataframe.

from sklearn.preprocessing import LabelEncoder


#create Labelencoder's object
le = LabelEncoder()

#use fit_transform method to actually transform your target feature


…….
…….

5 | Page 29 November 2023


Artificial Intelligence and Machine Learning

2.4. Combine all encoded categorical features.


After encoding our categorical features we have created two different dataframes
which stores ordinal and nominal features separately. Since we need all our features
for training ML model, lets combine all of them. Also add target to the combined
dataframe. Our dataframe which contains all encoded categorical features is called
categoricaldf. For this task use pandas concat() method.

#Since nominaldf and ordinal df are dataframe, we can concatenate them


categoricaldf = pd.concat([nominaldf,ordinaldf],axis=1)

#lets add our encoded target variable to categoricaldf


categoricaldf['Target'] = target
categoricaldf.head()

Note: We need to reuse our categoricaldf together with numerical features which we
will be working on in next week lab practical.

THANK YOU ☺

6 | Page 29 November 2023

You might also like