You are on page 1of 7

Outlier Treatments

Instructions:

Please share your answers filled inline in the word document. Submit code files wherever
applicable.

Please ensure you update all the details:

Name: _______HARI MACHAVARAPU__________________

Batch Id: ___________DSWDCMB 150622H____________


Topic: Data Pre-Processing

Problem Statement:
Most of the datasets have extreme values or exceptions in their observations. These
values affect the predictions (Accuracy) of the model in one way or the other, removing
these values is not a very good option. For these types of scenarios, we have various
techniques to treat such values.
Refer: https://360digitmg.com/mindmap-data-science

1. Prepare the dataset by performing the preprocessing techniques, to treat the


outliers.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


CODE-
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv("C:/Users/hudso/Downloads/DataSets-Data Pre
Processing/DataSets/boston_data.csv")
df.dtypes
df.dtypes #types of data in dataset
df.describe() #to know dimensions of dataset

duplicates = df.duplicated() #to find duplicates in dataset


duplicates
sum(duplicates)
# no duplicates found

© 2013 - 2021 360DigiTMG. All Rights Reserved.


# to find outliers on graph
sns.boxplot(df.crim) #outliers are present
sns.boxplot(df.zn) #outliers are present
sns.boxplot(df.indus) #outliers are not present
sns.boxplot(df.nox) #outliers are not present
sns.boxplot(df.rm) #outliers are present
sns.boxplot(df.age) #outliers are not present
sns.boxplot(df.dis) #outliers are present
sns.boxplot(df.rad) #outliers are not present
sns.boxplot(df.tax) #outliers are not present
sns.boxplot(df.ptratio) #outliers are present
sns.boxplot(df.black) #outliers are present
sns.boxplot(df.lstat) #outliers are present
sns.boxplot(df.medv) #outliers are present

#find iqr for columns with outliers to relace them


#for crim column
IQR = df['crim'].quantile(0.75) - df['crim'].quantile(0.25)
lower_limit = df['crim'].quantile(0.25) - (IQR*1.5)
upper_limit = df['crim'].quantile(0.75) + (IQR*1.5)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


#for zn column
IQR = df['zn'].quantile(0.75) - df['zn'].quantile(0.25)
lower_limit = df['zn'].quantile(0.25) - (IQR*1.5)
upper_limit = df['zn'].quantile(0.75) + (IQR*1.5)
#for rm column
IQR = df['rm'].quantile(0.75) - df['rm'].quantile(0.25)
lower_limit = df['rm'].quantile(0.25) - (IQR*1.5)
upper_limit = df['rm'].quantile(0.75) + (IQR*1.5)
#for dis column
IQR = df['dis'].quantile(0.75) - df['dis'].quantile(0.25)
lower_limit = df['dis'].quantile(0.25) - (IQR*1.5)
upper_limit = df['dis'].quantile(0.75) + (IQR*1.5)
#for ptratio
IQR = df['ptratio'].quantile(0.75) - df['ptratio'].quantile(0.25)
lower_limit = df['ptratio'].quantile(0.25) - (IQR*1.5)
upper_limit = df['ptratio'].quantile(0.75) + (IQR*1.5)
#for black
IQR = df['black'].quantile(0.75) - df['black'].quantile(0.25)
lower_limit = df['black'].quantile(0.25) - (IQR*1.5)
upper_limit = df['black'].quantile(0.75) + (IQR*1.5)
#for lstat
IQR = df['lstat'].quantile(0.75) - df['lstat'].quantile(0.25)
lower_limit = df['lstat'].quantile(0.25) - (IQR*1.5)
upper_limit = df['lstat'].quantile(0.75) + (IQR*1.5)
#for medv
IQR = df['medv'].quantile(0.75) - df['medv'].quantile(0.25)
lower_limit = df['medv'].quantile(0.25) - (IQR*1.5)
upper_limit = df['medv'].quantile(0.75) + (IQR*1.5)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


#winsorization for replacing outliers
#pip install feature_engine
#install the package

from feature_engine.outliers import Winsorizer


#for crim
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['crim'])
df_crim = winsor.fit_transform(df[['crim']])
#for zn
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['zn'])
df_zn = winsor.fit_transform(df[['zn']])
#for rm
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['rm'])
df_rm = winsor.fit_transform(df[['rm']])
#for dis
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['dis'])
df_dis = winsor.fit_transform(df[['dis']])
#for ptratio
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['ptratio'])
df_ptratio = winsor.fit_transform(df[['ptratio']])
#for black
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['black'])
df_black = winsor.fit_transform(df[['black']])

© 2013 - 2021 360DigiTMG. All Rights Reserved.


#for lstat
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['lstat'])
df_lstat = winsor.fit_transform(df[['lstat']])
#for medv
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['medv'])
df_medv = winsor.fit_transform(df[['medv']])

#check for outliers again


sns.boxplot(df_crim.crim)
sns.boxplot(df_zn.zn)
sns.boxplot(df_rm.rm)
sns.boxplot(df_dis.dis)
sns.boxplot(df_ptratio.ptratio)
sns.boxplot(df_black.black)
sns.boxplot(df_lstat.lstat)
sns.boxplot(df_medv.medv)

Hints:
For each assignment, the solution should be submitted in the below format

© 2013 - 2021 360DigiTMG. All Rights Reserved.


1. Work on each feature to create a data dictionary as displayed in the image displayed
below:

2. Hint: Boston dataset is publicly available. Refer to Boston.csv file.


3. Research and perform all possible steps for obtaining solution
4. All the codes (executable programs) should execute without errors
5. Code modularization should be followed
6. Each line of code should have comments explaining the logic and why you are using that
function
7. Detailed explanation of your approach is mandatory

© 2013 - 2021 360DigiTMG. All Rights Reserved.

You might also like