DataPreparation - Outlier - Treatment ASSIGNMENT 1

Outlier Treatments
Instructions:
Please share your answers filled inline in the word document. Submit code files wherever
applicable.
Please ensure you update all the details:
Name: _______HARI MACHAVARAPU__________________
Batch Id: ___________DSWDCMB 150622H____________

Topic: Data Pre-Processing
Problem Statement:
Most of the datasets have extreme values or exceptions in their observations. These
values affect the predictions (Accuracy) of the model in one way or the other, removing
these values is not a very good option. For these types of scenarios, we have various
techniques to treat such values.
Refer: https://360digitmg.com/mindmap-data-science
1. Prepare the dataset by performing the preprocessing techniques, to treat the

outliers.
© 2013 - 2021 360DigiTMG. All Rights Reserved.

CODE-
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv("C:/Users/hudso/Downloads/DataSets-Data Pre
Processing/DataSets/boston_data.csv")
df.dtypes
df.dtypes #types of data in dataset
df.describe() #to know dimensions of dataset
duplicates = df.duplicated() #to find duplicates in dataset

duplicates
sum(duplicates)
# no duplicates found

# to find outliers on graph
sns.boxplot(df.crim) #outliers are present
sns.boxplot(df.zn) #outliers are present
sns.boxplot(df.indus) #outliers are not present
sns.boxplot(df.nox) #outliers are not present
sns.boxplot(df.rm) #outliers are present
sns.boxplot(df.age) #outliers are not present
sns.boxplot(df.dis) #outliers are present
sns.boxplot(df.rad) #outliers are not present
sns.boxplot(df.tax) #outliers are not present
sns.boxplot(df.ptratio) #outliers are present
sns.boxplot(df.black) #outliers are present
sns.boxplot(df.lstat) #outliers are present
sns.boxplot(df.medv) #outliers are present
#find iqr for columns with outliers to relace them

#for crim column
IQR = df['crim'].quantile(0.75) - df['crim'].quantile(0.25)
lower_limit = df['crim'].quantile(0.25) - (IQR*1.5)
upper_limit = df['crim'].quantile(0.75) + (IQR*1.5)

#for zn column
IQR = df['zn'].quantile(0.75) - df['zn'].quantile(0.25)
lower_limit = df['zn'].quantile(0.25) - (IQR*1.5)
upper_limit = df['zn'].quantile(0.75) + (IQR*1.5)
#for rm column
IQR = df['rm'].quantile(0.75) - df['rm'].quantile(0.25)
lower_limit = df['rm'].quantile(0.25) - (IQR*1.5)
upper_limit = df['rm'].quantile(0.75) + (IQR*1.5)
#for dis column
IQR = df['dis'].quantile(0.75) - df['dis'].quantile(0.25)
lower_limit = df['dis'].quantile(0.25) - (IQR*1.5)
upper_limit = df['dis'].quantile(0.75) + (IQR*1.5)
#for ptratio
IQR = df['ptratio'].quantile(0.75) - df['ptratio'].quantile(0.25)
lower_limit = df['ptratio'].quantile(0.25) - (IQR*1.5)
upper_limit = df['ptratio'].quantile(0.75) + (IQR*1.5)
#for black
IQR = df['black'].quantile(0.75) - df['black'].quantile(0.25)
lower_limit = df['black'].quantile(0.25) - (IQR*1.5)
upper_limit = df['black'].quantile(0.75) + (IQR*1.5)
#for lstat
IQR = df['lstat'].quantile(0.75) - df['lstat'].quantile(0.25)
lower_limit = df['lstat'].quantile(0.25) - (IQR*1.5)
upper_limit = df['lstat'].quantile(0.75) + (IQR*1.5)
#for medv
IQR = df['medv'].quantile(0.75) - df['medv'].quantile(0.25)
lower_limit = df['medv'].quantile(0.25) - (IQR*1.5)
upper_limit = df['medv'].quantile(0.75) + (IQR*1.5)

#winsorization for replacing outliers
#pip install feature_engine
#install the package
from feature_engine.outliers import Winsorizer

#for crim
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5,
variables=['crim'])
df_crim = winsor.fit_transform(df[['crim']])
#for zn
variables=['zn'])
df_zn = winsor.fit_transform(df[['zn']])
#for rm
variables=['rm'])
df_rm = winsor.fit_transform(df[['rm']])
#for dis
variables=['dis'])
df_dis = winsor.fit_transform(df[['dis']])
#for ptratio
variables=['ptratio'])
df_ptratio = winsor.fit_transform(df[['ptratio']])
#for black
variables=['black'])
df_black = winsor.fit_transform(df[['black']])

#for lstat
variables=['lstat'])
df_lstat = winsor.fit_transform(df[['lstat']])
#for medv
variables=['medv'])
df_medv = winsor.fit_transform(df[['medv']])
#check for outliers again

sns.boxplot(df_crim.crim)
sns.boxplot(df_zn.zn)
sns.boxplot(df_rm.rm)
sns.boxplot(df_dis.dis)
sns.boxplot(df_ptratio.ptratio)
sns.boxplot(df_black.black)
sns.boxplot(df_lstat.lstat)
sns.boxplot(df_medv.medv)
Hints:
For each assignment, the solution should be submitted in the below format

1. Work on each feature to create a data dictionary as displayed in the image displayed
below:
2. Hint: Boston dataset is publicly available. Refer to Boston.csv file.

3. Research and perform all possible steps for obtaining solution
4. All the codes (executable programs) should execute without errors
5. Code modularization should be followed
6. Each line of code should have comments explaining the logic and why you are using that
function
7. Detailed explanation of your approach is mandatory

DataPreparation - Outlier - Treatment ASSIGNMENT 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DataPreparation - Outlier - Treatment ASSIGNMENT 1

Uploaded by

Copyright:

Available Formats

Outlier Treatments

Please ensure you update all the details:

Name: _HARI MACHAVARAPU____________

Batch Id: _DSWDCMB 150622H__

1. Prepare the dataset by performing the preprocessing techniques, to treat the

© 2013 - 2021 360DigiTMG. All Rights Reserved.

duplicates = df.duplicated() #to find duplicates in dataset

© 2013 - 2021 360DigiTMG. All Rights Reserved.

#find iqr for columns with outliers to relace them

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

from feature_engine.outliers import Winsorizer

© 2013 - 2021 360DigiTMG. All Rights Reserved.

#check for outliers again

© 2013 - 2021 360DigiTMG. All Rights Reserved.

2. Hint: Boston dataset is publicly available. Refer to Boston.csv file.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

You might also like