You are on page 1of 5

Duplication Typecasting

Instructions:

Please share your answers filled inline in the word document. Submit Python code and R code
files wherever applicable.

Please ensure you update all the details:

Name: __________hari machavarapu_______________

Batch Id: _____dswdcmb 150622h__________________


Topic: Preliminaries for Data Analysis

Problem statement:
Data collected may have duplicate entries, that might be because the data collected
were not at regular intervals or any other reason. To build a proper solution on such
data will be a tough ask. The common techniques are either removing duplicates
completely or substitute those values with a logical data. There are various techniques
to treat these types of problems.

Q1. For the given dataset perform the type casting (convert the datatypes, ex. float to int)
Q2. Check for the duplicate values, and handle the duplicate values (ex. drop)
Q3. Do the data analysis (EDA)?
Such as histogram, boxplot, scatterplot etc
InvoiceN StockCod Description Quantit InvoiceDate UnitPrice CustomerID Country
o e y
536365 85123A WHITE 6 12/1/2010 2.55 17850 United
HANGING 8:26 Kingdom
HEART T-LIGHT
HOLDER
536365 71053 WHITE METAL 6 12/1/2010 3.39 17850 United
LANTERN 8:26 Kingdom
536365 84406B CREAM CUPID 8 12/1/2010 2.75 17850 United
HEARTS COAT 8:26 Kingdom
HANGER
536365 84029G KNITTED 6 12/1/2010 3.39 17850 United
UNION FLAG 8:26 Kingdom
HOT WATER

© 2013 - 2020 360DigiTMG. All Rights Reserved.


BOTTLE
536365 84029E RED WOOLLY 6 12/1/2010 3.39 17850 United
HOTTIE WHITE 8:26 Kingdom
HEART.
536365 22752 SET 7 2 12/1/2010 7.65 17850 United
BABUSHKA 8:26 Kingdom
NESTING
BOXES
536365 21730 GLASS STAR 6 12/1/2010 4.25 17850 United
FROSTED T- 8:26 Kingdom
LIGHT HOLDER
536366 22633 HAND 6 12/1/2010 1.85 17850 United
WARMER 8:28 Kingdom
UNION JACK
536366 22632 HAND 6 12/1/2010 1.85 17850 United
WARMER RED 8:28 Kingdom
POLKA DOT

CODE-
import pandas as pd #for manupulation of data
import numpy as np #for numerical calculations
import matplotlib.pyplot as plt #for data visualization
import seaborn as sn # for data visualization

#to import and read a dataset


df = pd.read_csv("C:/Users/hudso/Downloads/DataSets-Data Pre
Processing/DataSets/OnlineRetail.csv",
encoding = 'unicode_escape')

df.dtypes # tovknow the type of data


# to eliminate duplicates
duplicates = df.duplicated()
duplicates
sum(duplicates)

# to drop duplicates
df_dup = df.drop_duplicates()

# to finf null values


df_dup.isnull().sum()

# to drop unnecessary columns


df_drop1 = df_dup.drop('Description', axis = 1)
df_new = df_drop1.drop('CustomerID', axis = 1)

© 2013 - 2020 360DigiTMG. All Rights Reserved.


# to create new column from quantity and unit price
df_new['Total_Price'] = df_new.Quantity * df_new.UnitPrice

df_new.dtypes

df_new.Total_Price = df_new.Total_Price.astype('int') # to conver float into integer


df_new.dtypes
df_new.describe()

# to create a box plot


sn.boxplot(df_new.Total_Price)

# to find iqr, upper limit and lower limit


IQR = df_new.Total_Price.quantile(0.75) - df_new.Total_Price.quantile(0.25)
upper_limit = df_new['Total_Price'].quantile(0.75) + (IQR * 1.5)
lower_limit = df_new['Total_Price'].quantile(0.25) - (IQR*1.5)

# for winsorization for outliers treatment


from feature_engine.outliers import Winsorizer
winsor = Winsorizer(capping_method = 'iqr', tail = 'both', fold = 1.5, variables = ['Total_Price'])
df_plot = winsor.fit_transform(df_new[['Total_Price']])
sn.boxplot(df_plot.Total_Price)

© 2013 - 2020 360DigiTMG. All Rights Reserved.


#for histogram
plt.hist(df_plot.Total_Price)

# to find skewness and kurtosis


df_plot.Total_Price.skew()
df_plot.Total_Price.kurt()

Hints:
For each assignment, the solution should be submitted in the below format
1. Work on each feature of the dataset to create a data dictionary as displayed in the
below image:

2. Consider the OnlineRetail.csv dataset


3. Research and perform all possible steps for obtaining solution

© 2013 - 2020 360DigiTMG. All Rights Reserved.


4. All the codes (executable programs) should execute without errors
5. Code modularization should be followed
6. Each line of code should have comments explaining the logic and why you are using that
function

© 2013 - 2020 360DigiTMG. All Rights Reserved.

You might also like