Data Treatment

EDA Car Data Set
We will explore the Car Data set and perform the exploratory data analysis on the dataset.
The major topics to be covered are below:
 Removing duplicates
 Missing value treatment
 Outlier Treatment
 Normalization and Scaling( Numerical Variables)
 Encoding Categorical variables( Dummy Variables)
 Univariate Analysis
 Bivariate Analysis
Basic Data Exploration

In this step, we will perform the below operations to check what the data set comprises of.
We will check the below things:
 head of the dataset

 shape of the dataset
 info of the dataset
 summary of the dataset
 check duplicates
 remove duplicates
 check for missing/null values
 univariate analysis
 bivariate analysis
 multivariate analysis

Replacing NULL values in Numerical Columns using Median
median1=df["INCOME"].median()
median2=df["TRAVEL TIME"].median()
median3=df["MILES CLOCKED"].median()
median4=df["CAR AGE"].median()
df["INCOME"].replace(np.nan,median1,inplace=True)
df["TRAVEL TIME"].replace(np.nan,median2,inplace=True)
df["MILES CLOCKED"].replace(np.nan,median3,inplace=True)
df["CAR AGE"].replace(np.nan,median4,inplace=True)
Replacing NULL values in Categorical Columns using Mode
mode1=df["SEX"].mode().values[0]
mode2=df["MARITAL STATUS"].mode().values[0]
mode3=df["EDUCATION"].mode().values[0]
mode4=df["JOB"].mode().values[0]
mode5=df["USE"].mode().values[0]
mode6=df['CITY'].mode().values[0]
mode7=df["CAR TYPE"].mode().values[0]
mode8=df["POSTAL CODE"].mode().values[0]
df["SEX"]=df["SEX"].replace(np.nan,mode1)
df["MARITAL STATUS"]= df["MARITAL STATUS"].replace(np.nan,mode2)
df["EDUCATION"]=df["EDUCATION"].replace(np.nan,mode3)
df["JOB"]=df["JOB"].replace(np.nan,mode4)
df["USE"]=df["USE"].replace(np.nan,mode5)
df["CAR TYPE"]=df["CAR TYPE"].replace(np.nan,mode7)
df['CITY']=df['CITY'].replace(np.nan,mode6)
df['POSTAL CODE']=df['POSTAL CODE'].replace(np.nan,mode6)
Outlier Treatment
def remove_outlier(col):
sorted(col)
Q1,Q3=col.quantile([0.25,0.75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
lrincome,urincome=remove_outlier(df['INCOME'])
df['INCOME']=np.where(df['INCOME']>urincome,urincome,df['INCOME'])
df['INCOME']=np.where(df['INCOME']<lrincome,lrincome,df['INCOME'])
lrtravel,urtravel=remove_outlier(df['TRAVEL TIME'])
df['TRAVEL TIME']=np.where(df['TRAVEL TIME']>urtravel,urtravel,df['TRAVEL TIME'])
df['TRAVEL TIME']=np.where(df['TRAVEL TIME']<lrtravel,lrtravel,df['TRAVEL TIME'])
lrmiles,urmiles=remove_outlier(df['MILES CLOCKED'])
df['MILES CLOCKED']=np.where(df['MILES CLOCKED']>urmiles,urmiles,df['MILES CLOCKED'])
df['MILES CLOCKED']=np.where(df['MILES CLOCKED']<lrmiles,lrmiles,df['MILES CLOCKED'])
Check for Duplicate records

# Check for duplicate data
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df[dups]
Removing Duplicates
df.drop_duplicates(inplace=True)
Univariate Analysis
1. sns.distplot (df.INCOME, bins=20) # histogram of income
From above figure, we can say that the Income parameter is right skewed
2. sns.countplot(df["EDUCATION"],hue=df["SEX"])
From the above graph we can interpret that majority of the people are High School passouts and this
is true for both Males and Females
Bivariate Analysis
1. sns.pairplot(df)
plt.show()
In the above plot scatter diagrams are plotted for all the numerical columns in the dataset. A
scatter plot is a visual representation of the degree of correlation between any two columns.
The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the
columns in the data.
2. Corr = df.corr()
corr
Correlation Heatmap
plt.figure(figsize=(12,7))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.show()
Normalizing and Scaling
Often the variables of the data set are of different scales i.e. one variable is in millions and
other in only 100. For e.g. in our data set Income is having values in thousands and age in
just two digits. Since the data in these variables are of different scales, it is tough to compare
these variables.
Feature scaling (also known as data normalization) is the method used to standardize the
range of features of data. Since, the range of values of data may vary widely, it becomes a
necessary step in data preprocessing while using machine learning algorithms.
In this method, we convert variables with different scales of measurements into a single
scale.
StandardScaler normalizes the data using the formula (x-mean)/standard deviation.
We will be doing this only for the numerical variables.
#Scales the data. Essentially returns the z-scores of every attribute
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc
df['INCOME'] = sc.fit_transform(df[['INCOME']])
df['TRAVEL TIME'] = sc.fit_transform(df[['TRAVEL TIME']])
df['CAR AGE'] = sc.fit_transform(df[['CAR AGE']])
df['MILES CLOCKED']= sc.fit_transform(df[['MILES CLOCKED']])
ENCODING
One-Hot-Encoding is used to create dummy variables to replace the categories in a
categorical variable into features of each category and represent it using 1 or 0 based on the
presence or absence of the categorical value in the record.
This is required to do since the machine learning algorithms only works on the numerical
data. That is why there is a need to convert the categorical column into numerical one.
get_dummies is the method which creates dummy variable for each categorical variable.
It is considered a good practice to set parameter drop_first as True whenever
get_dummies is used. It reduces the chances of multicollinearity which will be covered in
coming courses and the number of features are also less as compared
to drop_first=False
dummies= pd.get_dummies (df[["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR

TYPE","CITY"]], columns=["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR
TYPE","CITY"],
prefix=["married","sex","Education","Job","Use","cartype","city"],drop_first=True).head()
columns=["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR TYPE","CITY"]
df = pd.concat([df, dummies], axis=1)
# drop original column "fuel-type" from "df"
df.drop(columns, axis = 1, inplace=True)

Data Treatment

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Treatment

Uploaded by

Copyright:

Available Formats

EDA Car Data Set

Basic Data Exploration

 head of the dataset

Replacing NULL values in Numerical Columns using Median

Replacing NULL values in Categorical Columns using Mode

df["MARITAL STATUS"]= df["MARITAL STATUS"].replace(np.nan,mode2)

df["CAR TYPE"]=df["CAR TYPE"].replace(np.nan,mode7)

df['POSTAL CODE']=df['POSTAL CODE'].replace(np.nan,mode6)

upper_range= Q3+(1.5 * IQR)

return lower_range, upper_range

df['TRAVEL TIME']=np.where(df['TRAVEL TIME']>urtravel,urtravel,df['TRAVEL TIME'])

df['TRAVEL TIME']=np.where(df['TRAVEL TIME']<lrtravel,lrtravel,df['TRAVEL TIME'])

df['MILES CLOCKED']=np.where(df['MILES CLOCKED']>urmiles,urmiles,df['MILES CLOCKED'])

df['MILES CLOCKED']=np.where(df['MILES CLOCKED']<lrmiles,lrmiles,df['MILES CLOCKED'])

Check for Duplicate records

print('Number of duplicate rows = %d' % (dups.sum()))

StandardScaler normalizes the data using the formula (x-mean)/standard deviation.

We will be doing this only for the numerical variables.

#Scales the data. Essentially returns the z-scores of every attribute

from sklearn.preprocessing import StandardScaler

df['TRAVEL TIME'] = sc.fit_transform(df[['TRAVEL TIME']])

df['CAR AGE'] = sc.fit_transform(df[['CAR AGE']])

df['MILES CLOCKED']= sc.fit_transform(df[['MILES CLOCKED']])

dummies= pd.get_dummies (df[["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR

columns=["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR TYPE","CITY"]

df = pd.concat([df, dummies], axis=1)

# drop original column "fuel-type" from "df"

df.drop(columns, axis = 1, inplace=True)

You might also like