You are on page 1of 6

EDA Car Data Set

We will explore the Car Data set and perform the exploratory data analysis on the dataset.
The major topics to be covered are below:

 Removing duplicates
 Missing value treatment
 Outlier Treatment
 Normalization and Scaling( Numerical Variables)
 Encoding Categorical variables( Dummy Variables)
 Univariate Analysis
 Bivariate Analysis

Basic Data Exploration


In this step, we will perform the below operations to check what the data set comprises of.
We will check the below things:

 head of the dataset


 shape of the dataset
 info of the dataset
 summary of the dataset
 check duplicates
 remove duplicates
 check for missing/null values
 univariate analysis
 bivariate analysis
 multivariate analysis

Replacing NULL values in Numerical Columns using Median

median1=df["INCOME"].median()

median2=df["TRAVEL TIME"].median()

median3=df["MILES CLOCKED"].median()

median4=df["CAR AGE"].median()

df["INCOME"].replace(np.nan,median1,inplace=True)

df["TRAVEL TIME"].replace(np.nan,median2,inplace=True)
df["MILES CLOCKED"].replace(np.nan,median3,inplace=True)

df["CAR AGE"].replace(np.nan,median4,inplace=True)

Replacing NULL values in Categorical Columns using Mode

mode1=df["SEX"].mode().values[0]

mode2=df["MARITAL STATUS"].mode().values[0]

mode3=df["EDUCATION"].mode().values[0]

mode4=df["JOB"].mode().values[0]

mode5=df["USE"].mode().values[0]

mode6=df['CITY'].mode().values[0]

mode7=df["CAR TYPE"].mode().values[0]

mode8=df["POSTAL CODE"].mode().values[0]

df["SEX"]=df["SEX"].replace(np.nan,mode1)

df["MARITAL STATUS"]= df["MARITAL STATUS"].replace(np.nan,mode2)

df["EDUCATION"]=df["EDUCATION"].replace(np.nan,mode3)

df["JOB"]=df["JOB"].replace(np.nan,mode4)

df["USE"]=df["USE"].replace(np.nan,mode5)

df["CAR TYPE"]=df["CAR TYPE"].replace(np.nan,mode7)

df['CITY']=df['CITY'].replace(np.nan,mode6)

df['POSTAL CODE']=df['POSTAL CODE'].replace(np.nan,mode6)

Outlier Treatment
def remove_outlier(col):

sorted(col)

Q1,Q3=col.quantile([0.25,0.75])

IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)

upper_range= Q3+(1.5 * IQR)

return lower_range, upper_range

lrincome,urincome=remove_outlier(df['INCOME'])

df['INCOME']=np.where(df['INCOME']>urincome,urincome,df['INCOME'])

df['INCOME']=np.where(df['INCOME']<lrincome,lrincome,df['INCOME'])

lrtravel,urtravel=remove_outlier(df['TRAVEL TIME'])

df['TRAVEL TIME']=np.where(df['TRAVEL TIME']>urtravel,urtravel,df['TRAVEL TIME'])

df['TRAVEL TIME']=np.where(df['TRAVEL TIME']<lrtravel,lrtravel,df['TRAVEL TIME'])

lrmiles,urmiles=remove_outlier(df['MILES CLOCKED'])

df['MILES CLOCKED']=np.where(df['MILES CLOCKED']>urmiles,urmiles,df['MILES CLOCKED'])

df['MILES CLOCKED']=np.where(df['MILES CLOCKED']<lrmiles,lrmiles,df['MILES CLOCKED'])

Check for Duplicate records


# Check for duplicate data

dups = df.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

df[dups]
Removing Duplicates
df.drop_duplicates(inplace=True)

Univariate Analysis
1. sns.distplot (df.INCOME, bins=20) # histogram of income

From above figure, we can say that the Income parameter is right skewed

2. sns.countplot(df["EDUCATION"],hue=df["SEX"])
From the above graph we can interpret that majority of the people are High School passouts and this
is true for both Males and Females

Bivariate Analysis
1. sns.pairplot(df)

plt.show()
In the above plot scatter diagrams are plotted for all the numerical columns in the dataset. A
scatter plot is a visual representation of the degree of correlation between any two columns.
The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the
columns in the data.

2. Corr = df.corr()
corr

Correlation Heatmap

plt.figure(figsize=(12,7))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.show()
Normalizing and Scaling
Often the variables of the data set are of different scales i.e. one variable is in millions and
other in only 100. For e.g. in our data set Income is having values in thousands and age in
just two digits. Since the data in these variables are of different scales, it is tough to compare
these variables.

Feature scaling (also known as data normalization) is the method used to standardize the
range of features of data. Since, the range of values of data may vary widely, it becomes a
necessary step in data preprocessing while using machine learning algorithms.

In this method, we convert variables with different scales of measurements into a single
scale.

StandardScaler normalizes the data using the formula (x-mean)/standard deviation.

We will be doing this only for the numerical variables.

#Scales the data. Essentially returns the z-scores of every attribute

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

sc

df['INCOME'] = sc.fit_transform(df[['INCOME']])

df['TRAVEL TIME'] = sc.fit_transform(df[['TRAVEL TIME']])

df['CAR AGE'] = sc.fit_transform(df[['CAR AGE']])

df['MILES CLOCKED']= sc.fit_transform(df[['MILES CLOCKED']])

ENCODING
One-Hot-Encoding is used to create dummy variables to replace the categories in a
categorical variable into features of each category and represent it using 1 or 0 based on the
presence or absence of the categorical value in the record.

This is required to do since the machine learning algorithms only works on the numerical
data. That is why there is a need to convert the categorical column into numerical one.

get_dummies is the method which creates dummy variable for each categorical variable.
It is considered a good practice to set parameter drop_first as True whenever
get_dummies is used. It reduces the chances of multicollinearity which will be covered in
coming courses and the number of features are also less as compared
to drop_first=False

dummies= pd.get_dummies (df[["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR


TYPE","CITY"]], columns=["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR
TYPE","CITY"],
prefix=["married","sex","Education","Job","Use","cartype","city"],drop_first=True).head()

columns=["MARITAL STATUS", "SEX","EDUCATION","JOB","USE","CAR TYPE","CITY"]

df = pd.concat([df, dummies], axis=1)

# drop original column "fuel-type" from "df"

df.drop(columns, axis = 1, inplace=True)

You might also like