Professional Documents
Culture Documents
We will explore the Car Data set and perform the exploratory data analysis on the dataset.
The major topics to be covered are below:
Removing duplicates
Missing value treatment
Outlier Treatment
Normalization and Scaling( Numerical Variables)
Encoding Categorical variables( Dummy Variables)
Univariate Analysis
Bivariate Analysis
median1=df["INCOME"].median()
median2=df["TRAVEL TIME"].median()
median3=df["MILES CLOCKED"].median()
median4=df["CAR AGE"].median()
df["INCOME"].replace(np.nan,median1,inplace=True)
df["TRAVEL TIME"].replace(np.nan,median2,inplace=True)
df["MILES CLOCKED"].replace(np.nan,median3,inplace=True)
df["CAR AGE"].replace(np.nan,median4,inplace=True)
mode1=df["SEX"].mode().values[0]
mode2=df["MARITAL STATUS"].mode().values[0]
mode3=df["EDUCATION"].mode().values[0]
mode4=df["JOB"].mode().values[0]
mode5=df["USE"].mode().values[0]
mode6=df['CITY'].mode().values[0]
mode7=df["CAR TYPE"].mode().values[0]
mode8=df["POSTAL CODE"].mode().values[0]
df["SEX"]=df["SEX"].replace(np.nan,mode1)
df["EDUCATION"]=df["EDUCATION"].replace(np.nan,mode3)
df["JOB"]=df["JOB"].replace(np.nan,mode4)
df["USE"]=df["USE"].replace(np.nan,mode5)
df['CITY']=df['CITY'].replace(np.nan,mode6)
Outlier Treatment
def remove_outlier(col):
sorted(col)
Q1,Q3=col.quantile([0.25,0.75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
lrincome,urincome=remove_outlier(df['INCOME'])
df['INCOME']=np.where(df['INCOME']>urincome,urincome,df['INCOME'])
df['INCOME']=np.where(df['INCOME']<lrincome,lrincome,df['INCOME'])
lrtravel,urtravel=remove_outlier(df['TRAVEL TIME'])
lrmiles,urmiles=remove_outlier(df['MILES CLOCKED'])
dups = df.duplicated()
df[dups]
Removing Duplicates
df.drop_duplicates(inplace=True)
Univariate Analysis
1. sns.distplot (df.INCOME, bins=20) # histogram of income
From above figure, we can say that the Income parameter is right skewed
2. sns.countplot(df["EDUCATION"],hue=df["SEX"])
From the above graph we can interpret that majority of the people are High School passouts and this
is true for both Males and Females
Bivariate Analysis
1. sns.pairplot(df)
plt.show()
In the above plot scatter diagrams are plotted for all the numerical columns in the dataset. A
scatter plot is a visual representation of the degree of correlation between any two columns.
The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the
columns in the data.
2. Corr = df.corr()
corr
Correlation Heatmap
plt.figure(figsize=(12,7))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.show()
Normalizing and Scaling
Often the variables of the data set are of different scales i.e. one variable is in millions and
other in only 100. For e.g. in our data set Income is having values in thousands and age in
just two digits. Since the data in these variables are of different scales, it is tough to compare
these variables.
Feature scaling (also known as data normalization) is the method used to standardize the
range of features of data. Since, the range of values of data may vary widely, it becomes a
necessary step in data preprocessing while using machine learning algorithms.
In this method, we convert variables with different scales of measurements into a single
scale.
sc = StandardScaler()
sc
df['INCOME'] = sc.fit_transform(df[['INCOME']])
ENCODING
One-Hot-Encoding is used to create dummy variables to replace the categories in a
categorical variable into features of each category and represent it using 1 or 0 based on the
presence or absence of the categorical value in the record.
This is required to do since the machine learning algorithms only works on the numerical
data. That is why there is a need to convert the categorical column into numerical one.
get_dummies is the method which creates dummy variable for each categorical variable.
It is considered a good practice to set parameter drop_first as True whenever
get_dummies is used. It reduces the chances of multicollinearity which will be covered in
coming courses and the number of features are also less as compared
to drop_first=False