Professional Documents
Culture Documents
ON
AIRTIFICIAL INTELLIGENCE(EDA)
(Python)
Once you’re comfortable with the available data, you can start
work on the rest of the Machine Learning process model.
This is unavoidable and one of the major step to fine-tune the given
data set(s) in a different form of analysis to understand the insights of
the key characteristics of various entities of the data set like column(s),
row(s) by applying Pandas, NumPy, Statistical Methods, and Data
visualization packages.
SOLUTION:
print("######################################")
print(" Import Key Packages ")
print("######################################")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display.
import statsmodels as sm
from statsmodels.stats.outliers_influence import
variance_inflation_factor
from sklearn.model_selection import
train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import
RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_squared_error
from sklearn import preprocessing
1.LOAD CSV.files
3.Data Cleaning/Wrangling:
the process of cleaning and unifying messy and complex data sets for
easy access and analysis.
df_cars.horsepower =
df_cars.horsepower.str.replace('?','NaN').astype(float)
df_cars.horsepower.fillna(df_cars.horsepower.mean(),inplace=True)
df_cars.horsepower = df_cars.horsepower.astype(int)
print("##################################################
####################")
print(" After Cleaning and type covertion in the Data Set")
print("##################################################
####################")
df_cars.info()
df_cars['name'] = df_cars['name'].str.replace('chevroelt|
chevrolet|chevy','chevrolet')
df_cars['name'] = df_cars['name'].str.replace('maxda|
mazda','mazda')
df_cars['name'] = df_cars['name'].str.replace('mercedes|
mercedes-benz|mercedes benz','mercedes')
df_cars['name'] = df_cars['name'].str.replace('toyota|
toyouta','toyota')
df_cars['name'] = df_cars['name'].str.replace('vokswagen|
volkswagen|vw','volkswagen')
df_cars.groupby(['name']).sum().head()
sns_plot = sns.distplot(df_cars["mpg"])
df_cars.hist(figsize=(12,8),bins=20)
plt.show()
plt.figure(figsize=(10,6))
sns.heatmap(df_cars.corr(),cmap=plt.cm.Reds,annot=True)
plt.title('Heatmap displaying the relationship betweennthe
features of the data',
fontsize=13)
plt.show()
Relationship between the Miles Per Gallon (mpg) and the other features.
We can see that there is a relationship between the mpg variable and
the other variables and this satisfies the first assumption of Linear
regression.
Strong Negative correlation between displacement, horsepower, weight,
and cylinders.
This implies that, as any one of those variables increases, the mpg
decreases.
Strong Positive correlations between the displacement, horsepower,
weight, and cylinders.
This violates the non-multicollinearity assumption of Linear
regression.
Multicollinearity hinders the performance and accuracy of our
regression model. To avoid this, we have to get rid of some of
these variables by doing feature selection.
The other variables.ie.acceleration, model, and origin are NOT highly
correlated with each other.
So, I trust that you were able to understand the EDA in full flow here, still,
there are many more functions in it, if you’re doing the EDA process clearly
and precisely, there is 99% of grantee that you could build your model
selection, hyperparameter tuning, and deployment process effectively without
further cleaning, cleansing on your data set. You have to continuously monitor
the data and model output is sustainable to predict or classify or cluster.
Conclusion
This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis
(EDA) helps us to look beyond the data. The more we explore the data, the
more the insights we draw from it. As a data analyst, almost 80% of our time
will be spent understanding data and solving various business problems
through EDA.