You are on page 1of 16

MAJOR PROJECT

ON
AIRTIFICIAL INTELLIGENCE(EDA)
(Python)

Submitted By: Ashish Aryal


Ai-major-august-google classroom name.
ABSTRACT

When a customer wants to implement ML(Machine Learning) for


the identified business problem(s) after multiple discussions along
with the following stakeholders from both sides – Business,
Architect, Infrastructure, Operations, and others. This is quite
normal for any new product/application development.
But in the ML world, this is quite different. because, for new
application development, we have to have a set of requirements
in the form of sprint plans or traditional SDLC form and it depends
on the customer for the next release plan.
Identify the data source(s) and Data Collection

 Organization’s key application(s) – it would be Internal or External


application or web-sites
 It would be streaming data from the web (Twitter/Facebook – any Social
media)

Once you’re comfortable with the available data, you can start
work on the rest of the Machine Learning process model.

Machine Learning process


Exploratory Data Analysis(EDA)

This is unavoidable and one of the major step to fine-tune the given
data set(s) in a different form of analysis to understand the insights of
the key characteristics of various entities of the data set like column(s),
row(s) by applying Pandas, NumPy, Statistical Methods, and Data
visualization packages.

Out Come of this phase as below

 Understanding the given dataset and helps clean up the given


dataset.
 It gives you a clear picture of the features and the relationships
between them.
 Providing guidelines for essential variables and leaving
behind/removing non-essential variables.
 Handling Missing values or human error.
 Identifying outliers.
 EDA process would be maximizing insights of a dataset.
This process is time-consuming but very effective, the below activities
are involved during this phase, it would be varied and depends on the
available data and acceptance from the customer.
PROBLEM:
Take any Dataset of your choice ,perform EDA(Exploratory Data
Analysis) and apply a
suitable Classifier,Regressor or Clusterer and calculate the accuracy of
the model.

SOLUTION:
print("######################################")
print(" Import Key Packages ")
print("######################################")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display.
import statsmodels as sm
from statsmodels.stats.outliers_influence import
variance_inflation_factor
from sklearn.model_selection import
train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import
RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_squared_error
from sklearn import preprocessing

1.LOAD CSV.files

EDA (Exploratory Data Analysis)


2.Dataset Information
print("############################################")
print(" Info Of the Data Set")
print("############################################")
df_cars.info()

3.Data Cleaning/Wrangling:
the process of cleaning and unifying messy and complex data sets for
easy access and analysis.

df_cars.horsepower =
df_cars.horsepower.str.replace('?','NaN').astype(float)
df_cars.horsepower.fillna(df_cars.horsepower.mean(),inplace=True)
df_cars.horsepower = df_cars.horsepower.astype(int)
print("##################################################
####################")
print(" After Cleaning and type covertion in the Data Set")
print("##################################################
####################")
df_cars.info()

df_cars['name'] = df_cars['name'].str.replace('chevroelt|
chevrolet|chevy','chevrolet')
df_cars['name'] = df_cars['name'].str.replace('maxda|
mazda','mazda')
df_cars['name'] = df_cars['name'].str.replace('mercedes|
mercedes-benz|mercedes benz','mercedes')
df_cars['name'] = df_cars['name'].str.replace('toyota|
toyouta','toyota')
df_cars['name'] = df_cars['name'].str.replace('vokswagen|
volkswagen|vw','volkswagen')
df_cars.groupby(['name']).sum().head()
sns_plot = sns.distplot(df_cars["mpg"])

fig, ax = plt.subplots(figsize = (5, 5))


sns.countplot(x = df_cars.origin.values, data=df_cars)
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[0] = 'America'
labels[1] = 'Europe'
labels[2] = 'Asia'
ax.set_xticklabels(labels)
ax.set_title("Cars manufactured by Countries")
plt.show()
fig, ax = plt.subplots(6, 2, figsize = (15, 13))
sns.boxplot(x= df_cars["mpg"], ax = ax[0,0])
sns.distplot(df_cars['mpg'], ax = ax[0,1])
sns.boxplot(x= df_cars["cylinders"], ax = ax[1,0])
sns.distplot(df_cars['cylinders'], ax = ax[1,1])
sns.boxplot(x= df_cars["displacement"], ax = ax[2,0])
sns.distplot(df_cars['displacement'], ax = ax[2,1])
sns.boxplot(x= df_cars["horsepower"], ax = ax[3,0])
sns.distplot(df_cars['horsepower'], ax = ax[3,1])
sns.boxplot(x= df_cars["weight"], ax = ax[4,0])
sns.distplot(df_cars['weight'], ax = ax[4,1])
sns.boxplot(x= df_cars["acceleration"], ax = ax[5,0])
sns.distplot(df_cars['acceleration'], ax = ax[5,1])
plt.tight_layout()

f,axarr = plt.subplots(4,2, figsize=(10,10))


mpgval = df_cars.mpg.values
axarr[0,0].scatter(df_cars.cylinders.values, mpgval)
axarr[0,0].set_title('Cylinders')
axarr[0,1].scatter(df_cars.displacement.values, mpgval)
axarr[0,1].set_title('Displacement')
axarr[1,0].scatter(df_cars.horsepower.values, mpgval)
axarr[1,0].set_title('Horsepower')
axarr[1,1].scatter(df_cars.weight.values, mpgval)
axarr[1,1].set_title('Weight')
axarr[2,0].scatter(df_cars.acceleration.values, mpgval)
axarr[2,0].set_title('Acceleration')
axarr[2,1].scatter(df_cars["model_year"].values, mpgval)
axarr[2,1].set_title('Model Year')
axarr[3,0].scatter(df_cars.origin.values, mpgval)
axarr[3,0].set_title('Country Mpg')
# Rename x axis label as USA, Europe and Japan
axarr[3,0].set_xticks([1,2,3])
axarr[3,0].set_xticklabels(["USA","Europe","Asia"])
# Remove the blank plot from the subplots
axarr[3,1].axis("off")
f.text(-0.01, 0.5, 'Mpg', va='center', rotation='vertical', fontsize =
12)
plt.tight_layout()
plt.show()
sns.set(rc={'figure.figsize':(11.7,8.27)})
cData_attr = df_cars.iloc[:, 0:7]
sns.pairplot(cData_attr, diag_kind='kde')
# to plot density curve instead of the histogram on the diagram #
Kernel density estimation(kde)

df_cars.hist(figsize=(12,8),bins=20)
plt.show()
plt.figure(figsize=(10,6))
sns.heatmap(df_cars.corr(),cmap=plt.cm.Reds,annot=True)
plt.title('Heatmap displaying the relationship betweennthe
features of the data',
fontsize=13)
plt.show()

Relationship between the Miles Per Gallon (mpg) and the other features.
 We can see that there is a relationship between the mpg variable and
the other variables and this satisfies the first assumption of Linear
regression.
 Strong Negative correlation between displacement, horsepower, weight,
and cylinders.
 This implies that, as any one of those variables increases, the mpg
decreases.
 Strong Positive correlations between the displacement, horsepower,
weight, and cylinders.
 This violates the non-multicollinearity assumption of Linear
regression.
 Multicollinearity hinders the performance and accuracy of our
regression model. To avoid this, we have to get rid of some of
these variables by doing feature selection.
 The other variables.ie.acceleration, model, and origin are NOT highly
correlated with each other.

So, I trust that you were able to understand the EDA in full flow here, still,
there are many more functions in it, if you’re doing the EDA process clearly
and precisely, there is 99% of grantee that you could build your model
selection, hyperparameter tuning, and deployment process effectively without
further cleaning, cleansing on your data set. You have to continuously monitor
the data and model output is sustainable to predict or classify or cluster.

Conclusion
This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis
(EDA) helps us to look beyond the data. The more we explore the data, the
more the insights we draw from it. As a data analyst, almost 80% of our time
will be spent understanding data and solving various business problems
through EDA.

You might also like