Machine Learning Notes: 2. All The Commands For Eda

Machine Learning Notes
1. All the Import Modules Commands :
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
2. All the commands for Eda :
df.isna() / df.isna().sum()
df.info()
df.describe()
df.dropna( axis = 0,1 ) #0 for row and 1 for column
df.fillna()
 To calculate mean :-
df['column_name'].mean()
 To fill missing values by mean :-

x = df['column_name'].mean()
df['column_name'].fillna(x, inplace=True)
 To read a csv file :-

df = pd.read_csv('cars.csv')
df["column_name"].unique()
df["column_name"].value_counts()
 To replace a string by nan value :-

df['column_name'].replace("string",np.nan,inplace =True)
df['column_name'] = df['column_name'].astype("float")
 To create a new df with specific data type :-

# df_cat / df_num = df with categorical / numerical data
df_cat = df.select_dtypes(object)
df_num = df.select_dtypes(['int64','float64'])
 Steps to handle missing values :
#step1 - use replace
df['column_name'].replace("string",np.nan,inplace =True)
#step2 - change the datatype to float
df['column_name'] = df['column_name'].astype("float")
#step3 - calculate the mean for the cols
x = df['column_name'].mean()
#step4 - use fillna
df['column_name'].fillna(x, inplace=True)
 Label Encoder :
from sklearn.preprocessing import LabelEncoder
for col in df_cat:
le=LabelEncoder()
df_cat[col] = le.fit_transform(df_cat[col])
 To drop columns and rows :
df.drop('column_name', axis = 1) #for a single column
df.drop(['column_name','column_name'],axis=1) #multiple
df.drop(index_number) #to drop a Row
 To handle outliers :
#Step1-: Make boxplot with two variable
Eg :- sns.boxplot(data=df,x='price',y='make')
#Step2-: Filter out the outliers
Eg :- df[(df['make']=='dodge') & (df['price']>10000)]
#Step3-: Drop the outliers
Eg :- df.drop(29,inplace=True)
 Feature engineering : It is used to reduce the columns / features in the

data frame. Eg : if a data set has height and width
column ,we can create a new column = area ; a=l*b
and then remove height and width columns .
 Skewness and handling Skewness :
from scipy.stats import skew
To find skewness of a column :

skew(df_num['column_name'])
Using for loop & plotting graph :

for col in df_num:
print(col)
print(skew(df_num[col]))
plt.figure()
sns.distplot(df_num[col])
plt.show()
#to find correlation
df_num.corr()
sns.heatmap(df_num.corr(), annot=True)
WE SHOULD NOT REMOVE THE SKEWNESS FOR THE COLUMN WHICH HAS
VERY HIGH CO-RELATION WITH TARGET, BECAUSE IF WE DO THAT THEN
THEIR CO-RELATION WITH THE TARGET WILL ALSO BE CHANGE.
ALSO NEVER REMOVE SKEWNESS OF A NEGATIVE COLUMNS , IT WILL GIVE
YOU A NAN VALUE.
 To Handle Skewness either find the Square root or log of that

column :
df_num['column_name']= np.sqrt(df_num['column_name'])
 Scaling :-
1. MinMax Scaler
from sklearn.preprocessing import MinMaxScaler
for col in df_new:
ms = MinMaxScaler()
df_new[col]=ms.fit_transform(df_new[[col]])
2. Standard Scaler
from sklearn.preprocessing import StandardScaler
for col in df_new:
sc = StandardScaler()
df_new[col]=sc.fit_transform(df_new[[col]])
 Requirements for working with data in Sklearn :-
 Feature and response should be seperated objects

 Feature and response should be Numeric
 Feature and response should be numpy array
 Feature and response should have specific shape (2D)
x = df.iloc[:,:-1].values #Features -> independent Variable
y = df.iloc[:,-1].values # Response-> dependent variable
 Taking care of missing data :-
from sklearn.impute import SimpleImputer
#step1: define the missing value & strategy
si = SimpleImputer(missing_values=np.nan, strategy='mean'
)
#step2: select the col that has missing values
si.fit(x[:,1:3])
#step3: fill the value using transform method to selected
cols and save it back
x[:,1:3] = si.transform(x[:,1:3])
 Encoding categorical data ( One Hot Encoder ) : -

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers= [('encoder',
OneHotEncoder(), [0])], remainder=' passthrough ')
#selecting and apply change at the same time
x = np.array(ct.fit_transform(x))
 Splitting the dataset into the training set and test set :-
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x,y,
test_size=0.2, random_state = 1)
 Feature Scaling :-
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
xtrain[:,3:] = sc.fit_transform(xtrain[:,3:])
xtest[:,3:] = sc.fit_transform(xtest[:,3:])
 Linear regression model :-

#step 1-: Select a model from sklearn
from sklearn.linear_model import LinearRegression
#step 2 -: Create an object of your model
linreg = LinearRegression()
#step 3 -: Train your model
linreg.fit(xtrain, ytrain)
#step 4: Predict the value
ypred = linreg.predict(xtest)

Machine Learning Notes: 2. All The Commands For Eda

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Notes: 2. All The Commands For Eda

Uploaded by

Copyright:

Available Formats

Machine Learning Notes

1. All the Import Modules Commands :

2. All the commands for Eda :

 To fill missing values by mean :-

 To read a csv file :-

 To replace a string by nan value :-

 To create a new df with specific data type :-

 To drop columns and rows :

 Feature engineering : It is used to reduce the columns / features in the

To find skewness of a column :

Using for loop & plotting graph :

 To Handle Skewness either find the Square root or log of that

 Requirements for working with data in Sklearn :-

 Feature and response should be seperated objects

 Taking care of missing data :-

 Encoding categorical data ( One Hot Encoder ) : -

 Linear regression model :-

You might also like