You are on page 1of 5

Machine Learning Notes

1. All the Import Modules Commands :

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

2. All the commands for Eda :

df.isna() / df.isna().sum()
df.info()
df.describe()
df.dropna( axis = 0,1 ) #0 for row and 1 for column
df.fillna()

 To calculate mean :-
df['column_name'].mean()

 To fill missing values by mean :-


x = df['column_name'].mean()
df['column_name'].fillna(x, inplace=True)

 To read a csv file :-


df = pd.read_csv('cars.csv')
df["column_name"].unique()
df["column_name"].value_counts()

 To replace a string by nan value :-


df['column_name'].replace("string",np.nan,inplace =True)
df['column_name'] = df['column_name'].astype("float")

 To create a new df with specific data type :-


# df_cat / df_num = df with categorical / numerical data
df_cat = df.select_dtypes(object)
df_num = df.select_dtypes(['int64','float64'])  
 Steps to handle missing values :
#step1 - use replace 
df['column_name'].replace("string",np.nan,inplace =True)

#step2 - change the datatype to float
df['column_name'] = df['column_name'].astype("float")

#step3 - calculate the mean for the cols
x = df['column_name'].mean()

#step4 - use fillna
df['column_name'].fillna(x, inplace=True)

 Label Encoder :
from sklearn.preprocessing import LabelEncoder

for col in df_cat:
   le=LabelEncoder()
   df_cat[col] = le.fit_transform(df_cat[col])

 To drop columns and rows :

df.drop('column_name', axis = 1)  #for a single column
df.drop(['column_name','column_name'],axis=1) #multiple 
df.drop(index_number) #to drop a Row

 To handle outliers :
#Step1-: Make boxplot with two variable
Eg :- sns.boxplot(data=df,x='price',y='make')

#Step2-: Filter out the outliers
Eg :- df[(df['make']=='dodge') & (df['price']>10000)]

#Step3-: Drop the outliers
Eg :- df.drop(29,inplace=True)

 Feature engineering : It is used to reduce the columns / features in the


data frame. Eg : if a data set has height and width
column ,we can create a new column = area ; a=l*b
and then remove height and width columns .
 Skewness and handling Skewness :
from scipy.stats import skew

To find skewness of a column :


skew(df_num['column_name'])

Using for loop & plotting graph :


for col in df_num:
   print(col)
   print(skew(df_num[col]))

   plt.figure()
   sns.distplot(df_num[col])
   plt.show()

#to find correlation
df_num.corr()
sns.heatmap(df_num.corr(), annot=True)

WE SHOULD NOT REMOVE THE SKEWNESS FOR THE COLUMN WHICH HAS
VERY HIGH CO-RELATION WITH TARGET, BECAUSE IF WE DO THAT THEN
THEIR CO-RELATION WITH THE TARGET WILL ALSO BE CHANGE.
ALSO NEVER REMOVE SKEWNESS OF A NEGATIVE COLUMNS , IT WILL GIVE
YOU A NAN VALUE.

 To Handle Skewness either find the Square root or log of that


column :
df_num['column_name']= np.sqrt(df_num['column_name'])

 Scaling :-
1. MinMax Scaler
from sklearn.preprocessing import MinMaxScaler
for col in df_new:
   ms = MinMaxScaler()
   df_new[col]=ms.fit_transform(df_new[[col]])

2. Standard Scaler
from sklearn.preprocessing import StandardScaler
for col in df_new:
   sc = StandardScaler()
   df_new[col]=sc.fit_transform(df_new[[col]])

 Requirements for working with data in Sklearn :-

 Feature and response should be seperated objects


 Feature and response should be Numeric
 Feature and response should be numpy array
 Feature and response should have specific shape (2D)

x = df.iloc[:,:-1].values #Features -> independent Variable
y = df.iloc[:,-1].values  # Response-> dependent variable

 Taking care of missing data :-

from sklearn.impute import SimpleImputer

#step1: define the missing value & strategy
si = SimpleImputer(missing_values=np.nan, strategy='mean'
)

#step2: select the col that has missing values
si.fit(x[:,1:3])

#step3: fill the value using transform method to selected 
cols and save it back
x[:,1:3] = si.transform(x[:,1:3])

 Encoding categorical data ( One Hot Encoder ) : -


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers= [('encoder',
OneHotEncoder(), [0])], remainder=' passthrough ')

#selecting and apply change at the same time
x = np.array(ct.fit_transform(x))

 Splitting the dataset into the training set and test set :-
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x,y, 
test_size=0.2, random_state = 1)

 Feature Scaling :-
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
xtrain[:,3:] = sc.fit_transform(xtrain[:,3:])
xtest[:,3:]  = sc.fit_transform(xtest[:,3:])

 Linear regression model :-


#step 1-: Select a model from sklearn
from sklearn.linear_model import LinearRegression

#step 2 -: Create an object of your model
linreg = LinearRegression()

#step 3 -: Train your model
linreg.fit(xtrain, ytrain)

#step 4: Predict the value
ypred = linreg.predict(xtest)

You might also like