Feature-Engineering DAY 4

Data preprocessing
Agenda
• Why preprocess the data?

• Data cleaning
• Data transformation
• Summary
Data Cleaning
• Data cleaning tasks

– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Handle Categorical Features One Hot Encoding
• import pandas as pd
• df=pd.DataFrame(data=[['male','blue'],
['female','brown'],
['male','black']],columns=['gender','eyes'])
• df1 = pd.get_dummies(df)
• UPDATED
• The use of
• df2 = pd.get_dummies(df,drop_first=True)
Label Encoding
• #Import label encoder

• from sklearn import preprocessing
•
• label_encoder object knows how to understand word labels.
• label_encoder = preprocessing.LabelEncoder()
•
• # Encode labels in column 'species'.
• df['species']= label_encoder.fit_transform(df['species'])
•
• df['species'].unique()
Feature Engineering Handling Missing Values
• df['gender'].unique()
• df=pd.read_csv('titanic.csv',usecols=['Embarked']
)
• df.dropna(inplace=True)
• If you set inplace = True , the dropna method
will modify your DataFrame directly. That
means that if you set inplace = True , dropna will
drop all missing values from your original dataset
• df=pd.read_csv('titanic.csv')
• df.isnull().sum()
• df[df['Embarked'].isnull()]
• All the techniques of handling ,missing values .
Mean/ Median/Mode replacement
• Mean/ Median /Mode imputation When

should we apply? Mean/median imputation
has the assumption that the data are missing
completely at random(MCAR). We solve this
by replacing the NAN with the most frequent
occurance of the variables
• Lets go and see the percentage of missing
values df.isnull().mean()
• df['Age']=df['Age'].median()
• df['Age']=df['Age'].mean()
• df['Age'].fillna(df.Age.median(),inplace=True)
• df['Cabin'].value_counts()
• df['Cabin'].fillna('Missing',inplace=True)
• df['Cabin']=df['Cabin'].mode()
• df[‘Cabin'].mode()[0]
• df['Cabin']=df['Cabin'].fillna(df['Cabin'].value_counts
().index[0])
• # using sklearn-pandas package

• from sklearn_pandas import
CategoricalImputer
•
• # handling NaN values
• imputer = CategoricalImputer()
• data = np.array(df['Color'], dtype=object)
• imputer.fit_transform(data)
• # dictionary of lists
• dict = {'First Score':[100, 90, np.nan, 95],
• 'Second Score': [30, 45, 56, np.nan],
• 'Third Score':[np.nan, 40, 80, 98]}
•
• # creating a dataframe from dictionary
• df = pd.DataFrame(dict)
•
• # filling missing value using fillna()
• df.fillna(0)
• frames = [df1, df2], result = pd.concat(frames)
Transformation of Features
• Transformation of Features Why

Transformation of Features Are Required?
• Linear Regression---Gradient Descent ----
Global Minima
• Algorithms like KNN,K-Means, Hierarichal
Clustering--- Eucledian Distance Every Point
has some vectors and Directiom Deep
Learning
• Techniques(Standardization, Scaling)
• 1.ANN--->Global Minima, Gradient
• 2.CNN
• 3.RNN
• 0-255 pixels Types Of Transformation
• 1. Normalization And Standardization
• 2. Scaling to Minimum And Maximum values
3. Scaling To Median And Quantiles
• Standardization We try to bring all the

variables or features to a similar scale.
standarisation means centering the variable at
zero.
• z=(x-x_mean)/std
• df=pd.read_csv('titanic.csv',
usecols=['Pclass','Age','Fare','Survived'])
df.head()
• # standarisation: We use the Standardscaler

from sklearn library from
sklearn.preprocessing import StandardScaler
• scaler=StandardScaler()
• df_scaled=scaler.fit_transform(df)
• import matplotlib.pyplot as plt
• plt.hist(df_scaled[:,1],bins=20)
• plt.hist(df_scaled[:,2],bins=20)
• plt.hist(df['Fare'],bins=20)
• Min Max Scaling (# CNN)---Deep Learning
Techniques Min Max Scaling scales the values
between 0 to 1.
• X_scaled = (X - X.min / (X.max - X.min)
• from sklearn.preprocessing import

MinMaxScaler min_max=MinMaxScaler()
df_minmax=pd.DataFrame(min_max.fit_transf
orm(df),columns=df.columns)
df_minmax.head()
• plt.hist(df_minmax['Pclass'],bins=20)
• plt.hist(df_minmax['Age'],bins=20)
• plt.hist(df_minmax['Fare'],bins=20)
• Robust Scaler:
• The interquantile difference is the difference

between the 75th and 25th quantile:
• IQR = 75th quantile - 25th quantile
• X_scaled = (X - X.median) / IQR
• from sklearn.preprocessing import

RobustScaler scaler=RobustScaler()
df_robust_scaler=pd.DataFrame(scaler.fit_tran
sform(df),columns=df.colu mns)
df_robust_scaler.head()
Identifying, Cleaning and replacing outliers
• Outliers Identification
• There are different ways and methods of
identifying outliers, but we are only going to
use some of the most popular techniques:
• Visualization
• Skewness
• Interquartile Range
Outliers And Impact On Machine Learning
• Which Machine Learning Models Are Sensitive To Outliers?

• 1. Naivye Bayes Classifier--- Not Sensitive To Outliers
• 2. SVM-------- Not Sensitive To Outliers
• 3. Linear Regression---------- Sensitive To Outliers
• 4. Logistic Regression------- Sensitive To Outliers
• 5. Decision Tree Regressor or Classifier---- Not Sensitive 6.
Ensemble(RF,XGboost)------- Not Sensitive
• 7. KNN--------------------------- Not Sensitive
• 8. Kmeans------------------------ Sensitive
• 9. Neural Networks-------------- Sensitive
Outliers Identification
• Visualization
• Outliers can be detected using different visualization
methods, we are going to use :
• Boxplot
• Histogram
• sns.boxplot(df['Feature'],data=df)
• df['Fare'].hist()
• sns.distplot(df['Age'].dropna())
• sns.distplot(df['Age'].fillna(100))
• figure=df.Age.hist(bins=50)
figure.set_title('Age')
• figure.set_xlabel('Age')
• figure.set_ylabel('No of passenger')
• figure=df.boxplot(column="Age")
• Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
whisker_width = 1.5
• Fare_outliers = df[(df['Fare'] < Q1 -
whisker_width*IQR) | (df['Fare'] > Q3 +
whisker_width*IQR)]
• data=df.copy()
data.loc[data['Age']>=73,'Age']=73
data.loc[data['Fare']>=100,'Fare']=100
• figure=data.Age.hist(bins=50)
figure.set_title('Fare')
• figure.set_xlabel('Fare')
• figure.set_ylabel('No of passenger')
• figure=data.Fare.hist(bins=50)
figure.set_title('Fare') figure.set_xlabel('Fare')
figure.set_ylabel('No of passenger')
• sns.barplot(x='Pclass', y='Survived', data=df)

• Skewness
• The skewness value should be within the range
of -1 to 1 for a normal distribution, any major
changes from this value may indicate the
presence of outliers.
• The code below prints the skewness value of the
‘Fare’ variable.
• print('skewness value of Age: ',df['Age'].skew())
print('skewness value of Fare: ',df['Fare'].skew())
Logistic Regression
• from sklearn.model_selection import train_test_split
• X_train,X_test,y_train,y_test=train_test_split(data[['Age','Fare
']].fi llna(0),data['Survived'],test_size=0.3)
• Logistic Regression from sklearn.linear_model import
LogisticRegressionclassifier=LogisticRegression()
classifier.fit(X_train,y_train) y_pred=classifier.predict(X_test)
y_pred1=classifier.predict_proba(X_test)
• from sklearn.metrics import accuracy_score,roc_auc_score
print("Accuracy_score:
{}".format(accuracy_score(y_test,y_pred)))
print("roc_auc_score:
{}".format(roc_auc_score(y_test,y_pred1[:,1])))

Feature-Engineering DAY 4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature-Engineering DAY 4

Uploaded by

Copyright:

Available Formats

Data preprocessing

• Why preprocess the data?

• Data cleaning tasks

• #Import label encoder

• Mean/ Median /Mode imputation When

• # using sklearn-pandas package

• Transformation of Features Why

• Standardization We try to bring all the

• # standarisation: We use the Standardscaler

• from sklearn.preprocessing import

• The interquantile difference is the difference

• from sklearn.preprocessing import

• Which Machine Learning Models Are Sensitive To Outliers?

• sns.barplot(x='Pclass', y='Survived', data=df)

You might also like