You are on page 1of 30

Data preprocessing

Agenda

• Why preprocess the data?


• Data cleaning
• Data transformation
• Summary
Data Cleaning

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Handle Categorical Features One Hot Encoding

• import pandas as pd
• df=pd.DataFrame(data=[['male','blue'],
['female','brown'],
['male','black']],columns=['gender','eyes'])
• df1 = pd.get_dummies(df)
• UPDATED
• The use of
• df2 = pd.get_dummies(df,drop_first=True)
Label Encoding

• #Import label encoder


• from sklearn import preprocessing
•   
• label_encoder object knows how to understand word labels.
• label_encoder = preprocessing.LabelEncoder()
•   
• # Encode labels in column 'species'.
• df['species']= label_encoder.fit_transform(df['species'])
•   
• df['species'].unique()
Feature Engineering Handling Missing Values

• df['gender'].unique()
• df=pd.read_csv('titanic.csv',usecols=['Embarked']
)
• df.dropna(inplace=True)
• If you set inplace = True , the dropna method
will modify your DataFrame directly. That
means that if you set inplace = True , dropna will
drop all missing values from your original dataset
Feature Engineering Handling Missing Values

• df=pd.read_csv('titanic.csv')
• df.isnull().sum()
• df[df['Embarked'].isnull()]
• All the techniques of handling ,missing values .
Mean/ Median/Mode replacement
Feature Engineering Handling Missing Values

• Mean/ Median /Mode imputation When


should we apply? Mean/median imputation
has the assumption that the data are missing
completely at random(MCAR). We solve this
by replacing the NAN with the most frequent
occurance of the variables
• Lets go and see the percentage of missing
values df.isnull().mean()
Feature Engineering Handling Missing Values

• df['Age']=df['Age'].median()
• df['Age']=df['Age'].mean()
• df['Age'].fillna(df.Age.median(),inplace=True)
• df['Cabin'].value_counts()
• df['Cabin'].fillna('Missing',inplace=True)
• df['Cabin']=df['Cabin'].mode()
• df[‘Cabin'].mode()[0]
• df['Cabin']=df['Cabin'].fillna(df['Cabin'].value_counts
().index[0])
Feature Engineering Handling Missing Values

• # using sklearn-pandas package


• from sklearn_pandas import
CategoricalImputer
•  
• # handling NaN values
• imputer = CategoricalImputer()
• data = np.array(df['Color'], dtype=object)
• imputer.fit_transform(data)
Feature Engineering Handling Missing Values

• # dictionary of lists
• dict = {'First Score':[100, 90, np.nan, 95],
•         'Second Score': [30, 45, 56, np.nan],
•         'Third Score':[np.nan, 40, 80, 98]}
•  
• # creating a dataframe from dictionary
• df = pd.DataFrame(dict)
•  
• # filling missing value using fillna() 
• df.fillna(0)
• frames = [df1, df2], result = pd.concat(frames)
Transformation of Features

• Transformation of Features Why


Transformation of Features Are Required?
• Linear Regression---Gradient Descent ----
Global Minima
• Algorithms like KNN,K-Means, Hierarichal
Clustering--- Eucledian Distance Every Point
has some vectors and Directiom Deep
Learning
Transformation of Features

• Techniques(Standardization, Scaling)
• 1.ANN--->Global Minima, Gradient
• 2.CNN
• 3.RNN
• 0-255 pixels Types Of Transformation
• 1. Normalization And Standardization
• 2. Scaling to Minimum And Maximum values
3. Scaling To Median And Quantiles
Transformation of Features

• Standardization We try to bring all the


variables or features to a similar scale.
standarisation means centering the variable at
zero.
• z=(x-x_mean)/std
• df=pd.read_csv('titanic.csv',
usecols=['Pclass','Age','Fare','Survived'])
df.head()
Transformation of Features

• # standarisation: We use the Standardscaler


from sklearn library from
sklearn.preprocessing import StandardScaler
• scaler=StandardScaler()
• df_scaled=scaler.fit_transform(df)
• import matplotlib.pyplot as plt
• plt.hist(df_scaled[:,1],bins=20)
• plt.hist(df_scaled[:,2],bins=20)
Transformation of Features

• plt.hist(df['Fare'],bins=20)
• Min Max Scaling (# CNN)---Deep Learning
Techniques Min Max Scaling scales the values
between 0 to 1.
• X_scaled = (X - X.min / (X.max - X.min)
Transformation of Features

• from sklearn.preprocessing import


MinMaxScaler min_max=MinMaxScaler()
df_minmax=pd.DataFrame(min_max.fit_transf
orm(df),columns=df.columns)
df_minmax.head()
• plt.hist(df_minmax['Pclass'],bins=20)
• plt.hist(df_minmax['Age'],bins=20)
• plt.hist(df_minmax['Fare'],bins=20)
Transformation of Features

• Robust Scaler:

• The interquantile difference is the difference


between the 75th and 25th quantile:
• IQR = 75th quantile - 25th quantile
• X_scaled = (X - X.median) / IQR
Transformation of Features

• from sklearn.preprocessing import


RobustScaler scaler=RobustScaler()

df_robust_scaler=pd.DataFrame(scaler.fit_tran
sform(df),columns=df.colu mns)

df_robust_scaler.head()
Identifying, Cleaning and replacing outliers

• Outliers Identification
• There are different ways and methods of
identifying outliers, but we are only going to
use some of the most popular techniques:
• Visualization
• Skewness
• Interquartile Range
Outliers And Impact On Machine Learning

• Which Machine Learning Models Are Sensitive To Outliers?


• 1. Naivye Bayes Classifier--- Not Sensitive To Outliers
• 2. SVM-------- Not Sensitive To Outliers
• 3. Linear Regression---------- Sensitive To Outliers
• 4. Logistic Regression------- Sensitive To Outliers
• 5. Decision Tree Regressor or Classifier---- Not Sensitive 6.
Ensemble(RF,XGboost)------- Not Sensitive
• 7. KNN--------------------------- Not Sensitive
• 8. Kmeans------------------------ Sensitive
• 9. Neural Networks-------------- Sensitive
Outliers Identification

• Visualization
• Outliers can be detected using different visualization
methods, we are going to use :
• Boxplot
• Histogram
• sns.boxplot(df['Feature'],data=df)
• df['Fare'].hist()
• sns.distplot(df['Age'].dropna())
• sns.distplot(df['Age'].fillna(100))
Outliers Identification

• figure=df.Age.hist(bins=50)
figure.set_title('Age')
• figure.set_xlabel('Age')
• figure.set_ylabel('No of passenger')
• figure=df.boxplot(column="Age")
Outliers Identification
Outliers Identification

• Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
whisker_width = 1.5
• Fare_outliers = df[(df['Fare'] < Q1 -
whisker_width*IQR) | (df['Fare'] > Q3 +
whisker_width*IQR)]
Outliers Identification

• data=df.copy()
data.loc[data['Age']>=73,'Age']=73
data.loc[data['Fare']>=100,'Fare']=100
• figure=data.Age.hist(bins=50)
figure.set_title('Fare')
• figure.set_xlabel('Fare')
• figure.set_ylabel('No of passenger')
Outliers Identification

• figure=data.Fare.hist(bins=50)
figure.set_title('Fare') figure.set_xlabel('Fare')
figure.set_ylabel('No of passenger')

• sns.barplot(x='Pclass', y='Survived', data=df)


Outliers Identification

• Skewness
• The skewness value should be within the range
of -1 to 1 for a normal distribution, any major
changes from this value may indicate the
presence of outliers.
• The code below prints the skewness value of the
‘Fare’ variable.
• print('skewness value of Age: ',df['Age'].skew())
print('skewness value of Fare: ',df['Fare'].skew())
Outliers Identification
Logistic Regression
• from sklearn.model_selection import train_test_split
• X_train,X_test,y_train,y_test=train_test_split(data[['Age','Fare
']].fi llna(0),data['Survived'],test_size=0.3)
• Logistic Regression from sklearn.linear_model import
LogisticRegressionclassifier=LogisticRegression()
classifier.fit(X_train,y_train) y_pred=classifier.predict(X_test)
y_pred1=classifier.predict_proba(X_test)
• from sklearn.metrics import accuracy_score,roc_auc_score
print("Accuracy_score:
{}".format(accuracy_score(y_test,y_pred)))
print("roc_auc_score:
{}".format(roc_auc_score(y_test,y_pred1[:,1])))

You might also like