Professional Documents
Culture Documents
Agenda
• import pandas as pd
• df=pd.DataFrame(data=[['male','blue'],
['female','brown'],
['male','black']],columns=['gender','eyes'])
• df1 = pd.get_dummies(df)
• UPDATED
• The use of
• df2 = pd.get_dummies(df,drop_first=True)
Label Encoding
• df['gender'].unique()
• df=pd.read_csv('titanic.csv',usecols=['Embarked']
)
• df.dropna(inplace=True)
• If you set inplace = True , the dropna method
will modify your DataFrame directly. That
means that if you set inplace = True , dropna will
drop all missing values from your original dataset
Feature Engineering Handling Missing Values
• df=pd.read_csv('titanic.csv')
• df.isnull().sum()
• df[df['Embarked'].isnull()]
• All the techniques of handling ,missing values .
Mean/ Median/Mode replacement
Feature Engineering Handling Missing Values
• df['Age']=df['Age'].median()
• df['Age']=df['Age'].mean()
• df['Age'].fillna(df.Age.median(),inplace=True)
• df['Cabin'].value_counts()
• df['Cabin'].fillna('Missing',inplace=True)
• df['Cabin']=df['Cabin'].mode()
• df[‘Cabin'].mode()[0]
• df['Cabin']=df['Cabin'].fillna(df['Cabin'].value_counts
().index[0])
Feature Engineering Handling Missing Values
• # dictionary of lists
• dict = {'First Score':[100, 90, np.nan, 95],
• 'Second Score': [30, 45, 56, np.nan],
• 'Third Score':[np.nan, 40, 80, 98]}
•
• # creating a dataframe from dictionary
• df = pd.DataFrame(dict)
•
• # filling missing value using fillna()
• df.fillna(0)
• frames = [df1, df2], result = pd.concat(frames)
Transformation of Features
• Techniques(Standardization, Scaling)
• 1.ANN--->Global Minima, Gradient
• 2.CNN
• 3.RNN
• 0-255 pixels Types Of Transformation
• 1. Normalization And Standardization
• 2. Scaling to Minimum And Maximum values
3. Scaling To Median And Quantiles
Transformation of Features
• plt.hist(df['Fare'],bins=20)
• Min Max Scaling (# CNN)---Deep Learning
Techniques Min Max Scaling scales the values
between 0 to 1.
• X_scaled = (X - X.min / (X.max - X.min)
Transformation of Features
• Robust Scaler:
df_robust_scaler=pd.DataFrame(scaler.fit_tran
sform(df),columns=df.colu mns)
df_robust_scaler.head()
Identifying, Cleaning and replacing outliers
• Outliers Identification
• There are different ways and methods of
identifying outliers, but we are only going to
use some of the most popular techniques:
• Visualization
• Skewness
• Interquartile Range
Outliers And Impact On Machine Learning
• Visualization
• Outliers can be detected using different visualization
methods, we are going to use :
• Boxplot
• Histogram
• sns.boxplot(df['Feature'],data=df)
• df['Fare'].hist()
• sns.distplot(df['Age'].dropna())
• sns.distplot(df['Age'].fillna(100))
Outliers Identification
• figure=df.Age.hist(bins=50)
figure.set_title('Age')
• figure.set_xlabel('Age')
• figure.set_ylabel('No of passenger')
• figure=df.boxplot(column="Age")
Outliers Identification
Outliers Identification
• Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
whisker_width = 1.5
• Fare_outliers = df[(df['Fare'] < Q1 -
whisker_width*IQR) | (df['Fare'] > Q3 +
whisker_width*IQR)]
Outliers Identification
• data=df.copy()
data.loc[data['Age']>=73,'Age']=73
data.loc[data['Fare']>=100,'Fare']=100
• figure=data.Age.hist(bins=50)
figure.set_title('Fare')
• figure.set_xlabel('Fare')
• figure.set_ylabel('No of passenger')
Outliers Identification
• figure=data.Fare.hist(bins=50)
figure.set_title('Fare') figure.set_xlabel('Fare')
figure.set_ylabel('No of passenger')
• Skewness
• The skewness value should be within the range
of -1 to 1 for a normal distribution, any major
changes from this value may indicate the
presence of outliers.
• The code below prints the skewness value of the
‘Fare’ variable.
• print('skewness value of Age: ',df['Age'].skew())
print('skewness value of Fare: ',df['Fare'].skew())
Outliers Identification
Logistic Regression
• from sklearn.model_selection import train_test_split
• X_train,X_test,y_train,y_test=train_test_split(data[['Age','Fare
']].fi llna(0),data['Survived'],test_size=0.3)
• Logistic Regression from sklearn.linear_model import
LogisticRegressionclassifier=LogisticRegression()
classifier.fit(X_train,y_train) y_pred=classifier.predict(X_test)
y_pred1=classifier.predict_proba(X_test)
• from sklearn.metrics import accuracy_score,roc_auc_score
print("Accuracy_score:
{}".format(accuracy_score(y_test,y_pred)))
print("roc_auc_score:
{}".format(roc_auc_score(y_test,y_pred1[:,1])))