Professional Documents
Culture Documents
Random Forest Classifier
Random Forest Classifier
Model to predict whether Employee is gonna stay or leave the company using Random Forest Classifier
In [2]: 1 df=pd.read_csv("Employee_Attrition.csv")
2 df.head()
Out[2]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLe
Research &
1 49 No Travel_Frequently 279 8 1 Life Sciences 1 2 ... 4 80
Development
Research &
2 37 Yes Travel_Rarely 1373 2 2 Other 1 4 ... 2 80
Development
Research &
3 33 No Travel_Frequently 1392 3 4 Life Sciences 1 5 ... 3 80
Development
Research &
4 27 No Travel_Rarely 591 2 1 Medical 1 7 ... 4 80
Development
5 rows × 35 columns
Out[3]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOption
Research &
1465 36 No Travel_Frequently 884 23 2 Medical 1 2061 ... 3 80
Development
Research &
1466 39 No Travel_Rarely 613 6 1 Medical 1 2062 ... 1 80
Development
Research &
1467 27 No Travel_Rarely 155 4 3 Life Sciences 1 2064 ... 2 80
Development
Research &
1469 34 No Travel_Rarely 628 8 3 Medical 1 2068 ... 1 80
Development
5 rows × 35 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1470 non-null int64
1 Attrition 1470 non-null object
2 BusinessTravel 1470 non-null object
3 DailyRate 1470 non-null int64
4 Department 1470 non-null object
5 DistanceFromHome 1470 non-null int64
6 Education 1470 non-null int64
7 EducationField 1470 non-null object
8 EmployeeCount 1470 non-null int64
9 EmployeeNumber 1470 non-null int64
10 EnvironmentSatisfaction 1470 non-null int64
11 Gender 1470 non-null object
12 HourlyRate 1470 non-null int64
13 JobInvolvement 1470 non-null int64
14 JobLevel 1470 non-null int64
15 JobRole 1470 non-null object
16 JobSatisfaction 1470 non-null int64
17 MaritalStatus 1470 non-null object
18 MonthlyIncome 1470 non-null int64
19 MonthlyRate 1470 non-null int64
20 NumCompaniesWorked 1470 non-null int64
21 Over18 1470 non-null object
22 OverTime 1470 non-null object
23 PercentSalaryHike 1470 non-null int64
24 PerformanceRating 1470 non-null int64
25 RelationshipSatisfaction 1470 non-null int64
26 StandardHours 1470 non-null int64
27 StockOptionLevel 1470 non-null int64
28 TotalWorkingYears 1470 non-null int64
29 TrainingTimesLastYear 1470 non-null int64
30 WorkLifeBalance 1470 non-null int64
31 YearsAtCompany 1470 non-null int64
32 YearsInCurrentRole 1470 non-null int64
33 YearsSinceLastPromotion 1470 non-null int64
34 YearsWithCurrManager 1470 non-null int64
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
We could observe that there are totally 1470 records in the dataset, it has both numerical and categorical features and there are no null records
Out[5]: 0
Out[8]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel
0 rows × 35 columns
Out[11]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours
count 1470.000000 1470 1470 1470.000000 1470 1470.000000 1470.000000 1470 1470.0 1470.000000 ... 1470.000000 1470.0
unique NaN 2 3 NaN 3 NaN NaN 6 NaN NaN ... NaN NaN
Research &
top NaN No Travel_Rarely NaN NaN NaN Life Sciences NaN NaN ... NaN NaN
Development
freq NaN 1233 1043 NaN 961 NaN NaN 606 NaN NaN ... NaN NaN
mean 36.923810 NaN NaN 802.485714 NaN 9.192517 2.912925 NaN 1.0 1024.865306 ... 2.712245 80.0
std 9.135373 NaN NaN 403.509100 NaN 8.106864 1.024165 NaN 0.0 602.024335 ... 1.081209 0.0
min 18.000000 NaN NaN 102.000000 NaN 1.000000 1.000000 NaN 1.0 1.000000 ... 1.000000 80.0
25% 30.000000 NaN NaN 465.000000 NaN 2.000000 2.000000 NaN 1.0 491.250000 ... 2.000000 80.0
50% 36.000000 NaN NaN 802.000000 NaN 7.000000 3.000000 NaN 1.0 1020.500000 ... 3.000000 80.0
75% 43.000000 NaN NaN 1157.000000 NaN 14.000000 4.000000 NaN 1.0 1555.750000 ... 4.000000 80.0
max 60.000000 NaN NaN 1499.000000 NaN 29.000000 5.000000 NaN 1.0 2068.000000 ... 4.000000 80.0
11 rows × 35 columns
In [13]: 1 # correlation
2 df.corr()
Out[13]:
Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel ... RelationshipSatisfa
Age 1.000000 0.010661 -0.001686 0.208034 NaN -0.010145 0.010146 0.024287 0.029820 0.509604 ... 0.05
DailyRate 0.010661 1.000000 -0.004985 -0.016806 NaN -0.050990 0.018355 0.023381 0.046135 0.002966 ... 0.00
DistanceFromHome -0.001686 -0.004985 1.000000 0.021042 NaN 0.032916 -0.016075 0.031131 0.008783 0.005303 ... 0.00
Education 0.208034 -0.016806 0.021042 1.000000 NaN 0.042070 -0.027128 0.016775 0.042438 0.101589 ... -0.00
EmployeeCount NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
EmployeeNumber -0.010145 -0.050990 0.032916 0.042070 NaN 1.000000 0.017621 0.035179 -0.006888 -0.018519 ... -0.06
EnvironmentSatisfaction 0.010146 0.018355 -0.016075 -0.027128 NaN 0.017621 1.000000 -0.049857 -0.008278 0.001212 ... 0.00
HourlyRate 0.024287 0.023381 0.031131 0.016775 NaN 0.035179 -0.049857 1.000000 0.042861 -0.027853 ... 0.00
JobInvolvement 0.029820 0.046135 0.008783 0.042438 NaN -0.006888 -0.008278 0.042861 1.000000 -0.012630 ... 0.03
JobLevel 0.509604 0.002966 0.005303 0.101589 NaN -0.018519 0.001212 -0.027853 -0.012630 1.000000 ... 0.02
JobSatisfaction -0.004892 0.030571 -0.003669 -0.011296 NaN -0.046247 -0.006784 -0.071335 -0.021476 -0.001944 ... -0.01
MonthlyIncome 0.497855 0.007707 -0.017014 0.094961 NaN -0.014829 -0.006259 -0.015794 -0.015271 0.950300 ... 0.02
MonthlyRate 0.028051 -0.032182 0.027473 -0.026084 NaN 0.012648 0.037600 -0.015297 -0.016322 0.039563 ... -0.00
NumCompaniesWorked 0.299635 0.038153 -0.029251 0.126317 NaN -0.001251 0.012594 0.022157 0.015012 0.142501 ... 0.05
PercentSalaryHike 0.003634 0.022704 0.040235 -0.011111 NaN -0.012944 -0.031701 -0.009062 -0.017205 -0.034730 ... -0.04
PerformanceRating 0.001904 0.000473 0.027110 -0.024539 NaN -0.020359 -0.029548 -0.002172 -0.029071 -0.021222 ... -0.03
RelationshipSatisfaction 0.053535 0.007846 0.006557 -0.009118 NaN -0.069861 0.007665 0.001330 0.034297 0.021642 ... 1.00
StandardHours NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
StockOptionLevel 0.037510 0.042143 0.044872 0.018422 NaN 0.062227 0.003432 0.050263 0.021523 0.013984 ... -0.04
TotalWorkingYears 0.680381 0.014515 0.004628 0.148280 NaN -0.014365 -0.002693 -0.002334 -0.005533 0.782208 ... 0.02
TrainingTimesLastYear -0.019621 0.002453 -0.036942 -0.025100 NaN 0.023603 -0.019359 -0.008548 -0.015338 -0.018191 ... 0.00
WorkLifeBalance -0.021490 -0.037848 -0.026556 0.009819 NaN 0.010309 0.027627 -0.004607 -0.014617 0.037818 ... 0.01
YearsAtCompany 0.311309 -0.034055 0.009508 0.069114 NaN -0.011240 0.001458 -0.019582 -0.021355 0.534739 ... 0.01
YearsInCurrentRole 0.212901 0.009932 0.018845 0.060236 NaN -0.008416 0.018007 -0.024106 0.008717 0.389447 ... -0.01
YearsSinceLastPromotion 0.216513 -0.033229 0.010029 0.054254 NaN -0.009019 0.016194 -0.026716 -0.024184 0.353885 ... 0.03
YearsWithCurrManager 0.202089 -0.026363 0.014406 0.069065 NaN -0.009197 -0.004999 -0.020123 0.025976 0.375281 ... -0.00
26 rows × 26 columns
In [14]: 1 plt.subplots(figsize=(20,15))
2 sns.heatmap(df.corr())
In [15]: 1 num_col=[]
2 cat_col=[]
3 for i in df.columns:
4 if(df[i].dtype=='O'):
5 cat_col.append(i)
6 else:
7 num_col.append(i)
Out[16]: ['Age',
'DailyRate',
'DistanceFromHome',
'Education',
'EmployeeCount',
'EmployeeNumber',
'EnvironmentSatisfaction',
'HourlyRate',
'JobInvolvement',
'JobLevel',
'JobSatisfaction',
'MonthlyIncome',
'MonthlyRate',
'NumCompaniesWorked',
'PercentSalaryHike',
'PerformanceRating',
'RelationshipSatisfaction',
'StandardHours',
'StockOptionLevel',
'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
'YearsAtCompany',
'YearsInCurrentRole',
'YearsSinceLastPromotion',
'YearsWithCurrManager']
Out[17]: ['Attrition',
'BusinessTravel',
'Department',
'EducationField',
'Gender',
'JobRole',
'MaritalStatus',
'Over18',
'OverTime']
Distribution of Data
In [18]: 1 df['Attrition'].value_counts()
Out[18]: No 1233
Yes 237
Name: Attrition, dtype: int64
We could observe that 237 employees out off 1470 employees wanted to resign company
In [19]: 1 df['Gender'].value_counts()
In [31]: 1 df1
Out[31]:
Attrition Gender Count
0 No Female 501
1 No Male 732
2 Yes Female 87
1 We could see majorly Mens wanted to change the company compared to womens
In [33]: 1 df2
Out[33]:
Attrition MaritalStatus Count
0 No Divorced 294
1 No Married 589
2 No Single 350
3 Yes Divorced 33
4 Yes Married 84
We could see majorly Bachelor's are looking to change the company compared to Married and Divorced employees
In [36]: 1 df3
Out[36]:
Attrition JobRole Count
1 No Human Resources 40
3 No Manager 97
5 No Research Director 78
8 No Sales Representative 50
12 Yes Manager 5
Significantly Laboratory Technicians , Research Scientists , Sales Executives and Sales Representatives are looking for job chnage
In [40]: 1 df4
Out[40]:
Attrition BusinessTravel Count
0 No Non-Travel 138
1 No Travel_Frequently 208
2 No Travel_Rarely 887
3 Yes Non-Travel 12
4 Yes Travel_Frequently 69
Those who did travel they are significantly looking to change the company
In [43]: 1 df5
Out[43]:
Attrition Department Count
0 No Human Resources 51
2 No Sales 354
5 Yes Sales 92
Majorly Research & Development and sales departments employees are looking to change the company
In [46]: 1 df_copy.head()
Out[46]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLeve
0 41 1 2 1102 2 1 2 1 1 1 ... 1 80 0
1 49 0 1 279 1 8 1 1 1 2 ... 4 80 1
2 37 1 2 1373 1 2 2 4 1 4 ... 2 80 0
3 33 0 1 1392 1 3 4 1 1 5 ... 3 80 0
4 27 0 2 591 1 2 1 3 1 7 ... 4 80 1
5 rows × 35 columns
In [48]: 1 X.head()
Out[48]:
Age BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender HourlyRate ... RelationshipSatisfaction StandardHours StockOptionLev
0 41 2 1102 2 1 2 1 2 0 94 ... 1 80
1 49 1 279 1 8 1 1 3 1 61 ... 4 80
2 37 2 1373 1 2 2 4 4 1 92 ... 2 80
3 33 1 1392 1 3 4 1 4 0 56 ... 3 80
4 27 2 591 1 2 1 3 1 1 40 ... 4 80
5 rows × 32 columns
In [49]: 1 y.head()
Out[49]: 0 1
1 0
2 1
3 0
4 0
Name: Attrition, dtype: int32
In [50]: 1 X.shape,y.shape
In [54]: 1 X_train.shape,y_train.shape,X_test.shape,y_test.shape
In [56]: 1 Rf_model.fit(X_train,y_train)
Out[56]: RandomForestClassifier()
Training Accuracy
In [57]: 1 Rf_model.score(X_train,y_train)
Out[57]: 1.0
Testing Accuracy
In [58]: 1 y_pred_rf=Rf_model.predict(X_test)
In [60]: 1 accuracy_score(y_test,y_pred_rf)
Out[60]: 0.8586956521739131
Hyperparameter Tuning
In [61]: 1 # we are tuning three hyperparameters right now, we are passing the different values for both parameters
2 grid_param = {
3 "n_estimators" : [90,100,115,130],
4 'criterion': ['gini', 'entropy'],
5 'max_depth' : range(2,20,1),
6 'min_samples_leaf' : range(1,10,1),
7 'min_samples_split': range(2,10,1),
8 'max_features' : ['auto','log2']
9 }
In [63]: 1 grid_searh.fit(X_train,y_train)
In [64]: 1 grid_searh.best_params_
In [66]: 1 Rf_model_with_best_params.fit(X_train,y_train)
In [67]: 1 y_predict_rf_bp=Rf_model_with_best_params.predict(X_test)
In [68]: 1 accuracy_score(y_test,y_predict_rf_bp)
Out[68]: 0.8478260869565217
Bagging Classifier
In [72]: 1 from sklearn.svm import SVC
2 from sklearn.ensemble import BaggingClassifier
3 from sklearn.datasets import make_classification
4
5 model_bagging_svc = BaggingClassifier(base_estimator=SVC(),n_estimators=50, random_state=0).fit(X_train,y_train)
6
In [73]: 1 y_predict_bagging=model_bagging_svc.predict(X_test)
In [74]: 1 accuracy_score(y_test,y_predict_bagging)
Out[74]: 0.842391304347826
In [76]: 1 model.fit(X_train,y_train)
Out[76]: DecisionTreeClassifier()
In [77]: 1 model.score(X_train,y_train)
Out[77]: 1.0
In [79]: 1 model.get_depth()
Out[79]: 15
In [80]: 1 y_predict=model.predict(X_test)
In [82]: 1 accuracy_score(y_test,y_predict)
Out[82]: 0.8070652173913043
In this case we could see Ensemble classifier performing better than normal Decision Tree classifier
In [ ]: 1