Random Forest Classifier

2/22/23, 7:35 PM RandomForestClassifier - Jupyter Notebook
Model to predict whether Employee is gonna stay or leave the company using Random Forest Classifier
Import the libraries
In [1]: 1 import numpy as np

2 import pandas as pd
3 import matplotlib.pyplot as plt
4 %matplotlib inline
5 import seaborn as sns
6 import warnings
7 warnings.filterwarnings("ignore")
Import the Dataset
In [2]: 1 df=pd.read_csv("Employee_Attrition.csv")
2 df.head()
Out[2]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLe
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80
Research &
1 49 No Travel_Frequently 279 8 1 Life Sciences 1 2 ... 4 80
Development
Research &
2 37 Yes Travel_Rarely 1373 2 2 Other 1 4 ... 2 80
Development
Research &
3 33 No Travel_Frequently 1392 3 4 Life Sciences 1 5 ... 3 80
Development
Research &
4 27 No Travel_Rarely 591 2 1 Medical 1 7 ... 4 80
Development
5 rows × 35 columns
localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 1/18

In [3]: 1 # Last 5 records

2 df.tail()
Out[3]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOption
Research &
1465 36 No Travel_Frequently 884 23 2 Medical 1 2061 ... 3 80
Development
Research &
Development
Research &
1467 27 No Travel_Rarely 155 4 3 Life Sciences 1 2064 ... 2 80
Development
1468 49 No Travel_Frequently 1023 Sales 2 3 Medical 1 2065 ... 4 80
Research &
Development

In [4]: 1 # Summary of dataset

2 df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1470 non-null int64
1 Attrition 1470 non-null object
2 BusinessTravel 1470 non-null object
3 DailyRate 1470 non-null int64
4 Department 1470 non-null object
5 DistanceFromHome 1470 non-null int64
6 Education 1470 non-null int64
7 EducationField 1470 non-null object
8 EmployeeCount 1470 non-null int64
9 EmployeeNumber 1470 non-null int64
10 EnvironmentSatisfaction 1470 non-null int64
11 Gender 1470 non-null object
12 HourlyRate 1470 non-null int64
13 JobInvolvement 1470 non-null int64
14 JobLevel 1470 non-null int64
15 JobRole 1470 non-null object
16 JobSatisfaction 1470 non-null int64
17 MaritalStatus 1470 non-null object
18 MonthlyIncome 1470 non-null int64
19 MonthlyRate 1470 non-null int64
20 NumCompaniesWorked 1470 non-null int64
21 Over18 1470 non-null object
22 OverTime 1470 non-null object
23 PercentSalaryHike 1470 non-null int64
24 PerformanceRating 1470 non-null int64
25 RelationshipSatisfaction 1470 non-null int64
26 StandardHours 1470 non-null int64
27 StockOptionLevel 1470 non-null int64
28 TotalWorkingYears 1470 non-null int64
29 TrainingTimesLastYear 1470 non-null int64
30 WorkLifeBalance 1470 non-null int64
31 YearsAtCompany 1470 non-null int64
32 YearsInCurrentRole 1470 non-null int64
33 YearsSinceLastPromotion 1470 non-null int64
34 YearsWithCurrManager 1470 non-null int64
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
We could observe that there are totally 1470 records in the dataset, it has both numerical and categorical features and there are no null records

In [5]: 1 # checking nul values

2 df.isnull().sum().sum()
Out[5]: 0
In [8]: 1 # Checking duplicate records

2 df[df.duplicated()==True]
Out[8]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel
We could see there are no duplicate records in the dataset
In [11]: 1 # Statistical analysis of the dataset

2 df.describe(include='all')
Out[11]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours
count 1470.000000 1470 1470 1470.000000 1470 1470.000000 1470.000000 1470 1470.0 1470.000000 ... 1470.000000 1470.0
unique NaN 2 3 NaN 3 NaN NaN 6 NaN NaN ... NaN NaN
Research &
top NaN No Travel_Rarely NaN NaN NaN Life Sciences NaN NaN ... NaN NaN
Development
freq NaN 1233 1043 NaN 961 NaN NaN 606 NaN NaN ... NaN NaN
mean 36.923810 NaN NaN 802.485714 NaN 9.192517 2.912925 NaN 1.0 1024.865306 ... 2.712245 80.0
std 9.135373 NaN NaN 403.509100 NaN 8.106864 1.024165 NaN 0.0 602.024335 ... 1.081209 0.0
min 18.000000 NaN NaN 102.000000 NaN 1.000000 1.000000 NaN 1.0 1.000000 ... 1.000000 80.0
25% 30.000000 NaN NaN 465.000000 NaN 2.000000 2.000000 NaN 1.0 491.250000 ... 2.000000 80.0
50% 36.000000 NaN NaN 802.000000 NaN 7.000000 3.000000 NaN 1.0 1020.500000 ... 3.000000 80.0
75% 43.000000 NaN NaN 1157.000000 NaN 14.000000 4.000000 NaN 1.0 1555.750000 ... 4.000000 80.0
max 60.000000 NaN NaN 1499.000000 NaN 29.000000 5.000000 NaN 1.0 2068.000000 ... 4.000000 80.0

In [13]: 1 # correlation
2 df.corr()
Out[13]:
Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel ... RelationshipSatisfa
Age 1.000000 0.010661 -0.001686 0.208034 NaN -0.010145 0.010146 0.024287 0.029820 0.509604 ... 0.05
DailyRate 0.010661 1.000000 -0.004985 -0.016806 NaN -0.050990 0.018355 0.023381 0.046135 0.002966 ... 0.00
DistanceFromHome -0.001686 -0.004985 1.000000 0.021042 NaN 0.032916 -0.016075 0.031131 0.008783 0.005303 ... 0.00
Education 0.208034 -0.016806 0.021042 1.000000 NaN 0.042070 -0.027128 0.016775 0.042438 0.101589 ... -0.00
EmployeeCount NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
EmployeeNumber -0.010145 -0.050990 0.032916 0.042070 NaN 1.000000 0.017621 0.035179 -0.006888 -0.018519 ... -0.06
EnvironmentSatisfaction 0.010146 0.018355 -0.016075 -0.027128 NaN 0.017621 1.000000 -0.049857 -0.008278 0.001212 ... 0.00
HourlyRate 0.024287 0.023381 0.031131 0.016775 NaN 0.035179 -0.049857 1.000000 0.042861 -0.027853 ... 0.00
JobInvolvement 0.029820 0.046135 0.008783 0.042438 NaN -0.006888 -0.008278 0.042861 1.000000 -0.012630 ... 0.03
JobLevel 0.509604 0.002966 0.005303 0.101589 NaN -0.018519 0.001212 -0.027853 -0.012630 1.000000 ... 0.02
JobSatisfaction -0.004892 0.030571 -0.003669 -0.011296 NaN -0.046247 -0.006784 -0.071335 -0.021476 -0.001944 ... -0.01
MonthlyIncome 0.497855 0.007707 -0.017014 0.094961 NaN -0.014829 -0.006259 -0.015794 -0.015271 0.950300 ... 0.02
MonthlyRate 0.028051 -0.032182 0.027473 -0.026084 NaN 0.012648 0.037600 -0.015297 -0.016322 0.039563 ... -0.00
NumCompaniesWorked 0.299635 0.038153 -0.029251 0.126317 NaN -0.001251 0.012594 0.022157 0.015012 0.142501 ... 0.05
PercentSalaryHike 0.003634 0.022704 0.040235 -0.011111 NaN -0.012944 -0.031701 -0.009062 -0.017205 -0.034730 ... -0.04
PerformanceRating 0.001904 0.000473 0.027110 -0.024539 NaN -0.020359 -0.029548 -0.002172 -0.029071 -0.021222 ... -0.03
RelationshipSatisfaction 0.053535 0.007846 0.006557 -0.009118 NaN -0.069861 0.007665 0.001330 0.034297 0.021642 ... 1.00
StandardHours NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
StockOptionLevel 0.037510 0.042143 0.044872 0.018422 NaN 0.062227 0.003432 0.050263 0.021523 0.013984 ... -0.04
TotalWorkingYears 0.680381 0.014515 0.004628 0.148280 NaN -0.014365 -0.002693 -0.002334 -0.005533 0.782208 ... 0.02
TrainingTimesLastYear -0.019621 0.002453 -0.036942 -0.025100 NaN 0.023603 -0.019359 -0.008548 -0.015338 -0.018191 ... 0.00
WorkLifeBalance -0.021490 -0.037848 -0.026556 0.009819 NaN 0.010309 0.027627 -0.004607 -0.014617 0.037818 ... 0.01
YearsAtCompany 0.311309 -0.034055 0.009508 0.069114 NaN -0.011240 0.001458 -0.019582 -0.021355 0.534739 ... 0.01
YearsInCurrentRole 0.212901 0.009932 0.018845 0.060236 NaN -0.008416 0.018007 -0.024106 0.008717 0.389447 ... -0.01
YearsSinceLastPromotion 0.216513 -0.033229 0.010029 0.054254 NaN -0.009019 0.016194 -0.026716 -0.024184 0.353885 ... 0.03
YearsWithCurrManager 0.202089 -0.026363 0.014406 0.069065 NaN -0.009197 -0.004999 -0.020123 0.025976 0.375281 ... -0.00

In [14]: 1 plt.subplots(figsize=(20,15))
2 sns.heatmap(df.corr())
Out[14]: <AxesSubplot: >



Numerical and Categorical features
In [15]: 1 num_col=[]
2 cat_col=[]
3 for i in df.columns:
4 if(df[i].dtype=='O'):
5 cat_col.append(i)
6 else:
7 num_col.append(i)
In [16]: 1 # Numerical feature

2 num_col
Out[16]: ['Age',
'DailyRate',
'DistanceFromHome',
'Education',
'EmployeeCount',
'EmployeeNumber',
'EnvironmentSatisfaction',
'HourlyRate',
'JobInvolvement',
'JobLevel',
'JobSatisfaction',
'MonthlyIncome',
'MonthlyRate',
'NumCompaniesWorked',
'PercentSalaryHike',
'PerformanceRating',
'RelationshipSatisfaction',
'StandardHours',
'StockOptionLevel',
'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
'YearsAtCompany',
'YearsInCurrentRole',
'YearsSinceLastPromotion',
'YearsWithCurrManager']

In [17]: 1 # Categorical Features

2 cat_col
Out[17]: ['Attrition',
'BusinessTravel',
'Department',
'EducationField',
'Gender',
'JobRole',
'MaritalStatus',
'Over18',
'OverTime']
Distribution of Data
In [18]: 1 df['Attrition'].value_counts()
Out[18]: No 1233
Yes 237
Name: Attrition, dtype: int64
We could observe that 237 employees out off 1470 employees wanted to resign company
In [19]: 1 df['Gender'].value_counts()
Out[19]: Male 882

Female 588
Name: Gender, dtype: int64
In [30]: 1 df1=df.groupby(['Attrition', 'Gender']).size().reset_index().rename(columns={0:'Count'})
In [31]: 1 df1
Out[31]:
Attrition Gender Count
0 No Female 501
1 No Male 732
2 Yes Female 87
3 Yes Male 150
1 We could see majorly Mens wanted to change the company compared to womens

In [37]: 1 df2=df.groupby(['Attrition', 'MaritalStatus']).size().reset_index().rename(columns={0:'Count'})
In [33]: 1 df2
Out[33]:
Attrition MaritalStatus Count
0 No Divorced 294
1 No Married 589
2 No Single 350
3 Yes Divorced 33
4 Yes Married 84
5 Yes Single 120
We could see majorly Bachelor's are looking to change the company compared to Married and Divorced employees
In [41]: 1 df3=df.groupby(['Attrition', 'JobRole']).size().reset_index().rename(columns={0:'Count'})

In [36]: 1 df3
Out[36]:
Attrition JobRole Count
0 No Healthcare Representative 122
1 No Human Resources 40
2 No Laboratory Technician 197
3 No Manager 97
4 No Manufacturing Director 135
5 No Research Director 78
6 No Research Scientist 245
7 No Sales Executive 269
8 No Sales Representative 50
9 Yes Healthcare Representative 9
10 Yes Human Resources 12
11 Yes Laboratory Technician 62
12 Yes Manager 5
13 Yes Manufacturing Director 10
14 Yes Research Director 2
15 Yes Research Scientist 47
16 Yes Sales Executive 57
17 Yes Sales Representative 33
Significantly Laboratory Technicians , Research Scientists , Sales Executives and Sales Representatives are looking for job chnage
In [39]: 1 df4=df.groupby(['Attrition', 'BusinessTravel']).size().reset_index().rename(columns={0:'Count'})
In [40]: 1 df4
Out[40]:
Attrition BusinessTravel Count
0 No Non-Travel 138
1 No Travel_Frequently 208
2 No Travel_Rarely 887
3 Yes Non-Travel 12
4 Yes Travel_Frequently 69
5 Yes Travel_Rarely 156

Those who did travel they are significantly looking to change the company
In [42]: 1 df5=df.groupby(['Attrition', 'Department']).size().reset_index().rename(columns={0:'Count'})
In [43]: 1 df5
Out[43]:
Attrition Department Count
0 No Human Resources 51
1 No Research & Development 828
2 No Sales 354
3 Yes Human Resources 12
4 Yes Research & Development 133
5 Yes Sales 92
Majorly Research & Development and sales departments employees are looking to change the company
In [44]: 1 # copy the data

2 df_copy=df.copy()
Label Encoding of Categorical Features
In [45]: 1 from sklearn.preprocessing import LabelEncoder

2 le=LabelEncoder()
3 for i in cat_col:
4 df_copy[i]=le.fit_transform(df_copy[i])
In [46]: 1 df_copy.head()
Out[46]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLeve
0 41 1 2 1102 2 1 2 1 1 1 ... 1 80 0
1 49 0 1 279 1 8 1 1 1 2 ... 4 80 1
2 37 1 2 1373 1 2 2 4 1 4 ... 2 80 0
3 33 0 1 1392 1 3 4 1 1 5 ... 3 80 0
4 27 0 2 591 1 2 1 3 1 7 ... 4 80 1

In [47]: 1 X=df_copy.drop(['Attrition', 'EmployeeCount','EmployeeNumber'],axis=1)

2 y=df_copy['Attrition']
In [48]: 1 X.head()
Out[48]:
Age BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender HourlyRate ... RelationshipSatisfaction StandardHours StockOptionLev
0 41 2 1102 2 1 2 1 2 0 94 ... 1 80
1 49 1 279 1 8 1 1 3 1 61 ... 4 80
2 37 2 1373 1 2 2 4 4 1 92 ... 2 80
3 33 1 1392 1 3 4 1 4 0 56 ... 3 80
4 27 2 591 1 2 1 3 1 1 40 ... 4 80
In [49]: 1 y.head()
Out[49]: 0 1
1 0
2 1
3 0
4 0
Name: Attrition, dtype: int32
In [50]: 1 X.shape,y.shape
Out[50]: ((1470, 32), (1470,))
Train Test Split
In [53]: 1 from sklearn.model_selection import train_test_split

2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
In [54]: 1 X_train.shape,y_train.shape,X_test.shape,y_test.shape
Out[54]: ((1102, 32), (1102,), (368, 32), (368,))

Training the Model

In [55]: 1 from sklearn.ensemble import RandomForestClassifier
2 Rf_model=RandomForestClassifier()
In [56]: 1 Rf_model.fit(X_train,y_train)
Out[56]: RandomForestClassifier()
Training Accuracy
In [57]: 1 Rf_model.score(X_train,y_train)
Out[57]: 1.0
Testing Accuracy
In [58]: 1 y_pred_rf=Rf_model.predict(X_test)
In [59]: 1 from sklearn.metrics import accuracy_score
In [60]: 1 accuracy_score(y_test,y_pred_rf)
Out[60]: 0.8586956521739131
Hyperparameter Tuning
In [61]: 1 # we are tuning three hyperparameters right now, we are passing the different values for both parameters
2 grid_param = {
3 "n_estimators" : [90,100,115,130],
4 'criterion': ['gini', 'entropy'],
5 'max_depth' : range(2,20,1),
6 'min_samples_leaf' : range(1,10,1),
7 'min_samples_split': range(2,10,1),
8 'max_features' : ['auto','log2']
9 }

In [62]: 1 from sklearn.model_selection import GridSearchCV

2 grid_searh=GridSearchCV(estimator=Rf_model,param_grid=grid_param,cv=3,verbose=2,n_jobs=-1)
In [63]: 1 grid_searh.fit(X_train,y_train)
Fitting 3 folds for each of 20736 candidates, totalling 62208 fits
Out[63]: GridSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,

param_grid={'criterion': ['gini', 'entropy'],
'max_depth': range(2, 20),
'max_features': ['auto', 'log2'],
'min_samples_leaf': range(1, 10),
'min_samples_split': range(2, 10),
'n_estimators': [90, 100, 115, 130]},
verbose=2)
In [64]: 1 grid_searh.best_params_
Out[64]: {'criterion': 'gini',

'max_depth': 15,
'max_features': 'log2',
'min_samples_leaf': 2,
'min_samples_split': 7,
'n_estimators': 100}
In [65]: 1 Rf_model_with_best_params=RandomForestClassifier(criterion='gini',max_depth= 15,max_features= 'log2',min_samples_leaf= 2,min_samples_split= 7,n_estimators=100
In [66]: 1 Rf_model_with_best_params.fit(X_train,y_train)
Out[66]: RandomForestClassifier(max_depth=15, max_features='log2', min_samples_leaf=2,

min_samples_split=7)
In [67]: 1 y_predict_rf_bp=Rf_model_with_best_params.predict(X_test)
In [68]: 1 accuracy_score(y_test,y_predict_rf_bp)
Out[68]: 0.8478260869565217

Bagging Classifier
In [72]: 1 from sklearn.svm import SVC
2 from sklearn.ensemble import BaggingClassifier
3 from sklearn.datasets import make_classification
4
5 model_bagging_svc = BaggingClassifier(base_estimator=SVC(),n_estimators=50, random_state=0).fit(X_train,y_train)
6
In [73]: 1 y_predict_bagging=model_bagging_svc.predict(X_test)
In [74]: 1 accuracy_score(y_test,y_predict_bagging)
Out[74]: 0.842391304347826
Decision Tree Classifier

In [75]: 1 from sklearn.tree import DecisionTreeClassifier
2 model=DecisionTreeClassifier()
In [76]: 1 model.fit(X_train,y_train)
Out[76]: DecisionTreeClassifier()
In [77]: 1 model.score(X_train,y_train)
Out[77]: 1.0

In [78]: 1 from sklearn import tree

2 import matplotlib.pyplot as plt
3 fig=plt.figure(figsize=(25,15))
4 tree.plot_tree(model,filled=True)
In [79]: 1 model.get_depth()
Out[79]: 15
In [80]: 1 y_predict=model.predict(X_test)
In [81]: 1 from sklearn.metrics import accuracy_score
In [82]: 1 accuracy_score(y_test,y_predict)
Out[82]: 0.8070652173913043
In this case we could see Ensemble classifier performing better than normal Decision Tree classifier
In [ ]: 1

Random Forest Classifier

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Random Forest Classifier

Uploaded by

Copyright:

Available Formats

2/22/23, 7:35 PM RandomForestClassifier - Jupyter Notebook

Import the libraries

In [1]: 1 import numpy as np

Import the Dataset

0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 1/18

In [3]: 1 # Last 5 records

1468 49 No Travel_Frequently 1023 Sales 2 3 Medical 1 2065 ... 4 80

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 2/18

In [4]: 1 # Summary of dataset

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 3/18

In [5]: 1 # checking nul values

In [8]: 1 # Checking duplicate records

We could see there are no duplicate records in the dataset

In [11]: 1 # Statistical analysis of the dataset

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 4/18

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 5/18

Out[14]: <AxesSubplot: >

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 6/18

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 7/18

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 8/18

Numerical and Categorical features

In [16]: 1 # Numerical feature

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 9/18

In [17]: 1 # Categorical Features

Out[19]: Male 882

In [30]: 1 df1=df.groupby(['Attrition', 'Gender']).size().reset_index().rename(columns={0:'Count'})

3 Yes Male 150

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 10/18

In [37]: 1 df2=df.groupby(['Attrition', 'MaritalStatus']).size().reset_index().rename(columns={0:'Count'})

5 Yes Single 120

In [41]: 1 df3=df.groupby(['Attrition', 'JobRole']).size().reset_index().rename(columns={0:'Count'})

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 11/18

0 No Healthcare Representative 122

2 No Laboratory Technician 197

4 No Manufacturing Director 135

6 No Research Scientist 245

7 No Sales Executive 269

9 Yes Healthcare Representative 9

10 Yes Human Resources 12

11 Yes Laboratory Technician 62

13 Yes Manufacturing Director 10

14 Yes Research Director 2

15 Yes Research Scientist 47

16 Yes Sales Executive 57

17 Yes Sales Representative 33

In [39]: 1 df4=df.groupby(['Attrition', 'BusinessTravel']).size().reset_index().rename(columns={0:'Count'})

5 Yes Travel_Rarely 156

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 12/18

In [42]: 1 df5=df.groupby(['Attrition', 'Department']).size().reset_index().rename(columns={0:'Count'})

1 No Research & Development 828

3 Yes Human Resources 12

4 Yes Research & Development 133

In [44]: 1 # copy the data

Label Encoding of Categorical Features

In [45]: 1 from sklearn.preprocessing import LabelEncoder

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 13/18

In [47]: 1 X=df_copy.drop(['Attrition', 'EmployeeCount','EmployeeNumber'],axis=1)

Out[50]: ((1470, 32), (1470,))

Train Test Split

In [53]: 1 from sklearn.model_selection import train_test_split

Out[54]: ((1102, 32), (1102,), (368, 32), (368,))

localhost:8888/notebooks/Downloads/Full Stack Data Science/Machine Learning/Supervised Learning/Random Forest/RandomForestClassifier.ipynb# 14/18