You are on page 1of 28

GROUP ASSIGNMENT: DATA MINING

 Objective:
Build models using CART and Neural network to classify/predict the employee attrition
from the given dataset.
CART → Decision Tree using rpart
Neural Network → neural net library

 Overview of the data:

Target Variable: Attrition


Independent variables:

Age EnvironmentSatisfaction NumCompaniesWorked


BusinessTravel Gender Over18
DailyRate HourlyRate OverTime
Department JobInvolvement PercentSalaryHike
DistanceFromHome JobLevel PerformanceRating
Education JobRole RelationshipSatisfaction
EducationField JobSatisfaction StandardHours
EmployeeCount MaritalStatus StockOptionLevel
EmployeeNumber MonthlyIncome TotalWorkingYears
WorkLifeBalance MonthlyRate TrainingTimesLastYear
YearsAtCompany YearsInCurrentRole  
YearsSinceLastPromotion YearsWithCurrManager  

 Perform Exploratory Data Analysis:

1. Data Import:

EmpData <- read.csv("HR_Employee_Attrition_Data.csv")

1. Structure of the dataset


2. Summary of the dataset
Code required to perform EDA:
There are 2940 obs. of 35 variables. The variable type being integer and factor

basic_eda <- function(EmpData)


{
glimpse(EmpData)
df_status(EmpData)
freq(EmpData)
profiling_num(EmpData)
plot_num(EmpData)
describe(EmpData)
}
basic_eda(EmpData = EmpData)
plot_str(EmpData)
plot_missing(EmpData)
plot_histogram(EmpData)
We can see there are no missing values present
For plotting the correlation rearranging the columns such that numerical variables and categorical
variables come together. The below correlation plot, YearsAtCompany, YearsInCurrentRole,
YearsSinceLastPromotion, YearsWithCurrManager, TotalWorkingYears, MonthlyIncome, JobLevel
seems to be highly correlated. This might lead to multicollinearity problem.
Moreover, StandardHours, EmployeeCount, Over18 and EmployeeNumber does not have any
variations hence those columns can be dropped
2. Columns that are redundant:

Dropping the columns with redundancy:


 Over18 as there is no variability, all are Y.
 EmployeeCount as there is no variability, all are 1.
 StandardHours as there is no variability, all are 80.
 EmployeeNumber as it is an identifier.

library(caret)
nearZeroVar(EmpData)
[1] 9 22 27

Over18 <- NULL


EmployeeCount <- NULL
StandardHours <- NULL
EmployeeNumber <- NULL

16%. 474 out of 2940 have attired.


Some analysis of different variables w.r.t Attrition:

1. TotalWorkingYears vs Attrition:

summary(EmpData$TotalWorkingYears)
TtlWkgYrs <- cut(EmpData$TotalWorkingYears, 10, include.lowest =
TRUE)
ggplot(EmpData, aes(TtlWkgYrs, ..count.., fill = factor(Attrition))) +
geom_bar(position="dodge")

Ratio of people in the company to the ones leaving the company is 1:5

2. YearsSinceLastPromotion vs Attrition

ggplot(EmpData, aes(YearsSinceLastPromotion, ..count.., fill = factor(Attrition)))


+ geom_bar(position="dodge")
People recently promoted quit the company more than the ones not promoted.

3. YearsWithCurrManager vs Attrition

ggplot(EmpData, aes(YearsWithCurrManager, ..count.., fill = factor(Attrition))) +


geom_bar(position="dodge")

As the number of years with current manager increases, Attrition decreases.

4. TrainingTimeLastYear vs Attrition:

ggplot(EmpData, aes(TrainingTimesLastYear, ..count.., fill = factor(Attrition)))


+ geom_bar(position="dodge")
Attrition seen in employees trained between 2-4 times a year.

5. YearsAtComapny vs Attrition:

ggplot(EmpData, aes(YearsAtCompany, ..count.., fill = factor(Attrition))) +


geom_bar(position="dodge")

People with less number of years tend to quit the company more.

6. TotalWorkingYears vs Attrition

ggplot(EmpData, aes(TotalWorkingYears, ..count.., fill = factor(Attrition))) +


geom_bar(position="dodge")
People with less experience are leaving the job more.

7. PercentSalaryHike vs Attrition:

ggplot(EmpData, aes(PercentSalaryHike, ..count.., fill = factor(Attrition))) +


geom_bar(position="dodge")

People with less percent hike leave the company.

8. OverTime vs Attrition:

prop.table(table(EmpData$OverTime))
table(EmpData$OverTime, EmpData$Attrition)
ggplot(EmpData, aes(OverTime, ..count.., fill = factor(Attrition))) +
geom_bar(position="dodge")
Overall 28% of the employees are putting overtime. The percentage of attrition
amongst those putting in overtime is close to 44% (254/578) vs 11% (220/1888) for
those not putting in overtime. Thus Overtime is contributing towards attrition.

9. WorkLifeBalance vs Attrition

ggplot(EmpData1, aes(WorkLifeBalance, ..count.., fill = factor(Attrition))) +


geom_bar(position="dodge")

People with better work life balance tend to quit more.

10. MaritalStatus vs Attrition


ggplot(EmpData1, aes(MaritalStatus, ..count.., fill = factor(Attrition))) +
geom_bar(position="dodge")

Attrition highest in person who are single, medium who are married and least in
divorced employees.

11. JobRole vs Attrition:

table(EmpData$JobRole)
table(EmpData$JobRole, EmpData$Attrition)
ggplot(EmpData, aes(JobRole, ..count.., fill = factor(Attrition))) +
geom_bar(position="dodge")
Technicians followed by the Sales Executives are contributing the maximum towards
attrition. In percentage terms, Sales Representative are far ahead at 66%.

12. JobSatisfaction vs Attrition:

ggplot(EmpData1, aes(JobSatisfaction, ..count.., fill = factor(Attrition))) +


geom_bar(position="dodge")
Low job satisfaction results in people leaving the company.

13. Gender vs Attrition

ggplot(EmpData1, aes(Gender, ..count.., fill = factor(Attrition))) +


geom_bar(position="dodge")

Attrition more in male employees.

14. Hourly, Daily and Monthly Rates vs Attrition

summary(EmpData$HourlyRate)
HrlyRate<- cut(EmpData$HourlyRate, 7, include.lowest = TRUE)
ggplot(EmpData, aes(HrlyRate, ..count.., fill = factor(Attrition))) +
geom_bar(position="dodge")

15. Age vs Attrition

summary(EmpData$Age)
A_g_e <- cut(EmpData$Age, 8, include.lowest = TRUE)
ggplot(EmpData, aes(A_g_e, ..count.., fill = factor(Attrition))) +
geom_bar(position="dodge")

Employees after 30 years tends to leave the company, it shows a downward trend.

16. MonthlyIncome vs Attrition:

summary(EmpData$MonthlyIncome)
MnthlyIncome <- cut(EmpData$MonthlyIncome, 10, include.lowest = TRUE,
labels=c(1,2,3,4,5,6,7,8,9,10))
ggplot(EmpData, aes(MnthlyIncome, ..count.., fill = factor(Attrition))) +
geom_bar(position="dodge")

The attrition in absolute terms decreases as the salary increases, thus lower salary is
contributing towards attrition.
 Splitting the data in DEV and Hold Out Sample(70:30):

set.seed(777)
library(caTools)
split = sample.split(Attrition, SplitRatio = 0.70)
train = subset(EmpData, split == TRUE)
test = subset(EmpData, split == FALSE)

The imported data after dropping of redundant variables has been split into training and test
samples. Split on Attrition variable with split ratio = 0.7

 Hypothesis Testing:

H0 : Attrition(dependent variable) not affected by Independent variable.


H1 : Attrition(dependent variable) affected by Independent variable.

F-Statistic is significant and hence we reject Null Hypothesis and accept Alternate Hypothesis
i.e Dependent variable Attrition is affected by Independent variable

 CART Model:

table(test$Attrition)
740/nrow(test)

ModelCart = rpart(Attrition ~ ., data=train, method="class")


prp(ModelCart)
tree_full = rpart(formula = Attrition~., data = train)
rpart.plot(tree_full, cex=0.8)
#Predict the test data
PredictionCart <- predict(ModelCart, newdata=test, type="class")

#CART Accuracy
#Confusion matrix
t1 <- table(test$Attrition, PredictionCart)
t1

#CART model accuracy


(t1[1]+t1[4])/(nrow(test))

Validate CART Model:


Confusion Matrix:

The CART model has improved the accuracy but not by much.
Accuracy = 85.26% on testing set.

 Random Forest:

library(randomForest)

ModelRf = randomForest(Attrition ~ ., data=train, ntree = 100, mtry = 5,


importance = TRUE, method="class")

#Plot the model


print(ModelRf)
#OOB vs No. Of Trees
plot(ModelRf, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
title(main="Error Rates Random Forest")

# List the importance of the variables.


ImpVar <- round(randomForest::importance(ModelRf), 2)
ImpVar[order(impVar[,3], decreasing=TRUE),]
TunedRf <- tuneRF(x = train[,-2],
y=as.factor(train$Attrition),
mtryStart = 5,
ntreeTry=60,
stepFactor = 2,
improve = 0.001,
trace=TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 5,
importance=TRUE
)
ImpvarTunedRf <- TunedRf$importance
ImpvarTunedRf[order(ImpvarTunedRf[,3], decreasing=TRUE),]

PredictionRf <- predict(TunedRf, test, type="class")

#RandomForest Accuracy
#Confusion matrix
t2 <- table(test$Attrition, PredictionRf)
t2
#RandomForest model accuracy
(t2[1]+t2[4])/(nrow(test))

Random Forest has improved the accuracy to 94% proving that bagging is better than
a single tree model (CART).
 Neural Network:

library(nnet)
set.seed(777)
ModelNN<-
nnet(Attrition~.,train,size=21,rang=0.07,Hess=FALSE,decay=15e-4,maxit=2000)

> ModelNN<-nnet(Attrition~.,train,size=21,rang=0.07,Hess=FALSE,decay=15e-4,maxit=2000)
# weights: 967
initial value 1259.026759
iter 10 value 899.779328
iter 20 value 883.125558
iter 30 value 882.856833
iter 40 value 882.479622
iter 50 value 881.443760
iter 60 value 878.441653
iter 70 value 875.782006
iter 80 value 874.882651
iter 90 value 873.773543
iter 100 value 873.303544
iter 110 value 872.247616
iter 120 value 871.615744
iter 130 value 871.492874
iter 140 value 870.422323
iter 150 value 870.204272
iter 160 value 869.957355
iter 170 value 868.772616
iter 180 value 865.514489
iter 190 value 863.857474
iter 200 value 863.579794
iter 210 value 863.194340
iter 220 value 862.363122
iter 230 value 856.188856
iter 240 value 850.607038
iter 250 value 846.707643
iter 260 value 840.650373
iter 270 value 839.614410
iter 280 value 837.229109
iter 290 value 831.481824
iter 300 value 817.219486
iter 310 value 774.382784
iter 320 value 701.762947
iter 330 value 643.033297
iter 340 value 613.604670
iter 350 value 589.140287
iter 360 value 578.151334
iter 370 value 565.190917
iter 380 value 556.574622
iter 390 value 551.459427
iter 400 value 550.158238
iter 410 value 548.621001
iter 420 value 548.350850
iter 430 value 547.926148
iter 440 value 547.068021
iter 450 value 546.388319
iter 460 value 543.958892
iter 470 value 540.745619
iter 480 value 539.235683
iter 490 value 538.602445
iter 500 value 537.755900
iter 510 value 536.677109
iter 520 value 536.203518
iter 530 value 534.583342
iter 540 value 534.387111
iter 550 value 534.173219
iter 560 value 534.063107
iter 570 value 533.664626
iter 580 value 533.511674
iter 590 value 533.321811
iter 600 value 532.192178
iter 610 value 530.734431
iter 620 value 529.372234
iter 630 value 528.770980
iter 640 value 528.019922
iter 650 value 527.430963
iter 660 value 526.298655
iter 670 value 525.692799
iter 680 value 525.023007
iter 690 value 524.508087
iter 700 value 524.324316
iter 710 value 523.331760
iter 720 value 521.152905
iter 730 value 516.415379
iter 740 value 514.111402
iter 750 value 511.835972
iter 760 value 507.977901
iter 770 value 503.245825
iter 780 value 498.288133
iter 790 value 494.972996
iter 800 value 493.968316
iter 810 value 492.625595
iter 820 value 491.619314
iter 830 value 489.937374
iter 840 value 486.727061
iter 850 value 485.936355
iter 860 value 485.716476
iter 870 value 485.643754
iter 880 value 485.356797
iter 890 value 484.680175
iter 900 value 483.170401
iter 910 value 482.904045
iter 920 value 482.660587
iter 930 value 482.562270
iter 940 value 482.514243
iter 950 value 482.135244
iter 960 value 481.674688
iter 970 value 481.481894
iter 980 value 480.943347
iter 990 value 480.665476
iter1000 value 479.778580
iter1010 value 479.406735
iter1020 value 478.878078
iter1030 value 478.860950
iter1040 value 478.844519
iter1050 value 478.250587
iter1060 value 478.144281
iter1070 value 478.003743
iter1080 value 477.677976
iter1090 value 476.933821
iter1100 value 474.855823
iter1110 value 474.177353
iter1120 value 473.761446
iter1130 value 473.031643
iter1140 value 471.277878
iter1150 value 469.738301
iter1160 value 469.312775
iter1170 value 468.737208
iter1180 value 467.935215
iter1190 value 467.812012
iter1200 value 467.015831
iter1210 value 466.013573
iter1220 value 463.998010
iter1230 value 461.555291
iter1240 value 461.293002
iter1250 value 461.244277
iter1260 value 460.541329
iter1270 value 459.902595
iter1280 value 458.858513
iter1290 value 457.930196
iter1300 value 456.892861
iter1310 value 455.989798
iter1320 value 453.811554
iter1330 value 452.677427
iter1340 value 450.661449
iter1350 value 449.767054
iter1360 value 449.653095
iter1370 value 449.272171
iter1380 value 448.504676
iter1390 value 447.882317
iter1400 value 447.646634
iter1410 value 447.142310
iter1420 value 447.113570
iter1430 value 447.054808
iter1440 value 446.833888
iter1450 value 446.619184
iter1460 value 446.392426
iter1470 value 446.208198
iter1480 value 446.036966
iter1490 value 445.274706
iter1500 value 445.163208
iter1510 value 444.522304
iter1520 value 444.097994
iter1530 value 443.574850
iter1540 value 443.044266
iter1550 value 442.041497
iter1560 value 439.525132
iter1570 value 437.590517
iter1580 value 436.392604
iter1590 value 436.202004
iter1600 value 436.157515
iter1610 value 435.917056
iter1620 value 435.759583
iter1630 value 434.353994
iter1640 value 434.277379
iter1650 value 434.213307
iter1660 value 433.883208
iter1670 value 433.455099
iter1680 value 432.551327
iter1690 value 431.908449
iter1700 value 431.876608
iter1710 value 431.776466
iter1720 value 431.558745
iter1730 value 431.283670
iter1740 value 430.745133
iter1750 value 430.066524
iter1760 value 429.563158
iter1770 value 429.545176
iter1780 value 429.144357
iter1790 value 428.633950
iter1800 value 428.462026
iter1810 value 428.386778
iter1820 value 427.939770
iter1830 value 425.073400
iter1840 value 423.735041
iter1850 value 422.316940
iter1860 value 422.007211
iter1870 value 419.987934
iter1880 value 413.765897
iter1890 value 407.844048
iter1900 value 404.502477
iter1910 value 403.498504
iter1920 value 402.516921
iter1930 value 399.932929
iter1940 value 392.910875
iter1950 value 385.177661
iter1960 value 365.684448
iter1970 value 353.180803
iter1980 value 340.379691
iter1990 value 332.964232
iter2000 value 324.678115
final value 324.678115
stopped after 2000 iterations

PredictionNN<-predict(ModelNN,test,type=("class"))
table(PredictionNN)
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703
516c34f1a4684a5/nnet_plot_update.r')
plot.nnet(ModelNN)

#Counfusion Matrix
t3 <- table(test$Attrition, PredictionNN)
t3

#NeuralNetwork model accuracy


(t3[1]+t3[4])/(nrow(test))

The neural networks give an accuracy of 90% approximately.

You might also like