You are on page 1of 6

Individual Assignment Week 3 Case B

Sanne Bor (500832)

Introduction
Companies are continually optimizing their talent recruitment strategies, especially in the field of data science
recruitment (Smaldone, Francesco, et al., 2022). It becomes crucial for companies to identify potential
candidates who are likely to change jobs, to lower talent acquisition costs. Predictive modelling can be
used in this context (Pessach, Dana, et al., 2020). This assignment explores the use of the Naïve Bayes
algorithm to predict job-changing behavior of data scientist based on information that can typically be
found on LinkedIn. A company has accumulated data on candidates they approached, and is interested in
discerning patterns indicative of a data scientists’ likelihood to change jobs. The goal of the research is to
assist a company in targeting candidates in an effective way based on the information contained in eight
variables. This leads to the following research question: Is the Naïve Bayes algorithm a good method to
predict whether a data scientist is likely to change jobs to the specific company?

Data
The used dataset contains 2,000 observations of decisions of candidates, in addition to eight variables to
describe the candidates. A ninth variable is target, which is the variable that will be predicted. A description
of the variables can be found in Table 1 of Appendix A. The data consisted of eight categorical variables,
including the target variable, and one numerical variable experience. To get consistency in the data types
of the features within the Naïve Bayes model, the variable experience is modified to a categorical variable
with the levels 0-5, 6-10, 11-15, 16-20, and more than 25 years of experience. Furthermore, all variables are
converted to numeric data to fit in the Naïve Bayes algorithm which expects numeric values.
When exploring the data, a correlation matrix is employed to assess the correlation within the features.
Figure 1 in Appendix A shows that features doesn’t exhibit high collinearity. For training and prediction
of the Naïve Bayes model, the data is randomly splitted into a 75% training set and a 25% test set, which
gives sufficient training data and a adequate test size.

Method
The multinomial Naïve Bayes algorithm is employed, which is a probabilistic classification method within
the supervised machine learning techniques. The algorithm is based on the Bayes theorem and predicts the
target label of a new observation, based on several features. The Bayes theorem is based on the following
formula: P (A|B) = P (A)·P (B|A)
P (B) , where P (B) is the prior probability of B, P (B) is the prior probability of
class A and P (B|A) is the occurence of predictor B given class A probability. It calculates the probability
of each label for a given sample and then gives the label with the highest probability as output to a single
observation.
The model assumes that the features are conditionally independent given the class label. This means that
each feature that is being classified is not related to any other feature. The presence or absence of the one
feature does not affect the presence or absence of the other feature. As seen in the correlation plot, the

1
variables are not highly correlated. So with the assumed conditional independence, all the variables can be
included.
The model is evaluated with the Confusion Matrix (table that summarizes the performance of a classification
algorithm), the Accuracy ( T rueP ositives+T rueN egatives
T otalInstances ), the AUC-score (area under the ROC-curve, where
true positive rate is plotted against the false positive rate) and the F1-score (2 × Precision×Recall
Precision+Recall ).

Results
First, the model was fitted on the training set of the data. When having trained the model, the model is
employed as a prediction model on the test set of the data and then used to calculate the accuracy, the
F1-score, the AUC-score and the confusion matrix. The table below displays the values of the different
evaluation metrics of the Multinomial Naïve Bayes model.

Table 1: Scores Evaluation Metrics

Metric Value
Accuracy 0.663
F1 Score 0.688
Recall 0.590
Precision 0.826
AUC 0.679

This multinomial Naïve Bayes model has an accuracy of 66.3%, indicating the proportion of correctly clas-
sified instances. A recall score of 0.590 suggests the model captures 59% of the actual positive instances,
while a precision score of 0.826 indicates that when the model predicts positive, it is correct about 83% of
the time. The F1-score of 0.688 combines both the precision and the recall, providing a balance between
false positive and false negatives. The area under the ROC-curve (Figure 1) of 0.679 assesses the model’s
ability to distinguish between classes. It’s generally a good sign if the AUC-score is closer to 1, indicating a
better performance.

Conclusion
In this research, the goal was to assist a company in targeting candidates in an effective way. There is
fitted a Multinomial Naïve Bayes model to predict the target for the potential job changers. In order to
answer the research question, the evaluation metrics of the model are examined. This leads to an answer
that Multinomial Naïve Bayes performs reasonably well for predicting whether a data scientist is likely to
change job to the specific company, but that the data should be examined in more detail to obtain more
accurate outcomes.

References
Pessach, D., Singer, G., Avrahami, D., Ben-Gal, H. C., Shmueli, E., & Ben-Gal, I. (2020). Employees re-
cruitment: A prescriptive analytics approach via machine learning and mathematical programming. Decision
Support Systems, 134, 113290.
Smaldone, F., Ippolito, A., Lagger, J., & Pellicano, M. (2022). Employability skills: Profiling data scientists
in the digital labour market. European Management Journal, 40 (5), 671-684.

2
Appendix A: Tables and Figures

Table 2: Description of Dataset

Variables Description
target Whether the person switched jobs or not
gender The gender of the candidate
relevant_experience Whether the candidate has relevant experience for the new job or not
education_level The level of education of the candidates
major_discipline The type of major the candidate has followed
experience The total number of years of experience the candidate
company_size The number of employees working at the current company of the candidate
company_type The type of company the candidate is currently working at
last_new_job The number of years since the candidate changed jobs

or _le nce
m tio erie

co ien line
ex _di vel

ew pe
la any e
iz

ob
p

ip

_n _ty
_s
uc ex

m ce
pe sc

_j
n
ed nt_

co any
er

a
va

p
r
p
nd

m
le

aj

st
ge

re

1
gender
0.8
relevant_experience 0.6
education_level 0.4
major_discipline 0.2
0
experience
−0.2
company_size −0.4
company_type −0.6
last_new_job −0.8
−1

Figure 1: Correlation Plot of Features

Confusion Matrix
Count
1 200
47 223
Actual

150

0 100
175 155
50
1 2
Predicted

Figure 2: Confusion Matrix

3
ROC Curve

0.8
Sensitivity
0.4
AUC: 0.678

0.0
1.0 0.5 0.0
Specificity

Figure 3: ROC-curve

4
Appendix B: Code

setwd("C:/Users/sanne/Downloads")

library(ggplot2)
library(caret)
library(gpairs)
library(corrplot)

# Load dataset
load("C:/Users/sanne/Downloads/JobChanges.RData")

# View dataset
str(data)

summary(data)

levels(data$gender)
levels(data$relevant_experience)
levels(data$education_level)
levels(data$major_discipline)
levels(data$experience)
levels(data$company_size)
levels(data$company_type)
levels(data$last_new_job)

data$gender <- as.numeric(data$gender)


data$relevant_experience <- as.numeric(data$relevant_experience)
data$education_level <- as.numeric(data$education_level)
data$major_discipline <- as.numeric(data$major_discipline)
data$experience <- as.numeric(data$experience)
data$company_size <- as.numeric(data$company_size)
data$company_type <- as.numeric(data$company_type)
data$last_new_job <- as.numeric(data$last_new_job)

data$experience <- cut(data$experience, breaks=c(0,6,11,16,21,26),


labels = c("0-5", "6-10", "11-15", "16-20", ">21"),
include.lowest = TRUE)
data$experience <- as.numeric(data$experience)

data <- as.data.frame(data)

# Plotting relation between various parameter


gpairs(data[, c(1:8)])

# Check Independence Assumption: Correlation Matrix. High correlations between


# features might violate the independence assumption
cor_matrix <- cor(data[, 1:8], method="spearman")
corrplot(cor_matrix, method="color", main="Correlation Plot Features")

# Check feature distribution: Imbalances in the distribution of feature levels


# might impact modek performance

5
ggplot(data, aes(x=gender))+
geom_bar()+
theme(axis.text.x = element_text(angle=45, hjust=1))

#----
# Fit the Naive Bayes model
#----
set.seed(123)
sample.indices <- sample(seq_len(nrow(data)), size=0.7*nrow(data))
train_data <- data[sample.indices,]
test_data <- data[-sample.indices,]

nb_model <- naiveBayes(target ~ ., data=train_data)

#----
# Predictions of Naive Bayes model
#----

predictions <- predict(nb_model, test_data)


predictions <- as.numeric(predictions)

#----
# Evaluation of the Naive Bayes Model
#----

confusion_matrix <- table(predictions, test_data$target)


confusion_df <- as.data.frame(confusion_matrix)
names(confusion_df) <- c("Predicted", "Actual", "Count")
ggplot(confusion_df, aes(x = Predicted, y = Actual, fill = Count))+
geom_tile(color="white")+
geom_text(aes(label=Count), vjust=1)+
scale_fill_gradient(low="white", high="blue")+
theme_minimal()+
labs(title="Confusion Matrix", x="Predicted", y="Actual")

accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)

tp <- confusion_matrix[2,2]
fp <- confusion_matrix[1,2]
fn <- confusion_matrix[2,1]
precision <- tp/(tp+fp)
recall <- tp/(tp+fn)
f1_score <- 2*(precision*recall)/(precision+recall)

res.roc <- roc(test_data$target, predictions)


plot.roc(res.roc, print.auc=TRUE, main="ROC Curve", col="blue", lwd=2)

AUC <- 0.679

evaluation_table <- data.frame(Metric = c("Accuracy", "F1 Score", "Recall", "Precision", "AUC"),


Value = c(round(accuracy, 3), round(f1_score, 3), round(recall,3),
round(precision,3), round(AUC,3)))

You might also like