You are on page 1of 15

HR Data

Pushkar Sarkar

2022-10-25

Let’s Begin!

libraries & data

Let’s bring the tidyverse into RStudio.

library(tidyverse)

Let’s also import the HR dataset and view it. Notice that the first row of the dataset has the column names,
so don’t forget to select “Headings = Yes” when importing!

View(HRDataset_v14)

Cleaning the data

Just like last time, let’s use janitor and clean_names() to get the names in a convenient format. Instead of
EmpID or MarriedID, we would like to start each letter in lowercase and separate words with an underscore,
like this - emp_id and married_id. There’s nothing better or worse about any name, but it’s easier for us
to follow a single format for all our datasets. By the way, this style of naming is called snake case.

library(janitor)

##
## Attaching package: ’janitor’

## The following objects are masked from ’package:stats’:


##
## chisq.test, fisher.test

hr <- HRDataset_v14 %>%


clean_names()

glimpse(hr)

## Rows: 311
## Columns: 36
## $ i_employee_name <chr> "Adinolfi, Wilson K", "Ait Sidi, Karthik~
## $ emp_id <int> 10026, 10084, 10196, 10088, 10069, 10002,~

1
## $ married_id <int> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,~
## $ marital_status_id <int> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0,~
## $ gender_id <int> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,~
## $ emp_status_id <int> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1,~
## $ dept_id <int> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5,~
## $ perf_score_id <int> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3,~
## $ from_diversity_job_fair_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,~
## $ salary <int> 62506, 104437, 64955, 64991, 50825, 57568~
## $ termd <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,~
## $ position_id <int> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 1~
## $ position <chr> "Production Technician I", "Sr. DBA", "Pr~
## $ state <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA",~
## $ zip <int> 1960, 2148, 1810, 1886, 2169, 1844, 2110,~
## $ dob <chr> "07/10/83", "05/05/75", "09/19/88", "09/2~
## $ sex <chr> "M ", "M ", "F", "F", "F", "F", "F", "M "~
## $ marital_desc <chr> "Single", "Married", "Married", "Married"~
## $ citizen_desc <chr> "US Citizen", "US Citizen", "US Citizen",~
## $ hispanic_latino <chr> "No", "No", "No", "No", "No", "No", "No",~
## $ race_desc <chr> "White", "White", "White", "White", "Whit~
## $ dateof_hire <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7~
## $ dateof_termination <chr> "", "6/16/2016", "9/24/2012", "", "9/6/20~
## $ term_reason <chr> "N/A-StillEmployed", "career change", "ho~
## $ employment_status <chr> "Active", "Voluntarily Terminated", "Volu~
## $ department <chr> "Production ", "IT/IS", "Production~
## $ manager_name <chr> "Michael Albert", "Simon Roup", "Kissy Su~
## $ manager_id <int> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14,~
## $ recruitment_source <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed~
## $ performance_score <chr> "Exceeds", "Fully Meets", "Fully Meets", ~
## $ engagement_survey <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04,~
## $ emp_satisfaction <int> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4,~
## $ special_projects_count <int> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0,~
## $ last_performance_review_date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1~
## $ days_late_last30 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ absences <int> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 1~

This HR dataset is an academic dataset - two HR professors created this dataset about a fictional company.
Read more about it here - https://rpubs.com/rhuebner/hrd_cb_v14. You’ll also find out what each column
means (although most are self-explanatory).
After a glimpse(), we can see that this is a record of 311 employees. There are 36 columns with all kinds of
information. This is not a big dataset, but this is a “wide” dataset (for lack of a better word). Without a
specific question in our minds, it’ll be difficult to approach such a dataset.
So before we start, let’s spend a minute understanding what kind of data we have and sorting it out.
Broadly, we have the following information:

1. Employee demographic data (name, marital status, background etc.)


2. Employee general data (salary, tenure, joining date etc.)
3. Role data (role, department, manager etc.)
4. Performance & satisfaction data.

Then some miscellaneous data such as how many special projects they did, recruitment sources etc.
From this, we can intuitively infer that two things - the performance and satisfaction of the employees - are
the two metrics present in the dataset. Nothing else is really a measurement of anything. The rest are all
just information.

2
So a simple question to ask is - does performance depend on any of the other variables?

Analysing Performance

Let’s first find out how performance is recorded in this dataset. There are two performance columns:
perf_score_id and performance_score. Let’s take a look at both.

hr %>%
select(performance_score, perf_score_id) %>%
View()
# manually viewing the two columns

hr %>% count(perf_score_id)

## perf_score_id n
## 1 1 13
## 2 2 18
## 3 3 243
## 4 4 37

hr %>% count(performance_score)

## performance_score n
## 1 Exceeds 37
## 2 Fully Meets 243
## 3 Needs Improvement 18
## 4 PIP 13

They seem to be the same thing! Someone has just converted performance_score into numbers and created
perf_score_id. So let’s forget about performance_score for now.
Performance, in perf_score_id, has been recorded as a categorical variable from 1 to 4, with 1 being the
lowest performance and 4 being the highest.
Now that we know the performance metric we’ll be using, the next thing we’ll have to decide on is what to
compare it with. We can check employee demographic data, employee general data, employee role data, or
the other miscellaneous data.
Let’s start with the role data for now. The role data is (remember that this is just our categorisation of the
data) recorded in: position_id, position, department, manager_name, manager_id
For our performance score analysis, we’ll create a new, smaller dataset out of the original so that we can
work faster.

hr_perf <- hr %>%


select(perf_score_id, position_id, position, department, manager_name, manager_id)

hr_perf

Writing hr_perf gives us a long, long list of columns. A quick solution is to convert this into a tibble.

3
hr_perf <- hr %>%
select(perf_score_id, position_id, position, department, manager_name, manager_id) %>%
as_tibble()

hr_perf

## # A tibble: 311 x 6
## perf_score_id position_id position department manager_name manager_id
## <int> <int> <chr> <chr> <chr> <int>
## 1 4 19 Production Tech~ "Producti~ Michael Alb~ 22
## 2 3 27 Sr. DBA "IT/IS" Simon Roup 4
## 3 3 20 Production Tech~ "Producti~ Kissy Sulli~ 20
## 4 3 19 Production Tech~ "Producti~ Elijiah Gray 16
## 5 3 19 Production Tech~ "Producti~ Webster But~ 39
## 6 4 19 Production Tech~ "Producti~ Amy Dunn 11
## 7 3 24 Software Engine~ "Software~ Alex Sweetw~ 10
## 8 3 19 Production Tech~ "Producti~ Ketsia Lieb~ 19
## 9 3 19 Production Tech~ "Producti~ Brannon Mil~ 12
## 10 3 14 IT Support "IT/IS" Peter Monroe 7
## # ... with 301 more rows

What is a tibble? A tibble is a nice, clean table. Since normal data looks ugly in R (unless you View(data)),
the tidyverse package added both glimpse() and as_tibble() so that we can see the data in a neat manner.
The two position columns - position and position_id - are also just like performance score: they are dupli-
cates. So we can work on either. But since there are many positions. . .

hr_perf %>% count(position)

## # A tibble: 32 x 2
## position n
## <chr> <int>
## 1 "Accountant I" 3
## 2 "Administrative Assistant" 3
## 3 "Area Sales Manager" 27
## 4 "BI Developer" 4
## 5 "BI Director" 1
## 6 "CIO" 1
## 7 "Data Analyst" 7
## 8 "Data Analyst " 1
## 9 "Data Architect" 1
## 10 "Database Administrator" 5
## # ... with 22 more rows

. . . we probably won’t understand position_id, since it’s all just numbers which are unordered. So let’s do
the opposite of what we did with performance score. Let’s keep position and drop position_id.

hr_perf <- hr %>%


select(perf_score_id, position, department, manager_name, manager_id) %>%
as_tibble()

glimpse(hr_perf)

4
## Rows: 311
## Columns: 5
## $ perf_score_id <int> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 3,~
## $ position <chr> "Production Technician I", "Sr. DBA", "Production Techni~
## $ department <chr> "Production ", "IT/IS", "Production ", "Prod~
## $ manager_name <chr> "Michael Albert", "Simon Roup", "Kissy Sullivan", "Eliji~
## $ manager_id <int> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 20, 4, 18, 22,~

Let’s start with the department. Do different departments have different overall performance rates?

hr_perf %>%
group_by(department) %>%
summarise(mean = mean(perf_score_id))

## # A tibble: 6 x 2
## department mean
## <chr> <dbl>
## 1 "Admin Offices" 3
## 2 "Executive Office" 3
## 3 "IT/IS" 3.06
## 4 "Production " 2.97
## 5 "Sales" 2.84
## 6 "Software Engineering" 3.09

No such indication. But we used the mean - is that the best measure of central tendency? Try the same for
median using median(perf_score_id)! (spoiler: nothing changes)

hr_perf %>%
group_by(department) %>%
summarise(mean = mean(perf_score_id),
n = n())

## # A tibble: 6 x 3
## department mean n
## <chr> <dbl> <int>
## 1 "Admin Offices" 3 9
## 2 "Executive Office" 3 1
## 3 "IT/IS" 3.06 50
## 4 "Production " 2.97 209
## 5 "Sales" 2.84 31
## 6 "Software Engineering" 3.09 11

Adding a bit of useful information - the number of rows ie. the number of employees.
This is probably something we should have done before! But that’s always easier to say in hindsight. There
are 200+ employees in production, and a few tens in the other offices. And only one exec.
Let’s dig deeper into production then and see what we find.

hr_perf %>%
filter(department == "Production")

5
## # A tibble: 0 x 5
## # ... with 5 variables: perf_score_id <int>, position <chr>, department <chr>,
## # manager_name <chr>, manager_id <int>

Nothing happens. But there’s no error - we just get a dataset of 0 entries. This means that the filter has
worked. . . only that there is no entry labeled “Production” in the department column. We’ll have to manually
look at the data.

hr_perf %>% glimpse()

## Rows: 311
## Columns: 5
## $ perf_score_id <int> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 3,~
## $ position <chr> "Production Technician I", "Sr. DBA", "Production Techni~
## $ department <chr> "Production ", "IT/IS", "Production ", "Prod~
## $ manager_name <chr> "Michael Albert", "Simon Roup", "Kissy Sullivan", "Eliji~
## $ manager_id <int> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 20, 4, 18, 22,~

The first entry is a Production entry, and right there we see something weird - there are a bunch of spaces
after Production. We can do two things:

1. Just use “Production” instead of “Production”


2. Trim the whitespace

There is a handy function called trimws, so let’s use that. Note: How does one find out about trimws?
Google how to remove white spaces in R. There will be many such small edge cases, but almost always
someone will have a solution up on the internet.
Instead of writing new code, let’s go back to the creation code of hr_perf, and add trimws() to that. We’ll
have to use mutate() of course!

hr_perf <- hr %>%


select(perf_score_id, position, department, manager_name, manager_id) %>%
mutate(department = trimws(department)) %>%
as_tibble()

glimpse(hr_perf)

## Rows: 311
## Columns: 5
## $ perf_score_id <int> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 3,~
## $ position <chr> "Production Technician I", "Sr. DBA", "Production Techni~
## $ department <chr> "Production", "IT/IS", "Production", "Production", "Prod~
## $ manager_name <chr> "Michael Albert", "Simon Roup", "Kissy Sullivan", "Eliji~
## $ manager_id <int> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 20, 4, 18, 22,~

Perfect! Now let’s go back to isolating the production department.

hr_perf %>%
filter(department == "Production")

6
## # A tibble: 209 x 5
## perf_score_id position department manager_name manager_id
## <int> <chr> <chr> <chr> <int>
## 1 4 Production Technician I Production Michael Albert 22
## 2 3 Production Technician II Production Kissy Sullivan 20
## 3 3 Production Technician I Production Elijiah Gray 16
## 4 3 Production Technician I Production Webster Butler 39
## 5 4 Production Technician I Production Amy Dunn 11
## 6 3 Production Technician I Production Ketsia Liebig 19
## 7 3 Production Technician I Production Brannon Miller 12
## 8 3 Production Technician I Production David Stanley 14
## 9 3 Production Technician I Production Kissy Sullivan 20
## 10 3 Production Technician I Production Kelley Spirea 18
## # ... with 199 more rows

Let’s analyse based on position.

hr_perf %>%
filter(department == "Production") %>%
group_by(position) %>%
summarise(mean = mean(perf_score_id),
median = median(perf_score_id),
n = n())

## # A tibble: 4 x 4
## position mean median n
## <chr> <dbl> <dbl> <int>
## 1 Director of Operations 4 4 1
## 2 Production Manager 3 3 14
## 3 Production Technician I 2.95 3 137
## 4 Production Technician II 3 3 57

Still no trend. Everyone is pretty good overall except the Director of Operations, who probably gave
himself/herself a 4.
Does the manager have anything to do with the performance rating?

hr_perf %>%
filter(department == "Production") %>%
group_by(manager_id) %>%
summarise(mean = mean(perf_score_id),
median = median(perf_score_id),
n = n())

## # A tibble: 12 x 4
## manager_id mean median n
## <int> <dbl> <dbl> <int>
## 1 2 3.07 3 15
## 2 11 2.90 3 21
## 3 12 2.82 3 22
## 4 14 3 3 21
## 5 16 3 3 22
## 6 18 3.09 3 22

7
## 7 19 3.05 3 21
## 8 20 2.95 3 22
## 9 22 2.86 3 21
## 10 30 3 3 1
## 11 39 2.92 3 13
## 12 NA 3.12 3 8

Again, nothing significant. Let’s step back out of the production department and see what happens.

hr_perf %>%
group_by(manager_id) %>%
summarise(mean = mean(perf_score_id),
median = median(perf_score_id),
n = n()) %>%
ggplot() + geom_col(aes(manager_id, mean)) +
geom_point(aes(manager_id, median))

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (geom_point).

2
mean

0 10 20 30 40
manager_id
We’ve plotted the data to make it more interpretable. The columns represent the means and the points
represent the medians.
How do the means and medians interact with each other?

8
If the means and medians are the same, it’s normally distributed. If the median is higher than the mean,
then there are a few outliers on the lower end (poor performers) that are dragging the mean down. And vice
versa.
Seems like managers 8, 11, 12, 15 and 17 have some poorer performers in their teams. Who are they? We’ll
have to go back to the original dataset and find out.

hr %>%
select(manager_id, manager_name) %>%
arrange(-manager_id)

Let’s add a unique() to this.

hr %>%
select(manager_id, manager_name) %>%
unique() %>%
arrange(-manager_id)

## manager_id manager_name
## 1 39 Webster Butler
## 2 30 Michael Albert
## 3 22 Michael Albert
## 4 21 Lynn Daneault
## 5 20 Kissy Sullivan
## 6 19 Ketsia Liebig
## 7 18 Kelley Spirea
## 8 17 John Smith
## 9 16 Elijiah Gray
## 10 15 Debra Houlihan
## 11 14 David Stanley
## 12 13 Brian Champaigne
## 13 12 Brannon Miller
## 14 11 Amy Dunn
## 15 10 Alex Sweetwater
## 16 9 Board of Directors
## 17 7 Peter Monroe
## 18 6 Eric Dougall
## 19 5 Jennifer Zamora
## 20 4 Simon Roup
## 21 3 Brandon R. LeBlanc
## 22 2 Janet King
## 23 1 Brandon R. LeBlanc
## 24 NA Webster Butler

Let’s take manager 15 - Debra Houlihan and explore a little.

hr %>%
filter(manager_id == 15) %>%
View()

Debra manages a 3-person Sales team. Her team’s performance average was pulled down by John Smith,
who is a “Needs Improvement” employee with 16 absences. In a team of 3, one poor performer is enough to
drop the average. John is quite different from the other two - he was a diversity hire and is the only married
(now divorced), the only male, and the only African-American member of Debra’s team.

9
Summary

So we explored the relationship between performance and role. Overall, there seems to be no clear correlation
as such. Everyone is by and large a good performer.
As always, there is more than just this much. We arbitrarily selected role data to explore, which is ideal for
exploring a dataset for the first time.
But as indicated by John Smith, performance may be based on the demographic data instead - where
you come from, what your hiring situation was, what your family needs are. This may be a controversial
statement but it certainly is a valuable insight for an HR manager if they want to bring equity into the
workplace and bring out the best in each employee.
So now whereas previously we had no question to ask, we now have a direction - we can probe the recruitment
sources for insights on recruitment and performance.

Analysing Recruitment Sources

Let’s look into the recruitment information. We have the main column - recruitment_source - and we can
choose to map those to any of the other types of information.
Let’s start with three things: 1. Gender 2. Department 3. Performance

hr_rec <- hr %>%


select(recruitment_source, sex, department, perf_score_id) %>%
as_tibble()

hr_rec

## # A tibble: 311 x 4
## recruitment_source sex department perf_score_id
## <chr> <chr> <chr> <int>
## 1 LinkedIn "M " "Production " 4
## 2 Indeed "M " "IT/IS" 3
## 3 LinkedIn "F" "Production " 3
## 4 Indeed "F" "Production " 3
## 5 Google Search "F" "Production " 3
## 6 LinkedIn "F" "Production " 4
## 7 LinkedIn "F" "Software Engineering" 3
## 8 Employee Referral "M " "Production " 3
## 9 Diversity Job Fair "F" "Production " 3
## 10 Indeed "M " "IT/IS" 3
## # ... with 301 more rows

So what types of recruitment sources are there?

hr_rec %>% count(recruitment_source)

## # A tibble: 9 x 2
## recruitment_source n
## <chr> <int>
## 1 CareerBuilder 23
## 2 Diversity Job Fair 29
## 3 Employee Referral 31

10
## 4 Google Search 49
## 5 Indeed 87
## 6 LinkedIn 76
## 7 On-line Web application 1
## 8 Other 2
## 9 Website 13

Let’s slice these sources by gender.

hr_rec %>%
ggplot() + geom_bar(aes(recruitment_source, fill = sex))

75

50 sex
count

F
M

25

CareerBuilder
Diversity Job
Employee
Fair Referral
Google SearchIndeed LinkedIn
On−line Web application
Other Website
recruitment_source

Interesting: It seems like more women come from CareerBuilder (an American job portal) and from Google.
But we’ll have to visualise this data better - as the bar sizes differ, we can’t make out the rations. While
we’re at it, let’s get rid of web applications and other sources, since there are only 3 of them combined.

hr_rec %>%
filter(recruitment_source != "Other" &
recruitment_source != "On-line Web application") %>%
ggplot() + geom_bar(aes(recruitment_source, fill = sex),
position = "fill")

11
1.00

0.75

sex
count

0.50 F
M

0.25

0.00

CareerBuilder
Diversity JobEmployee
Fair Referral
Google Search Indeed LinkedIn Website
recruitment_source

# don't put position inside the aes()!

Another visualisation change - let’s add coord_flip(), which flips the x and y axes.

hr_rec %>%
filter(recruitment_source != "Other" &
recruitment_source != "On-line Web application") %>%
ggplot() + geom_bar(aes(recruitment_source, fill = sex),
position = "fill") +
coord_flip()

12
Website

LinkedIn
recruitment_source

Indeed

sex
Google Search F
M

Employee Referral

Diversity Job Fair

CareerBuilder

0.00 0.25 0.50 0.75 1.00


count

When your x-axis has words/letters (such as right now), it’s better to use a horizontal bar chart so that
people can read the labels easier.
So women are more likely to come through CareerBuilder and Google Search (and consequently Indeed and
LinkedIn), while men are more likely to get referred.
Do different recruitment sources result in better/worse performers?
We’ll have to visualise this differently.
With sex, it was a binary variable - M or F. So we split each recruitment source color-wise. With performance
score, we know it’s from 1 to 4, and since there is an order (1 lowest 4 highest), we can actually create a
histogram from 1 to 4, and create different distributions for each recruitment source.

hr_rec %>%
filter(recruitment_source != "Other" &
recruitment_source != "On-line Web application") %>%
ggplot() + geom_bar(aes(perf_score_id)) +
facet_wrap(~recruitment_source)

13
CareerBuilder Diversity Job Fair Employee Referral
60

40

20

0
Google Search Indeed LinkedIn
60
count

40

20

0
1 2 3 4 1 2 3 4
Website
60

40

20

0
1 2 3 4
perf_score_id

Wrap-up

By now its clear that this dataset holds many relationships and there is potential for answering many different
kinds of questions. For now, let’s just clean up and beautify the last plot.

hr_rec %>%
filter(recruitment_source != "Other" &
recruitment_source != "On-line Web application") %>%
ggplot() + geom_bar(aes(perf_score_id, fill = recruitment_source)) +
facet_wrap(~recruitment_source) +
theme_minimal() +
labs(title = "Performance Results of Employees from Different Recruitment Sources",
subtitle = "Employee data of a fictional company",
caption = "n = 311 | Data provided by Dr. Rich Huebner and Dr. Carla Patalano",
x = "Performance Score (1 = lowest | 4 = highest)",
y = "Number of Employees",
fill = "Recruitment Source")

14
Performance Results of Employees from Different Recruitment Sources
Employee data of a fictional company
CareerBuilder Diversity Job Fair Employee Referral
60
40
20
Recruitment Source
0
Number of Employees

CareerBuilder
Google Search Indeed LinkedIn
60 Diversity Job Fair

40 Employee Referral

20 Google Search

0 Indeed
1 2 3 4 1 2 3 4
Website LinkedIn
60 Website
40
20
0
1 2 3 4
Performance Score (1 = lowest | 4 = highest)
n = 311 | Data provided by Dr. Rich Huebner and Dr. Carla Patalano

Homework

For homework, answer the following questions:

1. What is the overall diversity profile of the organization?


2. Are there areas of the company where pay is not equitable?
3. Does working on special projects affect performance?

Bonus marks: Is there a relationship between age and performance?


Attach your code and summaries/plots that you used to answer the questions!

15

You might also like