You are on page 1of 65

INDIVIDUAL ASSIGNMENT

TECHNOLOGY PARK MALAYSIA


CT127-3-2-PFDA
PROGRAMMING FOR DATA ANALYSIS
APU2F2206CS(CYB)

HAND OUT DATE : 25TH JULY 2022

HAND IN DATE : 12TH AUGUST 2022

WEIGHTAGE : 50%     

        
NAME : ARMILLIA KARENNA

TP NUMBER : TP060327

LECTURER : MS. MINNU HELEN JOSEPH


Pg 2

TABLE OF CONTENTS

TABLE OF CONTENTS 2

INTRODUCTION & ASSUMPTION 4

DATA IMPORT, CLEANING, PRE-PROCESSING AND EXPLORATION 5


Data Import 5
Data Cleaning 6
Data Pre-processing 8
Data Exploration 10

QUESTIONS AND ANALYSIS 13


Question 1 : Analysis for employee attrition over the years 13
Analysis 1-1 : Relationship between terminated and active employees over the years.
13
Analysis 1-2 : Status year that has the most active employees. 14
Analysis 1-3 : Status year that has the most terminated employees. 15
Analysis 1-4 : Employee attrition rate between male and female employees. 16
Analysis 1-5 : Year that has the most employee hired date. 18
Analysis 1-6 : Year that has the most employee termination date. 19
Question 2 : Analysis for type of termination 21
Analysis 2-1 : Employee attrition rate in type of termination. 21
Analysis 2-2 : Type of termination that has the highest number. 22
Analysis 2-3 : Type of termination with the lowest number. 23
Analysis 2-4 : Relationship between type of termination and age. 24
Question 3 : Analysis for reason of termination 26
Analysis 3-1 : Employee attrition rate in reason of termination. 26
Analysis 3-2 : Reason of termination with the highest number. 27
Analysis 3-3 : Reason of termination with the lowest number. 28
Analysis 3-4 : Relationship between reason of termination and age. 29
Question 4 : Analysis on Cities in Canada 31
Analysis 4-1 : Employee attrition rate in the cities of Canada. 31
Analysis 4-2 : Cities with the highest number of employee attrition. 32
Analysis 4-3 : Cities with the lowest number of employee attrition. 33
Analysis 4-4 : Relationship between cities in Canada and department. 34
Question 5 : Analysis on Business Unit 36
Analysis 5-1 : Employee attrition rate in the business unit. 36
Analysis 5-2 : Relationship between city and business unit. 37
Analysis 5-3 : Relationship between type of termination and business unit. 38
Analysis 5-4 : Highest number of type of termination in business unit. 39
Pg 3

Analysis 5-5 : Relationship between reason of termination and business unit. 40


Analysis 5-6 : Highest number of reasons for termination in business unit. 41
Question 6 : Analysis on ages of employees 43
Analysis 6-1 : Relationship of the terminated employees and their age 43
Analysis 6-2 : Age with the highest number of terminated employees. 44
Analysis 6-3 : Age with the lowest number of terminated employees. 45
Analysis 6-4 : Relationship between employee age and gender. 46
Question 7 : Analysis on Job Title 48
Analysis 7-1 : Type of job with the highest number of terminated employees. 48
Analysis 7-2 : Type of job with the lowest number of terminated employees. 49
Question 8 : Analysis on Departments 51
Analysis 8-1 : Employee attrition rate in department. 51
Analysis 8-2 : Department with highest number of terminated employees. 52
Analysis 8-3 : Department with lowest number of terminated employees. 53
Question 9 : Analysis on length of service 55
Analysis 9-1 : Relationship between length of service and job titles. 55
Analysis 9-2 : Relationship between cities in Canada and length of service 56
Analysis 9-3 : Relationship between length of service and status year 57
Analysis 9-4 : Relationship between employees length of service and status. 58
Question 10 : Analysis on employee birthday 59
Analysis 10-1 : Counts of birthdays in month for active employees and terminated
employees 59
Analysis 10-2 : Counts for the highest amount of birthdays in a day 61

EXTRA FEATURE USED 63

CONCLUSION 64

REFERENCES 65
Pg 4

INTRODUCTION & ASSUMPTION

The main objective of this assignment is to determine some hidden issue in human
resources management. I will need to perform analysis with the given dataset to
identify hidden problems in the organisation and provide meaningful insight for
decision making.

The techniques used to explore the dataset using various data exploration,
manipulation, transformation, and visualisation techniques which were covered in the
course. And as an additional feature must explore the further concepts which can
improve the retrieval effects.

The dataset provided for this assignment is related to the employees’ job information
and attribution. It contains 18 columns and 49654 rows. The dataset includes the
personal details of the staff, job department, position, location, working status, and
reason of termination.
Pg 5

DATA IMPORT, CLEANING, PRE-PROCESSING AND EXPLORATION

Data Import

Before I imported the CSV File, I went ahead and first installed all my necessary packages I
needed and loaded them. Thus, the packages “ggplot2”, “dplyr”, “plotrix”, “janitor”,
“gridExtra”, “RcolorBrewer” are installed using install.packages() and loaded using library().
Library() is used so that the packages can be loaded into the environment so that it can be
used with the codes. Moving on to data import, since the excel sheet is in a CSV format,
read.csv is used as it imports the data and reads the dataset as well. And the name of the
imported dataset is called “employee_attrition”.
Pg 6

Data Cleaning

As for data cleaning, I have decided to use “janitor” and “dplyr” to help me clean my dataset.
Using “janitor” clean_names(), it helps me clean the objects and make all of my variable
names consistent in one line of code in my dataframe. Next, remove_empty() helps me to
remove any empty columns or rows of data in my dataframe. And lastly using “dplyr”
distinct() it will help me remove duplicate rows in my dataset.

In this set of coding, the main use of it is to remove any duplicate employee id located in my
dataset as there are a lot of them with the same employee id but different information overall.
First, I removed the employee_id column in my dataframe using the subset function. Then,
using the base length of the original beginning employee id, I made employee id 1318 as id 1.
Next, I created a new dataframe containing a range from 1318 to x which was formerly
calculated. Then I changed the column name from “employee_id” to “emp_id” and then
using cbind i binded the columns together with the existing dataset.
Pg 7

By using the function sum(is.na()) I will find and know if there are still any missing values
located in my columns or rows of my dataset. This is used just in case I missed any NA
values in my dataset, and this is done to double confirm I don't have any in my dataset
anymore. I used the sum(is.na()) for all of my variables as seen in figure 1.4.

And as seen above on figure 1.6 are some of the results. Which displays 0 since there are no
null/NA values.
Pg 8

Data Pre-processing

Upon using the unique() function to see all the unique values that are in the dataset for each
variable, I have found that there are some issues with it such as 2 values that are supposed to
be the same thing but since one of them has the incorrect spelling it is considered a different
value. And another value has an incorrect spelling in the dataset, so to fix them I used the line
of code above to replace values in a single column inside my dataframe. In the line of code to
explain it simply the code is as stated, df["Column Name"][df["Column Name"] == "Old
Value"] <- "New Value" with this code I replaced the value in column “city_name” for “New
Westminister” with the correct spelling “New Westminster”. This is because if the data needs
to be used and it is not fixed, the data set will give wrong results if an analysis is done. Next, i
replaced the value in column “termreason_desc” for “Resignaton” with the correct spelling
“Resignation” since the spelling was incorrect.

For this part, I have changed all my dates in my dataframe to Dates as they were previously
set as character vectors. In the line of code I called my data frame followed by my column
name using the “$” symbol to extract the data which then equals to the as.Date function
which changes the characters to dates followed by data frame and column name again. Then
I formatted the dates such that it is the month followed by the day and year format. I have
done these for the “recorddate_key”, “birthday_key” , “orighiredate_key” and the
“terminationdate_key”.
Pg 9

In this section, I wanted to create some new data columns to be used later for analysis
purposes with the dates in my dataframe. So firstly I formatted the dates that will only return
the day or month or year. The line of code consists of the format() function followed by the
data frame and column name using the “$” symbol to extract the values followed by the
format I need which is in day or month or year using the “%” symbol. Then after I want to
make the new data columns and change them to numeric vectors, so for that I called my data
frame followed by data column which is then followed by the as.numeric() function and
format inside the bracket then again data frame and column again followed by the format i
want which is either day, month or year. In this section I made new columns for “Bmonth”
and “Bday” where it is extracted from column “birthday_key”. Next I made new columns for
“hire_year” and “terminated” where it is extracted from columns “orighiredate_key” and
“terminationdate_key”.
Pg 10

Data Exploration

Starting with the function names(), this function helps me to obtain the variable names found
in my dataframe which will return the value and display it, and the code simply follows
calling out the names() function with stating my data frame name in the bracket. And results
of using the function will be shown on figure 1.12.
Pg 11

Moving on, we have the dim() function, which helps me to determine how many rows and
columns I have in my data frame. By stating the dim() function followed by my data frame in
the bracket, the output will show the number of rows I have followed by the number of
columns I have in my data frame. And result is shown below as figure 1.13

Furthermore, we have the str() function, which mainly helps me to compactly display the
internal structure of my dataset. It will essentially display the amount of objects and variables
I have in my dataset followed by some information for each of my variables such as the
names, class for each column in my data frame. And the result is seen below figure 1.14.

Additionally, the summary() function will ultimately display the summary of each column,
such as the mean, median,maximum and minimum for every column followed by information
Pg 12

like the mode, length and class of each column if the column is a character. Below figure 1.15
is the result of the function used.

Moving on to the view() function, this will pop out an excel type of spreadsheet that contains
all my data for me to view in a clean and neat way. Figure 1.16 below shows some results.

And lastly the unique() function, this function when called upon will return all the different
and unique values located in each column when it is specified in the line of code from the
data frame. And below is the result shown for this function in figure 1.17.
Pg 13

QUESTIONS AND ANALYSIS

Question 1 : Analysis for employee attrition over the years

Analysis 1-1 : Relationship between terminated and active employees over the years.

Source code

In this analysis, the graph is plotted using the “ggplot” function that has 4 layers. The first
layer states the x-axis which is “status_year” followed by “fill” for inside colour and “colour”
for outside colour which will be indicated by “status” so that the values “terminated” and
“active” will be seen in 2 different colours. Then in layer 2 function “stat_count” which will
count the number of values for each x axis values, and inside the function contains “width”
which will give the width size for the bars in the graph and “position” will help show the
values in the bar in a stacked position since “identity” is used for the graph. Moving on,
“geom_text” function is in the third layer which adds texts to the plot with conditions and
aesthetics set, position is set to “identity”, stat count counts each values for x axis position,
and “aes” label = count will display the total number for each x value at the top of the bar,
vjust sets the elements vertically and size sets label font size and colour sets the label font in
black. Lastly in layer 4 is the labs() function where the “title” for the graph title , “x” for the x
axis position and “y” for the y axis position for the axis labels.
Pg 14

Data visualisation

In this graph, here it shows the number of employees on the y axis and the years on the x
axis, in the bars there are 2 colours to separate and indicate the 2 different values in the status
column which is “terminated” and “active” employees in the dataframe. Furthermore, the bar
is seen to keep increasing from 2006 to 2013, and it decreases from 203 to 2015. And the
total numbers for both terminated and active employees are shown above each stacked plot.

Analysis 1-2 : Status year that has the most active employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph. As for this
analysis the graph is used to find which year has the most active employees using status year.
Pg 15

Data visualisation

As seen in figure 2.4 the graph clearly states that in the year 2013, they have the most active
employees indicated in the colour red for active employees in the stacked bar where 5215 of
them were active that year.

Analysis 1-3 : Status year that has the most terminated employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph. As for this
analysis the graph is used to find which year has the most terminated employees using status
year.
Pg 16

Data visualisation

As seen in figure 2.6 the graph clearly states that in the year 2014, they have the most
terminated employees indicated in the colour turquoise for terminated employees in the
stacked bar where 253 of them were terminated that year.

Analysis 1-4 : Employee attrition rate between male and female employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and for the
fill and colour I used for gender which will be indicated by “gender_full” so that the values
“female” and “male” will be seen in 2 different colours. And an added facet_wrap() function
which arranges my panels if I have a variable with the minimum of 2 unique datas where 2
Pg 17

same graphs with different values are shown to compare data. I also used the table function to
show data for the “status_year”, “gender_full” and for “status” variables. Moreover I also
added the coord_flip() function which helps to flip the x axis and y axis places. As for this
analysis the graph is used to find employee attrition rate between male and female over the
years. Moreover a table function is used since the numbers are too big to be displayed on the
bars since it will clash with each other, the table will show the numbers to replace the graph
count.

Data visualisation

In this graph figure 2.8 we get to see how many male and female employees were active and
terminated over the years, where male is in turquoise colour and female is in red colour. And
2 graphs are displayed to separately show both terminated and active employees with both
genders over the years.
Pg 18

In the data figure 2.9 here we can see that for the year 2013 they had the most active
employees for females and the year 2013 they had the most active employees for males. And
below we can see that for the year 2014 they had the most terminated employees for females
and the year 2014 they had the most terminated employees for males.

Analysis 1-5 : Year that has the most employee hired date.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and the x
axis variable. As for this analysis the graph is used to find which year that has the most
employees hired.
Pg 19

Data visualisation

As seen in figure 2.11 the graph clearly states that the year 2000 is the year where employees
were hired the most with 2733 employees being hired during that year.

Analysis 1-6 : Year that has the most employee termination date.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and the x
axis variable. As for this analysis the graph is used to find which year that has the most
employee termination.
Pg 20

Data visualisation

As seen in figure 2.13 the graph clearly shows that in the year 1900 is the year where most
termination occurred, with a very high number of 42450 number of termination happened.
And also very noticeably seen is that no termination happened for a very long time over the
years as there is a big empty gap in the graph, and the termination only started again in 2006.

Since the data is so big, numbers cannot be seen clearly on the graph so this table function is
used as seen in figure 2.14 to see the number instead, and it clearly also states that the year
1900 has the most termination occurring.
Pg 21

Question 2 : Analysis for type of termination

Analysis 2-1 : Employee attrition rate in type of termination.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “status_year” is involved so 3
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find employee attrition rate in type of termination.

Data visualisation
Pg 22

As seen in the figure 2.16 the variable “termtype_decs” has 3 variables “involuntary” in red
colour, “not applicable” in green colour and “voluntary” in blue colour appearing on the x
axis with the number of employees appearing on the y axis. Moreover, as seen in the graph it
seems that there were no employees who were terminated involuntarily from 2006 to 2013.

Analysis 2-2 : Type of termination that has the highest number.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. As for this analysis the graph is used to find the type of termination with the highest
number.

Data visualisation
Pg 23

As seen in figure 2.18 the graph between voluntary and involuntary, clearly has the highest
number with 1270 employees being terminated voluntarily. And the value “not applicable” is
not counted for this analysis since it means that those employees are still active.

Analysis 2-3 : Type of termination with the lowest number.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. As for this analysis the graph is used to find the type of termination with the lowest
number.

Data visualisation
Pg 24

As seen in figure 2.18 the graph between “voluntary” and “involuntary”, involuntaryis clearly
has the lowest number with 215 employees being terminated involuntarily. And the value “not
applicable” is not counted for this analysis since it means that those employees are still active.

Analysis 2-4 : Relationship between type of termination and age.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “age” is involved so 3 graphs
will appear the same just for different values so that the data can be seen more clearer.
Moreover the function coord_flip() is also added which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the relationship between type of
termination and age.

Data visualisation
Pg 25

As seen in figure 2.22 the graph shows that many employees from the age of 20 to 30 chose
to voluntarily leave where the reason could be that they found a better job opportunity or are
exploring since it is common to do so around those ages. And employees ages from 60 and
above also chose to voluntary leave, in this case it would definitely be because they retired
from old age. As for involuntary, it is seen that at the age of 64 it has the most number of
employees which are 14 of them that had to leave involuntarily, my best guess would be that
they are supposed to be retiring but still chose not and and still wanted to work.
Pg 26

Question 3 : Analysis for reason of termination

Analysis 3-1 : Employee attrition rate in reason of termination.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “status_year” is involved so 4
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find employee attrition rate in reason of termination.

Data visualisation

As seen in figure 2.24 the graph shows that there were no termination that happened during
years 2006 to 2013 for layoff being the reason, and for resignation, it shows in the graph that
Pg 27

the year 2012 has the most resignations with 76 number of employees that resigned.
Furthermore, in retirement it shows in the graph that the year 2008 has the most retirements
with 138 number of employees that retired.

Analysis 3-2 : Reason of termination with the highest number.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and the
variable for x axis.

Data visualisation

As seen in figure 2.26 the graph clearly states that retirement has the highest number for
reason of termination with 885 employees retiring. And the value “not applicable” is not
Pg 28

counted for this analysis since it means that those employees are still active. As for this
analysis the graph is used to find the reason for termination with the highest number.

Analysis 3-3 : Reason of termination with the lowest number.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. As for this analysis the graph is used to find the reason for termination with the
lowest number.

Data visualisation

As seen in figure 2.28 the graph clearly states that layoff has the lowest number for reason of
termination with 215 employees being laid off. And the value “not applicable” is not counted
Pg 29

for this analysis since it means that those employees are still active. As for this analysis the
graph is used to find the reason for termination with the lowest number.

Analysis 3-4 : Relationship between reason of termination and age.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “age” is involved so 4 graphs
will appear the same just for different values so that the data can be seen more clearer. As for
this analysis the graph is used to find the relationship between reason of termination and age.

Data visualisation
Pg 30

As seen in figure 2.30 the graph shows that employees for almost all ages 20 to 64 years got
laid off with the most being 18 employees being laid off at the age of 64 years old. As for
resignation, it's similar to layoff where employees ages 19 to 64 years resigned with the most
75 employees who resigned at the age of 30 years old. Finally for retirement, only employees
60 years and above retired with 591 employees who retired at the age of 65 years old.
Pg 31

Question 4 : Analysis on Cities in Canada

Analysis 4-1 : Employee attrition rate in the cities of Canada.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the employee attrition rate in cities of
Canada.

Data visualisation
Pg 32

As seen in figure 2.32 the graph can be seen showing 2 plots of graph one for terminated
employees and one for active employees. Where the x axis position is now on y axis is the
function coord_flip() ws used to flip it so that the names of the city can be seen more clearer
compared to being on the x axis position. The red can be seen to represent active employees
while the blue represents terminated employees.

Analysis 4-2 : Cities with the highest number of employee attrition.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the city with the highest number of
employee attrition.

Data visualisation
Pg 33

As seen in figure 2.34 the graph clearly states that Vancouver is the city with the highest
number of employee attrition, they have the most terminated employees indicated in the
colour turquoise where 296 employees were terminated in that city.

Analysis 4-3 : Cities with the lowest number of employee attrition.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the city with the lowest number of
employee attrition.

Data visualisation
Pg 34

As seen in figure 2.36 the graph clearly states that Blue River is the city with the lowest
number of employee attrition, they have the lowest terminated employees indicated in the
colour turquoise where only 1 employee was terminated in that city.

Analysis 4-4 : Relationship between cities in Canada and department.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and I also added the coord_flip() function which helps to flip the x axis and y axis
places. Moreover the function theme() function is added in this plot to change the angle of the
labels for the x axis where it switches from being horizontally straight to being vertically
straight as 90 degrees was used. As for this analysis the graph is used to find the city with the
lowest number of employee attrition.

Data visualisation
Pg 35

As seen in figure 2.38 the graph shows that Vancouver has the most departments in the city
followed by Victoria and Burnaby. And in the graph the department for produce seems to be
the most dominated department as it can clearly be seen in almost every city with high
numbers. Moreover, the department of dairy seems to dominate the city of Burnaby by a
staggering amount shown on the graph. And the meat department seems to also be very
present in the cities of Kamloops, Kelowna, Victoria and Vancouver as seen in the graph.
Pg 36

Question 5 : Analysis on Business Unit

Analysis 5-1 : Employee attrition rate in the business unit.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover the other difference is that instead of using “identity” for the position I
used “dodge” instead as it gives a nicer and neater look to it. As for this analysis the graph is
used to find the employee attrition rate in the business unit.

Data visualisation

As seen in figure 2.40 the graph shows the relationship between business unit and status of
employees. Here we can see that the head office is indicated in the colour red while the stores
are indicated in the colour turquoise. For active employees it shows that for head office there
Pg 37

are 516 employees active and 47652 employees are active for stores. Moving on, for
terminated employees it shows that for head office there are 69 employees who are terminated
while there are 1416 employees terminated for stores.

Analysis 5-2 : Relationship between city and business unit.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer..
Moreover the function theme() function is added in this plot to change the angle of the labels
for the x axis where it switches from being horizontally straight to being vertically straight as
90 degrees was used. As for this analysis the graph is used to find the relationship between
cities in Canada and business units.

Data visualisation
Pg 38

In this figure 2.42 between the relationship of city and business unit, the head office can be
seen to be located in only one city which is Vancouver with the number of 585 head offices
located there. Furthermore, Vancouver also seems to have the most stores in their city with
the number of 10626 stores located there. In Blue River, they seem to have the least amount
of stores located there with the number of only 9 stores located there.

Analysis 5-3 : Relationship between type of termination and business unit.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 3
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the relationship between type of termination and
business units.

Data visualisation
Pg 39

As seen in figure 2.44 the graph shows 3 plots where it separately displays for all 3
involuntary, not applicable and voluntary values, Involuntary in red, not applicable in green
and voluntary in the colour blue. In the graph it also seems to show that there are no
involuntary terminations for head office.

Analysis 5-4 : Highest number of type of termination in business unit.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 3
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the highest number of type of termination for
business unit.

Data visualisation
Pg 40

As seen in figure 2.46 the graph shows 3 plots where it separately displays for all 3
involuntary, not applicable and voluntary values, Involuntary in red, not applicable in green
and voluntary in the colour blue. In the graph the type of termination for voluntary is seen as
the one with the highest number of type of termination for the business unit with 69
employees terminated in head office and 1201 employees terminated in stores which add up
and become 1270 employees who voluntarily got terminated.

Analysis 5-5 : Relationship between reason of termination and business unit.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 4
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the relationship between reason of termination
and business units.

Data visualisation
Pg 41

As seen in figure 2.48 the graph shows 4 plots where it separately displays for all 4 layoff, not
applicable, resignation and retirement values, layoff in red, not applicable in green,
resignation in blue and retirement in the colour purple. In the graph it also seems to show that
there are no layoff terminations for head office.

Analysis 5-6 : Highest number of reasons for termination in business unit.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 4
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the highest number for reason of termination in
business units.

Data visualisation
Pg 42

As seen in figure 2.50 the graph shows 4 plots where it separately displays for all 4 layoff, not
applicable, resignation and retirement values, layoff in red, not applicable in green,
resignation in blue and retirement in the colour purple. In the graph the reason of termination
for retirement is seen as the one with the highest number of reason of termination for the
business unit with 68 employees terminated in head office and 817 employees terminated in
stores which add up and become 885 employees who got terminated through retirement.
Pg 43

Question 6 : Analysis on ages of employees

Analysis 6-1 : Relationship of the terminated employees and their age

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover I also added the coord_flip() function which helps to flip the x axis and y
axis places. As for this analysis the graph is used to find the relationship of terminated
employees and their age.

Data visualisation

As seen in the figure 2.52, we can see 2 colours representing the status of employees who are
active and terminated where active employees are in the colour red and terminated employees
are in the colour turquoise. And since the function coord_flip() is used, so the axis position
Pg 44

has switched with the y axis position as this way the graph looks a bit better and the data can
be seen more clearly.

Analysis 6-2 : Age with the highest number of terminated employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover I also added the coord_flip() function which helps to flip the x axis and y
axis places. As for this analysis the graph is used to find the age with the highest number of
terminated employees.

Data visualisation
Pg 45

As seen in figure 2.54, the graph shows us that looking at it using the turquoise colour that
represents terminated employees, we can see that employees at the age 65 years have the
highest number of terminated employees with 591 of them being terminated.

Analysis 6-3 : Age with the lowest number of terminated employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover I also added the coord_flip() function which helps to flip the x axis and y
axis places. As for this analysis the graph is used to find the age with the lowest number of
terminated employees.

Data visualisation
Pg 46

As seen in figure 2.56, the graph shows us that looking at it using the turquoise colour that
represents terminated employees, we can see that employees at the age 63 years have the
lowest number of terminated employees with only 2 of them being terminated.

Analysis 6-4 : Relationship between employee age and gender.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and and the added facet_wrap() function since the variable “age” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover the function theme() function is added in this plot to change the angle of the labels
for the x axis where it switches from being horizontally straight to being vertically straight as
90 degrees was used. As for this analysis the graph is used to find the relationship between
employee’s age and gender.

Data visualisation
Pg 47

As seen in figure 2.58, the graph shows us 2 graphs for the variable status with values
between “active” and “terminated” employees and the gender of the employees. As seen in
the graph we can see that there are way more male employees compared to female
employees. Furthermore, at the age of 50 years and above is where the female employees are
more visibly seen meaning most of the female employees are of old age. Moving on in the
terminated graph, it seems that for the age of 65 years all 591 employees who were
terminated were all female employees. And there also seems to be a significant amount of
female employees who were terminated at the age of 30 years while a significant amount of
male employees were terminated at the age of 60 years.
Pg 48

Question 7 : Analysis on Job Title

Analysis 7-1 : Type of job with the highest number of terminated employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “job_title” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the type of job with the highest number of
terminated employees.

Data visualisation
Pg 49

As seen in figure 2.60, the graph shows that terminated employees are shown in the colour
turquoise. And based on the graph it says that the highest number for termination in job titles
happens to be meat cutters with a high number of 354 employees who were terminated.

Analysis 7-2 : Type of job with the lowest number of terminated employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “job_title” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the type of job with the lowest number of
terminated employees.

Data visualisation
Pg 50

As seen in figure 2.62, the graph shows that terminated employees are shown in the colour
turquoise. And based on the graph it says that the lowest number for termination in job titles
happens to be CEO’s, Chief Information Officers, Exec Assistant in Finance, Exec Assistant
in Human Resources , Exec Assistant in Legal Counsel, Exec Assistant in VP Stores, Legal
Counsel, VP of Finance,VP of Human Resources, VP of Stores since there were no
employees who were terminated.
Pg 51

Question 8 : Analysis on Departments

Analysis 8-1 : Employee attrition rate in department.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “department_name” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover I also added the coord_flip() function which helps to flip the x axis
and y axis places. As for this analysis the graph is used to find the employee attrition rate
department.

Data visualisation
Pg 52

As seen in figure 2.64 the graph shows the relationship between department and status of
employees. Here we can see that the active employees are indicated in the colour red while
the terminated employees are indicated in the colour turquoise.

Analysis 8-2 : Department with highest number of terminated employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “department_name” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover I also added the coord_flip() function which helps to flip the x axis
and y axis places. As for this analysis the graph is used to find the department with the
highest number of terminated employees.

Data visualisation
Pg 53

As seen in figure 2.66 the graph shows the relationship between department and status of
employees. Here we can see that the active employees are indicated in the colour red while
the terminated employees are indicated in the colour turquoise. Moreover it shows here that
the meats department has the highest number of employees terminated, a total number of 377
employees were terminated from the meats department.

Analysis 8-3 : Department with lowest number of terminated employees.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “department_name” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover I also added the coord_flip() function which helps to flip the x axis
and y axis places. As for this analysis the graph is used to find the department with the lowest
number of terminated employees.

Data visualisation
Pg 54

As seen in figure 2.66 the graph shows the relationship between department and status of
employees. Here we can see that the active employees are indicated in the colour red while
the terminated employees are indicated in the colour turquoise. Moreover it shows here that
the executive department has the lowest number of employees terminated, as no employees
were ever terminated from the executives department.
Pg 55

Question 9 : Analysis on length of service

Analysis 9-1 : Relationship between length of service and job titles.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “length_of_service” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. As for this analysis the graph is used to find the relationship between reason of
termination and business units.

Data visualisation
Pg 56

As seen in figure 2.70 the graph shows the relationship between the length of service and the
jobs titles of the employees. In the graph the data shows that the number of active employees
working as a shelf stocker seem to decrease as the length of service increases. Furthermore it
seems to be the same for cashiers as well. It also shows for the meat cutters a lot of the
employee’s length of service is between 12 - 24 and for produce clerks a lot of the employee's
length of service is 8 - 19. As for the terminated plot graph it shows that a lot of employees
are terminated after doing a length of service between 1-3. Lastly it shows that a lot of
employees were terminated at the length of service of 8 and 13.

Analysis 9-2 : Relationship between cities in Canada and length of service

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and I also added the coord_flip() function which helps to flip the x axis and y axis
places. Moreover the difference in this plot is that geom_point() was used, this type of plot
essentially creates a scatter plot type of graph that displays the relationship between 2
continuous variables. As for this analysis the graph is used to find the relationship between
reason of termination and business units.
Pg 57

Data visualisation

For this figure 2.72, it shows me the relationship between the length of service and cities in
canada while showing active and terminated values, as seen in the graph it shows that the
length of service 8 and 13 almost all the cities have terminated employees. While very little
termination happens on length of service of 17 and onwards.

Analysis 9-3 : Relationship between length of service and status year

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and I also added the coord_flip() function which helps to flip the x axis and y axis
places. Moreover the difference in this plot is that geom_point() was used, this type of plot
essentially creates a scatter plot type of graph that displays the relationship between 2
Pg 58

continuous variables. As for this analysis the graph is used to find the relationship between
length of service and status year.

Data visualisation

As seen in the figure 2.74, it shows the relationship between length of service and status year
with terminated and active values shown. It seems that as the years pass from 2006 - 2015,
the length of service increases while termination also increases especially for 2014.

Analysis 9-4 : Relationship between employees length of service and status.

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “length_of_service” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover, I also added the coord_flip() function which helps to flip the x axis
Pg 59

and y axis places. As for this analysis the graph is used to find the relationship between
employee length of service and status.

Data visualisation

As seen in the graph 2.76, it shows that in the active plot graph, as the length of service
increases the number of employees first increases but then decreases a lot. Moreover in the
terminated plot graph it shows that the highest number of employees terminated during the
length of service 13 is 485 employees being terminated.

Question 10 : Analysis on employee birthday

Analysis 10-1 : Counts of birthdays in month for active employees and terminated employees

Source code
Pg 60

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “Bmonth” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, there is an added function scale_x_continuous(), which is used to add values for
continuous x axis scale aesthetics. The limit is indicated from 0-13 since the months are
between 1-12. As for this analysis the graph is used to find the counts of birthdays in a month
for both active and terminated employees.

Data visualisation

As seen in figure 2.78, the graph shows the number of birthdays that occur each month for
every active and terminated employee. In the graph it shows that the month of May for active
employees has the highest number of 4398 birthdays happening during that month, and for
the month of May for terminated employees has the highest number of 152 birthdays
happening during that month. Moving on in the graph it shows that the month of June for
active employees has the lowest number of 3361 birthdays happening during that month, and
for the month of January for terminated employees has the lowest number of 99 birthdays
happening during that month.
Pg 61

Analysis 10-2 : Counts for the highest amount of birthdays in a day

Source code

For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “Bmonth” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, there is an added function scale_x_continuous(), which is used to add values for
continuous x axis scale aesthetics. The limit is indicated from 0-32 since the days are between
1-31. As for this analysis the graph is used to find the counts of birthdays for days 1-31.

Data visualisation
Pg 62

As seen in figure 2.80, the graph shows the number of birthdays that occur each day for every
employee. In the graph it shows that on the 11th is the day that has the highest number of
1941 birthdays happening during that day. Moving on in the graph it shows that on the 31st is
the day that has the lowest number of 836 birthdays happening during that day.
Pg 63

EXTRA FEATURE USED


1. “Janitor”
Pg 64

CONCLUSION
Overall, based on my analysis, there are more active and male employees in the dataset while
more of the female employees are much older than the male employees. And based on type
and reason of termination it seems that the main termination is voluntary since a lot of the
employees retire from working. Moreover, it seems that Vancouver has the most employees
and departments so it seems that that place is the main part of the country since the head
offices are only located in Vancouver as well. And based on the business unit, stores in
Vancouver seem to have the highest termination rate since they have the most employees as
well. And as for Blue River since they have the lowest employees over there it would make
sense for them to have the lowest rate for terminated employees as well. Additionally, based
on the departments it seems that Meats are the most dominated department with the most
number of employees, and because of that it would make sense for the job title for meat
cutters to have the most employees as well. Moving on to the length of service, it seems that
the more longer the employees work the more the employee attrition rate. And it also seems
that the year that has the most termination happening is in 2014. And it also seems that very
little termination happens after 17 and onwards for the length of service in the data. And
lastly we have the counts of birthdays for employees, it seems that the month of May has the
most birthdays happening and as for the days in the month, it seems that the 11th day of the
month seems to have the most birthdays happening during that day.

In conclusion, almost all of my analysis uses the same plots. But the details of the layers of
the function are different from the rest since every analysis done is all different from each
other. In my opinion, I could have definitely done better for this analysis if I started it earlier.
But overall, I am somewhat satisfied with my findings and analysis.
Pg 65

REFERENCES

Mishra, S. (2018). RPubs - Final Report on Employee Attrition. Rbups.com


https://rpubs.com/siddhantmishra/MyFinalProjectReport

Raja, A. M.(2020). Penguins Dataset Overview – iris alternative in R. R-bloggers.com


https://www.r-bloggers.com/2020/06/penguins-dataset-overview-iris-alternative-in-r/

Anon. (2021). How to Replace Values in a DataFrame in R - Data to Fish. datatofish.com


https://datatofish.com/replace-values-dataframe-r/

You might also like