Professional Documents
Culture Documents
NAME : ARMILLIA KARENNA
TP NUMBER : TP060327
TABLE OF CONTENTS
TABLE OF CONTENTS 2
CONCLUSION 64
REFERENCES 65
Pg 4
The main objective of this assignment is to determine some hidden issue in human
resources management. I will need to perform analysis with the given dataset to
identify hidden problems in the organisation and provide meaningful insight for
decision making.
The techniques used to explore the dataset using various data exploration,
manipulation, transformation, and visualisation techniques which were covered in the
course. And as an additional feature must explore the further concepts which can
improve the retrieval effects.
The dataset provided for this assignment is related to the employees’ job information
and attribution. It contains 18 columns and 49654 rows. The dataset includes the
personal details of the staff, job department, position, location, working status, and
reason of termination.
Pg 5
Data Import
Before I imported the CSV File, I went ahead and first installed all my necessary packages I
needed and loaded them. Thus, the packages “ggplot2”, “dplyr”, “plotrix”, “janitor”,
“gridExtra”, “RcolorBrewer” are installed using install.packages() and loaded using library().
Library() is used so that the packages can be loaded into the environment so that it can be
used with the codes. Moving on to data import, since the excel sheet is in a CSV format,
read.csv is used as it imports the data and reads the dataset as well. And the name of the
imported dataset is called “employee_attrition”.
Pg 6
Data Cleaning
As for data cleaning, I have decided to use “janitor” and “dplyr” to help me clean my dataset.
Using “janitor” clean_names(), it helps me clean the objects and make all of my variable
names consistent in one line of code in my dataframe. Next, remove_empty() helps me to
remove any empty columns or rows of data in my dataframe. And lastly using “dplyr”
distinct() it will help me remove duplicate rows in my dataset.
In this set of coding, the main use of it is to remove any duplicate employee id located in my
dataset as there are a lot of them with the same employee id but different information overall.
First, I removed the employee_id column in my dataframe using the subset function. Then,
using the base length of the original beginning employee id, I made employee id 1318 as id 1.
Next, I created a new dataframe containing a range from 1318 to x which was formerly
calculated. Then I changed the column name from “employee_id” to “emp_id” and then
using cbind i binded the columns together with the existing dataset.
Pg 7
By using the function sum(is.na()) I will find and know if there are still any missing values
located in my columns or rows of my dataset. This is used just in case I missed any NA
values in my dataset, and this is done to double confirm I don't have any in my dataset
anymore. I used the sum(is.na()) for all of my variables as seen in figure 1.4.
And as seen above on figure 1.6 are some of the results. Which displays 0 since there are no
null/NA values.
Pg 8
Data Pre-processing
Upon using the unique() function to see all the unique values that are in the dataset for each
variable, I have found that there are some issues with it such as 2 values that are supposed to
be the same thing but since one of them has the incorrect spelling it is considered a different
value. And another value has an incorrect spelling in the dataset, so to fix them I used the line
of code above to replace values in a single column inside my dataframe. In the line of code to
explain it simply the code is as stated, df["Column Name"][df["Column Name"] == "Old
Value"] <- "New Value" with this code I replaced the value in column “city_name” for “New
Westminister” with the correct spelling “New Westminster”. This is because if the data needs
to be used and it is not fixed, the data set will give wrong results if an analysis is done. Next, i
replaced the value in column “termreason_desc” for “Resignaton” with the correct spelling
“Resignation” since the spelling was incorrect.
For this part, I have changed all my dates in my dataframe to Dates as they were previously
set as character vectors. In the line of code I called my data frame followed by my column
name using the “$” symbol to extract the data which then equals to the as.Date function
which changes the characters to dates followed by data frame and column name again. Then
I formatted the dates such that it is the month followed by the day and year format. I have
done these for the “recorddate_key”, “birthday_key” , “orighiredate_key” and the
“terminationdate_key”.
Pg 9
In this section, I wanted to create some new data columns to be used later for analysis
purposes with the dates in my dataframe. So firstly I formatted the dates that will only return
the day or month or year. The line of code consists of the format() function followed by the
data frame and column name using the “$” symbol to extract the values followed by the
format I need which is in day or month or year using the “%” symbol. Then after I want to
make the new data columns and change them to numeric vectors, so for that I called my data
frame followed by data column which is then followed by the as.numeric() function and
format inside the bracket then again data frame and column again followed by the format i
want which is either day, month or year. In this section I made new columns for “Bmonth”
and “Bday” where it is extracted from column “birthday_key”. Next I made new columns for
“hire_year” and “terminated” where it is extracted from columns “orighiredate_key” and
“terminationdate_key”.
Pg 10
Data Exploration
Starting with the function names(), this function helps me to obtain the variable names found
in my dataframe which will return the value and display it, and the code simply follows
calling out the names() function with stating my data frame name in the bracket. And results
of using the function will be shown on figure 1.12.
Pg 11
Moving on, we have the dim() function, which helps me to determine how many rows and
columns I have in my data frame. By stating the dim() function followed by my data frame in
the bracket, the output will show the number of rows I have followed by the number of
columns I have in my data frame. And result is shown below as figure 1.13
Furthermore, we have the str() function, which mainly helps me to compactly display the
internal structure of my dataset. It will essentially display the amount of objects and variables
I have in my dataset followed by some information for each of my variables such as the
names, class for each column in my data frame. And the result is seen below figure 1.14.
Additionally, the summary() function will ultimately display the summary of each column,
such as the mean, median,maximum and minimum for every column followed by information
Pg 12
like the mode, length and class of each column if the column is a character. Below figure 1.15
is the result of the function used.
Moving on to the view() function, this will pop out an excel type of spreadsheet that contains
all my data for me to view in a clean and neat way. Figure 1.16 below shows some results.
And lastly the unique() function, this function when called upon will return all the different
and unique values located in each column when it is specified in the line of code from the
data frame. And below is the result shown for this function in figure 1.17.
Pg 13
Analysis 1-1 : Relationship between terminated and active employees over the years.
Source code
In this analysis, the graph is plotted using the “ggplot” function that has 4 layers. The first
layer states the x-axis which is “status_year” followed by “fill” for inside colour and “colour”
for outside colour which will be indicated by “status” so that the values “terminated” and
“active” will be seen in 2 different colours. Then in layer 2 function “stat_count” which will
count the number of values for each x axis values, and inside the function contains “width”
which will give the width size for the bars in the graph and “position” will help show the
values in the bar in a stacked position since “identity” is used for the graph. Moving on,
“geom_text” function is in the third layer which adds texts to the plot with conditions and
aesthetics set, position is set to “identity”, stat count counts each values for x axis position,
and “aes” label = count will display the total number for each x value at the top of the bar,
vjust sets the elements vertically and size sets label font size and colour sets the label font in
black. Lastly in layer 4 is the labs() function where the “title” for the graph title , “x” for the x
axis position and “y” for the y axis position for the axis labels.
Pg 14
Data visualisation
In this graph, here it shows the number of employees on the y axis and the years on the x
axis, in the bars there are 2 colours to separate and indicate the 2 different values in the status
column which is “terminated” and “active” employees in the dataframe. Furthermore, the bar
is seen to keep increasing from 2006 to 2013, and it decreases from 203 to 2015. And the
total numbers for both terminated and active employees are shown above each stacked plot.
Analysis 1-2 : Status year that has the most active employees.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph. As for this
analysis the graph is used to find which year has the most active employees using status year.
Pg 15
Data visualisation
As seen in figure 2.4 the graph clearly states that in the year 2013, they have the most active
employees indicated in the colour red for active employees in the stacked bar where 5215 of
them were active that year.
Analysis 1-3 : Status year that has the most terminated employees.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph. As for this
analysis the graph is used to find which year has the most terminated employees using status
year.
Pg 16
Data visualisation
As seen in figure 2.6 the graph clearly states that in the year 2014, they have the most
terminated employees indicated in the colour turquoise for terminated employees in the
stacked bar where 253 of them were terminated that year.
Analysis 1-4 : Employee attrition rate between male and female employees.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and for the
fill and colour I used for gender which will be indicated by “gender_full” so that the values
“female” and “male” will be seen in 2 different colours. And an added facet_wrap() function
which arranges my panels if I have a variable with the minimum of 2 unique datas where 2
Pg 17
same graphs with different values are shown to compare data. I also used the table function to
show data for the “status_year”, “gender_full” and for “status” variables. Moreover I also
added the coord_flip() function which helps to flip the x axis and y axis places. As for this
analysis the graph is used to find employee attrition rate between male and female over the
years. Moreover a table function is used since the numbers are too big to be displayed on the
bars since it will clash with each other, the table will show the numbers to replace the graph
count.
Data visualisation
In this graph figure 2.8 we get to see how many male and female employees were active and
terminated over the years, where male is in turquoise colour and female is in red colour. And
2 graphs are displayed to separately show both terminated and active employees with both
genders over the years.
Pg 18
In the data figure 2.9 here we can see that for the year 2013 they had the most active
employees for females and the year 2013 they had the most active employees for males. And
below we can see that for the year 2014 they had the most terminated employees for females
and the year 2014 they had the most terminated employees for males.
Analysis 1-5 : Year that has the most employee hired date.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and the x
axis variable. As for this analysis the graph is used to find which year that has the most
employees hired.
Pg 19
Data visualisation
As seen in figure 2.11 the graph clearly states that the year 2000 is the year where employees
were hired the most with 2733 employees being hired during that year.
Analysis 1-6 : Year that has the most employee termination date.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and the x
axis variable. As for this analysis the graph is used to find which year that has the most
employee termination.
Pg 20
Data visualisation
As seen in figure 2.13 the graph clearly shows that in the year 1900 is the year where most
termination occurred, with a very high number of 42450 number of termination happened.
And also very noticeably seen is that no termination happened for a very long time over the
years as there is a big empty gap in the graph, and the termination only started again in 2006.
Since the data is so big, numbers cannot be seen clearly on the graph so this table function is
used as seen in figure 2.14 to see the number instead, and it clearly also states that the year
1900 has the most termination occurring.
Pg 21
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “status_year” is involved so 3
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find employee attrition rate in type of termination.
Data visualisation
Pg 22
As seen in the figure 2.16 the variable “termtype_decs” has 3 variables “involuntary” in red
colour, “not applicable” in green colour and “voluntary” in blue colour appearing on the x
axis with the number of employees appearing on the y axis. Moreover, as seen in the graph it
seems that there were no employees who were terminated involuntarily from 2006 to 2013.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. As for this analysis the graph is used to find the type of termination with the highest
number.
Data visualisation
Pg 23
As seen in figure 2.18 the graph between voluntary and involuntary, clearly has the highest
number with 1270 employees being terminated voluntarily. And the value “not applicable” is
not counted for this analysis since it means that those employees are still active.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. As for this analysis the graph is used to find the type of termination with the lowest
number.
Data visualisation
Pg 24
As seen in figure 2.18 the graph between “voluntary” and “involuntary”, involuntaryis clearly
has the lowest number with 215 employees being terminated involuntarily. And the value “not
applicable” is not counted for this analysis since it means that those employees are still active.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “age” is involved so 3 graphs
will appear the same just for different values so that the data can be seen more clearer.
Moreover the function coord_flip() is also added which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the relationship between type of
termination and age.
Data visualisation
Pg 25
As seen in figure 2.22 the graph shows that many employees from the age of 20 to 30 chose
to voluntarily leave where the reason could be that they found a better job opportunity or are
exploring since it is common to do so around those ages. And employees ages from 60 and
above also chose to voluntary leave, in this case it would definitely be because they retired
from old age. As for involuntary, it is seen that at the age of 64 it has the most number of
employees which are 14 of them that had to leave involuntarily, my best guess would be that
they are supposed to be retiring but still chose not and and still wanted to work.
Pg 26
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “status_year” is involved so 4
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find employee attrition rate in reason of termination.
Data visualisation
As seen in figure 2.24 the graph shows that there were no termination that happened during
years 2006 to 2013 for layoff being the reason, and for resignation, it shows in the graph that
Pg 27
the year 2012 has the most resignations with 76 number of employees that resigned.
Furthermore, in retirement it shows in the graph that the year 2008 has the most retirements
with 138 number of employees that retired.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph and the
variable for x axis.
Data visualisation
As seen in figure 2.26 the graph clearly states that retirement has the highest number for
reason of termination with 885 employees retiring. And the value “not applicable” is not
Pg 28
counted for this analysis since it means that those employees are still active. As for this
analysis the graph is used to find the reason for termination with the highest number.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. As for this analysis the graph is used to find the reason for termination with the
lowest number.
Data visualisation
As seen in figure 2.28 the graph clearly states that layoff has the lowest number for reason of
termination with 215 employees being laid off. And the value “not applicable” is not counted
Pg 29
for this analysis since it means that those employees are still active. As for this analysis the
graph is used to find the reason for termination with the lowest number.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “age” is involved so 4 graphs
will appear the same just for different values so that the data can be seen more clearer. As for
this analysis the graph is used to find the relationship between reason of termination and age.
Data visualisation
Pg 30
As seen in figure 2.30 the graph shows that employees for almost all ages 20 to 64 years got
laid off with the most being 18 employees being laid off at the age of 64 years old. As for
resignation, it's similar to layoff where employees ages 19 to 64 years resigned with the most
75 employees who resigned at the age of 30 years old. Finally for retirement, only employees
60 years and above retired with 591 employees who retired at the age of 65 years old.
Pg 31
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the employee attrition rate in cities of
Canada.
Data visualisation
Pg 32
As seen in figure 2.32 the graph can be seen showing 2 plots of graph one for terminated
employees and one for active employees. Where the x axis position is now on y axis is the
function coord_flip() ws used to flip it so that the names of the city can be seen more clearer
compared to being on the x axis position. The red can be seen to represent active employees
while the blue represents terminated employees.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the city with the highest number of
employee attrition.
Data visualisation
Pg 33
As seen in figure 2.34 the graph clearly states that Vancouver is the city with the highest
number of employee attrition, they have the most terminated employees indicated in the
colour turquoise where 296 employees were terminated in that city.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the city with the lowest number of
employee attrition.
Data visualisation
Pg 34
As seen in figure 2.36 the graph clearly states that Blue River is the city with the lowest
number of employee attrition, they have the lowest terminated employees indicated in the
colour turquoise where only 1 employee was terminated in that city.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and I also added the coord_flip() function which helps to flip the x axis and y axis
places. Moreover the function theme() function is added in this plot to change the angle of the
labels for the x axis where it switches from being horizontally straight to being vertically
straight as 90 degrees was used. As for this analysis the graph is used to find the city with the
lowest number of employee attrition.
Data visualisation
Pg 35
As seen in figure 2.38 the graph shows that Vancouver has the most departments in the city
followed by Victoria and Burnaby. And in the graph the department for produce seems to be
the most dominated department as it can clearly be seen in almost every city with high
numbers. Moreover, the department of dairy seems to dominate the city of Burnaby by a
staggering amount shown on the graph. And the meat department seems to also be very
present in the cities of Kamloops, Kelowna, Victoria and Vancouver as seen in the graph.
Pg 36
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover the other difference is that instead of using “identity” for the position I
used “dodge” instead as it gives a nicer and neater look to it. As for this analysis the graph is
used to find the employee attrition rate in the business unit.
Data visualisation
As seen in figure 2.40 the graph shows the relationship between business unit and status of
employees. Here we can see that the head office is indicated in the colour red while the stores
are indicated in the colour turquoise. For active employees it shows that for head office there
Pg 37
are 516 employees active and 47652 employees are active for stores. Moving on, for
terminated employees it shows that for head office there are 69 employees who are terminated
while there are 1416 employees terminated for stores.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer..
Moreover the function theme() function is added in this plot to change the angle of the labels
for the x axis where it switches from being horizontally straight to being vertically straight as
90 degrees was used. As for this analysis the graph is used to find the relationship between
cities in Canada and business units.
Data visualisation
Pg 38
In this figure 2.42 between the relationship of city and business unit, the head office can be
seen to be located in only one city which is Vancouver with the number of 585 head offices
located there. Furthermore, Vancouver also seems to have the most stores in their city with
the number of 10626 stores located there. In Blue River, they seem to have the least amount
of stores located there with the number of only 9 stores located there.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 3
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the relationship between type of termination and
business units.
Data visualisation
Pg 39
As seen in figure 2.44 the graph shows 3 plots where it separately displays for all 3
involuntary, not applicable and voluntary values, Involuntary in red, not applicable in green
and voluntary in the colour blue. In the graph it also seems to show that there are no
involuntary terminations for head office.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 3
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the highest number of type of termination for
business unit.
Data visualisation
Pg 40
As seen in figure 2.46 the graph shows 3 plots where it separately displays for all 3
involuntary, not applicable and voluntary values, Involuntary in red, not applicable in green
and voluntary in the colour blue. In the graph the type of termination for voluntary is seen as
the one with the highest number of type of termination for the business unit with 69
employees terminated in head office and 1201 employees terminated in stores which add up
and become 1270 employees who voluntarily got terminated.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 4
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the relationship between reason of termination
and business units.
Data visualisation
Pg 41
As seen in figure 2.48 the graph shows 4 plots where it separately displays for all 4 layoff, not
applicable, resignation and retirement values, layoff in red, not applicable in green,
resignation in blue and retirement in the colour purple. In the graph it also seems to show that
there are no layoff terminations for head office.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “city_name” is involved so 4
graphs will appear the same just for different values so that the data can be seen more clearer.
As for this analysis the graph is used to find the highest number for reason of termination in
business units.
Data visualisation
Pg 42
As seen in figure 2.50 the graph shows 4 plots where it separately displays for all 4 layoff, not
applicable, resignation and retirement values, layoff in red, not applicable in green,
resignation in blue and retirement in the colour purple. In the graph the reason of termination
for retirement is seen as the one with the highest number of reason of termination for the
business unit with 68 employees terminated in head office and 817 employees terminated in
stores which add up and become 885 employees who got terminated through retirement.
Pg 43
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover I also added the coord_flip() function which helps to flip the x axis and y
axis places. As for this analysis the graph is used to find the relationship of terminated
employees and their age.
Data visualisation
As seen in the figure 2.52, we can see 2 colours representing the status of employees who are
active and terminated where active employees are in the colour red and terminated employees
are in the colour turquoise. And since the function coord_flip() is used, so the axis position
Pg 44
has switched with the y axis position as this way the graph looks a bit better and the data can
be seen more clearly.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover I also added the coord_flip() function which helps to flip the x axis and y
axis places. As for this analysis the graph is used to find the age with the highest number of
terminated employees.
Data visualisation
Pg 45
As seen in figure 2.54, the graph shows us that looking at it using the turquoise colour that
represents terminated employees, we can see that employees at the age 65 years have the
highest number of terminated employees with 591 of them being terminated.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable. Moreover I also added the coord_flip() function which helps to flip the x axis and y
axis places. As for this analysis the graph is used to find the age with the lowest number of
terminated employees.
Data visualisation
Pg 46
As seen in figure 2.56, the graph shows us that looking at it using the turquoise colour that
represents terminated employees, we can see that employees at the age 63 years have the
lowest number of terminated employees with only 2 of them being terminated.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and and the added facet_wrap() function since the variable “age” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover the function theme() function is added in this plot to change the angle of the labels
for the x axis where it switches from being horizontally straight to being vertically straight as
90 degrees was used. As for this analysis the graph is used to find the relationship between
employee’s age and gender.
Data visualisation
Pg 47
As seen in figure 2.58, the graph shows us 2 graphs for the variable status with values
between “active” and “terminated” employees and the gender of the employees. As seen in
the graph we can see that there are way more male employees compared to female
employees. Furthermore, at the age of 50 years and above is where the female employees are
more visibly seen meaning most of the female employees are of old age. Moving on in the
terminated graph, it seems that for the age of 65 years all 591 employees who were
terminated were all female employees. And there also seems to be a significant amount of
female employees who were terminated at the age of 30 years while a significant amount of
male employees were terminated at the age of 60 years.
Pg 48
Analysis 7-1 : Type of job with the highest number of terminated employees.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “job_title” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the type of job with the highest number of
terminated employees.
Data visualisation
Pg 49
As seen in figure 2.60, the graph shows that terminated employees are shown in the colour
turquoise. And based on the graph it says that the highest number for termination in job titles
happens to be meat cutters with a high number of 354 employees who were terminated.
Analysis 7-2 : Type of job with the lowest number of terminated employees.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “job_title” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, I also added the coord_flip() function which helps to flip the x axis and y axis
places. As for this analysis the graph is used to find the type of job with the lowest number of
terminated employees.
Data visualisation
Pg 50
As seen in figure 2.62, the graph shows that terminated employees are shown in the colour
turquoise. And based on the graph it says that the lowest number for termination in job titles
happens to be CEO’s, Chief Information Officers, Exec Assistant in Finance, Exec Assistant
in Human Resources , Exec Assistant in Legal Counsel, Exec Assistant in VP Stores, Legal
Counsel, VP of Finance,VP of Human Resources, VP of Stores since there were no
employees who were terminated.
Pg 51
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “department_name” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover I also added the coord_flip() function which helps to flip the x axis
and y axis places. As for this analysis the graph is used to find the employee attrition rate
department.
Data visualisation
Pg 52
As seen in figure 2.64 the graph shows the relationship between department and status of
employees. Here we can see that the active employees are indicated in the colour red while
the terminated employees are indicated in the colour turquoise.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “department_name” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover I also added the coord_flip() function which helps to flip the x axis
and y axis places. As for this analysis the graph is used to find the department with the
highest number of terminated employees.
Data visualisation
Pg 53
As seen in figure 2.66 the graph shows the relationship between department and status of
employees. Here we can see that the active employees are indicated in the colour red while
the terminated employees are indicated in the colour turquoise. Moreover it shows here that
the meats department has the highest number of employees terminated, a total number of 377
employees were terminated from the meats department.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “department_name” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover I also added the coord_flip() function which helps to flip the x axis
and y axis places. As for this analysis the graph is used to find the department with the lowest
number of terminated employees.
Data visualisation
Pg 54
As seen in figure 2.66 the graph shows the relationship between department and status of
employees. Here we can see that the active employees are indicated in the colour red while
the terminated employees are indicated in the colour turquoise. Moreover it shows here that
the executive department has the lowest number of employees terminated, as no employees
were ever terminated from the executives department.
Pg 55
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “length_of_service” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. As for this analysis the graph is used to find the relationship between reason of
termination and business units.
Data visualisation
Pg 56
As seen in figure 2.70 the graph shows the relationship between the length of service and the
jobs titles of the employees. In the graph the data shows that the number of active employees
working as a shelf stocker seem to decrease as the length of service increases. Furthermore it
seems to be the same for cashiers as well. It also shows for the meat cutters a lot of the
employee’s length of service is between 12 - 24 and for produce clerks a lot of the employee's
length of service is 8 - 19. As for the terminated plot graph it shows that a lot of employees
are terminated after doing a length of service between 1-3. Lastly it shows that a lot of
employees were terminated at the length of service of 8 and 13.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and I also added the coord_flip() function which helps to flip the x axis and y axis
places. Moreover the difference in this plot is that geom_point() was used, this type of plot
essentially creates a scatter plot type of graph that displays the relationship between 2
continuous variables. As for this analysis the graph is used to find the relationship between
reason of termination and business units.
Pg 57
Data visualisation
For this figure 2.72, it shows me the relationship between the length of service and cities in
canada while showing active and terminated values, as seen in the graph it shows that the
length of service 8 and 13 almost all the cities have terminated employees. While very little
termination happens on length of service of 17 and onwards.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and I also added the coord_flip() function which helps to flip the x axis and y axis
places. Moreover the difference in this plot is that geom_point() was used, this type of plot
essentially creates a scatter plot type of graph that displays the relationship between 2
Pg 58
continuous variables. As for this analysis the graph is used to find the relationship between
length of service and status year.
Data visualisation
As seen in the figure 2.74, it shows the relationship between length of service and status year
with terminated and active values shown. It seems that as the years pass from 2006 - 2015,
the length of service increases while termination also increases especially for 2014.
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “length_of_service” is
involved so 2 graphs will appear the same just for different values so that the data can be seen
more clearer. Moreover, I also added the coord_flip() function which helps to flip the x axis
Pg 59
and y axis places. As for this analysis the graph is used to find the relationship between
employee length of service and status.
Data visualisation
As seen in the graph 2.76, it shows that in the active plot graph, as the length of service
increases the number of employees first increases but then decreases a lot. Moreover in the
terminated plot graph it shows that the highest number of employees terminated during the
length of service 13 is 485 employees being terminated.
Analysis 10-1 : Counts of birthdays in month for active employees and terminated employees
Source code
Pg 60
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “Bmonth” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, there is an added function scale_x_continuous(), which is used to add values for
continuous x axis scale aesthetics. The limit is indicated from 0-13 since the months are
between 1-12. As for this analysis the graph is used to find the counts of birthdays in a month
for both active and terminated employees.
Data visualisation
As seen in figure 2.78, the graph shows the number of birthdays that occur each month for
every active and terminated employee. In the graph it shows that the month of May for active
employees has the highest number of 4398 birthdays happening during that month, and for
the month of May for terminated employees has the highest number of 152 birthdays
happening during that month. Moving on in the graph it shows that the month of June for
active employees has the lowest number of 3361 birthdays happening during that month, and
for the month of January for terminated employees has the lowest number of 99 birthdays
happening during that month.
Pg 61
Source code
For this analysis, it uses the same “ggplot” function that has the same layers and inner
functions used in analysis 1-1 where the difference is just in the title of the graph, the x axis
variable and the added facet_wrap() function since the variable “Bmonth” is involved so 2
graphs will appear the same just for different values so that the data can be seen more clearer.
Moreover, there is an added function scale_x_continuous(), which is used to add values for
continuous x axis scale aesthetics. The limit is indicated from 0-32 since the days are between
1-31. As for this analysis the graph is used to find the counts of birthdays for days 1-31.
Data visualisation
Pg 62
As seen in figure 2.80, the graph shows the number of birthdays that occur each day for every
employee. In the graph it shows that on the 11th is the day that has the highest number of
1941 birthdays happening during that day. Moving on in the graph it shows that on the 31st is
the day that has the lowest number of 836 birthdays happening during that day.
Pg 63
CONCLUSION
Overall, based on my analysis, there are more active and male employees in the dataset while
more of the female employees are much older than the male employees. And based on type
and reason of termination it seems that the main termination is voluntary since a lot of the
employees retire from working. Moreover, it seems that Vancouver has the most employees
and departments so it seems that that place is the main part of the country since the head
offices are only located in Vancouver as well. And based on the business unit, stores in
Vancouver seem to have the highest termination rate since they have the most employees as
well. And as for Blue River since they have the lowest employees over there it would make
sense for them to have the lowest rate for terminated employees as well. Additionally, based
on the departments it seems that Meats are the most dominated department with the most
number of employees, and because of that it would make sense for the job title for meat
cutters to have the most employees as well. Moving on to the length of service, it seems that
the more longer the employees work the more the employee attrition rate. And it also seems
that the year that has the most termination happening is in 2014. And it also seems that very
little termination happens after 17 and onwards for the length of service in the data. And
lastly we have the counts of birthdays for employees, it seems that the month of May has the
most birthdays happening and as for the days in the month, it seems that the 11th day of the
month seems to have the most birthdays happening during that day.
In conclusion, almost all of my analysis uses the same plots. But the details of the layers of
the function are different from the rest since every analysis done is all different from each
other. In my opinion, I could have definitely done better for this analysis if I started it earlier.
But overall, I am somewhat satisfied with my findings and analysis.
Pg 65
REFERENCES