You are on page 1of 31

ANALYSIS OF HUMAN RESOURCES DATA

USING R, HADOOP(HIVE) AND TABLEAU

Apoorva Sharma

2013IPG-026
ABV-IIITM Gwalior
Gwalior-474 010, MP, India

August 24, 2017

1 / 31
COMPANY PROFILE

I Research Center Imarat (RCI) is considered as Avionics


Hub of DRDO. It is one of the three DRDO Labs of Missile
Complex. It houses different work centers, integration and
testing facilities.RCI is a leading laboratory of Defence
Research Development Organization (DRDO) and responsible
for development of missile system.
I DRDO is a network of more than 50 laboratories which are
deeply engaged in developing defense technologies covering
various disciplines, like aeronautics, armaments,electronics,
combat vehicles, engineering systems, special materials, naval
systems, life sciences,training, information systems and
agriculture. Presently, the Organization is backed by over
5000 scientists and about 25,000 other technical and
supporting personnel.

2 / 31
PROBLEM STATEMENT
I To analyze the Human Resources (HR) data of the
organization and identify various trends over the years with
respect to gender, age groups, departments, payroll,
designation, leaves taken, overtime claimed and attendance of
the employees to quantify different parameters like
organizational loyalty of the employees.
I The goal is to provide the organization with insights for
effectively managing employees so that goals can be reached
quickly and efficiently.
I To enable the organization to keep a check on employees
taking undue advantage of the incentives provided to govt
employees.
I The challenge is to identify what data should be captured and
how to use the data to model and predict capabilities so the
organization gets an optimal return on investment on its
human capital.
3 / 31
OBJECTIVES

I Finding the employees claiming the most and least number of


leaves, overtime and business trips in order to filter out the
employees committed to their duties from the ones taking
undue advantage of the incentives provided to the employees
of a central government organization.
I Finding Employees having the least and the highest
attendance on an average over the time period of 4 years,
taking into account his leaves and business trips as well during
the time period.
I Analyzing the data using different demographics like age and
gender.

4 / 31
OBJECTIVES

I Analyzing the data by grouping it according to Designation,


Directorate and Payroll and performing a comparative study
on all the 4 types of data.
I Analyzing the data as a time series and observing trends or
patterns over the years.
I Predicting the future behavior of employees using time series
analysis and forecasting.

5 / 31
LITERATURE REVIEW

I (Roberts, 2015) says that the ability to combine business


outcome data with HR data when making predictions through
analytics is one of the most important trends for HR leaders.
I The research by Naimatullah Shah (2017)highlights a
contextual application for big data within a HR case study
setting. The paper considers a data sample from a large
public sector organization and identifies salary, job promotion,
organizational loyalty and organizational identity influences on
employee job satisfaction.
I Bersin (2015) examines the State of the Market for people
analytics, and includes a number of fascinating findings
including i) growth of HR Analytics will be exponential, but
we are still in the early days, ii) most companies still donat
really know what People Analytics really is, iii) modelling is
valuable but implementing models is the key
6 / 31
LITERATURE REVIEW

I White (2015), Roos (2014) and Holmes (2012) provide an


introduction as well as an in depth explanation of Big Data as
a concept, as well as the Hadoop Ecosystem, its various
components like Hive, Apache Pig, Sqoop, Apache Spark etc
and the underlying Map Reduce Framework.
I Various methodologies were used in these literature works,
from Python, R language to Big data frameworks like Apache
Hadoop and Hortonworks. R was used instead of Python in
this project due to signifcant advantages.
I For smoothing time series and forecasting results, simple
moving average and weighted moving average were used in
some of the mentioned works. In this project, exponential
smoothing was used for better forecasting and predictions.

7 / 31
SOLUTION APPROACH

I SOFTWARE TOOLS USED:


I R 3.4.1 for Windows
I RStudio 1.0.143 for Windows
I Clouderaas Distribution for Hadoop (CDH 5.10)
I Quickstart VM 5.10.0.0 for CDH 5.10
I VMWare Workstation 12 Player
I Tableau Desktop 64 bit 10.3.1
I HARDWARE REQUIRED:
I Windows 10 Operating System (64 bit)
I CentOS 6.9 with Cloudera VM
I 8 GB RAM required for Cloudera VM

8 / 31
DATA ANALYSIS PROCEDURE

I Collect the data from various sources in the organization.


I Organize the data into tables (for structured data), with rows
and columns.
I Describe the nature of the data to be analyzed, identify
different data types and understand the dataset as a whole.
I Identify the questions that need to be answered by performing
the analysis.
I Data Cleaning: Once processed and organized, the data may
be incomplete, contain duplicates, or contain errors. Common
tasks in data cleaning include record matching, identifying
inaccuracy of data, filling in the missing values, deduplication

9 / 31
DATA ANALYSIS PROCEDURE

I Exploratory Analysis: Once the data is cleaned, it can be


analyzed. The process of exploration may result in additional
data cleaning or additional requests for data, so these
activities may be iterative in nature. Descriptive statistics
such as the average, median and standard deviation are
generated to help understand the data.
I Predictive Analysis is done on Time series data. We try to
find patterns or trends with respect to various parameters in
the dataset over a period of time, and predict the future trend
in the organization.
I Data visualization is used to examine the data in graphical
format, to obtain additional insight regarding the messages
within the data.

10 / 31
ANALYSIS IN R AND HIVE

The Analysis and Visualization of the data was done using the
following 2 softwares:

I R language, using RSTudio environment: The data is analyzed


and depicted in the form of graphs using the R language in
the RStudio IDE.
I Apache Hive, combined with Tableau: Data is queried and
aggregated using Hive. For plotting and visualizing the data
in an easy to understand and effective manner, Tableau is
used. The tables from Hive are used as a data source for the
graphs plotted in Tableau.

11 / 31
ANALYSIS IN R HIVE AND TABLEAU

I First, a smaller dataset was analyzed, which was 300 MB in


size. This data was analyzed using R.
I The second dataset had the same attributes as the first one,
but the size was 1.5 GB and it covered the HR Data of all the
labs that are a part of DRDO.
I Since 1 GB data is a huge amount to process and analyze, the
second approach was used, i.e, using Hadoop Distributed File
System (HDFS) for storing the data and HIVE was used to
query the data.
I The results of the queries were plotted using the Tableau.

12 / 31
CLUSTERING AND FORECASTING IN TABLEAU

I Apart from making insightful graphs using Tableau, clustering


through K-Means was done to separate one kind of employees
from the other on the basis of different characteristics and
traits of the employees over the years.
I Cluster analysis partitions the marks in the view into clusters,
where the marks within each cluster are more similar to one
another than they are to marks in other clusters.Tableau
distinguishes clusters using color.The k-means algorithm is
used for clustering.
I Forecasting in Tableau is done using a technique known as
exponential smoothing
I Exponential smoothing models iteratively forecast future
values of a regular time series of values from weighted
averages of past values of the series.

13 / 31
RESULTS OF ANALYSIS IN R

14 / 31
Contd..

15 / 31
Contd..
HOLTWINTERS FORECASTING OF MALE LEAVES

16 / 31
Contd..

Figure: Designation wise leave claim comparison

17 / 31
Contd..
Boxplots of no. of days on business trips on an average taken by
Offiicers for each month

18 / 31
Contd..

Figure: Attendance trend of male vs female employees on an average per


month during the time period of 4 years

19 / 31
Contd..

Figure: Average attendance of officer vs non officer as an yearly time


series

20 / 31
Contd..
Average overtime claims by an employee in the four weeks of a
month, grouped by different age groups

21 / 31
Contd..
Average overtime claims by an employee in the four weeks of a
month, grouped by different Directorates

22 / 31
RESULTS OF ANALYSIS IN HADOOP AND TABLEAU
Treemap plot of Male and Female leaves on an average during the
four year time period, grouped by different age groups

23 / 31
Contd..
Descending order sorted scatter plot of average leaves by PersNo
of the employee.The data is clustered into 8 groups. The highest
leaves are taken by employees whose PersNo belongs to Cluster8
and the least by Cluster2.

24 / 31
Contd..
The graph shows average of Total hours(Actual and forecast)of
OT hours for each Employee designation. Color shows details
about gender. The marks are labelled by an average of Total hours

25 / 31
Contd..
The graph shows the trend of average of Total OT hours for week
of the date. Color shows information about gender and forecast
indicator.

26 / 31
Contd..
The graph above shows monthly trend of business trips for male
and female employees. It can be seen that there is a sudden
increase in the number of business trips during May- August. This
is when most of the Missile Launch Missions take place in the
organization

27 / 31
Contd..
The graph above shows a packed bubble plot of average no. of
business trip days, by PersNo. More size of bubble means more
leaves. The top 11 business trip leave claimers are shown in the
plot.

28 / 31
Contd..
The graph above shows a scatterplot of average hours of
attendance by PersNo, and clustered into 5 groups

29 / 31
CONCLUSION

I The results show various trends over the years with respect to
employees attendance,leaves, overtime claims and business
trips.
I Employees having the highest attendance, overtime claims,
business trips and leaves were identified. Employees having
the highest attendance on an average in a four year period
were considered to be sincere towards their work for the
organization.
I Comparative analysis on the basis of gender, payroll,
designation, directorate and age group was performed on all
four datasets.
I Predictions were made for the upcoming years using
forecasting in Tableau.

30 / 31
REFERENCES

I Bersin, J.: 2015, People analytics takes off, People Analytics


conference San Francisco Shah, Zahir Irani, A. M.: 2017, Big
data in an hr context: Exploring organizational change
readiness, employee attitudes and behaviors, Journal of
Business Research Elsevier 70, 366a378.
I Roberts, G.: 2015, How hr can harness the power of predictive
analytics, InsideHR .
I Roos, D. D.: 2014, Hadoop For Dummies, Wiley.
I Runkler, T. A.: 2012, Data Analytics: Models Algorithms and
Intelligent Data Analysis, Springer Vieweg.

31 / 31