You are on page 1of 22

SALARY PREDICTION

Capstone Project Final Report

Submitted by:
PRADEEP S
TV SRI HARI SUBRAMANYAM
SAI ADITI ARAVIND
BHARGAV NADUPALLI
TEJAASWII B

Under the guidance of


Mr Anurag Vishnoi

Batch: PGPMEx Jan ‘21

Year of Completion: 2022


ACKNOWLEDGEMENTS

We would like to express our sincere gratitude to the Great Lakes Institute of Management and Great
Learning Program Management team for their continuous support, patience, motivation, enthusiasm, and
immense knowledge. Their guidance helped us with our research and completion of this Capstone Project.

We would like to acknowledge and give our warmest thanks to our Mentor: Mr. Anurag Vishnoi, the
Panellist for our Presentation: Mr Jay Narayan Das for their invaluable inputs. We would also like to thank
Professors: Mr. Vinit Thakur (Machine Learning and its Applications), Dr. Abhinanda Sarkar (Business
Analytics using Python), Mr Vivek Anand and Mr Raghavshyam (Data Visualization using Tableau), for
their advice carried us through the course. We would also like to thank our team members for their support.

1
GLOSSARY OF TERMS / ABBREVIATIONS:
CTC Cost to Company

Exp Work Experience

HR Human Resource

EDA Exploratory Data Analysis

R2 R Squared

RMSE Root Mean Squared Error

INR Indian Rupee

PG Post Graduate

NA Not Applicable / Not Available

Grad Graduate

Stat Statistics

2
EXECUTIVE SUMMARY

Problem Statement:
The HR Department of Delta Ltd. has to maintain a salary range for employees with similar profiles. Factors
like work experience and relevance of skill set with job requirement are also evaluated in interviews, along
with existing salary.

This helps departments like HR and Finance plan, allocate funds and search for talent accordingly. It also
scales up the organisation and shows the impact of existing salary on other variables.

With the data given, a model has to be built to predict salary to be offered to potential employees of the
company. The aim is to use historical data so as to avoid any bias and to have minimal judgement in salary
prediction among similar employee profiles.

Main Results:
After the EDA and Model building, it can be inferred that: the Expected Salary is always higher than the
Current Salary. The expected salary has a positive correlation with almost all fields except with certification
and the indication, which has negative values in correlation coefficient. The international degree (treated
as Boolean values) does not provide any insight when analysed with salary prediction. It would have been
clearer if the international degree can be associated with the specialisation, and if the institute is mentioned.

The expected salary has a range from 2,03,744 to 55,99,970. The minimum and maximum salary range for
freshers is always lower than experienced workers and the salary value increases as the experience
increases. With regard to the experience and the educational qualification, there has been a linear increase
in the salary with a positive gradient. The last appraisal, offer in hand, publications, certifications &
International Degree has had very little impact.

Hence the company can get a highly educated qualified candidate for less salary by negotiating based on
work experience. Even the overall experience has played a major role in the model; the experience in the
applied field has had very little impact having maximum ‘0’ values. The prediction model for the salary
prediction has an accuracy of 99% and has very less Root Mean Square Error of 0.05 using the regression
method.

The five important fields which had strong correlation with respect to their impact with the expected
salary are Current CTC, Experience, Education, Experience in the applied Field and the number of
companies worked in sequences. The fresher data is challenging to interpret and contributes towards the
sparse set of data with zeros in many fields; and is therefore, an outlier in the dataset. The prediction
model has been made using the regression methodology using the XGBoost and Gradient boost learner
nodes on KNIME. MS Excel, Tableau, Python are used for data pre-processing Exploratory Data
Analysis and Data Visualisation.

3
Business Insight
Location vs Current & Expected Salary

Fig:1 Location vs Current and Expected Salary


In the bar diagram above, location wise comparison between average current and expected CTC
has been done. The comparison shows that Average current and expected CTC remains highest in
Mangalore which has the highest expected salary range of INR 22.8 lakhs & INR 17.8 lakhs as
current CTC. Kolkata remains second highest with an average expected CTC of INR 22.7 lakhs
and current CTC INR 17 lacs. While Western and Eastern regions of India have the highest average
CTC, North India has the lowest expected CTC in places like Ahmedabad and Delhi with INR
17.31 to INR 22.1 lakhs respectively.

4
Location Current Preferred

Ahmedabad 1113 1188

Bangalore 1226 1151

Bhubaneswar 1226 1148

Chennai 1135 1143

Delhi 1156 1190

Guwahati 1138 1191

Jaipur 1226 1134

Kanpur 1163 1212

Kolkata 1162 1162

Lucknow 1139 1137

Mangalore 1185 1144

Mumbai 1175 1101

Nagpur 1179 1155

Pune 1106 1179

Surat 1089 1183


Table 1: Comparison of Present and Preferred Location

5
Current Location vs Average Expected CTC

Fig 2: Current Location vs Average Expected CTC

In the above bubble chart, it is understood that Mangalore has the highest average expected CTC
of INR 22.84 lakhs while Ahmedabad has the lowest average expected CTC of INR 22.10 lakhs.
Chennai, Kolkata, Guwahati also have a very high Average Expected CTC.

Preferred Location vs Average Expected CTC

Fig 3: Preferred Location vs Average Expected CTC

From the above chart, Chennai remains a preferred location with the highest average expected
CTC with INR 22.82 lakhs. It is seen from this bubble chart that northern and south eastern parts
of India have the lowest average expected CTC.

6
There seems to be less impact overall of location when cross-verified with the median and
average values of the current and expected CTC. The candidates mostly preferred to change their
location while selecting the preferred location.
Education vs Expected Salary

Education Count

Doctorate 7886

Grad 1480

PG 3692

Under Grad 4360


Table 2: Comparison of Present and Preferred Location

Fig 4: Education vs Average Expected CTC


From the above tree map, it can be inferred that the average expected salary goes to the doctorate
segment which is as high as INR 24.2 lakhs while undergrads segment has the lowest average
expected salary range of INR 17 lakhs. PG and Grad segments have the same average expected
salary range with INR 24 lakhs per annum.

7
Graduation vs Average Expected CTC

Fig 5: Graduation vs Average Expected CTC


As per the horizontal bars shown above, Sociology has the highest average expected CTC range
of INR 24.7 lacs while statistics has the lowest average expected CTC range of INR 23 lakhs. It
can be seen that Botany and Chemistry specialisations also have a very high average expected
salary range of INR 24.4 & INR 24.6 lakhs. Arts, the most popular specialization, has a high
expected salary range of INR 24 lakhs.
Doctorate’s Average vs Current and Expected CTC

Fig 6: Doctorate’s Average vs Current and Expected CTC

8
Irrespective of the industries, Doctorate’s expected average salary is as high as INR 25 lakhs which
can be inferred from the side-by-side bars shown above. From this graph, it can be understood that
across industries, the salary offered to doctorates is the highest in the market.

Industry vs Average and Current CTC, Role vs Average Expected CTC, Last Appraisal
Rating vs Average Expected CTC:

Fig 7: Dashboard of Industry vs Average and Current CTC, Role vs Average Expected CTC, Last
Appraisal Rating vs Average Expected CTC:

9
Industry Count Department Count
Analytics 1412 Accounts 597
Automobile 1401 Analytics/BI 1424
Aviation 1899 Banking 1333
BFSI 1437 Education 1371
FMCG 1424 Engineering 1352
IT 1428 HR 1399
Insurance 1452 Healthcare 1393
NA 896 IT-Software 547
Others 1333 Marketing 1645
Retail 1426 NA 896
Telecom 1398 Others 3000
Training 1912 Sales 1326
Top 1135
Management
Table 3: Industry and Department Comparison

From the above dashboard, it can be inferred that FMCG segment’s Average Current & Expected CTC
remains the highest. With respect to role, Research Scientists’ Average Expected CTC is the highest
among all the other roles like Associate, Professor, Team lead or analyst. It is also evident that the last
appraisal rating plays a significant role in determining the expected CTC. Key performers' Average
Expected CTC is INR 27.10 lakhs which is the highest among the other ratings like A,B,C,D.

Recommendations
- The expected salary is always higher than the current salary so the attrition should be concentrated
with the existing employees as the cost of hiring new hires would always be on the higher side.
- The company has to collect correct data as there is nearly about 5 percent of incorrect education
data in the source data. The empty and missing fields may be avoided by offering the right options
in the forms while collecting data. Avoidance of missing data and incorrect data improve the data
pre-processing and EDA. The average of the current and expected salary in comparison with the
current and preferred location does not show any drastic variations; implying that the wide range
of locations has been preferred by the candidates. The candidates mostly choose to change their
location as very few opt to retain their current location as preferred.
- The minimum and maximum salary range of the expected salary was impacted mainly by three
factors in the hierarchical fashion: Current CTC, Experience and Education. The dependency of
expected salary with current salary is very high and the current salary has a strong relation with the
experience and education. Hence, if the company wishes to reduce the budget, they should go with
hiring less experienced and highly educated professionals.
- With a prediction model of this accurate value, the company can prepare their talent pool more
perfectly during project planning and budgeting. The company can be sure about the cost of hire
and the same would apply to a potential manager or replacement in the project.

10
Section 1: Introduction
Need, Objective of the study:
The objective of the study is to ensure that there is minimal bias and no discrimination while predicting the
salary offered to potential employees of Delta Ltd. A prediction model, using historical data, has been
created to ensure there is minimal manual adjustment. The prediction model would enable the decision
makers: for instance, the Finance and HR Department of Delta Ltd. to effectively manage its existing labour
pool and potential candidates by giving them satisfying emoluments. This study also indicates the impact
of existing salary on other variables such as Total Experience and Education Qualification.

Data Sources, Approach:


The data given has been extracted from the HR database of a job portal. The data clean-up and pre-
processing has been done using Python and Univariate and Bivariate analysis has been done using MS
Excel, following which the model has been built using KNIME.

On visual inspection of the data, it has been found that the data has 25,000 rows with data for 29 columns
such as Total Work Experience, Educational Qualification, Specialisation, Current CTC, Current and
Preferred Location, Number of publications, Certifications, Last Appraisal rating, etc.

During the EDA process, it has been found that a total of 7 columns namely: Applicant ID, Graduation
location, Graduation year of passing, Post-Graduation location, Post-Graduation year of passing, Doctorate
location, Doctorate year of passing has had very little relevance to the business case and hence, dropped
from the analysis.

There were about 1918 incorrect entries in the education field. To correct that, the education field has been
updated with the highest qualification ordered from Doctorate, PG, Graduate & Undergraduate and 1028
Undergraduates have been upgraded to Graduates and 890 Graduates have been updated to Post Graduate.
The role and designation fields which had the value ‘0’ have been modified to ‘Fresher’.

No modifications were made to the Current CTC and Expected CTC fields.

Limitations:
There were some limitations with the study and they are as follows:
● There was a lot of missing data.
● There was incorrect data in the Education field, which has been resolved with modifications to
the data.
● There was no time series data
● There was a lot of data which had no relevance to the business problem and was hence dropped
for analysis purposes.
There was ambiguity on whether the ‘Organisation’ field should be dropped, however, after visualisation
and clustering of the data, it has been retained for the study.

Section 2: Literature Review:


With reference to Sl.A listed under the references, the functionality of the workflow and process has been
cross verified with the own project model and the employee salary prediction in Machine Learning given
in datascience2000.in. Though the reference model gives study of the data only with the python, the
project model involves usage of the Python, Excel, Tableau and KNIME for the study of the data and
model over a wide spectrum and platforms.

11
With reference to Sl.B, the KNIME blog aided in understanding the functionality of the features available
within the KNIME platform. The referred case study given in the link discusses the prediction of attrition
in the company and the workflow. The systematic approach used in the case study of the blog gives us
enriched knowledge on the usage of various nodes to build the prediction model.

With reference to Sl.C, the platform shows how model building has been used in the market as competition
to identify the knowledge sources worldwide. The case study has predicted the salary of any UK job ad
based on its contents and the competitors' data provided insight about the usage of the various data or fields
in the prediction model.

The KNIME node guide page for classification and predictive modelling was also used as reference in
model building for salary prediction in our project, with reference to Sl.D.

Section 3: Exploratory Data Analysis:


Univariate Analysis:
The statistics analysis of the numeric fields like experience, experience in the applied field, Number of
companies worked, Certifications, Publications, Current and Expected salary are undertaken using Excel.
The details are as follows:

Column Min Mean Median Max Std. Dev. Skew Kurtosis

Total Exp 0 12.2102 12 25 7.55 0.02 -1.20

Total Exp
(field applied) 0 6.1457 4 25 5.82 0.97 0.12

No_Of_Co
worked 0 3.4161 3 6 1.73 -0.08 -0.94

Certifications 0 0.7593 0 5 1.19 1.65 2.01

Current CTC 0 1719517 1768468 3999490 927554.59 0.06 -0.69

Expected
CTC 203744 2172380 2178330 5399311 1134978.20 0.31 -0.56
Table 4: Univariate Analysis done using Excel

The experience column indicates clearly more intake of freshers and less intake of experienced
professionals. The experienced candidates mostly worked in at least 2 or 3 companies. The salary has a
wide range which is evident from the standard deviation of the current and expected salary. The
certification, publications and number of companies worked have less standard deviation. In the appraisal

12
field, the key performer category professionals accounted only for 5 percent and the freshers had a sparse
value. The detailed split up for the data has been given in the following table:

Experience Count No of Companies Count


Worked

0 896 0 896

1 680 1 1437

2 674 2 3503

3 685 3 3412

4 617 4 2776

5 699 5 2646

6 666 6 2748

7 656 Appraisal Rating Count

8 718 A 3716

9 666 B 4044

10 650 C 3820

11 665 D 3941

12 671 Key_Performer 1001

13 662 NA 896

14 651 Certifications count

15 691 0 10656

16 678 1 3246

17 643 2 1565

18 642 3 1199

19 667 4 508

20 618 5 244

21 678

22 659

13
23 627

24 647

25 612
Table 5: Statistics Table

Bivariate Analysis:

Correlation Total Total Exp Current No_Of_Co Publicatio Certificat Expected


Exp (field applied) _CTC Worked ns ions _CTC

Total Exp 1.0000

Total_Exp (field 0.6451 1.0000


applied)

Current_CTC 0.8465 0.5480 1.0000

No_Of_Co 0.3981 0.2490 0.3797 1.0000


Worked

Publications -0.0005 -0.0107 -0.0064 0.0006 1.0000

Certifications -0.0011 -0.0028 -0.1434 0.0130 0.0185 1.0000

Expected_CTC 0.8166 0.5291 0.9867 0.3432 0.0015 -0.1740 1.0000

Table 6: Bivariate Analysis

The relationship between the fields with the other fields given above have been studied with the correlation
coefficient. The correlation with the target field (expected salary) has been taken into account to identify
the relationship with the major contributors. With the value of 0.986, the current salary becomes the
important field in determining the expected salary. The total experience, experience in the applied field,
number of companies worked, contribute in sequence towards the expected salary. Publications,
certifications have a very less correlation coefficient, which shows they have less significance to the
expected salary. The total experience and current salary also have a very strong relation with a correlation
coefficient value of 0.845.

14
Section 4: Classification/ Clustering:
Business Insight using Clustering:
Clustering is the process of grouping observations of similar kinds into smaller groups within the larger
population. It has widespread application in business analytics. One of the questions facing businesses is:
how to organise the huge amounts of available data into meaningful structures? The clustering was carried
out using K-Means, Hierarchical and DBScan models.

KNIME CLUSTERING MODEL

Fig 8: Knime Clustering Model

CLUSTERING OF CURRENT CTC VS EXPECTED CTC:

Fig 9: Clustering of Current CTC vs Expected CTC

15
As per the clusters shown in the above scatter plot, it is inferred that the expected CTC is increasing and
there has never been a decline while comparing with the current CTC. It is understood that the employee's
expectation in salary is always higher than the current CTC. There are no outliers in the pattern except the
freshers salary with a vertical pattern due to zero current salary value.

Scatter Plot (Expected CTC vs Education)

Fig 10: Scatter plot showing Expected CTC vs Education

The education vs expected salary indicates clearly that undergraduates missed out on the top notch of the
salary cluster. The doctorate and PG has the top spot occupied, as shown in the above graph.

Scatter Plot (Expected CTC vs Total Experience)

Fig 11: Scatter plot showing Total Experience vs Expected CTC

16
The experience gives the ideal indication of a step by step increment in the salary and the cluster for the top
salary group starts with experience of 16 years and above.

Scatter Plot (Designation vs Expected CTC)

Fig 12: Scatter plot showing Designation vs Expected CTC

The designation vs expected salary gives the picture of Freshers’ salary falling only under cluster0 (red)
and Scientists are very less in number; they are not part of cluster4 (blue) and have less number in cluster3
(green). The drop was observed in the Network Engineer and Medical Officer category. The HR, Marketing
Manager, Research Analyst, Product Manager are major contributors in cluster4 (blue).

17
Scatter Plot ( Designation vs Experience)

Fig 13: Scatter plot showing Designation vs Total Work Experience

The Education versus experience pattern showed a mix of the clusters in all designations except Scientists
and Freshers, where the scientists have less than 10 years of experience and the freshers have nil experience.

Section 5: Model Building:


Prediction Models:
The salary prediction model has been built using the regression methodology on KNIME and four types
of learner models have been used to study the training data set (70%), namely: Linear Regression,
Random Forest, Gradient Boosted Tree and XGBoost Model are the four types used in the prediction
models. The data was read using an Excel reader and then the values were normalised using the Z-Score
normalisation (Gaussian method) and then partitioned to a ratio of 70:30 before being given to the learner
models. The suitable predictor models were used to compare the training and test data set and a numeric
score node has been used to find the R^2 value and RMSE. The models are executed without any warning
or error messages and the final output has been obtained.

18
KNIME SALARY PREDICTOR MODEL:

Fig 14: Knime Salary Prediction Model


The output statistics of the four different types of the model is given below:

Regression Models / Linear Regression Gradient Boosted Random


Statistics Model Tree Model Forest XGBoost

R2 0.976747 0.996731 0.788723 0.996609

Mean Absolute
Error 0.115259 0.036419 0.386197 0.038080

Mean Squared
Error 0.023071 0.003243 0.209618 0.003364

RMSE 0.151891 0.056948 0.457841 0.058001

Mean Signed
Difference 0.005497 -0.001009 0.289889 0.001798

Mean Absolute
Percentage Error 0.729154 0.221670 3.464160 0.215162

Adjusted R2 0.976747 0.996731 0.788723 0.996609

Table 7: Output Statistics of Linear Regression, Gradient Boosted Tree Model, Random Forest and
XGBoost Models

19
Model Tuning:
The models have been tuned to yield better results by changing the tree layer and the number of iterations.
The Gradient Boosted model and XGBoost have reached the maximum limit and yield better results. The
efforts to improve the Random Forest model also does not show any improvement. Therefore, model tuning
carried out using various combinations of the specific requirements does not give much results as there
have been two successful models built.

Interpretation of the most optimum model and its implication on the business:
From the analysis, the Gradient Boost Tree Model & XGBoost model are most optimal as they almost have
the R^2 value of 99.6% and RMSE value of 0.06. The linear regression model also has given a better
result with the value of 97% and 0.152 respectively, but the Random Forest model has failed to yield a
better result due to the existence of a sparse dataset (zero values). The freshers having ‘0’ experience count
for 908 zero values in Experience, Experience in field applied, Current CTC and total experience in field
applied. The fresher data had become a challenge with regard to the prediction model as the data is an
outlier in all fields, with many values not matching the rest of the values in the experienced candidate data
set.

Conclusion:
The problem statement given for the capstone project was to predict the salary for Delta Ltd and the historic
data of 25,000 candidates provided has been used as source data in the Machine Learning model. Using
various platforms, an ideal prediction model with an accuracy of 99% has been built to predict the salary
of any candidate with the details of a few parameters like experience, education, current CTC, etc.

20
Bibliography:
List of references:

https://www.datascience2000.in/2021/05/employee-salary-prediction-in-machine.html
……. Sl (A)
https://www.knime.com/blog/predicting-employee-attrition-with-machine-learning?
……. Sl (B)
https://www.kaggle.com/c/job-salary-prediction/code
…… Sl (C)
https://www.knime.com/nodeguide/analytics/classification-and-predictive-modelling …..
Sl (D)

21

You might also like